Computational Prediction of RNA-based Gene Tetrahymena by Jacob

Computational Prediction of RNA-based Gene
Regulatory Mechanisms in Human and Tetrahymena
by
Jacob O.Kitzman
Submitted to the Department of Electrical Engineering and Computer Science
in Partial Fulfillment of the Requirements for the Degree of
Master of Engineering in Electrical Engineering and Computer Science
at the Massachusetts Institute of Technology
February 9, 2006
Copyright C 2006 Jacob O.Kitzman. All rights reserved.
The author hereby grants to M.I.T. permission to reproduce and
distribute publicly paper and electronic copies of this thesis
and to grant others the right to do so.
Author
\
-
Department of Electrical Engineering and Computer Science
February 9, 2006
Certified by
Christopher B. Burge
Accepted by_
_--
'Arthur C. Smith
Chairman, Department Committee on Graduate Theses
OF TECHNOLOGY
AUG 14 2006
LIBRARIES
ARCHIVES
Computational Prediction of RNA-based Gene
Regulatory Mechanisms in Human and Tetrahymena
by
Jacob O. Kitzman
Submitted to the
Department of Electrical Engineering and Computer Science
February 9, 2006
In Partial Fulfillment of the Requirements for the Degree of
Master of Engineering in Electrical Engineering and Computer Science
ABSTRACT
The diversity and profound impact of gene regulation mediated by small RNAs (sRNAs)
is just beginning to come into focus. RNA interference (RNAi) pathways have been
shown to mediate processes such as genomic rearrangement in ciliates and developmental
timing and tissue differentiation in plants and animals. Here we present a computational
study into the function of two distinct classes of sRNAs. In the first section, we examine
an uncharacterized class of sRNAs isolated from the ciliate Tetrahymena thermophila,
present functional comparison to known classes of sRNAs in other organisms, and note a
strong and specific relationship to a novel sequence motif. In the second section, we
examine the evolutionary impact of microRNAs (miRNAs), which mediate potent posttranscriptional repression on their targets. We observe that miRNAs with tissue-specific
expression exert remarkable evolutionary pressure, compelling many preferentially
coexpressed genes to avoid accumulating target sites. We present tissue-specific
patterns of such target depletion and note strong agreement with experimentally obtained
miRNA expression patterns. Conversely, we report enrichment for targeting among
genes with expression patterns spatially or temporally complementary to the miRNAs',
suggesting a widespread role of tissue identity maintenance for miRNA-mediated
regulation.
Thesis Supervisor: Christopher B. Burge
Title: Associate Professor, Department of Biology, M.I.T.
To my best friend, Maggie, with love.
Acknowledgements
I gratefully acknowledge Professor Chris Burge for granting me the privilege of studying
in one of the most exciting centers of bioinformatics research among incredibly gifted
colleagues. His mentoring, and that of my coworkers in the Burge lab, were an essential
source of guidance and have greatly influenced my approach to scientific thinking.
I also thank Dr. Kathleen Collins and Suzanne Lee, both of the University of CaliforniaBerkeley, for graciously providing Tetrahymena sRNA sequences prior to publication.
Lastly, I wish to thank my mother Helen and father Craig, as well as my sisters Eva and
Hannah, for their love, support, and work in making educational opportunities available
to me over the many years.
Table of Contents
ABSTRACT ...................................
Acknowledgements......................................................................................................
Table of Figures ..........................................................................................................
Table of Tables ............................................................................................................
C H A PT ER 1 .....................................................................................................................
1.1. M otivation..............................................................................................................
1.2. B ackground ............................................................................................................
1.2.1. The Central Dogma of Molecular Biology ....................................... .....
1.2.2. Small RNAs Play Potent Regulatory Roles ..................................... ....
1.2.3 Experimental Techniques...............................................
1.2.3.1. Gene Expression Microarrays........................................
................
......
1.2.2.3.2. Small RNA Assays .....................................
1.3. References Cited .................................................
2
4
7
9
11
12
12
12
13
17
17
17
18
C H A PT E R 2 ..................................................................................................................... 19
Abstract ......................................................................................................................... 19
........... 20
2.1. Overview: Tetrahymena thermophila ........................................
2.1.1. Introduction..................................................................................................... 20
............ 20
2.1.2. Lifecycle and Nuclear Dualism .........................................
....... 21
2.1.3. Macronuclear Genomic Characteristics ......................................
............ 22
2.1.4. RNA-Guided Genomic Rearrangement............................
....
............. 23
2.2. Overview: -23nt sRNAs in Tetrahymena..........................
2.2.1. Introduction .............................................. ................................................. 23
2.2.2. Small RNA Loci Cluster within the MAC Genome .................................... 24
24
2.2.3. Composition and Editing ......................................................................
........... 26
2.3. Screen for RNA Secondary Structure ........................................
2.3.1. Overview ......................................................................................................... 26
2.3.2. M ethods........................................................................................................... 26
2.3.3. Results............................................................................................................. 27
2.4. Screen for trans-TargetingPotential............................................................. 32
2.4.1. O verview ............................................... .................................................... 32
2.4.2. M ethods........................................................................................................... 32
................. 34
......
2.4.3. Results and Discussion .....................................
....... 38
......................................
Related
are
Cluster
Sequences
2.5. Small RNA
2.5.1 O verview .......................................................................................................... 38
2.5.2. M ethods........................................................................................................... 39
39
2.5.2.1 Paralog Gene Families ...................................................................
2.5.3. R esults............................................................................................................. 40
........... 40
2.5.3.1. Gene Prediction Verification ..................................... ..
41
2.5.3.2. Paralog Gene Families ......................................................................
42
.....................................
Interrelated
Highly
2.5.3.3. Small RNA Clusters Are
2.5.3.4. Adjacent Genes Generally Tend to be Paralogous ............................... 43
2.5.3.5. Genes Overlapping sRNA Clusters Have Average Overall Paralog Counts
44
...........
...................................................................
2.5.3.6. Rearrangement of Related sRNA Clusters ..................................... .44
2.5.3.7. Individual sRNA positions are not conserved ..................................... 45
2.5.3.8. Paralogy to non-sRNA Associated Loci .....................................
.46
2.5.4. D iscussion .............................................. ................................................... 46
2.6. A Novel Sequence Motif Is Strongly and Specifically Associated with -23nt
sRN A Activity .............................................................................................................. 48
2.6. 1. Overview ......................................................................................................... 48
2.6.2. M ethods........................................................................................................... 49
2.6.2. 1. General Motif-Finding at sRNA Clusters .....................................
.49
2.6.2.2. Motif-Finding Controls..............................................
50
2.6.3. R esults............................................................................................................. 55
2.6.3.1. General Motif Finding Among sRNA-cognate genes .......................... 55
2.6.3.2. Discriminative Modeling of A-Rich Tracts ..................................... . 58
2.6.3.3. D iscussion ............................................................... ........................... 60
2.7. Conclusion .................................................. ..................................................... 62
2.8. References C ited .................................................................. ............................ 63
C HA PT E R 3 ..................................................................................................................... 67
A bstract ...................................................
67
3.1. Introduction............................................................................................................ 68
3.2. M ethods .................................................................................................................. 70
3.2. 1. Microarray Dataset Processing .........................................
............ 70
3.2.2. Observed Sequence Features ................................................................
71
3.2.3. Sequence Feature Background Model .......................................
....... 71
3.2.4. Sequence Feature Sets....................................................
72
3.2.5. Tissue Specificity Index score .........................................
............. 73
3.2.6. Measurement of Feature Depletion and Enrichment ................................... 76
3.2.6.1. O verview ............................................ ................................................ 76
3.2.6.2. Binning Strategy ...................................................... 77
3.2.6.3. KS Test Statistics ......................................................
77
78
Background
Distribution
............................
3.2.6.4. Estimation of KS Statistic
3.2.6.5. Application of KS Test ..........................................
............... 78
3.2.6.6. False Positive Analysis .....................................
....
............... 79
3.3. M ethods.................................................................................................................. 80
3.3.1. Comparison of Expected and Observed Sequence Feature Counts ............. 80
3.3.2. Tissue-Specific Index Score Evaluation ......................................
... . 81
3.3.3. Tissue-Specific Depletion of microRNA Target Sites................................ 85
3.3.3. 1. Weak Depletion Signals also Coincide with microRNA Expression...... 89
3.3.3.2. Signal-to-Noise Estimation.......................................... 90
3.3.3.3. Comparison to Experimentally Determined MicroRNA Expression ...... 91
3.3.3.4. Comparison to Results of Farh et al....................................
....... 97
3.3.4. Tissue-Specific Enrichment for MicroRNA Target Sites .......................... 100
3.4. Conclusion and Future Directions .....................................
104
3.5. References Cited ........................................
106
A PPEN D IX.....................................................................................................................
6
109
Table of Figures
CHAPTER 1
Figure 1: Mammalian microRNA biogenesis ..................................... .....
Figure 2: MicroRNA-target pairing.............................
....
14
.............. 15
CHAPTER 2
Figure 1: Minimal flanking folding energies for Tetrahymena sRNAs .............
and controls ............................................... ..................................................... 28
Figure 2: Minimal flanking folding energies for Arabidopsis...............................
miRNAs and controls.................................................... 30
Figure 3: Hit count distributions to different genomic features........................ 36
Figure 4: Comparison of predicted and curated annotation of gene H2A.Z......... 41
Figure 5: Rearrangement of sRNA-cognate gene clusters...........................
. 45
Figure 6: A-Rich hidden Markov model..................................
....... 52
Figure 7: A-rich motif Sequence logo ......................................
...... 55
Figure 8: Distribution of motif hits for sRNAs and randomized controls ............ 57
Figure 9: Quality and quantity of A-rich motif instances
predicted by H M M ......................................... ................................................ 59
CHAPTER 3
Figure 1: Tissue Specificity Index calculation.................................
75
Figure 2: Gene binning algorithm...........................................
77
Figure 3: Sequence feature background model performance ............................ 80
Figure 4: Expression and average rank correlations in 61 mouse tissues.......... 82
Figures 5a, 5b: Examination of TSI scoring in three tissues ............................. 83
Figure 6: Selected miRNAs depleted for targeting in non-brain tissues ........... 87
Figure 7: Selected miRNAs depleted for targeting in brain tissues.................. 88
Figure 8: Signal to noise estimation for depletion analysis ............................... 91
Figure 9: Signal to noise estimation for enrichment analysis ......................... 101
Figure 10: Enrichment of targeting among tissues
spatially/temporally complementary to miRNA expression.......................... 102
CHAPTER 3
Figure Al: MicroRNA targeting depletion...........................
117
Figure A2: MicroRNA targeting enrichment...........................
120
Table of 1 ables
CHAPTER 2
Table 1: Eukaryotic genomic summary statistics ...........................................
Table 2: Length and composition of various classes of short RNAs.................
Table 3: Genome-wide hits from -23 nt sRNAs .............................................
CHAPTER 3
Table 1: MicroRNA sequence feature sets ....................................
..........
Table 2: Target depletion compared with in situ miRNA expression data...........
APPDENIX
Table A l: Genomic loci of sRNAs cloned by Lee et al .................................. 109
Table A2: Paralogy relationships between sRNA-associated clusters ............ 112
Table A3: HMM-identified A-rich motif instances near sRNA clusters ........ 115
Table A4: HMM-identified strong A-rich motif instances genome-wide
....... 116
CHAPTER 1
Introduction
1.1. Motivation
Following the convergence of whole-genome sequencing and high-throughput assays in
life science research, the unprecedented opportunity has arisen to glimpse life's full
biological complexity at the molecular, cellular, and physiological levels. By contrast,
experimental molecular biology has traditionally been applied with great success to
focused studies, characterizing in detail the actions of individual genes and proteins.
Today, experimental inquires are often motivated by computational analyses which mine
massive datasets for interesting patterns and trends that have eluded manual detection.
Conversely, computational analysis is also commonly used to follow up experimental
discovery, for instance by measuring genome-wide signal for an event experimentally
observed in limited scope but suspected to have larger role. In this thesis, we present two
distinct studies each seeking to assess in silico the role or scope of a biological
phenomenon.
1.2. Background
1.2.1. The Central Dogma of Molecular Biology
The central dogma of molecular biology provides a first-order summary of the pathways
and substrates that support the living cell. Elegant in its simplicity, the dogma can be
visualized as a graph with three nodes and three edges: DNA composing the genome,
RNA transcribed from it, and finally, proteins, which are the primary structural and
enzymatic components of living cells. Two edges represent the molecular actions that
carry information encoded in the genome through to a physical manifestation:
transcription, in which genes are copied from the genome to messenger RNA (mRNA),
and translation, in which protein molecules are synthesized based on instructions from
these transcripts. Proteins interact with each other and ultimately act on genomic DNA to
initiate or block transcription, thus feeding back into the graph.
Following this model, protein-protein interactions eventually flowing back to alter DNA
state were considered the primary modes action sustaining cellular function. This model
appeals by rough analogy to the digital computer, which features robust but inflexible
long-term information storage in some sense similar to genomic DNA. Random-access
memory resembles RNA, transiently preserving state to carry out only the tasks at hand.
DNA and RNA are each encoded by 4 bases, a strong parallel to the electrostatic binary
encoding used in the modem computer. Lastly, instructions encoded in memory direct
input/output devices such as a printer or more exotically, perhaps a robotic arm, which
effect physical action, analogous to the synthesis of proteins with enzymatic activity.
1.2.2. Small RNAs Play Potent Regulatory Roles
RNA, which is chemically much less stable than DNA, was for many years deemed too
transient to be capable of carrying out precise regulation and was relegated to the role of
information carrier. Over the course of the past decade, this view has been shattered by
the discovery of diverse regulatory pathways mediated by tiny RNAs, collectively termed
RNA interference (RNAi). Core elements of these pathways, which are shared among all
plants and animals, as well as some fungi, unify to repress (or "silence") gene expression
despite differences in mode and mechanism. Though initially proposed to mediate only a
few specific developmental timing effects, RNAi is now understood to exert regulatory
control over as many as one-third of all transcribed human genes (1, 2), and has been
implicated in diverse physiological processes, including neuronal differentiation (3), and
muscle (4)and skin morphogenesis (5).
I:xn•rt
Mammalian miRNA Biogenesis
Fxnort to
N.
..---
Tr;
Figure 1. MicroRNA hairpin-loop precursors undergo several rounds of processing before being
loading to the RISC effector complex. Target repression is thought to be carried out both through
blockage of productive translation as well as transcriptional cleavage. Adapted from Bartel (6).
MicroRNAs (miRNAs) are an endogenous class of -22 nt RNAs that mediate posttranscriptional gene silencing in both animals and plants and in human comprise
approximately 1%of all known genes (6). MicroRNA precursor sequences (premiRNAs) characteristically form energetically favorable secondary structures by forming
Watson-Crick pairs between complementary bases (adenosine and uracil; guanine and
cytidine), as depicted in Figure 1. Precursor structures are observed to include an
extended hairpin with imperfect pairing flanking the mature sequence and meeting at a
terminal loop of unpaired bases, structural features thought to be required for recognition
and subsequent processing. Pre-miRNAs are exported to the cytoplasm, and are cleaved
by the endonuclease Dicer to yield a double-stranded RNA (dsRNA) containing the
mature miRNA and its complement, the miR*, each bearing a 2nt 3' overhang
characteristic of RNAseIII cleavage.
MicroRNA-Target Pairing
tl t2
...
t7
Target 3' UTR 3'-UAGUGUAA-5'
IIII1III
miR-23a 5'-AUCACAUUGCCCGAGGGAUUUCC-3'
ml m2
...
m7
seed region
(bases 2-7)
5'-
P-3'
Figure 2. Metaozoan microRNAs target messages primarily by Watson-Crick base pairing to
sequence complementary to miRNA bases 2 to 7 (ml-m7). Pairs at m and m8 are also observed
but are not required.
Following processing, the mature miR is loaded into the RNA-Induced Silencing
Complex (RISC), while the complementary miR* strand is degraded. RISC-bound
miRNAs then target messages by binding reverse complementary sequences in their 3'
untranslated regions (3'-UTRs), causing repression of productive translation and in some
cases, target cleavage (Figure 2). Pairing to the miRNA 5' seed region (bases 2-7) is
thought to be the primary determinant of targeting specificity for animal miRNAs, and
experimental data suggests that pairing to this region is necessary and sufficient for target
repression (7). Comptuational analyses (1,2) reveal that 6-7 nt microRNA seed matches
are evolutionarily conserved 2- to 3-fold above expectation, further supporting a
widespread role for miRNA-mediated silencing.
MicroRNA targeting has also been shown to play an important role in plant development
(6). Unlike in metazoans, plant microRNAs target sites with complementary pairing
along the full length of the miRNA molecule, though with greater tolerance for bulges or
internal loops. Also, plant miRNA targets are generally silenced by cleavage rather than
by translational repression.
Small interfering RNAs (siRNAs) are another class of small RNAs of both endogenous
and exogenous origin that mediate transcriptional and post-transcriptional gene silencing.
In contrast to microRNAs, these are not derived one or two at a time from hairpin-loop
precursor structures, but instead many at a time in phase from bidirectional transcripts.
Knockdown experiments employ silencing mediated by transfected siRNAs and have
many applications, such as to study a gene's deficiency phenotype. An endogenous class
of siRNAs, the plant trans-actingsiRNAs (tasiRNAs) rely partially on the miRNA
biogenesis pathway and differ from many other siRNAs by targeting distant loci in trans.
Numerous other classes of small RNAs have been observed (8), some of which may carry
out novel regulatory function. In Chapter 2, we examine a small RNA of uncharacterized
function isolated from the single-celled eukaryote Tetrahymena thermophila, and
speculate that it may participate in a functionally distinct RNAi pathway.
1.2.3 Experimental Techniques
1.2.3.1. Gene Expression Microarrays
DNA microarrays are a popular large-scale biological assay for measuring the expression
levels of a large number of genes simultaneously. One high-density microarray can
interrogate transcript abundances of every known or predicted gene in the genome.
Typically, a series of identical microarrays is used to measure expression levels in
samples extracted from different tissues, time-points, or treatments. Microarrays have
enabled genome-wide studies of tissue-specific gene expression and alternative splicing,
two cellular activities that have recently been proposed to account for much of the
difference in biological complexity between humans and simpler organisms. In Chapter
3 of this study, we make use of measurements from a microarray study (9) probing the
levels of known and predicted transcripts of over 13,000 genes in 61 mouse tissues. By
considering comparing the entire transcriptional contexts of different tissues, we observe
potent evolutionary pressure to avoid, or in some cases, gain target sites for tissuespecific miRNAs.
1.2.2.3.2. Small RNA Assays
Microarrays have also been used to assay miRNA abundance, and we refer to several
such studies to confirm patterns suggesting tissue-specific miRNA expression. Several
other experimental procedures can be used as well. One classical technique is the
northern blot, in some sense a small-scale microarray in which a panel of RNA samples
are size-fractionated in separate lanes by gel electrophoresis and then simultaneously
hybridized with labeled probes. In another technique, in situ hybridization, probes are
hybridized to mounted cells or tissue sections, revealing not only expression level but
also spatial context, providing an excellent basis of comparison for tissue-specific effects.
1.3. References Cited
1. Lewis BP, Burge CB, Bartel DP. Conserved Seed Pairing, often Flanked by
Adenosines, Indicates that Thousands of Human Genes are microRNA Targets. Cell.
2005 Jan 14; 120(1):15-20.
2. Lewis BP, Shih IH, Jones-Rhoades MW, Bartel DP, Burge CB. Prediction of
Mammalian microRNA Targets. Cell. 2003 Dec 26; 115(7):787-98.
3. Schratt GM, Tuebing F, Nigh EA, Kane CG, Sabatini ME, Kiebler M, Greenberg ME.
A Brain-Specific microRNA Regulates Dendritic Spine Development. Nature. 2006 Jan
19;439(7074):283-9.
4. Chen JF, Mandel EM, Thomson JM, Wu Q, Callis TE, Hammond SM, Conlon FL,
Wang DZ. The Role of microRNA-I and microRNA-133 in Skeletal Muscle Proliferation
and Differentiation. Nat Genet. 2006 Feb;38(2):228-33.
5. Yi R, O'carroll D, Pasolli HA, Zhang Z, Dietrich FS, Tarakhovsky A, Fuchs E.
Morphogenesis in Skin is Governed by Discrete Sets of Differentially Expressed
microRNAs. Nat Genet. 2006 Feb 5.
6. Bartel DP. MicroRNAs: Genomics, Biogenesis, Mechanism, and Function. Cell. 2004
Jan 23; 116(2):281-97.
7. Brennecke J, Stark A, Russell RB, Cohen SM. Principles of microRNA-Target
Recognition. PLoS Biol. 2005 Mar;3(3):e85.
8. Ambros V, Lee RC, Lavanway A, Williams PT, Jewell D. MicroRNAs and Other Tiny
Endogenous RNAs in C. Elegans. Curr Biol. 2003 May 13;13(10):807-18.
9. Su AI, Wiltshire T, Batalov S, Lapp H, Ching KA, Block D, Zhang J, Soden R,
Hayakawa M, Kreiman G, Cooke MP, Walker JR, Hogenesch JB. A Gene Atlas of the
Mouse and Human Protein-Encoding Transcriptomes. Proc Natl Acad Sci U S A. 2004
Apr 20; 101(16):6062-7.
CHAPTER 2
Computational Characterization
of a Recently Discovered Class of Small RNAs
in Tetrahymena thermophila.
Abstract
Tetrahymena thermophilarecently became the first unicellular organism reported to have
multiple distinct RNAi pathways as a novel, -23 nt class of small RNA (sRNA) of
unknown function (1) joined the previously-described sRNAs that mediate genomic
rearrangement in this ciliated eukaryote (2). Here we report efforts to characterize these
-23 nt sRNAs in silico. We apply computational methods to assess any functional
similarities to known classes of sRNA in higher organisms, showing that these sRNAs do
not appear to be processed from microRNA-like precursors, and moreover do not appear
to target loci in trans, suggesting that they participate in yet another RNAi pathway of
novel mechanism or effect. Lastly, we report that an A-rich motif is specifically
associated with the small RNAs, and apply a discriminative Hidden Markov Model
(HMM) to predict other genomic loci with high potential for sRNA activity.
2.1. Overview: Tetrahymena thermophila
2.1.1. Introduction
Tetrahymena thermophilais a single-celled eukaryote indigenous to temperate freshwater
pond habitats worldwide. Perhaps its most physiologically striking feature is the
covering of cilia on its exterior. A member of the family Alveolata, which includes the
malaria pathogen Plasmodium, Tetrahymena exhibits remarkable genomic, biochemical,
and structural complexity despite being unicellular. Combined with great ease of
experimental manipulation, this has made it increasingly popular as a model organism.
Several fundamental discoveries have been made in Tetrahymena studies, including selfsplicing RNA, telomeres, and chromatin structure.
2.1.2. Lifecycle and Nuclear Dualism
Like other ciliated protozoa but in stark contrast to almost all other eukaryotes,
Tetrahymena possesses two distinct nuclei - a diploid, germline micronucleus (MIC),
which has N=5 pairs of chromosomes, and a haploid somatic macronucleus (MAC),
which contains about 275 chromosomes, most at -45 copies each. Tetrahymena's
nuclear dualism is complemented by its two distinct life cycles (3, 4). In the first such
cycle, called vegetative growth, cells reproduce through successive rounds of asexual
splitting. During this cycle, the MIC is transcriptionally silent while the MAC expresses
the set of messages necessary to divide and sustain cellular function. Vegetative growth
proceeds indefinitely until cells first undergo nutritional stress and then encounter cells of
different mating type. At that point, the two mating cells undergo conjugation,
combining micronuclear genetic material through meiosis and cross-fertilization to yield
daughter micronuclei from which new macronuclei are subsequently derived.
2.1.3. Macronuclear Genomic Characteristics
The Tetrahymena macronuclear genome was recently sequenced at the Institute for
Genomic Research (TIGR). With a total assembled length of 104 Mb, of which 77.6%
are adenosine and thymine (A+T), the MAC genome is comparable in size and
composition to that of the related ciliate Parameciumtetraurelia,as well as the more
distantly-related Plasmodium. A comparison of the genomic features with several other
model organisms is shown in Table 1. Despite having much shorter genomes than higher
eukaryotes, Tetrahymena like other ciliates, has a very high coding density and is
predicted to have roughly as many genes as mammals. However, in contrast to
mammalian genes, the predicted genes in Tetrahymena have fewer exons and shorter
introns, possibly reflecting more simplistic spliceosomal processes.
Table 1. Genomic summary statistics for several eukaryotes.
Genome
Length
%(A+T)
%(G+C)
Predicted no.
genes
Average
genomic
Human
Arabidopsis
2.91 Gb
125 Mb
55%
45%
-32,000
27 kb
Plasmodium
64%
36%
25,498
Paramecium
MIC: -100 Mb
MAC: -90 Mb
72%
28%
>30,000
2.01 kb
1.43 kb
2.53 kb
23 Mb
81%
19%
5,266
extent / gene
Average exon
length
Average no.
no.
intron length
1.67 kb
(median)
218 bp
145 bp
250 bp
419 bp
949 bp
median
7 (median)
5 (median)
3.3
2.4
3 (median)
178.7 bp
(median)
1694 bp
1117 bp
(median)
(median)
exons / gene
Average
Tetrahymena
MIC: -120 Mb
MAC: -105 Mb
78%
22%
-27,300
intron length
3.37 kb
168 bp
25.4 bp
Average
51.2 kb
intergenic
(Chr14 only)
2.26 kb
202 bp
distance
(Chrl4 only)
Averages are means unless otherwise noted. Sources: (5-11)
86 bp
(median)
2.1.4. RNA-Guided Genomic Rearrangement
The MAC genome is derived from its germline cousin by specific elimination of -15% of
the micronuclear sequence followed by fragmentation into many smaller chromosomes
which are selectively amplified to 45C. Selective deletion of MIC sequence was shown
to be mediated by a class of-28nt single-stranded RNA dubbed the scnRNAs for their
proposed mechanism of "scanning" the MIC genome for complementary sequence to be
deleted (2). Within such internal eliminated sequences (IESs), specific promoter elements
initiate bidirectional transcription, yielding double-stranded transcript that is sequentially
cleaved by the Dicer homolog Dcllp (12) into the -28nt scnRNAs which target reversecomplement sites within the MIC genome for deletion. By silencing specific target loci,
this process achieves a similar effect to that of other RNAi pathways, but does so by a
unique method - actual deletion of targeted sequence from the nascent somatic
macronucleus.
2.2. Overview: -23nt sRNAs in Tetrahymena
2.2.1. Introduction
A novel class of endogenous small RNAs (sRNAs) 23-24nt in length was recently
reported by Lee and colleagues (1). These sRNAs appear to be distinct from other
classes of sRNAs in Tetrahymena such as the scnRNAs or starvation-induced tRNA
degradation products (13), suggesting the presence of an additional RNAi pathway of asyet unknown function. To date, no other unicellular organism has been reported to have
multiple distinct classes of small RNAs.
The -23nt sRNAs are differentiated from other classes of sRNA in Tetrahymena by
several characteristics in addition to their length. First, the -23 nt sRNAs accumulate
throughout the Tetrahyemna lifecycle whereas tRNA cleavage products and scnRNAs are
generated primarily in response to starvation and conjugation, respectively. The -23
sRNAs are derived in a perfectly strand-specific fashion, in contrast to the scnRNAs
which are endonucleolytic products of both strands of double-stranded transcript. The
-23 sRNAs rely upon a distinct factor for processing and, unlike the scnRNAs, are
observed in DCR1 knockout progeny. The ubiquitously-expressed protein Dcr2p, the
most canonical of the three Dicer homologs found in the MAC genome, was postulated
as the agent of these sRNAs' biogenesis (1). Significantly, DCR2 is an essential gene,
suggesting that its putative -23nt substrates are functionally important.
2.2.2. Small RNA Loci Cluster within the MAC Genome
Of the 151 -23 nt sRNAs cloned and sequenced by Lee and colleagues, 107 (71%)
matched the MAC genome in at least one place, and of those, 89 (84%) mapped
unambiguously to a genomic locus (listed in Appendix Table Al). Each sRNA sequence
was cloned exactly once, suggesting high diversity in the population of -23nt sRNAs.
Remarkably, of the 107 MAC sRNAs, 97 (91%) are grouped in twelve distinct, unlinked
clusters each containing between 2 and 16 sRNAs and spanning between 46 and 8588 bp
(mean = 3237 bp). This clustered organization bears similarity to small RNAs in other
organisms: mammalian microRNA genes are commonly found in clusters (14), plant
tasiRNAs are sequentially derived in 23-nt phase from primarily one strand of a dsRNA
transcript (15), and the "X-cluster" sRNAs of C. elegans densely reside in one poorlycharacterized locus without apparent phase or structure (16). Distinguishing them from
known examples, though, the sRNAs are derived in a perfectly strand-specific manner
from each cluster. Moreover, all but two of the clusters overlap predicted genes on the
opposite strand.
2.2.3. Composition and Editing
The -23nt sRNAs display several notable compositional biases (Table 2). First, they
display a preference for 5' uracil similar to but much stronger than similar bias noted for
metazoan and plant mi/siRNAs (17, 18), and most closely resembling that of the plant
tasiRNAs.
In contrast to other reported classes of sRNA in Tetrahymena, the nucleotide
composition of the -23 sRNAs appears to complement that of predicted coding sequence,
further suggesting high coding potential among sequences antisense to sRNA loci. Lee et
al. (1) report that over half of the cloned -23nt sRNA sequences differ from their
genomic sequence by one or two bases, mostly at the 3' terminal nucleotide. Because
this discrepancy was not observed among other classes of sRNAs cloned in the same
study, they propose an untemplated 3' end activity specific to this class of sRNAs.
Table 2. Length and composition for short RNAs of various organisms.
Predicted coding sequence composition in Tetrahymena is shown for comparison.
Overall composition
sRNA Length
Tetrahymena
-23nt sRNAs
matching MAC genome
scnRNAs
Predicted cds
Arabidopsis
tasiRNAs(•T
microRNAs(t)
Human
microRNAs *
Mean 23.5 nt
(22 nt: 1,23 nt: 55,
24 nt: 48, 25 nt: 3)
Mean 28.6 nt e'
%(5' U)
94%
84%
%A
30%
%U
()
%C
%G
)
10% (
45%
15%
34%1'
36% '
17% ('
13% V)
NA
NA
40%
32%
13%
15%
Mean 21.2 nt
Mean 22.0 nt
70%
37%
30%
30%
36%
23%
18%
23%
16%
24%
Mean 21.9 nt
49%
23%
30%
22%
25%
(*)Compositional frequencies reported by Lee et al (1). (t) Sequences for 23 Arabidopsis tasiRNAs and 298
miRNAs downloaded from ASRP (19). ($) Sequences for 319 human microRNAs downloaded from mirBase RFAM
7.1 (20, 21)
These sRNAs bear considerable similarities in composition and organization to a variety
of RNAi-mediating molecules in other eukaryotes. We believe that it is unlikely that
Tetrahymena would retain these clusters of-23nt sRNAs and the cellular machinery
necessary for their synthesis if they did not have an important function, especially given
the organism's notable efficiency. In an attempt to ascertain their function, we compare
them to two well-studied such classes, namely microRNAs and tasiRNAs. Finally, we
present evidence for common sequence-level features shared by sRNA clusters and
identify additional genomic loci with sRNA-generating potential.
2.3. Screen for RNA Secondary Structure
2.3.1. Overview
The Tetrahymena -23nt sRNAs share several characteristics with microRNAs such as a
length distribution tightly centered on 23 nt and a strong bias for uracil at position 1,
leading us to investigate the full extent of these molecules' similarity in form and
function. Hairpin-loop secondary structure is a defining characteristic of microRNAs in
both plants and animals and is strictly required for their processing and nuclear export
(22). In order to assess the sRNAs' similarity to miRNAs, we examined the potential of
their flanking sequences to assume an energetically favorable secondary structure.
2.3.2. Methods
The program RNAfold from the freely-available Vienna RNA software package (23) was
used to determine the most energetically-favorable RNA structure of each sequence. For
each of 97 clustered -23nt sRNAs, every window of length L (tested using L = 150 nt and
L = 350 nt) containing its genomic locus was extracted. RNAfold (arguments "-noLP")
was applied to determine the optimal secondary structure of every such sequence window
and for each sRNA, the surrounding window with lowest mean free energy (MFE) of
folding was selected.
To more systematically assess whether the cloned sRNAs arose from pre-miRNA-like
structures, their flanking sequences' folding energies were compared with those of a
randomized control set. For each sRNA, a control cohort of sequences was obtained by
selecting randomly from the MAC genome non-overlapping sequences having the length
equal to and dinucleotide content similar to that of the sRNA. This roughly equalized the
folding energy of the sRNA and its controls under the null hypothesis that the sRNAs
fold no better than random given their length and composition.
To select the control cohort for a given cloned sequence i, equal-length sequences were
sampled from randomly chosen positions in the MAC genome sequence. Each sampled
sequencej was kept if the root mean squared deviation between its dinucleotide
frequencies and those of the cloned sRNA fell below a given threshold rcore, that is, if
Rc(i,
j)o re
<
rcore, where D is the set of 16 dinucleotides and
ffre equals the observed frequency of dinucleotide k in sequence i. As an additional
requirement, a similarly-defined threshold rflank was applied to the dinucleotide
frequencies of the surrounding window. Parameter values (rflank = 0.015, rcore = 0.030)
were chosen on the basis that they were high enough to avoid overfitting and allow for
rapid sampling but were far below the average standard deviation of dinucleotide
frequencies in the genome as a whole, ensuring that a reasonably large control cohort
would mirror the composition of the sRNA set. In the same way as for the cloned
sequences, a fixed-length window was slid across each control sequence, and the lowestMFE window was selected.
2.3.3. Results
Only a small subset of the -23nt sRNAs are flanked by sequence predicted to have
sufficiently low folding energy to potentially exhibit pre-miRNA-like structure. We
manually examined these cases and found that although some of the predicted structures
did contain stem loops, none of the cloned sRNAs resided in stem regions as would be
expected for a mature miRNA.
The distributions of lowest-energy flanking windows surrounding both the cloned
sequences and the control cohorts each take an extreme-valued form as would be
expected for the selection of best-folding window surrounding each sequence.
Distributions were obtained for sequence windows of length L = 150nt and L = 350nt,
with cohorts of 300 and 45 controls per sRNA sequence, respectively, and are shown in
Figure 1.
Minimal Flanking Folding Energies for Tetrahymena sRNAs and Controls
(Window size: 150 nt)
(Window size: 350 nt)
Control sequences
(45
4 er sRNA)
0.08
- Cloned RNAs
0.1
(N=107)
0.07
0.08
0.06
LL
I
n tI
0.06
0.05
0.04
:;i
0.04
0.02
0.03
0.02
ii
-80
-60
-40
-20
n
-120
-100
-80
-60
-40
0.01
-20
Folding energy of best-folding window, Kcal/mol
Figure 1. Distributions of minimum free energy (MFE) for best-folding windows surrounding
Tetrahymena -23nt sRNAs or control cohorts for window sizes 150nt and 350nt. Small RNA
mean MFE: -24.1 Kcal/mol and -61.5 Kcal/mol, respectively; control cohort mean MFE: -26.9
Kcal/mol, -66.3 Kcal/mol, respectively.
In fact, the --23nt sRNA sequences have significantly higher folding energies than their
control cohorts (P < 1.3x10 -8 and P < 7.5x10 -9 , respectively; Mann-Whitney-Wilcoxon
rank sum test), exactly the opposite of the effect expected were the sRNAs derived from
highly structured precursors such as pre-miRNAs. A likely explanation for the strength
of the result in this direction is the possible inclusion of other classes of structural RNA
in the control cohort, skewing it to actually fold better than the sRNAs on average.
To positively control for the ability of this method to identify microRNA-like secondary
structure, it was applied to known microRNAs from the plant Arabidopsis thaliana. The
set of 114 known Arabidopsis microRNAs was downloaded from the miRBase (RFAM
release 7, 7/12/2005: (20, 21)), and a control cohort was selected from the Arabidopsis
genome, preserving sequence length and composition as before. Because secondary
structures in plant pre-miRNAs commonly extend beyond 150bp, this analysis was
performed only with a window size of 300 nt.
Minimal Flanking Folding Energies for
Arabidopsis miRNAs and Controls
(Window size: 350 nt)
I
0.04
I-
0.035
S
1
1
1
r
I
1
Control sequences
(-80 per miRNA)
IArabidopsis miRNAs
(N=1 14)
0.03
c
0.025
p
0.02
LL
0.015
1:
-u
0.01
0.005
i
n-F
-180
-160
-140
3
i
;; .r
r
i
-120
-100
11310-i
-80
-60
-40
Folding energy of best-folding window, Kcal/mol
Figure 2. Best-folding 350nt sequence windows containing Arabidopsis miRNAs (clear) fold on
average with significantly lower MFE than those of control cohorts (gray). MicroRNA mean: 103.5 Kcal/mol; control cohort mean: -91.77 Kcal/mol.
As shown in Figure 2, the Arabidopsis miRNAs fold with significantly lower MFE than
their control cohorts (P < 2.0x10017). In fact, this comparison likely underestimates the
difference in folding energies between the miRNAs and controls, as no attempt was made
to exclude other classes of structural RNA from the control sequences.
We have presented a screen for RNA secondary structure potential in sequences flanking
a set of uncharacterized loci. This screen strongly reinforces the null hypothesis that
sequences flanking Tetrahymena-23nt sRNAs have no greater folding potential than
expected given their composition and length. Despite the diversity of organisms in which
microRNAs are found, a unifying requirement for their processing is hairpin-loop
secondary structure, and so we conclude that the -23nt sRNAs of Tetrahymena are not
microRNAs.
2.4. Screen for trans-Targeting Potential
2.4.1. Overview
We next investigated the possibility that the Tetrahymena -23nt sRNAs target messages
for post-translational repression in the same manner as do the plant tasiRNAs. These
sRNAs bear several similarities to the tasiRNAs, including their organization into several
clusters, strand bias within each cluster, and compositional features. This resemblance,
however, is imperfect - the tasiRNAs show a strong but decidedly incomplete strand
bias, and they are derived in a near-perfect 21 nt phase from their precursor transcripts,
whereas the sRNAs show no discernable phase.
Previous computational studies (24, 25) have reported detection of signal above noise for
target sites to mammalian microRNAs, and the signal for endogenous short RNAs in
plants is likely even more evident given their more stringent targeting requirements. We
attempted to detect a similar signal in Tetrahymena for widespread gene targeting by the
-23nt sRNAs.
2.4.2. Methods
To test the trans-targetingpotential of the Tetrahymena sRNAs, we searched the genome
for messages with target sites for the 107 sRNA sequences and compared the results to
those obtained by searching with a randomized cohort of sequences. The alignment
program wublastn (26) was used to search the genome for all instances of each sRNA's
reverse complement (settings
"dbgcode=6 C=6 W=3 E=le-7 hspsepSmax=20
hspsepQmax=20 qframe=1 hspmax=2000 V=2000 B=2000").
As interfering RNA has
been shown to target messages with imperfect complimentarity in both animals and
plants, up to 8 mismatches and gaps were allowed. Prior to searching, the program
DUST was applied to the genome in order to filter out a potentially large number of hits
to low complexity sequence. Even after filtering, five sRNAs with very low (G+C)%
yielded extensive hits to repetitive elements and were excluded from further analysis.
Lastly, hits were mapped by genomic coordinate to the TIGR-predicted exonic, intronic,
and integenic regions. Hits overlapping any sRNA's own locus of origin were excluded
and not counted.
A first-order Markov chain model was created from each cloned sRNA sequence and
used to generate a control cohort of sequences with dinucleotide composition equal in
expectation to that of the sRNA. This model is comprised of four fully-connected states
with one emitting each RNA nucleotide and a silent start state with one outgoing edge to
each RNA-emitting state. Transition probabilities between states were set using the
conditional probabilities of each dinucleotide within the cloned sRNA sequence:
P(Krrx) = fIx = fxY / fXy,. Across all models, transition probabilities from the start
y.
state were set to the pooled mononucleotide frequencies of the sRNA sequences in order
to capture their strong bias towards U in the first position. Because control sequences
equal in length to their paired sRNAs were desired, no end state or other length
distribution was used, and sequences emitted by the chain model were simply trimmed to
have the same length as the appropriate sRNA.
Each sRNA sequence model was used to emit 500 sequences comprising the control
cohort for that sRNA. The DUST-filtered genomic sequence was searched as before the
reverse complement of each individual control sequence, and the results were grouped by
corresponding sRNA and mapped to genomic features. As the control sequences were
randomly generated rather than being drawn from the genome, they have no locus of
origin per se and so no hits from the control sequences were excluded.
2.4.3. Results and Discussion
Reverse complement hit counts to different genomic features are shown in Table 3.
Imposing a higher stringency level by allowing fewer mismatched or gapped bases in the
hit alignments naturally resulted in fewer hits. Across all stringency levels examined,
cloned sequences had on average more hits in both orientations to every genomic feature
than their controls did. This difference was greater than two-fold for nearly every
combination of stringency level and hit type, suggesting some similarity to distant loci
beyond the controlled attributes of length and mono- or di-nucleotide content.
Table 3. Hit counts from 102 cloned sRNAs and control cohorts categorized
down rows by type and orientation of genomic feature hit, and across columns by
stringency level (maximum number of mismatches or gaps allowed). Hits from
five cloned sequences closely resembling low-complexity repeats and their cohorts
were excluded.
Hits from cloned RNAs
(excluded five cloned seqs;
excluded hits to exact locus of origin)
Stringency Level
4
3
1546
388
187
Hits from randomized controls
Average over 500 controls per included
cloned RNA
28
Stringency Level
4
3
685.4
179.9
2
8.86
46
3
78.02
20.01
0.89
863
189
6
426.3
104.9
4.85
Sense hits to
exon/intron
boundary
Intergenic hits
102
33
4
46.25
11.54
0.54
3585
946
57
1522
409.0
19.51
Total hits
6283
1602
98
2758
725.4
34.64
Hit Tye
Antisense to exon
Antisense hits to
exon/intron
boundary
Sense hits to exon
2
The distributions of hit counts for individual sRNAs and controls are shown in Figure 3.
Notably, at all stringency levels, the number of antisense hits to exons from cloned RNAs
is near or above the third quartile of antisense hits from controls; for no other category of
hits is this the case. Although this disparity may suggest a trans-targetingproperty of the
sRNAs, a more simple explanation is that it reflects a protein coding sequence on the
sense strand opposite the sRNA loci. Randomized controls are controlled for di- but not
tri-nucleotide composition and therefore do not reflect the codon biases seen in proteincoding portions of the Tetrahymena genome. If the sRNAs are largely found antisense to
protein-coding genes, it might be reflected in a greater number of hits antisense to other
protein-coding regions of the genome relative to controls.
Count distributions of hits to different genomic elements for 102 Tetrahymena
sRNAs and controls
Figure 3. Hit count distributions for sRNAs and controls to different genomic features. Cloned
sRNAs' antisense hits to genes exceeded those of controls across all stringency levels.
Although cloned sequences had overall more hits to genic regions than controls, among
genic hits there was no detectable disparity between antisense and sense orientation: at
stringency levels of 2 and 3 the difference in preferences towards antisense versus sense
genic hits was insignificant (P < 0.20 and P < 0.13, respectively; chi-squared). At
stringency level 4, the bias towards antisense hits compared to controls was significant
(P<0.01) but in terms of actual hit counts represented an average of less than one hit
above expectation per sRNA.
The short RNAs likewise displayed no significant preference to hit either genic or
intergenic sequence relative to their controls. The cloned sRNAs had very slightly fewer
hits to genes than expected at stringency levels 4 and 3 (P<0.0044 and P<0.034,
respectively). At stringency level 2, the difference was insignificant. As before, these
differences were marginal, in this case averaging to less than two hits below expectation
per cloned sequence.
Although these sRNAs had in sum more reverse-complement hits to the genome than did
their controls, within their respective target sets there was neither any significant bias
above noise for antisense hits to genes nor for genic versus intergenic regions. Thus, as a
group, the -23 nt sRNAs of Tetrahymena appear unlikely to target messages in trans by
imperfect pairing to reverse-complement sites in the same manner as classical modes of
RNA interference.
2.5. Small RNA Cluster Sequences are Related
2.5.1 Overview
The clustered organization of the -23nt sRNAs cloned from Tetrahymena led us to
investigate whether any particular sequence feature triggered their expression and
subsequent processing. Indeed, simple pairwise BLAST alignments reveal a high degree
of homology between and within the clusters, in most cases far exceeding homology to
loci not containing observed sRNAs. Lee et al. (1) used this homology to group the
sRNA clusters into three families; we slightly modify and extend these family definitions
(Appendix Table A2).
Extant clusters' grouping into related families, within some of which are shared >1 kb
blocks of near perfect identity, suggests duplication and divergence from a common set
of ancestors. Such sequence duplication is a common event in all organisms, and is
thought to be a necessary step in evolution of new function. Duplicated gene copies,
called "psuedogenes", are free from selective pressure and may accumulate mutations
without phenotypic effect. Most such mutations would be deleterious if not for the
original copy of the gene, leading most psuedogenes to be eventually disposed. In some
rare cases, however, the sequence may undergo a so-called "gain-of-function" mutation
which confers new or enhanced function upon its encoded protein. When a newlyfunctional psuedogene undergoes such a change, it again comes under selective pressure
to remain in the genome, becomes fixed as a related but distinct copy of the original gene,
and is referred to as a paralog. More drastic events such as the loss, duplication, and
rearrangement of entire blocks of sequence also shape paralogs.
We measured the genome-wide network of gene paralogy in Tetrahymena and used it as
a baseline to assess the degree of relatedness of genes in different sRNA clusters as well
as those within the same cluster. We hypothesize that a series of divergence events led
from several ancestral sequences to eight of the twelve observed sRNA clusters. Lastly,
we present evidence that the clusters are related to other loci which may be centers of asyet unobserved -23 nt sRNA synthesis.
2.5.2. Methods
2.5.2.1 Paralog Gene Families
The protein-coding region (cds) for each of the 27,306 predicted macronuclear genes was
extracted and aligned to the genome using wu-tblastx to find putative paralogs. Even at
seemingly significant E-values (e.g., 10.- < E < 10-25), many alignments did not seem
reflective of true protein-coding homology when subjected to manual examination. In
particular, there were numerous long alignments with matches or positives sparsely
distributed along their length. The matches in these alignments were enriched for a
limited set of amino acids (primarily asparagine, lysine, and glutamine) with A+U-rich
codons not reflective of overall transcriptome composition. We hypothesized that the
genome's heterogeneity in A+T% might be causing wu-tblastx to locally underestimate
the background score distribution, artificially improving (lowering) reported E-values of
these hits.
A stringent set of filters was applied in order to remove as many spurious alignments as
possible. Two predicted genes are considered paralogous under this filtering if an exon
from one was aligned to an exon from another with a BLAST E-value of less than 10-10
and had at least one 70-aa window of at least 55% identity and at least 80%
positive+identity characters. Each exon was also required to be overlapped by the
alignment by the lesser of 100 bp or 50% of the exon's length. Lastly, each
corresponding pair of splice sites was required to align within a distance of 2,000 bp. As
many of the exon boundaries may be incorrectly predicted or simply missed, we expect
some loss of sensitivity to result from the requirements that aligned genes' exons overlap
in this manner.
2.5.3. Results
2.5.3.1. Gene Prediction Verification
In order to estimate the accuracy of the TIGR gene predictions, we obtained cDNA
sequences of several manually-curated genes and compared their genomic alignments
with their predicted counterparts. We surveyed only sequences deposited in GenBank
after the gene predictions were issued (May, 2005), as older sequences may have been
used as training data during gene prediction.
In all cases examined, one or more gene(s) were predicted to overlap the curated genes'
verified locations. Given that many Tetrahymena genes lack homology to known protein
families, it appears that the gene finder was fairly sensitive. However, each gene model
we examined differed from the curated gene models, often by the inclusion of apparently
spurious exons. A typical example was the histone H2 variant A gene (H2A.Z) shown in
Figure 4, where the predicted gene structure includes two 5' exons not annotated in the
curated gene structure. Although such differences could reflect legitimate alternative
spliceforms not deposited in GenBank, there is no reason to believe that they are not
simply erroneous predictions. Unfortunately, neither any predicted genes near -23 nt
sRNA loci nor their paralogs could be verified by EST or manually-curated gene model
evidence.
Comparison of Predicted and Curated Annotation for Gene H2A.Z
MAC scaffold CH445731
3 ik
371.5k
3t2k
Gene Predictions
11.m00369
H
sprotein, no similarity
or motifs detected
11.m00370
Histone
HTIA3:
varant H2A.Z)
Genbank Sequence Alignments
X15548
-----------c~1-----
Tetrahymena thermophila hvl gene for histone H2 variant
Figure 4. Comparison of predicted and manually-curated models for gene H2A.Z. The predicted
gene structure includes two apparently spurious 5' exons.
2.5.3.2. Paralog Gene Families
Using wutblastx alignments and filtering for conservation of likely coding sequence, we
inferred families of paralogs among the predicted macronuclear genes in Tetrahymena.
Alignment of the 27,306 predicted genes against the genome and subsequent filtering
yielded 215,379 hits between exon pairs. Mapping these to gene pairs and discarding
redundant matches, we arrived at a list of 27,491 unique paralogy relationships between
predicted genes, with under a fifth (5,348) of all predicted genes having at least one
paralog.
2.5.3.3. Small RNA Clusters Are Highly Interrelated
Initial BLAST analysis showed that several genes at different sRNA cluster loci are
related to each other, suggesting that a common feature may trigger sRNA synthesis. The
genome-wide collection of paralogs includes 8 unique pairs of genes overlapping the
sRNA clusters (designated "predicted paralog") in Table A2. In all eight cases, both
paralogous genes are unlinked, each overlapping different sRNA clusters. This is a very
significantly higher degree of paralogy (P ; 0, chi-squared ) between different clusters
than would be expected had the clusters been placed antisense to genes at random,
controlling for the fact that some sRNA clusters span multiple adjacent genes and are
thus more likely to overlap paralogous genes.
Four families of sRNA-associated genes are listed in Table A2. These combine paralogy
relationships discovered through automated alignment and filtering analysis (designed
"predicted paralog"), as well as those inferred from manual examination of sRNAassociated genes' alignments (designated "probable paralog"). In the latter case, the
strength of alignment supports a likely paralogy relationship but one or more of the
alignment filters failed. In addition to the BLAST E-value statistic, which can be
misleading, this table lists the percent similarity (% ident, % ident+positives) of the best
360-nt window overlapping the predicted cds of both genes by the lesser of 100-bp or
50% of the overlapped exon's length.
Our paralogy predictions differ somewhat from those of Lee et al. First, we do not find
strong evidence for paralogy between genes within most clusters. Though alignment
results included such intra-cluster matches, the alignments failed automated quality filters
and most appeared to be artifactual upon manual inspection. We group the sRNAassociated clusters in the same families as do Lee et al, except that we exclude sRNA
cluster 0 from Family I on the basis of weak alignment data. We also include a novel
family (IV) which includes sRNA cluster 2 and putative paralogs. Despite the
differences, both the paralogy relationships we find and those reported by Lee et al reflect
stronger paralogy between separate clusters than within each, and both suggest an ancient
tandem duplication event and subsequent divergence followed by much more recent
duplication to numerous distant sites.
2.5.3.4. Adjacent Genes Generally Tend to be Paralogous
Whereas the degree of paralogy between unlinked sRNA clusters clearly exceeds
expectation, we find that intra-cluster paralogy is a genome-wide phenomenon not
specific to sRNA clusters. Clusters of genes along the same strand separated by
intergenic stretches of less than I kb, such as those overlapping the sRNA clusters, are
found throughout the MAC genome. Such gene clusters contain between 2 to 30 genes
and in sum include over two-thirds of all predicted macronuclear genes. Moreover, gene
clusters are very significantly enriched for paralogous sets of genes. There are
27306) :8.7x1 0- possible unique paralogy relationships between predicted genes, of
which an estimated 27,491 are realized. 2,766 of these paralog pairs have both genes in
the same cluster, leaving 24,520 pairs where genes are in different clusters (including the
case where either or both clusters are of size 1). The total number of possible paralogy
relationships between two genes within the same cluster is -xkI
k=2
2
= 32803, where xk is
the number of gene clusters of size k. The chi-square test strongly rejects the hypothesis
that paralogy relationships are distributed homogeneously between gene pairs contained
within a cluster and all other pairs. Such tendency of paralogs to reside nearby each other
in clusters is likely reflective of the higher incidence of tandem duplication compared to
duplication at distant loci.
2.5.3.5. Genes Overlapping sRNA Clusters Have Average Overall Paralog Counts
Additionally considering paralogs to other loci, genes overlapping sRNA clusters have
somewhat higher numbers of paralogs (Mann-Whitney; P < 0.0398) than other genes in
the genome. Because the background distribution of paralog counts is likely shifted
downward by spurious gene predictions having no paralogs, and given the marginal
significance of this result to begin with, it seems likely that the genes overlapping sRNA
clusters do not individually have significantly more paralogs than other genes around the
genome. That sRNA-yielding genes' paralogy is enriched more significantly amongst
themselves than with other genes suggests that some shared feature triggers antisense
sRNA synthesis at those locations rather than paralogy or high copy number per se.
2.5.3.6. Rearrangement of Related sRNA Clusters
In at least one case, synteny is not preserved for paralogous genes across sRNA clusters.
Clusters 4 and 6 share extensive homology, perhaps following large-scale rearrangement
and insertion as suggested by the alignment visualization in Figure 5. Predicted gene
4800 within cluster 4 is closely homologous to predicted gene 8863 of cluster 6, and
looking upstream in cluster 4 but downstream in cluster 6, predicted genes 4802 and 8862
are largely homologous with the exception of an -0.9kb indel. This rearrangement could
be explained by tandem duplication of three genes (ABC 4 ABCABC) followed by gene
loss and divergence (ABCABC - C'A')
Rearrangement of sRNA-Cognate Gene Clusters
sRNA Cluster 4
4112
,411
41141,
O N.
....
.
.
.
----
----- -un
u-
ur-
-s,
_Z_. .. .. .. . .. .. .. . .. .. .. . .. .. .. . .. .. . . .
..
..
....
.......
. ..
....
. ....
.. . ..•. . .. .. .. . .. .. ". .. .. . . .. . .. ..;o+"7
1141
sRNA Cluster 6
Figure 5. Alignment visualization generated by the program GATA (27) reveals rearrangement
between sRNA clusters 4 and 6. Predicted genes models are shown above and below the
alignment; connecting lines indicate pairwise blastn hits and are shaded by the strength of
homology.
2.5.3.7. Individual sRNA positions are not conserved
Positions of the -23nt sRNAs did not appear to be conserved between pairs of related
clusters. Excluding the cases where a single cloned sRNA maps to multiple different
clusters and may have originated from any subset of them, we found only two pairs of
sRNAs whose aligned positions partially overlapped sRNAs of another cluster. Not only
did the sRNAs of related clusters not tend to overlap each other in alignments, they
showed no tendency to localize to regions of high sequence conservation between the two
clusters. Their lack of positional conservation despite sequence homology suggests that
to the extent that can be ascertained by the given sample, the sRNAs are randomly
distributed within each cluster locus. In turn, this suggests that while some shared feature
may serve to define each cluster, no shared feature within the cluster denotes the
individual sRNA positions.
2.5.3.8. Paralogy to non-sRNA Associated Loci
Lastly, we obtained a list of sRNA cluster paralogs to which no cloned sequences
mapped. We hypothesize that some of these paralogs may have sRNA-generating
potential but for some reason weren't observed in the cloned sample, due perhaps to
transcriptional inactivity or some other cellular state. Such paralogs of sRNA-associated
clusters themselves lacking observed sRNA are highlighted in Table A2.
Small RNA cluster 2 had especially many paralogs to loci not associated with cloned
sRNAs; we grouped this cluster and its paralogs into Family IV. Genes and flanking
sequences adjacent to the sRNAs of cluster 2 are very closely duplicated on several other
scaffolds. The two most closely duplicated regions are the first -6.5-kbp of scaffold
8254572, running from the start of the scaffold, including gene 23409, and ending before
predicted gene 23310, and the first -1.5-kbp of predicted gene 23312. Interestingly, four
of the five sites of near-perfect duplication are at or near the 5' ends of their respective
scaffolds, suggesting that the duplicated sequences could be located nearby repetitive
elements or MIC chromosomal breakage sites.
2.5.4. Discussion
It is clear from the sRNA clusters' paralogy that they arose from a group of common
ancestral sequences. The exact series of divergence events leading to each extant cluster
is unknown but includes numerous rearrangement and duplication steps. The nearcomplete incidence of sRNAs within each cluster's set of paralogs suggests that some
shared sequence feature confers their sRNA-generating potential. In contrast to the
homology between related clusters, we find little or no conservation of individual
sRNAs' positions and propose that they are synthesized more or less randomly within
each cluster. Lastly, we examine the set of exceptions: the few strong sRNA cluster
paralogs to which no observed sRNAs mapped. In the next section, we will search for
shared sequence features of the sRNA clusters and assess their presence in this set of
paralogs.
2.6. A Novel Sequence Motif Is Strongly and Specifically Associated
with -23nt sRNA Activity
2.6.1. Overview
Based on the strong and specific homology between sRNA clusters, we hypothesized that
they share a motif responsible for directing nearby sRNA biogenesis. A diverse range of
cellular pathways and functions are directed in cis at genomic and transcriptional loci by
the presence of DNA and RNA sequence motifs. Such motifs commonly serve to
specifically bind the protein factors that actually effect processes such as transcriptional
initiation, alternative splicing, nuclear export, and post-translational repression.
The presence of a sequence motif, namely a variable-length, nearly homopolymeric
stretch of adenosine bases ("A-rich tracts") opposite the sRNA clusters on the predicted
coding strand, was suggested during the course of this study (K. Collins, personal
communication). The existence of this motif near the sRNA clusters was later reported
(1), its specificity to these loci and its functional role were not explored.
The genomic loci of A-rich motif instances nearby sRNA clusters are shown in Appendix
Table A3. Motif positions do not appear to be conserved between paralogous sRNA
cluster pairs, suggesting that this shared feature could have arisen convergently rather
than simply by duplication.
We searched systematically for all motifs overrepresented among sequences flanking the
small RNA loci. To determine the specificity of any resulting motifs to sRNA-associated
loci, we derived a randomized control set of predicted genes and performed a similar
motif search. We next developed a model to capture the important elements of the A-rich
tracts and applied it genome-wide to more rigorously demonstrate their specific
association with sRNA activity. Lastly, we predict a set of loci strongly associated with
this motif at which sRNA activity has not yet been observed but may be likely under the
motif's hypothesized function.
2.6.2. Methods
2.6.2. 1. General Motif-Finding at sRNA Clusters
The motif-finding software MEME (28) was applied to probe for overrepresented motifs
among the sRNA cluster sequences, including but in no way favoring the A-rich tracts.
The full genomic sequence and downstream flanking regions (including putative 3'
UTRs) of all predicted genes overlapping each of the 10 sRNA clusters was extracted.
Two clusters that did not overlap any gene predictions were excluded, and the resulting
18 sequences were input to MEME. For this and subsequent searches, the "-zoops"
option was used to direct MEME to find either zero or one motif instances per sequence
in order to preclude artifactual motifs that could arise were the software constrained to
find a motif hit in every sequence of the input. Additionally, average genomic mono- and
dinucleotide frequencies were supplied to MEME to ensure that the motifs' expectations
of occurrence were properly estimated.
Motifs reported by MEME were assigned "strength" and "quality" scores. Motif strength
was taken as the negative log of the E-value reported by MEME, thus reflecting the
degree to which the reported motif is overrepresented among the input sequences.
Motif quality is taken as the number of input sequences reported by MEME to match a
given motif. Low-quality motifs - those with instances in only a few of the inputs - were
generally artifacts reflecting those sequences' paralogy.
2.6.2.2. Motif-Finding Controls
A similar motif search was performed among a set of control sequences to provide a
comparative basis for the strength and specificity of any sRNA-associated motifs.
As the input set of 18 genes overlapping sRNA clusters included several sets of adjacent
genes, a randomized cohort was selected to control for the effects of paralogy between
nearby genes by drawing equally-sized sets of genes from gene clusters of corresponding
counts. For example, the sRNA cluster on scaffold 8254600 overlaps three adjacent
predicted genes, so for its corresponding control, a cluster of three or more genes was
randomly taken from the set of all such gene clusters, and then a contiguous block of
three genes was selected randomly from that cluster. This process was repeated 20,000
times, each time yielding a control set of 18 genes. MEME was then run on each control
set using the same options as for the sRNA-associated genes. The resulting motif
strength and quality scores comprise the background joint distribution over these
statistics for motifs found within and downstream of predicted genes in Tetrahymena.
2.6.2.3. Discriminative Motif Modeling
2.6.2.3.1. A-Rich Motif Model Architecture
MEME employs a fixed-length positional weight matrix (PWM) to model sequence
motifs and was ill-suited to describe the putatively sRNA-associated A-rich tracts
because their lengths varied widely. In place of MEME, a custom Hidden Markov Model
(HMM) was developed to better describe this motif. The HMM state architecture (shown
in Figure 6) was composed of a background sub-model (states named "B_...") and a motif
submodel (states "M_..."). The former was a first-order Markov chain with five fullyconnected nodes (one for each RNA nucleotide plus the ambiguity character "N"). The
background sub-model was fully unrestricted in transitions and used to collectively
model all non-motif sequence.
The motif sub-model was designed to emit sequences resembling the A-rich motif
instances observed near sRNA clusters. On the strand opposite the sRNAs, these motif
instances were generally composed of runs of A's separated by intervening mono and dinucleotides. By disallowing certain transitions, the motif sub-model was designed to
emit with nonzero probability any sequence matching the pattern (A>2[CGU]1 -2 )_1 . The
motif sub-model contains six nodes: an entry node Mo, an "A-repeat" node M I1, and lastly
nodes M2X (X e {C,G,U}) and M3 which model intervening mono- or dinucleotides.
A-Rich Motif Hidden Markov Model
Figure 6. HMM architecture for discriminative modeling of putatively-sRNA associated Arich motifs and all other macronuclear genomic sequence. Circles denote states and edges
denote transitions having nonzero probability; transitions between motif states are shown
red. The red character(s) in each non-silent state are the emissible characters at that state.
2.6.2.3.2. A-Rich Motif Model Training
The model was trained on a fully-labeled version of the same 18 coding and downstream
sequences of sRNA-cognate genes as previously input to MEME. Fully-labeled
sequences were obtained by assigning A-rich regions previously identified by MEME to
the motif sub-model and every remaining region to the background sub-model. Given
such a partition of each input sequence into disjoint motif and background regions, a
unique parse exists, allowing the sequence to be unambiguously labeled. This labeling
method was accommodated by the particular model architecture used and the restriction
of certain transition and emission probabilities. In general, this is not possible with
HMMs, and training would otherwise be performed using an iterative "missing-data"
approach such as the Baum-Welch algorithm.
Several parameters were fixed prior to training in order to avoid over-fitting
characteristics specific to the inputs, thus hopefully improving generalization
performance. In particular, transition probabilities within the background sub-model
were set equal to the corresponding genome-wide dinucleotide frequencies conditioned
on the first base. This was done to avoid biasing the background sub-model towards the
highly paralogous sRNA-cognate genes in the input set. Additionally, transition
probabilities from the start state and to the end state were set equal to the corresponding
average genomic mononucleotide frequencies.
Several motif sub-model parameters were likewise fixed. Intervening dinucleotides were
rarer within motifs than mononucleotides, and so it was not possible to reliably estimate
the composition at their second base (state M3). In particular, this state's emission and
outgoing transition probabilities were set equal to appropriately-conditioned
compositional frequencies drawn from the genome at large.
Viterbi training was then performed on the remaining set of free parameters. Each was set
equal to its maximum likelihood value, i.e., the observed frequencies of each transition or
emission in the labeled training data. For each such trained parameter, a pseudo-count of
I was added, e.g., a
m2U
<
AM.M
mGM
where a denotes the trained frequency and A
the observed count. All calculations were performed in logarithm space to improve
performance and numerical stability.
Leave-one-out cross-validation was performed to assess the stability of motif predictions
in the face of limited training data. The model was re-trained 18 times, each time
excluding a different one of the input sequences. Following each re-training, the Viterbi
(maximum-likelihood) parse of the excluded input sequence was determined and the
intervals labeled as motif hits by the HMM compared with the MEME labeling.
2.6.2.3.3. Genome-Wide A-Rich Motif Prediction and Scoring
Finally, the HMM was applied to every gene in the Tetrahymena MAC genome in order
to search for additional instances of the A-rich motif. For each of the 27,306 gene
predictions, the primary transcript sequence including downstream flanking region was
extracted. Because many predicted genes lack 3' UTR annotations, the downstream
flanking region was taken as the sequence following the given gene's STOP codon and
extending to the next annotated start codon or scaffold end, whichever was closer.
The optimal parse Ho,,,,, = arg max , {Pr(Hp,,)} of each extracted gene and flanking
sequence i was determined by the Viterbi dynamic programming algorithm, and its
likelihood hp,,, was recorded. Additionally, the parse containing only background states
was evaluated for the given sequence and its likelihood hbg.i recorded. For each
sequence, a motif quality score was taken as the log-likelihood ratio of the optimal parse,
including any motif intervals, and the background parse, where no motif subintervals
were allowed, scaled by the number of characters ni predicted to be motif hits:
Si =
1
ni
Ig
her,
p
hbg,
. This score is taken as a metric of motif quality, reflecting the confidence
per predicted motif character in each gene with predicted motif runs. In genes where no
motif runs are predicted (ni = 0), Si was set equal to zero.
2.6.3. Results
2.6.3.1. General Motif Finding Among sRNA-cognate genes
Using MEME, we identified four motifs within the coding and flanking regions of the
input genes. The strongest motif discovered (E = 6.8e46) occurred in each one of the 18
inputs and corresponds to the A-rich runs noted near sRNA clusters. A sequence logo
representation of the motif as reported by MEME is shown below in Figure 7.
2-
5'
,,.,
.-
,
N
NNNN
"
mcn mv X"
3'
Figure 7. Sequence logo of A-rich motifs found by MEME. This logo reflects the motif's
frequency of adenosines, but is based on a position weight matrix and does not capture any
dependence between features within the motif.
The three next strongest motifs reported by MEME were significantly weaker (E >=
4.1 e 19). Moreover, two of them hit only within predicted coding regions of 6 pairwise
paralogous genes at sRNA loci and thus are unlikely to be general features of sRNA-
associated sequences. Temporarily excluding the predicted coding sequences and
restricting the search to the downstream sequences of the same 18 genes, MEME again
found a strong A-rich motif(E = 7.8e55) in each input sequence. For the downstream
sequences, all further motifs found by MEME are weak (E >= 9.8e 10) and have instances
in five or fewer of the inputs. Taken together, these results suggest that runs of
adenosines on the coding strand downstream of predicted genes are the strongest shared
sequence feature of sequences at the sRNA loci.
We next applied MEME in the same manner to the control cohort of genes selected to
have potential for common motifs comparable to that of the sRNA-cognate genes. Motif
strength (E-value) and quality (number of inputs hit) are shown in Figure 8 for MEME
results on 20,000 control sets and the sRNA-cognate genes.
•
Strength vs. Quality for Motifs Found Among
sRNA-Cognate Genes and Randomized Controls
W WWM
W
__
0)
4-0
0
8'o
E6
60-*
C,,
S.
40
= 40-
II)I~0
I!".
I
w1 200
0
Ii
ilnlliiill
oI
5
10
* Control runs
15
Hit count, best motif (max=18)
(n=20000)
* sRNA-cognates
Figure 8. Joint empirical distribution over strength (-log MEME E-value) and quality
(number of inputs hit) of best (lowest E-value) motif hit reported by MEME for sRNAassociated and control sets of 18 genes each. Adjacent histograms show marginal
distributions of each statistic. All 18 sRNA-associated genes tested contain a motif
exceeding in strength all those found in more than 7 of 18 genes in every control run.
Out of 20,000 control runs, only two motifs were identified with higher E-value than the
motif near sRNA clusters, and those two motifs were only found in half or fewer of the
inputs to MEME. Thus, the A-rich motif does not appear to be a general feature found
downstream of Tetrahymena genes, and is much more strongly associated with sRNAcognate regions than motifs found within comparable controls.
2.6.3.2. Discriminative Modeling of A-Rich Tracts
2.6.3.2.1. Genome-Wide Search Using PWM
We initially attempted to use the MEME-companion tool MAST to search for genomewide A-rich motif instances. This, however, amounted to aligning a long PWM with
strong bias for "A" to a genome with nearly 80% A+T composition. Because the PWM
could not capture apparently important features of the A-rich tracts - such as the relative
paucity of intervening characters longer than 2-3nt - this approach exhibited very poor
specificity, and motivated our decision to develop a custom model to describe this motif.
2.6.3.2.2. Genome-Wide Search Using HMM
We next applied the trained HMM to find instances of the A-rich motif nearby other
genes. For every predicted macronuclear gene and downstream region, we obtained two
scores: the number of bases in that gene likely to be within a motif instance, and the perbase score of the best motif match in that gene's sequence. The joint distribution of these
two metrics is shown in Figure 9. Discarding the lowest sRNA-associated score on each
axis as an outlier, we count 109 genes with A-rich motif instances at least as strong as the
weakest remaining sRNA-associated gene. Genes with strong motif scores are listed in
Table A4; those not located near known sRNA clusters are bolded. Several of these
("family IV") are close paralogs of sRNA cluster 2 (8254572). Four other genes with
strong A-rich motifs (9721, 9739-40, 17099) group by paralogy and are duplicated in
numerous places throughout the genome. All other genes appear to be singletons or are
divergent enough to fail alignment filtering.
Quality vs. Quantity of Motif Instances for All Tetrahymena Predicted Genes
as Determined by HMM
Ila.i .A.11
*
6
*
*
U)
1
0
..
I
.
0@
0
0
sRNA-cognate genes
(N=18)
High-scoring genes
(N= 109)
All other genes
(N=27, 179)
r-'
8
0
€/1
fn
=J
I
I
4.
I
.*
0·
*'
oo
00
*
* *
I
*
I
100
150
150
I
200
2
250
Number of motif nucleotides (ni)
Figure 9. All predicted Tetrahymena macronuclear genes plotted by strength and
quantity of A-rich motif instances as scored by HMM. Known sRNA-associated genes
(red) comprised the training set and score highly. 109 genes without known sRNA
activity (light blue) have relatively long motif hits with similar characteristics to those
near sRNA loci.
2.6.3.2.3. Cross-Validation Results
A Hidden Markov Model (HMM) was specifically designed to classify A-rich motif
instances. Promisingly, under leave-one-out cross-validation, each motif hit near sRNA
loci identified by MEME was rediscovered using our HMM during each round of cross-
validation. Because MEME was constrained to return no more than one motif hit per
sequence, the HMM identified additional stretches of A-rich motifs near some of the 18
genes, for a total of 28 motif hits. Of the additional ten hits beyond the MEME-identified
set, all but two were rediscovered in every round of cross-validation testing, and those
two were rediscovered in more than half of the cross-validation rounds. The stability of
these results indicates that this HMM motif model is robust to perturbations in training
data, and suggests the motif runs identified by the model are reflective of those near
observed sRNA clusters.
2.6.3.3. Discussion
We have shown that an HMM trained on sRNA-associated A-rich motif instances
discriminates very well between sRNA-cognate genes and all others. As these A-rich
tracts appear to be the strongest feature at these loci and are nearly exclusive to them, we
propose that they trigger sRNA biosynthesis through an unknown mechanism. Instances
of this motif occur in a small set of other loci throughout the genome which may also
have sRNA-generating potential and would therefore make ideal subjects for follow-up
investigation in vivo.
2.6.3.3.1 Proposed Experiments
We propose to probe some of the poly-A loci for sense and antisense transcription. It
might be best to start at loci with paralogy to sites of known sRNA production. Among
genes with strong A-rich motifs, predicted gene 12216 seems best supported by paralogs
with observed sRNA activity (sRNA clusters 4, 6, and 7). Family IV, which includes
cluster 2, contains gene predictions with nearby A-rich motifs but only two sRNA loci.
Among the remaining motif-associated genes, four group by paralogy (family V; gene
preds. 9721, 9739-40, and 17099) in a manner similar to known sRNA loci. For a
negative control on sRNA activity, we suggest probing for antisense -23nt sRNAs a
number of well-characterized genes expressed throughout vegetative growth.
If it turns out that motif-associated genes have antisense sRNA activity, we propose the
creation of chimeric genes combining the positive control genes with A-rich regions to
attempt inducing sRNA activity at new loci. Also, to investigate whether the observed
sRNA clusters are essential for viability, we suggest the creation of knockout strains
lacking these loci.
2.7. Conclusion
The precise function of the -23nt sRNAs of Tetrahymena remains to be seen. Despite
sharing numerous characteristics with well-characterized classes of interfering RNA, our
analysis failed to uncover evidence supporting functional similarity to classical RNAi
pathways. Given the ciliates' unique molecular organization, it would hardly be
surprising to discover that they possess another highly specialized "niche" mode of RNAi
in addition to scanning RNA.
Yet despite the -23nt sRNAs' present functional ambiguity, it is tempting for several
reasons to speculate that they play an essential cellular role. First, their putative nuclease,
Dcl2p, is essential for vegetative growth. Tetrahymena grows extremely rapidly given its
size and must come under considerable stress to maintain transcriptional efficiency,
suggesting that the sRNA processing machinery and the cluster loci themselves are under
selective pressure to remain expressed.
Via computational analysis, we uncovered a strong and specific association between
sRNA activity and an A-rich motif. We have predicted additional occurrences of this
motif and suggest that one way to test their association with the sRNAs would be to
probe for their presence at these loci. That the -23nt sRNAs were organized in clusters
despite the apparent diversity of their population suggests that they are processed from a
relatively small number of loci genome-wide, consistent with the relatively small number
of A-rich motif occurrences discovered computationally.
Ultimately, the question of these sRNAs' purpose remains. One speculative hypothesis is
that they guide post-transcriptional silencing of their own loci in cis. RNAi can be
induced by feeding in the related ciliate Paramecium(29), so the necessary silencing
factors may exist in Tetrahymena. Silencing could be induced for any of numerous
reasons, including to quickly dampen transcript levels in response to some environmental
stimulus, to buffer against residual transcriptional activity of genes recently made
obsolete, or perhaps to protect against macronuclear contamination from MIC or other
foreign DNA.
2.8. References Cited
1. Lee SR, Collins K. Two Classes of Endogenous Small RNAs in Tetrahymena
Thermophila. Genes Dev. 2006 Jan 1;20(1):28-33.
2. Mochizuki K, Fine NA, Fujisawa T, Gorovsky MA. Analysis of a Piwi-Related Gene
Implicates Small RNAs in Genome Rearrangement in Tetrahymena. Cell. 2002 Sep
20; 110(6):689-99.
3. Yao MC, Chao JL. RNA-Guided DNA Deletion in Tetrahymena: An RNAi-Based
Mechanism for Programmed Genome Rearrangements. Annu Rev Genet. 2005;39:53759.
4. Collins K, Gorovsky MA. Tetrahymena Thermophila. Curr Biol. 2005 May
10;15(9):R317-8.
5. Dessen P, Zagulski M, Gromadka R, Plattner H, Kissmehl R, Meyer E, Betermier M,
Schultz JE, Linder JU, Pearlman RE, Kung C, Forney J, Satir BH, Van Houten JL, Keller
AM, Froissard M, Sperling L, Cohen J. Paramecium Genome Survey: A Pilot Project.
Trends Genet. 2001 Jun; 17(6):306-8.
6. Paramecium tetraurelia Genome Sequencing Project. ParameciumtetraureliaGenome
Browser [Internet]. http://www.genoscope.cns.fr/externe/Francais/Projets/Projet_FN/:;
2005 5/25/2005.
7. Zagulski M, Nowak JK, Le Mouel A, Nowacki M, Migdalski A, Gromadka R, Noel B,
Blanc I, Dessen P, Wincker P, Keller AM, Cohen J, Meyer E, Sperling L. High Coding
Density on the Largest Paramecium Tetraurelia Somatic Chromosome. Curr Biol. 2004
Aug 10;14(15): 1397-404.
8. Arabidopsis Genome Initiative. Analysis of the Genome Sequence of the Flowering
Plant Arabidopsis Thaliana. Nature. 2000 Dec 14;408(6814):796-815.
9. Ku HM, Vision T, Liu J,Tanksley SD. Comparing Sequenced Segments of the Tomato
and Arabidopsis Genomes: Large-Scale Duplication Followed by Selective Gene Loss
Creates a Network of Synteny. Proc Natl Acad Sci U S A. 2000 Aug 1;97(16):9121-6.
10. Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J, Devon K,
Dewar K, Doyle M, FitzHugh W, Funke R, Gage D, Harris K, Heaford A, Howland J,
Kann L, Lehoczky J,LeVine R, McEwan P, McKernan K, Meldrim J, Mesirov JP,
Miranda C, Morris W, Naylor J,Raymond C, Rosetti M, Santos R, Sheridan A, Sougnez
C,Stange-Thomann N, Stojanovic N, Subramanian A, Wyman D, Rogers J, Sulston J,
Ainscough R, Beck S, Bentley D, Burton J,Clee C, Carter N, Coulson A, Deadman R,
Deloukas P, Dunham A, Dunham 1,Durbin R, French L, Grafham D,Gregory S,
Hubbard T, Humphray S, Hunt A, Jones M, Lloyd C, McMurray A, Matthews L, Mercer
S, Milne S, Mullikin JC, Mungall A, Plumb R, Ross M, Shownkeen R, Sims S,
Waterston RH, Wilson RK, Hillier LW, McPherson JD, Marra MA, Mardis ER, Fulton
LA, Chinwalla AT, Pepin KH, Gish WR, Chissoe SL, Wendl MC, Delehaunty KD,
Miner TL, Delehaunty A, Kramer JB, Cook LL, Fulton RS, Johnson DL, Minx PJ,
Clifton SW, Hawkins T, Branscomb E, Predki P, Richardson P, Wenning S, Slezak T,
Doggett N, Cheng JF, Olsen A, Lucas S, Elkin C, Uberbacher E, Frazier M,Gibbs RA,
Muzny DM, Scherer SE, Bouck JB, Sodergren EJ, Worley KC, Rives CM, Gorrell JH,
Metzker ML, Naylor SL, Kucherlapati RS, Nelson DL, Weinstock GM, Sakaki Y,
Fujiyama A, Hattori M, Yada T, Toyoda A, Itoh T, Kawagoe C, Watanabe H, Totoki Y,
Taylor T, Weissenbach J, Heilig R, Saurin W, Artiguenave F, Brottier P, Bruls T,
Pelletier E, Robert C, Wincker P, Smith DR, Doucette-Stamm L, Rubenfield M,
Weinstock K, Lee HM, Dubois J, Rosenthal A, Platzer M,Nyakatura G, Taudien S,
Rump A, Yang H, Yu J, Wang J,Huang G, Gu J, Hood L, Rowen L, Madan A, Qin S,
Davis RW, Federspiel NA, Abola AP, Proctor MJ, Myers RM, Schmutz J, Dickson M,
Grimwood J, Cox DR, Olson MV, Kaul R, Raymond C, Shimizu N, Kawasaki K,
Minoshima S, Evans GA, Athanasiou M, Schultz R, Roe BA, Chen F, Pan H, Ramser J,
Lehrach H, Reinhardt R, McCombie WR, de la Bastide M, Dedhia N, Blocker H,
Hornischer K,Nordsiek G, Agarwala R,Aravind L, Bailey JA, Bateman A, Batzoglou S,
Birney E, Bork P, Brown DG, Burge CB, Cerutti L, Chen HC, Church D, Clamp M,
Copley RR, Doerks T, Eddy SR, Eichler EE, Furey TS, Galagan J,Gilbert JG, Harmon
C, Hayashizaki Y, Haussler D, Hermjakob H,Hokamp K, Jang W, Johnson LS, Jones
TA, Kasif S, Kaspryzk A, Kennedy S, Kent WJ, Kitts P, Koonin EV, Korf I, Kulp D,
Lancet D, Lowe TM, McLysaght A, Mikkelsen T, Moran JV, Mulder N, Pollara VJ,
Ponting CP, Schuler G, Schultz J, Slater G, Smit AF, Stupka E, Szustakowski J, ThierryMieg D, Thierry-Mieg J,Wagner L, Wallis J, Wheeler R, Williams A, Wolf YI, Wolfe
KH, Yang SP, Yeh RF, Collins F, Guyer MS, Peterson J, Felsenfeld A, Wetterstrand KA,
Patrinos A, Morgan MJ, de Jong P, Catanese JJ, Osoegawa K, Shizuya H, Choi S, Chen
YJ, International Human Genome Sequencing Consortium. Initial Sequencing and
Analysis of the Human Genome. Nature. 2001 Feb 15;409(6822):860-921.
11. Heilig R, Eckenberg R, Petit JL, Fonknechten N, Da Silva C, Cattolico L, Levy M,
Barbe V, de Berardinis V, Ureta-Vidal A, Pelletier E,Vico V, Anthouard V, Rowen L,
Madan A, Qin S, Sun H, Du H, Pepin K, Artiguenave F, Robert C, Cruaud C, Bruls T,
Jaillon O, Friedlander L, Samson G, Brottier P, Cure S, Segurens B, Aniere F, Samain S,
Crespeau H, Abbasi N, Aiach N, Boscus D, Dickhoff R, Dors M, Dubois I, Friedman C,
Gouyvenoux M, James R, Madan A, Mairey-Estrada B, Mangenot S, Martins N, Menard
M, Oztas S, Ratcliffe A, Shaffer T, Trask B, Vacherie B, Bellemere C, Belser C,
Besnard-Gonnet M, Bartol-Mavel D, Boutard M, Briez-Silla S, Combette S, DufosseLaurent V, Ferron C, Lechaplais C, Louesse C, Muselet D, Magdelenat G, Pateau E, Petit
E, Sirvain-Trukniewicz P, Trybou A, Vega-Czarny N, Bataille E, Bluet E, Bordelais I,
Dubois M, Dumont C, Guerin T, Haffray S, Hammadi R, Muanga J, Pellouin V, Robert
D, Wunderle E, Gauguet G, Roy A, Sainte-Marthe L, Verdier J, Verdier-Discala C,
Hillier L, Fulton L, McPherson J, Matsuda F, Wilson R, Scarpelli C, Gyapay G, Wincker
P, Saurin W, Quetier F, Waterston R, Hood L, Weissenbach J. The DNA Sequence and
Analysis of Human Chromosome 14. Nature. 2003 Feb 6;421(6923):601-7.
12. Mochizuki K, Gorovsky MA. A Dicer-Like Protein in Tetrahymena has Distinct
Functions in Genome Rearrangement, Chromosome Segregation, and Meiotic Prophase.
Genes Dev. 2005 Jan 1;19(1):77-89.
13. Lee SR, Collins K. Starvation-Induced Cleavage of the tRNA Anticodon Loop in
Tetrahymena Thermophila. J Biol Chem. 2005 Dec 30;280(52):42744-9.
14. Lee Y, Jeon K, Lee JT, Kim S, Kim VN. MicroRNA Maturation: Stepwise
Processing and Subcellular Localization. EMBO J. 2002 Sep 2;21(17):4663-70.
15. Vazquez F, Vaucheret H, Rajagopalan R, Lepers C, Gasciolli V, Mallory AC, Hilbert
JL, Bartel DP, Crete P. Endogenous Trans-Acting siRNAs Regulate the Accumulation of
Arabidopsis mRNAs. Mol Cell. 2004 Oct 8; 16(1):69-79.
16. Ambros V, Lee RC, Lavanway A, Williams PT, Jewell D. MicroRNAs and Other
Tiny Endogenous RNAs in C. Elegans. Curr Biol. 2003 May 13;13(10):807-18.
17. Tang G, Reinhart BJ, Bartel DP, Zamore PD. A Biochemical Framework for RNA
Silencing in Plants. Genes Dev. 2003 Jan 1;17(1):49-63.
18. Lau NC, Lim LP, Weinstein EG, Bartel DP. An Abundant Class of Tiny RNAs with
Probable Regulatory Roles in Caenorhabditis Elegans. Science. 2001 Oct
26;294(5543):858-62.
19. Gustafson AM, Allen E, Givan S, Smith D, Carrington JC, Kasschau KD. ASRP: The
Arabidopsis Small RNA Project Database. Nucleic Acids Res. 2005 Jan 1;33(Database
issue):D637-40.
20. Griffiths-Jones S, Grocock RJ, van Dongen S, Bateman A, Enright AJ. MiRBase:
MicroRNA Sequences, Targets and Gene Nomenclature. Nucleic Acids Res. 2006 Jan
1;34(Database issue): D140-4.
21. Griffiths-Jones S. The microRNA Registry. Nucleic Acids Res. 2004 Jan
1;32(Database issue):D109-1 1.
22. Bartel DP. MicroRNAs: Genomics, Biogenesis, Mechanism, and Function. Cell. 2004
Jan 23;116(2):281-97.
23. Hofacker IL, Fontana W, Stadler PF, Bonhoeffer S, Tacker M, Schuster P. Fast
Folding and Comparison of RNA Secondary Structures. Monatshefte f Chemie.
1994;125:167,167-188.
24. Lewis BP, Burge CB, Bartel DP. Conserved Seed Pairing, often Flanked by
Adenosines, Indicates that Thousands of Human Genes are microRNA Targets. Cell.
2005 Jan 14;120(1):15-20.
25. Lewis BP, Shih IH, Jones-Rhoades MW, Bartel DP, Burge CB. Prediction of
Mammalian microRNA Targets. Cell. 2003 Dec 26;115(7):787-98.
26. Gish W. [Internet].; 1996 1996-2003. Available from: http://blast.wustl.edu.
27. Nix DA, Eisen MB. GATA: A Graphic Alignment Tool for Comparative Sequence
Analysis. BMC Bioinformatics. 2005 Jan 17;6(1):9.
28. Bailey TL, Elkan C. Fitting a Mixture Model by Expectation Maximization to
Discover Motifs in Biopolymers. Proc Int Conf Intell Syst Mol Biol. 1994;2:28-36.
29. Galvani A, Sperling L. RNA Interference by Feeding in Paramecium. Trends Genet.
2002 Jan;18(1): 11-2.
CHAPTER 3
Estimating Tissue-Specific Patterns of
Depletion and Enrichment for microRNA
Targeting in Mammalian Genes
Abstract
Over the course of evolution, mammalian genes have come under tremendous selective
pressure to avoid certain regulatory sequence motifs while maintaining others.
MicroRNA-mediated RNA interference is a potent repressor of mammalian gene
expression, and it was recently reported (1) that genes preferentially expressed in certain
tissues are under significant and measurable pressure to avoid targeting by coexpressed
microRNAs (miRNAs). An enhanced method to estimate the degree of motif depletion
and enrichment is reported and applied to miRNA target motifs across a panel of 61
mouse tissues. Resulting patterns of target depletion display strong correlation with
miRNAs' reported expression profiles, reinforcing prior reports. Lastly, the inverse
analysis is applied, revealing target enrichment consistent with miRNAs' recently
hypothesized role in tissue identity maintenance.
3.1. Introduction
Numerous regulatory pathways in the cell are mediated by specific sequence motifs.
Transcription, the first step in the central dogma of molecular biology, is activated by
factors that recognize and bind promoter elements within DNA loci to be expressed. For
example, a circuit of transcriptional control largely conserved from humans to
archaebacteria responds to heat shock (2) by activating genes with special promoter
motifs.
As biological complexity grew during the course of evolution, organisms relied upon
such regulatory layers to exert increasingly fine-tuned control over gene expression to
respond to different environmental stimuli and develop specialized molecular and
physiological structures. The mammalian trancsriptome comes under the control of
numerous regulatory pathways, including RNAi, signal transduction, splicing, and
nonsense-mediated decay, to name a few.
The benefits of complex regulation are accompanied by selective pressure to maintain or
gain functional motifs and, conversely, to avoid those with deleterious effects. Here we
describe a method to estimate enrichment or depletion for such motifs specific to genes
expressed in a certain tissue or cell population. This method enhances one recently
applied to find tissue-specific patterns of microRNA target depletion (1), and for
comparative purposes we apply it to the same dataset. We additionally propose that this
method could be applied to other regulatory processes, both to measure tissuespecificities of known motifs, and to discover novel ones.
Farh et al reported that genes preferentially expressed in a given tissue show significant
avoidance for targeting by miRNAs coexpressed in the same tissue. Many microRNAs
are expressed in a highly tissue-specific fashion and are regulate a small set of target
genes with dramatic effect during tissue development. However, an inevitably large
remainder of genes are required to be expressed unhindered and are thus depleted for
such miRNA target sites. We estimate patterns of targeting depletion largely coinciding
with recent reports (1, 3) as well as with miRNA expression data. Finally, we invert this
analysis to search for target enrichment, and find patterns consistent with miRNAs'
proposed role in silencing aberrant transcription to help maintain tissue idenitity.
3.2. Methods
3.2.1. Microarray Dataset Processing
The GeneAtlas v2 mouse microarray dataset (4) was downloaded in MAS5-normalized
form from the Novartis Research Foundation (http://wombat.gnforg). This dataset
comprises intensity levels for 36,182 probes interrogating over 20,000 RefSeq-annotated
genes in cellular extracts of 61 different tissues. The geometric mean was taken over the
two replicate sets of measurements.
Microarray probes were mapped by name to RefSeq-annotated mouse genes using
mapping tables provided by the UCSC Genome Browser (5). Following exclusion of
probes with names ending in "_xat" which target entire gene families rather than single
genes, 35,224 probes remained. Even so, some genes were targeted by multiple probes;
in these cases, the arithmetic mean was taken over the corresponding probe sets.
Averaging over replicates and redundant probes yielded a 13,894x61 matrix of intensity
values with genes on the rows and tissues on the columns.
To mitigate the noise inherent in array measurements, the sorted rank of every gene's
intensity level in a given tissue was taken with respect to the levels of all other genes in
that tissue. This was repeated separately for each tissue, yielding a 13,894x61 of tissuespecific ranks with values ranging from 1, representing the lowest expression level, to
13,894, representing the highest level.
3.2.2. Observed Sequence Features
3' UTR coordinates were obtained from RefSeq annotations for all genes targeted by the
microarray dataset by concatenating exonic ranges not contained within the annotated
coding sequence. Transcripts with 3' UTRs less than 50 nt in length are likely to reflect
incorrect annotation and were discarded for statistical reasons. For the remaining 13,144
genes, a total of 13,289 3' UTR sequences were then extracted from the mm7 assembly
of the mouse genome.
Counts of all hexamers and heptamers were then obtained for the extracted sequences,
resulting in two matrices of 13,144 rows by 4096 and 16384 columns, respectively.
Constituent oligonucleotides (kmers) containing ambiguity characters at any position
were not counted. When a gene had multiple 3' UTR annotations, the arithmetic mean of
kmer counts was taken over its transcript sequences.
3.2.3. Sequence Feature Background Model
The expected counts of all hexamers and heptamers given a second-order Markov model
were calculated for each sequence. Under this model, a sequence's expected count of
every order k 2 3 oligo is a function of its trinucleotide frequencies and its length:
k
E,,x,...,x,
=(L, -
k +1) f
jxb x,_x,_, ,where
i indexes genes and X.i are bases. The
j=3
background conditional frequency of the trinucleotide X,X 2X 3 (or, more precisely, the
conditional frequency of base X3 given the preceding dinucleotide XX
2)
was taken as
bx,~xx = _r.c.rX f
, where fX is the observed frequency of trinucleotide X in the
ZYE(A.C,GT) fXIXw
3' UTR of gene i.
3.2.4. Sequence Feature Sets
Observed and expected counts were calculated in every 3' UTR sequence for two sets of
sequence features: all hexamers ("all6mers") and all heptamers ("all7mers").
For analysis of known microRNAs (miRNAs), mature sequences for the 288 mouse
miRnAs in RFAM release 7.1 were downloaded from miRBase (6, 7). MiRNAs were
grouped into 187 families having identical sequence in 5' bases 1-8 (i.e., the "ml" base
and the seed region). Two sequence feature sets were derived from these miRNA
sequences: "rf71 _con6", which comprised for each miRNA seed+m I sequence the
combination of three overlapping reverse-complement hexamers, and "rfl71_con7", which
was analogously composed of combinations of two overlapping heptamers. For miRNA
sequences not starting with U at the 5' end, the exact reverse complement of bases 1-8
was investigated as with the other sequences, but additionally a "tl A" record was added,
combining both counts of kmers exactly reverse complement to the miR as well as those
with an "A" at the position opposite the m base. The observed/expected counts of
overlapping hexamers and heptamers were then drawn from the all6mers and all7mers
sets, respectively, and added.
Table 1. Compilation of rfl ICon6 miRNA seed match feature set from constituent overlapping
hexamers; rf7 ICon7 was compiled analogously using heptamers.
Mouse microRNAs
mmu-miR-99b
mmu-miR-99b (+tlA)
mmu-miR-125a
mmu-miR-127
mature sequence
seed match
constituent hexamers
5'-CACCCGUA...
5'-CACCCGUA...
5'-UCCCUGAG...
5'-UCGGAUCC...
... UACGGGUG-3'
... UACGGGU[GIA]-3'
...CUCAGGGA-3'
...GGAUCCGA-3'
UACGGG+ACGGGU+CGGGUG
2*UACGGG+2*ACGGGU+CGGGUG+CGGGUA
CUCAGG+UCAGGG+CAGGGA
GGAUCC+GAUCCG+AUCCGA
3.2.5. Tissue Specificity Index score
A Tissue Specificity Index (TSI) score was computed for each gene-tissue pair,
measuring both the strength and specificity of the given gene's expression in that tissue.
To assess specificity of gene expression in one tissue it was first necessary to quantify the
average strength of expression in all others in the set of tissues T. To calculate the
average expression rank R, of gene i, the median of the expression ranks of that gene in
all tissues ri = median (R/ : je T was found. The median of a set of ranks is not
necessarily itself a rank and so these median values were ranked with respect to each
other to give the average expression rank R,. It was expected that gene with ubiquitously
high expression would have large values of R,, whereas those with low expression in
most tissues would have small Ri. The use of ranked medians of ranks was intended to
preclude genes expressed highly in only a few tissues from being assigned high average
expression, such as might occur by taking the mean of intensity values.
The average expression rank Ri of a gene across all tissues was subtracted from the
gene's specific expression rank in each tissuej to yield the TSI score for that gene-tissue
pair: TSI = R - Ri . Positive scores indicate strong expression specific to that tissue,
whereas scores near zero indicate a similar level of expression to that found in other
tissues. For example, genes strongly expressed in muscular tissues but not elsewhere, for
instance specific myosin isoforms, would be expected to have positive scores.
Conversely, ribosomal proteins, which are strongly expressed in every cell type, would
be expected to have scores near zero. In contrast, negative scores are reflective of tissuespecific avoidance of a certain gene's expression.
Calculation of Tissue Specificity Index score
Rank genes by tissue-specific expression
Skeletal muscle expression
+sort
1 (lowest)
13144 (highe,st)
SRank genes by average expression
by
......
expressionJ.
Tlssue-sReclflc ranks
'E
Mylpf 5871 3831
0
4272
13143
5129
13125 13134 13139
13128
13132 13144
Rps16
ribosomal protein S16 13136 13136 13090
13067
13128 13143
myosin light ch., fast twitch
c ox7c
cytochrome c oxidase
5261
tissues
(or conditions)
Calculate and rank genes by Tissue Specificity Index score
131A4
/
Genes ranked by TSI (Skeletal Muscle)
R,
1
(k = skeletal muscle)
A
TSi,k =tR,,k-
ME
)•muscle-
0'
LU
specific genes
(myosin, creatine kinase, etc.)
rank in skeletal muscle
Figure 1. Calculation of tissue average rank and Tissue Specificity Index scores from expression
atlas microarray data.
Defining the TSI in this way appeals to an intuitive geometric interpretation. Because
across all genes, average ranks were strongly correlated with tissue-specific ranks, it was
natural to fit them to a linear model. This can be visualized as a plot of all genes' average
ranks versus their ranks in a specific tissue (e.g., skeletal muscle) with a linear fit relating
the variables RWand R/i. In no tissue examined did the fitted linear relationship deviate
strongly from a zero-intercept, 45-degree line, and thus the linear model was fixed as
(o,f)= (0,1,-i). As the signed perpendicular distance to a line L = {fo + "-X = 0} is
1(p6 x +Po), the distance from the point representing each gene's expression rank to
II6I
the imposed line relating the ranks is 2 (R - R), exactly proportional to the TSI for
that gene. Thus if a slope-one, zero-intercept linear model is considered as the default
relationship between each tissue-specific rank and the average rank, the TSI of each gene
may be viewed as its signed deviation from that model.
3.2.6. Measurement of Feature Depletion and Enrichment
3.2.6.1. Overview
Sequence features were then individually assessed for depletion and enrichment relative
to their tissue-specific expected distributions. The expected distribution of a feature in a
given tissue was set by ranking genes by TSI and binning so as to equalize the total
expected count of that feature in each bin. Observed feature counts were compared in
each bin, and the running difference of expected and observed counts was calculated.
This amounted to finding the maximal difference between the empirical cumulative
distribution functions of two discrete variables. A background distribution on this
statistic was obtained and the empirically-fit KS test was applied to yield two P-values,
one for the given sequence feature's enrichment and one for its depletion among genes
with high TSI ranks in the given tissue.
3.2.6.2. Binning Strateav
For a given tissuej, genes were ranked by increasing TSI' in that tissue, so that genes
specifically avoided had small ranks, those neutrally expressed had moderate ranks, and
those specifically highly expressed in that tissue had large ranks. Using this ordering,
genes were divided into 100 bins such that the summed expected counts of the given
sequence feature X in each bin was roughly equal (details in Figure 2).
bin_equal_expcounts(X, E, binNumbers, nbins=100)
curBin <-0
acc - 0
for gene_i=l to size(E)
acc - acc + E[gene_i]
binNumbers [gene i] < curBin
if acc > sum(E) / n bins
curBin +- curBin + 1
acc < 0
Figure 2. Binning algorithm psuedocode scanned through genes ordered by TSI,
accumulating expected counts of a given feature and placing bin boundaries whenever
the accumulator exceeded the equal bin capacities for that feature.
Observed counts were then summed within each of these bins. As a corrective factor, the
summed expected counts in each bin were scaled by the ratio
F,'FI Ex to normalize
their totals to equal those of the observed counts F; .
3.2.6.3. KS Test Statistics
While this tissue-specific binning induced an approximately uniform distribution on the
expected feature counts, the observed counts were free to change. To obtain the
cumulative distribution function over the difference of observed and expected counts, the
summed difference within each bin was found and the running sum of those differences
taken. From the running sum, two one-sided discrete Kolmogorov-Smirnoff (KS) test
statistics were calculated: the largest non-negative difference was taken as the enrichment
statistic and the largest negative difference was taken as the depletion statistic. Both
were obtained for each pair of tissue and sequence feature.
3.2.6.4. Estimation of KS Statistic Background Distribution
For each sequence feature, a set of 10,000 random gene orderings was generated and used
to estimate background distributions of the enrichment and depletion statistics. One
randomized ordering was generated by taking, for each gene, the tissue-specific rank
from a tissue randomly chosen with replacement. The resulting vector of values was
sorted and each value was replaced by its sorted rank. This ranked order was then used in
the same way as a tissue-specific ranking: a TSI score was calculated between the control
and average rankings and was used to order genes, and the binning and calculation of
enrichment and depletion statistics was performed as before. This process was repeated
10,000 times per kmer set to yield an empirical background distribution for each of these
two statistics for each sequence feature.
3.2.6.5. Application of KS Test
Enrichment and depletion P-values were determined from the empirical background
distribution using the method employed by Farh et al (1). When the test statistic x was
greater than the 9 8 th percentile of the background distribution, P-values were instead
calculated from the fitted asymptotic KS tail probability Q = e- 2nx . The theoretical and
empirical tails were fitted by setting n so that their 9 8 th percentiles were equal:
n +- -log 0.02/(2x2),
where x,, is the empirical
8
9 8 th
percentile of the test statistic x.
The resulting P-values estimate the significance of departure from the null hypotheses
under which sequence feature enrichment and depletion are distributed homogeneously
across scramblings of the tissue labels.
3.2.6.6. False Positive Analysis
False discovery rate was assessed by generating additional control gene reorderings, thus
drawing values from the test statistics' background distributions, and determining the
number of feature-tissue pairs called significant among this contrived set. The
significance threshold in each dataset was then set to the largest P-value for which the
number of expected false positives was less than one per sequence feature.
3.3. Methods
3.3.1. Comparison of Expected and Observed Sequence Feature Counts
We evaluated the performance of our 3' UTR background models by comparing the total
expected and observed counts of each feature, summed over all genes (Figure 3). There
were few outliers among all 4,096 hexamers and all 187 sets of miRNA target heptamer
pairs, suggesting the features' expected counts indeed provide a good basis of
comparison for estimation of enrichment or depletion.
Observed vs. expected counts of each sequence feature, summed over 3' UTRs
All hexamers
microRNA target seed heptamers
4000
3500
3000
u)
2500
8
U) 10-
•o
2000
1500
O
1000
500
10
2
-
102
--
103
4
10
Expected counts
105
0
1000
2000
3000
4000
Expected counts
Figure 3. A second-order background model adequately captures sequence compositional effects
for both hexamers and miRNA target heptamers without overfitting or bias. Each point
represents a sequence feature plotted by its total expected and observed counts in all genes.
3.3.2. Tissue-Specific Index Score Evaluation
We next examined the Tissue-Specific Index scores to verify that they properly reflected
tissue-specific expression or avoidance thereof. Our assumption was that the average
tissue ranks were correlated with tissue-specific ranks for the majority of genes such that
outlier genes were those specifically expressed or avoided in each tissue or cell type (8).
In each tissue, we took the Friedman rank-based correlation between the tissue-specific
and average rankings (Figure 4). In 57 tissues, a very strong correlation was observed
(0.799 <p <0.921; p = 0.873, a, = 0.152).
Four tissues had distinctly lower correlation values: ovary, fertilized egg, testis, and
pancreas (0. 6 8 8 <p <0.716). Strikingly, three of these were from reproductive organs.
Under hierarchical clustering, these tissues showed no apparent relationship to each other
(data not shown), suggesting their shared deviation from the average expression ranking
was not simply by virtue of having similar expression profiles to each other.
Friedman rank correlation between genes' tissue-specific and average ranks
ii
i
I
I I II I I
I I
I I
I
I I
I I
I I I
I
I"I
""
I
I I r"
Irl I I171
I 17111
I I7 I
I
I
0.8
0.6
0.4
3
Figure 4. Average gene rank is highly correlated with tissue-specific expression rank for genes in
all 61 tissues, indicating that tissue-specific genes can be detected as outliers from a linear
regression model. Four tissues, of which three were germline-related, had distinctly weaker (but
still highly significant) correlations, suggesting highly specialized programs of transcription.
Goodness-of-fit analysis indicated that fixed slope-one, zero-intercept lines adequately
modeled the relationships between average and tissue-specific ranks in every tissue. Pvalues were approximately zero for all tissues except the four noted to have lower
correlations; P-values in those tissues were each less than 7.4xl0-14 (F-test, I and 61 df).
We visually inspected the average-to-tissue-specific ranking relationships for several
tissues. In each tissue, most genes were densely clustered around the best-fit line, though
with some local differences in variance. Most tissues had a marked cluster of presumably
tissue-specific genes in the lower-right corner. We plotted these ranks for three tissues
(Figure 5) and performed literature searches to confirm that high TSI scores identify
genes with tissue-specific function. TSI scoring remained sensitive even among genes
that are only moderately expressed in the tissue, and exhibited high specificity for both
highly- and moderately-expressed genes, assigning low scores to genes ubiquitously
expressed or specific to another tissue.
Examination of tissue-specific index scoring intwo tissues
muscle
Skeletal
C
2
0
_Iz
2
C
ca
0
Cu
0)
1=
0
2000
4000
6000
Rmki
a"
8000
1l
10000
12000
" cerebLMeumll
A
a
"AM
0
0.5
1.
U.a
TSI score(x 10^4)UA^4
TSI score
13144
Hrc
histidine rich calcium binding
12732
Trim54
protein
tripartite motif-containing 54
12662 13143
Txnip
0
05
1
15
I
4)
TS score (x 10A
TissueTSI Rank Specific
Comments
score (TSI) Rank
Description
Skeletal muscle: high muscle-specific rank: high TSI.
Name
Ndufal
(x
12971 Funcitons in Ca(2+) release during muscle
contraction (9)
12803 Function: myogenic differentiation; disease
phenotype: muscle atrophy(10)
Skeletal muscle: high muscle-specific rank: neutral TSI.
1
7034 13058 Ubiq; mitochondrial electron transport chain
NADH dehydrogenase
(ubiquinone) 1 alpha
subcomplex, 1
1
7032 13004 Ubiq; regulates cellular redox state
thioredoxin interacting protein
Skeletal muscle: moderate muscle-specific rank; high TSI
Mpz
Cabcl
Mus musculus myelin protein
zero
Mus musculus chaperone,
ABC1 activity of bcl complex
6587
12983
7796
7081
13007
7656
Myelin component; Deficiency: muscle
demylination. (11)
p53/apoptosis; Northern blot shows ubiq, but
enriched in SM, heart (12)
like
Skeletal muscle: moderate muscle-specific rank; neutral TSI
Abcc8
Gtf2ird2
ATP-binding cassette, subfamily C (CFTR/MRP),
member 8
GTF21 repeat domain
containing 2
0
7025
5831
No specificity for muscle; known function in beta
cells of pancreatic islands
0
7027
7282
Basally transcribed in a wide range of tissues
Figure 5a. Tissue-specific and tissue-average ranks for cerebellum and skeletal muscle, two
example tissues in which those ranks were very highly correlated. TSI scoring readily identifiers
outlier genes with tissue-specific function at high and moderate expression levels. Two genes
with highest and two genes with lowest (absolute) skeletal muscle TSI were chosen, first from the
top decile of skeletal muscle expression and then from the middle decile.
Testis
a)
(U
U)
(U
U)
(0
(0
0
2000
4000
6000
8000
10000
12000
Rank intestis
(
1000
D500
1
0
1. 5
1
0. 5
0
05
1
15
TSI score (x 10^4)
Figure 5b. Testis was one of four tissues displaying higher
deviance from average ranks; however, a cluster of testisspecific genes was still evident and assigned high TSI score.
TSI scoring identifies genes with specific expression and function in the tissues we
examined. Even so, we remain concerned that this method's sensitivity is suboptimal for
highly-correlated tissues, and propose several improvements (see "Future Directions").
3.3.3. Tissue-Specific Depletion of microRNA Target Sites
Genes with high TSI scores were enriched for tissue-specific function, and many are
likely to be required for proper cellular function in the given tissue. Indeed, some of the
high-scoring genes identified in muscle (e.g., Hrc, Mpc) have severe deficiency
phenotypes and therefore are likely to come under considerable selective pressure to
avoid miRNA-mediated knockdown. Consequently, we expected to observe significant
depletion among high-scoring genes for seed matches to co-expressed miRNAs. Farh et
al (1) recently measured a similar depletion effect among highly (though not necessarily
specifically) expressed genes.
Computational (13, 14) and experimental (15) studies have identified Watson-Crick
pairing to the miRNA seed region (bases 2-7) as the primary determinant of specificity in
miRNA targeting. We searched 3' UTRs of mouse genes for depleted counts of
heptamers reverse complementary to each of 187 mouse miRNA families at these
positions as well as at bases 1-7.
Selected miRNAs with strong signals for targeting depletion are shown in Figure 6 (nonbrain tissues; full set of 181 miRNAs shown in Figure Al). Many miRNAs' strongest
depletion effects coincide with their reported tissues of expression, reinforcing the
hypothesis (1, 3, 16) that preferentially-coexpressed messages avoid accumulation of
these target sites. For example, miR-1/-206 and miR-133 are well-known to be
specifically expressed in muscular tissue (17), and show very strong depletion for
targeting among genes specifically expressed in skeletal muscle (P < 7.6x
0-12 and
P<
5.5xl 0-'0, respectively), heart (P < 5.7x 106 and P < 9.2xl 0-5), and to a lesser extent,
another muscular tissue, tongue (P < 0.0085 and P < 0.0097).
We likewise found very strong depletion for targeting among specifically braintranscribed messages by a non-overlapping set of 37 miRNAs, many of which are known
to be expressed specifically in the brain (Figure 7). The strongest pattern of depletion
was seen for miR-124a, particularly in frontal cortex (P < 1.2x10-24), olfactory bulb (P <
2.1x10-20 ), cerebral cortex (P < 2.1x10- 9 ), and cerebellum (1.9x10
20
). Highly significant
depletion was observed for miR- I24a targeting in all other measured tissues of the central
and peripheral nervous system. Only one other tissue, testis, was found to be depleted for
miR-124 targeting, and only weakly so (P < 0.0026) - a false positive or perhaps signal
from weak and as-yet undocumented expression there, but either way demonstrating a
high degree of specificity in estimation of target depletion, at least for highly expressed
miRNAs such as miR-124a.
Selected miRNAs depleted for targeting innon-brain tissues
miR-143
miR-331 tla
miR-3730
a
miR-378(t1a)
l
I
I
miR-21
miR-27b/-27a
miR-23b/-23a (tla)
miR--381
miR-30a-5p/-30bcde
miR-19b/-19a
miR-106b/-20
miR-106a/-93/-17-50
miR-130a/-301/-130b
miR-132/-212
a
miR-291-3p/-294/-295 (tl
miR-203
miR-375
I101I
I
miR-384 (tla
miR-103/-107 (tha
I
miR-424 (tla
I
miR-22
miR-325 (tla)
miR-15b/-195/-15ai-16 __ 1
miR-154
II
miR-151 (tla)
miR-126-5p tla)
miR-142-3
miR-350
miR-34c (tla)
miR-34a/-449
miR-196a/-196b
miR-96
miR-144
miR-139
miR-451 (tla)
miR-129-5p (tl
miR-200b13a/-3200b
10 1
I
10
M
I
I
L
miR-141/-200a
miR-194
miR-215 (tl
a)
miR-122a
miR-133a/-133bl
---
miR-1/-206
miR-218
m_
T
0
0
o
"mm -3 ý._.Mg
5
10
-log
15
20
25
30
depletion P-values (one-sided KS)
Figure 6. Selected miRNAs with significant targeting depletion among genes preferentially
expressed in various tissues are shown, manually arranged by their tissues of depletion.
Depletion profiles for the full set of miRNAs are shown in Figure Al.
Depletion profiles for the full set of miRNAs are shown inFigure AL
MicroRNAs depleted for targeting among brain tissues
miR-183
miR-153
miR-465
miR-142-5p (tla)
miR-485-3p (tla)
miR-380-3p
miR-324-5p (tla)
miR-152/-148a/-148b
miR-134
miR-463
miR-202 (tla)
miR-24
miR-138 (tla)
miR-298 (tla)
miR-137
miR-483
miR-452
miR-433-3p (tla)
miR-410 (tla)
miR-376a (tla)
miR-292-5p (tla)
miR-100 (tla)
miR-99b (tla)
miR-99a (tla)
miR-7/-7b
miR-290 (tla)
miR-128a/-128b
miR-187
miR-464
miR-9
miR-125a/-125b/-351
miR-124a
miR-29b/-29a/-29c
miR-221/-222 (tla)
miR-10a
let-7d (tla)
miR-98/let-7abcegi
1" D4''
3~~E~
o<
'WO
g
02 -g'a'%
on~ 0~
C
33333
OMMr
CORCC
+D
0D
-'x
~~a~~rc
~30~6
___W
!_0_W
(D
Ejs~rmmpa4
pE ouu,0009i.8
V.
MO
M
-
-7
I 0&ý@Qi
C
3
Ca-.S
0
W
n
A
CD+
wag
nSO)MB.)
,::r~
&Pi9
O5:t
0A
Rtiw'2.
(D
In
1C 13n
~Jcr3;a'
'IA
4S
')fl
1).
')P
n
-log,o depletion P-values (one-sided KS)
Figure 7. 37 miRNAs showed depletion for targeting among genes specifically expressed in the
brain.
Some miRNAs have depletion signals that differ sharply between highly related tissues,
suggesting that this method can resolve differences in miRNA targeting on a
physiologically fine scale. For instance, mammalian microarray (18) and zebrafish in situ
experiments (19) showed that miR-138 is expressed in the brain. However, we find
significant targeting depletion specifically in the trigeminal and dorsal root ganglia (P <
2.8x 106 and P < 1.7x10-8 ) and not in any other nervous system tissues, suggesting that
miR-138 plays a specialized role in these tissues.
3.3.3.1. Weak Depletion Signals also Coincide with microRNA Expression
Some miRNAs had weak depletion signals (0.1 > P > 0.01) that nevertheless matched
their reported expression patterns. Returning to miR-138, we observed a signal (P <
0.05) in B220+ B cells, which undergo maturation in bone marrow, where miR-138 is
weakly expressed in mammals (19). Similarly, we found no tissues significantly depleted
for miR-217 targeting, but obtained weak signals for the following: pancreas (P< 0.04),
large intestine (P < 0.08), salivary gland (P < 0.04), and frontal cortex (P < 0.03).
Although not reaching statistical significance, these agreed with hybridization evidence
showed miR-217 expression in the brain, spinal cord, eyes, and the pancreas (specifically
in exocrine cell populations). Furthermore, miR-217 was only weakly detected by in situ,
suggesting high sensitivity even to subtle depletion effects such as for targeting by tissuespecific miRNAs present at low copy numbers.
Another miRNA with a weak depletion signal was miR-134, which was recently
implicated in neuronal development in the rat hippocampus (20). In fact, we observed a
weak depletion signal for miR-134 in hippocampus (P < 0.0196), second only to
blastocysts (P < 0.0112). The weak depletion seen for miR-134 targeting could be a
consequence of its relatively low expression level (compared to miR-124) or could reflect
a limited potential to target genes due to its specific localization in the synapto-dendritic
compartment.
3.3.3.2. Signal-to-Noise Estimation
We drew additional samples from the background depletion distribution of each miRNA
target heptamer pair in order to estimate the signal-to-noise ratio for the depletion
analysis. MicroRNA seed matches significantly depleted among these background
samples were deemed false positives. As Figure 8 shows, the depletion analysis achieved
positive signal over noise for all P-value cutoffs less than 0.32 (10-05).
Estimation of false discovery rate for miRNA-target depletion analysis
nes
3
S10
2
E 10
S10
~
10
E 10
10 2
._)10
3
10
0
1
2
3
4
5
Significance cutoff (-log P-value)
6
7
Figure 8. The number of significant tissue-heptamer pairs, constituting positive predictions is
shown along with the number of control instances deemed positive for each P-value cutoff. At a
cutoff of 10"', signal:noise is 1.53; at 10-2, signal:noise is 3.74. 10,000 shuffled controls were
drawn from the background distribution and the number of false positives was normalized by
dividing by the ratio of shuffles drawn to the number of tissues tested (10,000 / 61).
We conclude that a P-value cutoff of 0.001 was, if anything, conservative. Erroneous
results - significant depletion of targeting not coinciding with miRNA expression - are
more likely to arise from basic limitations of the method (e.g., its inability to resolve nontissue-specific effects) rather than sampling variability. The choice of P-value cutoff
remains somewhat arbitrary, and there are few, if any, miRNAs with precisely known
expression patterns that could be used to find the optimal cutoff.
3.3.3.3. Comparison to Experimentally Determined MicroRNA Expression
We systematically compared our estimates miRNAs' targeting depletion across tissues
with their expression patterns as determined by Wienholds et al (19). In their study,
chemically-modified miRNA-targeting probes were hybridized in situ to zebrafish
embryo slices. The particular method attained sufficiently high binding specificity to
discern between miRNAs differing in sequence by as little as one base, revealed their
patterns of expression to the sub-tissue level in some cases.
Note that by comparing depletion value estimates to expression levels measured in
zebrafish embryos, we make several assumptions. First, there may not in all cases be a
direct correspondence between miRNA expression (even if perfectly tissue-specific) and
depletion of seed-match targeting. To bridge this gap, we first accept the growing
consensus that seed matching is the primary means of metazoan miRNA targeting, and
further assume that no subset of miRNAs systematically target messages by alternate or
complementary means. Studies have demonstrated 3' pairing as an additional targeting
mechanism in specific cases (15), although the extent of such targeting may be limited
(13,21).
Beyond that, there are several difficulties: firstly, we used gene expression data from
mostly mature tissues, whereas the hybridization study used embryonic samples. The
authors commented, however, that miRNAs' primary role may generally be "not in tissue
fate establishment but in ... maintenance of tissue identity". If so, then their expression
levels may be comparable between late embryonic and mature stages. Physiological
differences between zebrafish and mammalians are an additional complication, making
the comparison most relevant for ancient miRNAs predating specialized mammalian
physiological features.
Nevertheless, the zebrafish hybridization results generally agreed well with microarray
data for mammalian miRNAs and tissues, while resolving tissue-specific differences on a
much finer basis, making it the best dataset presently available with which to verify our
results. Table 2 shows a comparison between our targeting depletion estimates and the
expression data for miRNAs found to be highly tissue-specific in the hybridization study.
Ao
o
"3
'U
ca
U
o
5
0.)•
4-
9?
E
r-0
0E
,.
o
U')
U)C
(h)r
m
0)
co
04
0) Cu
Cc_
30
"5
c cc.
o "•
0v,
m
>
cc
3")
0,
(U~
(.3
'a
c
cu
D
o•
0.
060
,E
_•o•
Cu0)-.t,..
r
(h!
ca).
3)
, o
(D
oC.
3
o
-
1)
o
.;_
0oE
...
o 'a
cc CL
C)d
E
0°.2
C.)
Em
mj
-~E
a)
a)
ACL,
cu 00.
C.)
nco
00
0300
C
3),
CD
CO
r
CE•o
cc
E
3
c
E
c
0
CuX
-L0,
0.- Cu
>•
"to
cd
bD0 0.)
c>
M
V
v
E'
0r.
L)
(D
I0)
ot
03
a)
0.
E
+
cuu
m o
-2
0_ t,
o~o
0-
o"
Cla)
*0.
r
>1
(
= E,
<o0
-E!'
a
C/) co Cu
0h
ci
E
r,,
E
-o
cc
E
Cl
*C-
*-
Cl)
0
Cr
cu
C
In
=
0,
a,
03
Cu .0
03. 0.
a ro0 .2
CL
cr
.)E
z
i'0,
0 -M
cuM
a cu
0-Ut
o
0.
.2S-do
_•> .--
(D
CL
0. '
Cu
0.0,
03
0,
0, C
t
EL
c~
03
8.:
.2
ci
+076
c
E"
EaC
o2
a)
LO,
O
a
Cu
mC
._•
•
cn
c
030
0,T
Q$ .
C .Ž
*0 0
VCu
0
t
)
03.
03
.5V
C
a,
4)
LU
(U)
a
C
m
cn
-.v0
cc
0S
ICL
t 0,
Cut
CD.
0b
a,
Cu:
0) 0
0,0
0,)
t >.
0)
&
C
m
03
to
._5
0)
to
f-0
cn
- ,
c
m
CL..
03
C.)
0,
c
€m
c•
Do
(D
Cu0
CuC
cu
In
E
w
C,
E
E
c ._
•
c
co
CM
00
10)
-F
Cu
0
20,
E
.S
-C4
CE
('
c"
0,
E
mCu
In
tc
0.to c
E
0
0c_
oC,
CuCu
0)
4-Cu
0 0)
m
c
C.
0
Ea).
0E•
cc
o,
.E
00
t
m
0,~
.u,
am
07 0)i6CL
cu
E
I-2"
'O
0
0
vV
Uf)
o
00
•8
o
X-.
O-C
.0
LOC
0
0
a)
.04
0
C.)
Cu
0
U3
0
0
o'._0
0
0
E
a,
0
C
1)
0
0
0
V
E0
o
0
tC
0
C
0
E
0.
I
O
Ec
2)
2 0 Cu4
U) -M
(N0 4-
059
c00
0i-
0
0*-
-
0)5-
co cc
c
0.
01
0.
0
0 t-
-C
-o0-
0
0
E
E-
0
4- 0 .
_cu
CO
F >.-u
- C
Cu,
0 cc0
vC.
0
0
Cu
CN
-4
00.
CL
-s
cn•
E
Cu
C..
0)
C4
co
0.
-J
t-0.a
E
0)-a
CC
V(Io=
..
~13
0cc
(n
.T
cu
C
C)
E
Cu
0"D
o
0 .D
0.c
C)~
'aCu
2.
Cu
0
0
0
.
c
c
CUi
0
C
E
x -C L.
C
Cu
0)
0
E
0
;-Z
C*)
C)
Cu$
0
U)
Uf)
00
co
E EN
0.
E
cts
CD :
C.c
M
M
t.L2
C
09
Un
E
m0
c
0
E
I-
0-
0
i._~
rCuC
u C
c
002
a-Eo
C
IM
0.
zV
C'
E
Cý
/) c c C-.EL-0l=~.cc
E~
L.>
+
t
0.
Uf)
CD
.C-(-
TC
-0oW
CU
-)C
0C
CU
C
c
0..._
L
.3-Li
cu
C
0C')
V2o
.C
0C
0
0
Oc
C0
>
0)
0
co
c
E
i5
m
Cu
C:
:32.
E
Cu
E
,-
0
0
0.0..
E•
Q.
a:
0.
r-
.Cu
(D
0
:30.
(D
0
-C
CL
cu0
Cu
%0
Eo
SCC
o
V
°0
0
C
C-
m
0
:-o
a)
E
C:
C
E
0
0..
cn
Cu)
CL
7-5
a)0
c
Cu
mCa
E
•40
0.r
CIL
0.
.E
-0. -c
L0
-ca
L
cc
O
CL
.2
0
0
0)
0
- ,..
Cu
0
D
CL
U)
C?
0
5
C'.4
C--
LO
-L.
Cu
w
U1)
7i
m
0
(a
DE
DO
=o
.0-0-
C4,
Cu
.E
(D
°_E
E)
0.E =3_
:)
C:
0C?
cu
0.
Cu4
.1J.
E
CL
E
0
E
0
_
CN
OC3
o
M.3
Q.
0
C
..
J
0)
A?
•.•m
04
a,
0m
DO
2 0
0-
LD
--C
",
C:
E
U,
:3
E
0
0C
-EE0.2
a:
E0
a)-EC
CW
CD
.=-E
E
o
EE
E
M0
CE
4,
o
._..
CV)
M
r)
a)3
C~)
0"0
o
,o
C)
U)
q
(0
0
o;
C.)
.0)C
U)
E
00
U)
0
ai+
CO
.0=
a)
r0
.o
C)
Cu
ov
Cum
0)
()
0
U/)
LO
Eio
C
E6+
oz
a)00)
a
cc
.;_r
0
o
C'1
m
E
0
(o
0
(.
0
0
a)
E
0 z
bE
0
0
o
cc
0
Eccn
.1
CLu
Z3
a)
0
a.)
a)
as
a)
0
CD
E
o
0.
O
5--CD
•EO
c)
E-)
01)
oDCL
CD
a)
n0
c,<
(D0
o-
o
0Cu
a)
0)a)0
LoCc.
C
0
a)
a)
0)
oc
-a)
CD
6
ca)0
m-0D
0
Cu
0
Cu0
i..
0
.. 0
00) c
a)
Lý"
m Cc
(Da
CY)
cu
2
-c
0
cura
0
(mo
uE
Cu
0--
0.
co
I-
m(D
a)
a) 2
C
"0
E
>
z
-o
a)
ao
Cr
Z?
-V6
a)
a)f
d-
CD
a)
a)
3
a=,o
3
-a)
C
*-
a)0
.0
UO
C0 u,
0
0
o i6
Saa
U) V
0
Cn
4c
0.C)*
a)
Cu
0u0
c-
a)
..
3 oo.Ca)
C-)
Cu 0.
0-a)
U)Cu
S'.•.
o 5,
o.
Za)
ZC
7a
mC
0.
U
-o
U
o
io
:,..
N
C
)cn
0
=O
0-.•
m
.5.
a)
0
=a_
a)
0o
S.r-
0a)
I 0)
C
a
m
2.,
Cý
E•
E
0
&
rn
CuE
0)
0)3
03
nE
E
E
E
0
In only seven of 49 cases did the predicted targeting depletion contradict the zebrafish in
situ and mammalian microarray expression measurements (miR-34a, miR-103/-107,
miR-199a*, miR-184, miR-9*, 203, 204). In at least one of these cases, miR-34a, we did
not detect depletion in the tissue of strongest reported expression - brain - but did so in
namely lung (P < 1.9x 103), another tissue reported (18) to contain that miRNA, though
at lower abundance. Several other cases were ambiguous, mostly for miRNAs expressed
in zebrafish tissues not represented on the GNF expression atlas and for which no
mammalian microarray measurements were available.
Even if all ambiguous cases are regarded as missed predictions, miRNAs' patterns of
target depletion show remarkable agreement with their tissue-specific expression. That
the depletion statistics were estimated using gene expression in adult tissues reinforces
the hypothesis that miRNAs activated during embryonic development continue to be
expressed and remain in effect, maintaining tissue idenity in the adult organism.
3.3.3.4. Comparison to Results of Farh et al.
The present study was motivated in part by the notable finding of Farh and colleagues (1)
that mammalian genes have evolved under considerable pressure to avoid targeting by
coexpressed miRNAs. While we searched for and report a similar effect, our approach
differs in several ways.
Firstly, we measure depletion of targeting by a particular miRNA as the cumulative
number of target sites - rather than the proportion of targeted messages - below
expectation. When measuring enrichment of targeting, this allows us to capture extra
signal when a single miRNA targets a message in several places. Some evidence
suggests that miRNA targeting is cooperative in this fashion (15). When measuring
targeting depletion, using counts allows us to better resolve messages that under neutral
evolution would be expected to have two or three seed matches given their UTR length
and composition, but have avoided all of them.
Another consequence of this difference is that our estimation of the background targeting
level is deterministic, allowing us to calculate the depletion statistic without introducing
sampling variance. Lastly, by using sequence feature counts directly rather than
imposing a Poisson event model, we avoid making an independence assumption between
potentially overlapping sequence features of interest.
The other major difference between our approach and that of Farh et al is our use of the
Tissue-Specificity Index (TSI), which may be more effective in assigning high rank to
genes which come under tissue-specific pressure to avoid seed matches to coexpressed
miRNAs.
Despite these differences, we arrive at a set of depletion patterns that are, on the whole,
very similar. There are a few notable differences, however. Among them was miR-125,
the mammalian lin-4 ortholog found in the brain. We predict significant depletion for
miR-125 targeting in twelve CNS tissues (see Figure 7), most strongly in amygdala (P <
1.1xl10I ") and lower spinal cord (2.3x10-9). In no other tissues did we find a significant
depletion signal, any in only one (embryo day 10.5, P < 0.05) did we detect any signal at
all. In contrast, Farh et al report significant depletion of miR- 125 targeting in three
embryonic stages (blastocyst, 8.5, 9.5d) but not in any of the adult CNS tissues.
In situ staining of 9.5 day-old mouse embryos reveals that miR-125b is specifically
expressed at the midbrain-hindbrain region (22), supporting a role in neuronal
differentiation. However, experimental studies have found miR-125 to be highly
abundant in adult brain tissue, by cloning frequency more abundant than any other brain
miRNA (23). A microarray study (18) directly compared the level of miR-125a and 125b in adult brain tissues and various embryo stages and found both miRNAs to be
much present at much higher levels in the former, placing them (along with miR-222) in
a "late brain" cluster of miRNAs.
Finally, we compared our depletion results with the northern-blot assays for miRNA
expression level reported by (1). In Figure SI of that publication, miR-142-3p expression
level is shown to be highest in CD8+ T cells, followed by CD4+ T cells and B cells, in
that order. Our estimates of targeting depletion recapitulate the order of expression these
three tissues exactly (P < 2.7x 10-6, 1.2x l0-5, 4.39x 10-3, respectively). We also find
statistically significant targeting depletion of miR-124 and miR-7 in exactly the tissues
showing expression to those miRNAs in their Figure 4D.
3.3.4. Tissue-Specific Enrichment for MicroRNA Target Sites
MicroRNAs collectively mediate repression of thousands of target genes (13). While
some of these targets are strongly downregulated with switch-like phenotypic effects,
many more may simply be downregulated as a safety mechanism. Stark and coworkers
(3) recently proposed that miRNAs serve to reinforce the fidelity of tissues'
transcriptional programs and thus help to maintain tissue identity more than to actually
determine it, consistent with the observation that many tissue-specific miRNA are
expressed after cell fate is determined (19). Under this model, genes that shouldn't be
expressed in a given tissue context come under positive selective pressure to accumulate
target sites for miRNAs active in that context to ensure attenuation of any leaky
transcription.
Stark et al measured tissue-specific targeting enrichment among Drosophilagenes with
annotated functional categories. Genes having epidermal, tracheal, or digestive function
were enriched for target sites of miR-124, which is highly expressed in brain tissues,
where expression of such genes could have severe phenotype. Similarly, genes with
functions in ventral sensory, PNS, and digestive tissues were found to be enriched for
targets of miR-1, which is expressed in muscle tissue.
We repeated the same analysis that we used to measure tissue-specific depletion of
miRNA targeting, this time inverting the direction of the KS test in order to measure
targeting enrichment.
100
As for depletion, we find a significant signal-to-noise ratio (3.48 at P-value cutoff of
0.01; Figure 9) for enrichment of targeting, suggesting similarly widespread impact on
the mammalian transciptome.
Estimation of false discovery rate for miRNA-target enrichment analysis
I
.
.
.
..
I
nes
10
S101
C 10
10
S210
10 2
||
3
0
I
I
1
2
I
I
a
3
4
5
Significance cutoff (-logloP-value)
I
I
6
7
Figure 9. False discovery was estimated in the same manner as for target depletion. Enrichment
analysis has estimated signal:noise of 1.56 at P-value cutoff of 10-', and 3.48 at 10.2.
We find 50 miRNAs with significant target enrichment among genes highly expressed in
3 or more tissues, suggesting a pattern of mutual exclusivity between miRNA and target
expression as suggested by Stark et al. Clusters of miRNAs enriched for targets in the
brain, embryo, and epidermal tissues are evident (Figure 10).
101
Tissues complementary to microRNA expression display enrichment for targeting
miR-141/-200a
miR-153
miR-25 (tla)
miR-200bc/-429
miR-203
miR-451 (tla)
miR-448
miR-204/-211
miR-133a/-133b
miR-350
miR-186 (ta)
miR-320 (tl a
miR-9
miR-467
(tl a
mimiRR-9d
miR-375
miR-218
miR-126-(tl2a
miR-103/-107 (tla)
miR-424 (tla)
miR-15b/-195/-15a1-16
miR-378 (tla)
miR-R302
1,-291-3p/-294/-295 (tla)
miR-134
miR-208 (tla)
miR- 299
miR-183
miR-129-3p (tla)
miR-14M-3p
miR-127
miR-322 (tla)
miR-410 (tla)
miR-29b/-29a/-29c
miR-189 (tla)
miR-125a/-125b/-351
miR-124a
miR-7/-7b
miR-344
miR-9
miR-99b (tla)
miR-1/-206
miR-341
miR-224
miR-10a
miR-24
miR-202 (tIa)
let-7d tla)
let-7abcefgiVmiR-9
0
5
3
10
15
20
25
30
-log 1oenrichment P-values (one-sided KS)
Figure 10. MicroRNAs show tissue-specific patterns of target enrichment generally mutually
exclusive with their expression contexts. Clustering reveals groups of miRNAs enriched for
preferentially brain, epidermal, and embryonic targets. Enrichment values for full set of mouse
miRNAs are shown in Figure A2.
102
We found considerable overlap with the enrichment results reported by Stark et al for two
of the three fly miRNAs having mouse orthologs. The muscle-associated miR-l was
enriched for targeting in olfactory sensory tissues of the mouse, similar to its enrichment
in fly genes with PNS and sensory functions. For miR-124, we observed very strong
targeting enrichment in gut, epidermal, and tracheal tissues, again overlapping the
functional categories enriched in fly.
Lastly, we noted a cluster of 7 miRNA families that show temporally-phased enrichment
for genes preferentially expressed in the developing mouse embryo. The strongest of
these, let-7, is known to control stage-specific development in C. elegans (24), and is
conserved to mammals, in which it has been postulated to mediate differentiation during
development (25).
103
3.4. Conclusion and Future Directions
We have presented a statistical method to measure motif enrichment or depletion among
genes with tissue-specific expression. We applied this to miRNA targeting of 3' UTRs,
reaffirming the recent reports (1, 3) that miRNA target sites are depleted among genes
preferentially coexpressed in the same tissue. Although we did not attempt experimental
validation of our results, we note that they recapitulate results from in situ hybridization
and microarray (18, 19, 22) studies with great accuracy. Numerous miRNAs for which
we predict tissue-specific targeting depletion have not been fully characterized and may
be subjects of novel tissue-specificity predictions. Lastly, we showed that tissue-specific
enrichment for miRNA targeting is at least as widespread as depletion, possibly helping
cells "clean up" after aberrant transcription, thus supporting but not determining tissue
identity.
We intend to implement several methodological improvements and apply this method to
several new problems. First, we are concerned that the TSI scoring method is not robust
to the introduction of numerous highly-correlated tissues. TSI scoring is obtained by
comparing tissue-specific expression rank a gene's median rank across tissues.
Introducing many similar tissues inflates the median rank for genes highly expressed
among them, degrading the method's sensitivity to all such tissues and also causing
unwanted effects on the predictions in other tissues. To correct for this, we will employ a
clustering-based distance metric between tissues and use it to recalculate a different
average rank from the perspective of each tissue that downweights expression ranks in
104
highly correlated tissues. This will allow us to obtain greater resolution on tissues that
only differ in expression of a subset of genes, such as embryonic development or immune
cell differentiation stages. Having revised the average rank measure, we will turn to a
more empirically-motivated fit to determine the TSI score taking into account the local
variability around the linear model.
With these enhancements in place, we are eager to pursue several follow-up
investigations. Firstly, we will be able to profile targeting depletion - and thus predict
miRNA activity - in much larger tissue/condition panels. Because this method is
nonparametric in the actual microarray data, relying only on ranks, we are free to
integrate heterogeneous measurements from different labs, reports, and even array
platforms. Secondly, we believe that this method can be readily applied to discover
tissue-specific miRNAs that may be difficult to detect through experimental means
because of low abundance or highly-specialized expression. Lastly, we expect this
method will be applicable to evaluation or discovery of other classes of motifs such as the
splicing regulatory elements.
The primary weakness of this method is its inability to discover motifs such as ubiquitous
miRNAs with uniform depletion or enrichment across tissues. Yet this is also its greatest
strength, as it is what confers discriminative power between different tissues. Tissuespecific regulatory effects of cellular processes such as miRNA-mediated silencing or
alternative splicing are generally more difficult to study by experimental means than are
105
ubiquitous effects, making computational approaches such as the present work a natural
complement.
3.5. References Cited
1. Farh KK, Grimson A, Jan C, Lewis BP, Johnston WK, Lim LP, Burge CB, Bartel DP.
The Widespread Impact of Mammalian MicroRNAs on mRNA Repression and
Evolution. Science. 2005 Dec 16;310(5755):1817-21.
2. Lindquist S, Craig EA. The Heat-Shock Proteins. Annu Rev Genet. 1988;22:631-77.
3. Stark A, Brennecke J, Bushati N, Russell RB, Cohen SM. Animal MicroRNAs Confer
Robustness to Gene Expression and have a Significant Impact on 3'UTR Evolution. Cell.
2005 Dec 16;123(6):1133-46.
4. Su Al, Wiltshire T, Batalov S, Lapp H, Ching KA, Block D, Zhang J, Soden R,
Hayakawa M, Kreiman G, Cooke MP, Walker JR, Hogenesch JB. A Gene Atlas of the
Mouse and Human Protein-Encoding Transcriptomes. Proc Natl Acad Sci U S A. 2004
Apr 20;101(16):6062-7.
5. Hinrichs AS, Karolchik D, Baertsch R, Barber GP, Bejerano G, Clawson H, Diekhans
M, Furey TS, Harte RA, Hsu F, Hillman-Jackson J, Kuhn RM, Pedersen JS, Pohl A,
Raney BJ, Rosenbloom KR, Siepel A, Smith KE, Sugnet CW, Sultan-Qurraie A, Thomas
DJ, Trumbower H, Weber RJ, Weirauch M, Zweig AS, Haussler D, Kent WJ. The UCSC
Genome Browser Database: Update 2006. Nucleic Acids Res. 2006 Jan 1;34(Database
issue):D590-8.
6. Griffiths-Jones S, Grocock RJ, van Dongen S, Bateman A, Enright AJ. MiRBase:
MicroRNA Sequences, Targets and Gene Nomenclature. Nucleic Acids Res. 2006 Jan
1;34(Database issue):D140-4.
7. Griffiths-Jones S. The microRNA Registry. Nucleic Acids Res. 2004 Jan
1;32(Database issue):D109-1 1.
8. Velculescu VE, Madden SL, Zhang L, Lash AE, Yu J, Rago C, Lal A, Wang CJ,
Beaudry GA, Ciriello KM, Cook BP, Dufault MR, Ferguson AT, Gao Y, He TC,
Hermeking H, Hiraldo SK, Hwang PM, Lopez MA, Luderer HF, Mathews B, Petroziello
JM, Polyak K, Zawel L, Kinzler KW. Analysis of Human Transcriptomes. Nat Genet.
1999 Dec;23(4):387-8.
9. Hong S, Kim TW, Choi 1,Woo JM, Oh J, Park WJ, Kim do H, Cho C. Complementary
DNA Cloning, Genomic Characterization and Expression Analysis of a Mammalian
106
Gene Encoding Histidine-Rich Calcium Binding Protein. Biochim Biophys Acta. 2005
Mar 10;1727(3):188-96.
10. Meroni G, Diez-Roux G. TRIM/RBCC, a Novel Class of 'Single Protein RING
Finger' E3 Ubiquitin Ligases. Bioessays. 2005 Nov;27(11): 1147-57.
11. Frei R, Motzing S, Kinkelin I, Schachner M, Koltzenburg M, Martini R. Loss of
Distal Axons and Sensory Merkel Cells and Features Indicative of Muscle Denervation in
Hindlimbs of PO-Deficient Mice. J Neurosci. 1999 Jul 15; 19(14):6058-67.
12. liizumi M, Arakawa H, Mori T, Ando A, Nakamura Y. Isolation of a Novel Gene,
CABC I1,Encoding a Mitochondrial Protein that is Highly Homologous to Yyast Activity
of bc I Complex. Cancer Res. 2002 Mar 1;62(5): 1246-50.
13. Lewis BP, Burge CB, Bartel DP. Conserved Seed Pairing, often Flanked by
Adenosines, Indicates that Thousands of Human Genes are microRNA Targets. Cell.
2005 Jan 14;120(1):15-20.
14. Lewis BP, Shih IH, Jones-Rhoades MW, Bartel DP, Burge CB. Prediction of
Mammalian microRNA Targets. Cell. 2003 Dec 26; 115(7):787-98.
15. Brennecke J, Stark A, Russell RB, Cohen SM. Principles of microRNA-Target
Recognition. PLoS Biol. 2005 Mar;3(3):e85.
16. Lai EC. MicroRNAs: Runts of the Genome Assert Themselves. Curr Biol. 2003 Dec
2; 13(23):R925-36.
17. Lagos-Quintana M, Rauhut R, Yalcin A, Meyer J, Lendeckel W, Tuschl T.
Identification of Tissue-Specific microRNAs from Mouse. Curr Biol. 2002 Apr
30;12(9):735-9.
18. Thomson JM, Parker J, Perou CM, Hammond SM. A Custom Microarray Platform
for Analysis of microRNA Gene Expression. Nat Methods. 2004 Oct;1(1):47-53.
19. Wienholds E, Kloosterman WP, Miska E, Alvarez-Saavedra E, Berezikov E, de
Bruijn E, Horvitz HR, Kauppinen S, Plasterk RH. MicroRNA Expression in Zebrafish
Embryonic Development. Science. 2005 Jul 8;309(5732):310-1.
20. Schratt GM, Tuebing F, Nigh EA, Kane CG, Sabatini ME, Kiebler M, Greenberg
ME. A Brain-Specific microRNA Regulates Dendritic Spine Development. Nature. 2006
Jan 19;439(7074):283-9.
21. Lai EC. MiRNAs: Whys and Wherefores of miRNA-Mediated Regulation. Curr Biol.
2005 Jun 21;15(12):R458-60.
22. Kloosterman WP, Wienholds E, de Bruijn E, Kauppinen S, Plasterk RH. In Situ
Detection of miRNAs in Animal Embryos using LNA-Modified Oligonucleotide Probes.
Nat Methods. 2006 Jan;3(1):27-9.
107
23. Kim J, Krichevsky A, Grad Y, Hayes GD, Kosik KS, Church GM, Ruvkun G.
Identification of Many microRNAs that Copurify with Polyribosomes in Mammalian
Neurons. Proc Natl Acad Sci U S A. 2004 Jan 6;101(1):360-5.
24. Reinhart BJ, Slack FJ, Basson M, Pasquinelli AE, Bettinger JC, Rougvie AE, Horvitz
HR, Ruvkun G. The 21-Nucleotide Let-7 RNA Regulates Developmental Timing in
Caenorhabditis Elegans. Nature. 2000 Feb 24;403(6772):901-6.
25. Pasquinelli AE, Reinhart BJ, Slack F, Martindale MQ, Kuroda MI, Mailer B,
Hayward DC, Ball EE, Degnan B, Muller P, Spring J, Srinivasan A, Fishman M, Finnerty
J, Corbo J, Levine M, Leahy P, Davidson E, Ruvkun G. Conservation of the Sequence
and Temporal Expression of Let-7 Heterochronic Regulatory RNA. Nature. 2000 Nov
2;408(6808):86-9.
108
Table Al. Genomic loci of 107 -23nt sRNA cloned by Lee et al. 97 sequences map to twelve
distinct clusters each on separate genomic scaffolds. Nearly all sRNA clusters are oriented
antisense to overlapping or nearby predicted genes.
sRNA #
Sequence (forward)
Diffs to
genome
Coordinates
Overlapping /
nearby gene ID and
orientation
Cluster 0 (scaffold 8254028)
A->T
8254028 (22460-22482) +
8254028 (22804-22827) +
C->T
8254028 (22820-22842) +
AS 1038
AS 1038
TAGAATAATTATTTAATTGCTCA
A->C
8254028 (23191-23213) +
8254028 (23423-23445) +
TATTTTTTAGAATAAATCCTATAT
T->A
8254028 (24685-24708) +
AS 1038
0-0
TCATAAGGTATAGCTTCTTAGTA
0-1
TTCGAAGATTTACCCTTACTATAT
0-2
TACTATATTTGTGCGCTCTGTAC
0-3
TGTGACAATCTATTTATCATTAT
0-4
0-5
Cluster I (scaffold 8254557)
1-0
TTATCTCGAAGTTTTTCTCTTTGT
AS 1038
AS 1038
AS 1038
8254557 (44124-44147) -
AS 21693
1-1
1-2
1-3
TCTAGTTATGTCTTTATCTCGAAT
T->G
8254557 (44137-44160) -
AS 21693
TAATTCGCTTAATTTTGCTTTTAT
T->A
8254557 (44194-44217) -
AS 21693
TATTCACCATCAATCAGTTAGAT
8254557 (44638-44660) -
AS 21693
1-4
TAAACGTTCTTCTTGCATATCGTT
8254557 (45139-45162)-
AS 21693
1-5
TAACGAAGAGTATTTTCTCTTGT
8254557 (45549-45571) -
AS 21693
1-6
CTGTTGGATCTGTCAATTCTTTTA
8254557 (45902-45925) -
AS 21693
1-7
TAATCTCAGAGCTGTTCTAGATTT
8254557 (46576-46599) -
AS 21693
8254572 (1855-1877) 8254572 (7970-7992) -
AS 23309 (> 1 kb away)
AS 23311
8254597 (5178-5201) +
AS 26106 (> 1 kb away)
8254597 (5202-5224) +
AS 26106 (> 1 kb away)
Cluster 2 (scaffold 8254572)
2-0
TTCAATCCCAACTATTCGTTCAA
2-1
TAAATTATTTCATTATATTTAAT
Cluster 3 (scaffold 8254597)
3-0
3-1
TTATTATATTCACCTTTATTATAT
TAATTACTAAACCTATTTGATTT
3-2
GTAGCCTCTCTTTAAAAGCACGC
G->A
8254597 (5304-5326) +
AS 26106 (> I kb away)
3-3
TCTTTGATATCTTTAATATTTTT
T->A
8254597 (5432-5454) +
AS 26106 (> 1 kb away)
3-4
TAAACCAAAGCTTTCATCTAGCT
8254597 (5500-5522) +
AS 26106 (> 1 kb away)
Cluster 4 (scaffold 8254600)
4-0
TGCATACTATTAAGCTTATCCAT
8254600 (158620-158642) +
AS 4800
4-1
TCTATAATTAATAGTTCTACAGAT
T->A
8254600 (159859-159882)+
AS 4800
4-2
TCAACATGGTTTAAGTCTTCGAT
T->A
8254600 (160455-160477) +
AS 4800
4-3
TTGTCAATATTTTGTTGAATAAT
8254600 (160479-160501) +
AS 4800
4-4
TTCGTAGTTCTATCCCAGTACGT
8254600 (162017-162039) +
AS 4801
4-5
TCTACTATTGTTATTAGTTGTTT
T->A
T->C
8254600 (162447-162469) +
AS 4801
4-6
TATATTTATGCTATTTTAAATTGC
C->T
8254600 (162919-162942) +
AS 4801
4-7
TCAATTGTTGTATTTATTATCGT
8254600 (165328-165350) +
AS 4802
4-8
TACAAGGTCAATGCTTGGTTTTT
8254600 (165458-165480) +
AS 4802
4-9
TCATAAAGCATAATATTTTATAAT
8254600 (166559-166582) +
AS 4802
4-10
TATATCCATGTATAGTAGTTAAT
8254600 (166704-166726) +
AS 4802
4-11
TCTATATCAATATGAGATTATAT
8254600 (167186-167208) +
AS 4802
TGAAATACTTCCTATTTTTTTAA
8254617 (879961-879983)-
S 6834, AS 6835 (both >1kb
away)
TATGAAATACTTCCTATTTTTTT
8254617 (879963-879985)-
T->G
T->A
Cluster S(scaffold 8254617)
S 6834, AS 6835 (both >1kb
TTCAAAGAAGTGATATTCCCAAT
A->T
8254617 (880452-880474) -
TAACTTTTTAAAGTATAAAGTTGT
T->G
8254617 (880659-880682) -
109
away)
S 6834, AS 6835 (both >1kb
away)
S 6834, AS 6835 (both >1kb
away)
Cluster 6 (scaffold 8254638)
6-0
TCTTATTCTAATTTAAGATGAAAT
6-1
TCAATTGTTGTATTTATTATCGT
6-2
TACAAGGTCAATGCTTGGTTTTT
TCCAGGTATTTCATCATCATCTTT
6-3
8254638 (961835-961858) +
AS 8861
8254638 (962329-962351) +
AS 8861, 8862
T->G
8254638 (962492-962514) +
AS 8862
T->G
8254638 (963121-963144) +
AS 8862
AS 8862
T->A
6-4
TCATAAAGCATAATATTTTATAAT
8254638 (963617-963640) +
6-5
6-6
TACGGTTCTAACACATATCCGTGT
8254638 (963749-963772) +
AS 8862
8254638 (964060-964083) +
AS 8862
8254638 (964376-964398) +
AS 8862
TGGTGTATAATCAAAATCTTCACC
TCrGTTCTCATCTAATTCAATCT
C->G, T>C
AS 8862
TCAAATCTTATTCTTATTGAGATT
6-9
6-10
6-11
TATATAATACATCCACTTGTTATTA
C->T, A>G
8254638 (964747-964771) +
AS 8862
TATAATACATCCACTTGTTATTGT
C->T, T->C
8254638 (964749-964772) +
AS 8862
TCTATATCAATATGAGATTATAT
T->A
8254638 (965183-965205) +
AS 8862
6-12
TAATTACTGTGTTTTTTCACTATT
T->A
8254638 (965668-965691) +
AS 8862
6-13
6-14
6-15
TACTGTGTTTTTTCACTATAATT
T->A
8254638 (965672-965694) +
AS 8862
TAAATTCTATGCTTAGAAGTTCTT
T->A
8254638 (965990-966013) +
AS 8863
TATAGTACATTTTTCCAAAATGAT
T->A
8254638 (967380-967403) +
AS 8863
Cluster 7 (scaffold 8254659)
7-0
TGAAGTATGTATTCTTCGATTTT
8254659 (1290365-1290387) -
AS 11387
7-1
TGAATATTCTAATCATAAATATAA
A->T
8254659 (1290401-1290424) -
AS 11387
7-2
TCAATTTCATTTAATTTATTGAAA
A->T
8254659 (1290604-1290627) -
AS 11387
7-3
TAATGGATTGCATTTCCCAAATT
T->A
8254659 (1290681-1290703)-
AS 11387
7-4
TATGACTACATACAATCTTCATT
T->C
8254659 (1290918-1290940) -
AS 11387
7-5
TAGTGTTTTATGACTACATACAAT
8254659 (1290925-1290948) -
AS 11387
7-6
ATTGGAATATTTTTTGTTTCCTT
8254659 (1291024-1291046) -
AS 11387
7-7
TTCATACTTAATTAACAGTTTTA
8254659 (1291047-1291069) -
AS 11387
7-8
GTATATTCCCATGTATAGTGGTTT
8254659 (1291207-1291230) -
AS 11387
AS 11387
G,T->A
7-9
TATCTGTGTCATGATTCATTCTAT
8254659 (1291294-1291317) -
7-10
TCTTAGTTTTCTTAGCTATCTGT
8254659 (1291311-1291333)-
AS 11387
7-11
TGTTGTACCCTGATCGTTATCAT
8254659 (1291957-1291979)-
AS 11387
Cluster 8 (scaffold 8254678)
8-0
TATTTTCATTTAACAATCATTCAT
8254678 (17318-17341) -
AS 14809
8-1
TCATATTTATCAAATTCGGTATT
8254678 (18513-18535)-
AS 14810
8-2
TAAGAATATTTAAATCATATTTAT
8254678 (18526-18549)-
AS 14810
8-3
TAAAAAGGTATTTATTCATCTTT
8254678 (19121-19143) -
AS 14810
8-4
TTCAAGAACATTCCTATTGGTTT
8254678 (19498-19520) -
AS 14811
8-5
TATAACTATTGCTTAGCATTGAT
8254678 (19527-19549) -
AS 14811
8-6
TAAGAATCTTTTTTGTTTCTTCAT
8254678 (19780-19803) -
AS 14811
8-7
TGAATATTGTTAATCGTCTTTGCT
8254678 (19818-19841) -
AS 14811
8-8
TTAGACAATATAATTTGTCAAGAA
8254678 (20043-20066) -
AS 14811
8-9
TCTACAGCTATAGGACAACTAATT
8254678 (20332-20355) -
AS 14811
8254697 (272579-272602) -
AS 16336
8254697 (273688-273710) -
AS 16336
8254697 (274888-274911) -
AS 16337
8254697 (275109-275132) -
AS 16337
8254697 (275779-275802) -
AS 16337
8254697 (276041-276063) -
AS 16337
8254697 (276278-276302) -
AS 16337
8254697 (276439-276461) -
AS 16337
Cluster 9 (scaffold 8254697)
9-0
TATGTTACGTTCATAGTTCCAGCA
9-1
TATATAGTTCAATAAACTATCAC
9-2
TGAACAAAGATTAACAGTTCAATT
9-3
9-4
9-5
9-6
9-7
TTAAAATCCAAAACCTTAATTTTT
C->T
T->A
TATGCAAATTCTTTATATGTCTAA
TGAAAATCTAAAAGATTAAAATT
TAATATTTTATTTAAAAATATTTAT
T->A
T->A
TACTTTCCCAAATTATTTCTGAT
110
TCTTACAATGATTAATGATTTGT
9-8
Cluster 10 (scaffold 8254822)
8254697 (276577-276599)-
AS 16337
10-0
TCATCTTCTGTAAAAGATAGTAT
10-1
TCTAGATTTCCTTGTTTTCTTGT
8254822 (133383-133405) +
8254822 (134096-134118) +
AS 14971
10-2
TACTAATTTATTCGCATAAATAAA
10-3
TAATCTCAGAGCTGTTCTAGATTT
10-4
TATTTAACGTATTGTTGTATTTTT
8254822 (135210-135233) +
AS 14971
10-5
TACTAAACAAGTCATAAAATTAGT
T->C
8254822 (136262-136285) +
AS 14971
10-6
TCTTAAATTCTCTTATTTTTCTT
T->A
8254822 (136673-136695) +
AS 14971
10-7
TCACGAAGATTAAATTTTTGCAT
8254822 (136729-136751) +
AS 14971
10-8
TATAGTCTGAATAATCTTCTAAAT
T->G
T->A
8254822 (136817-136840) +
AS 14971
10-9
TATCGTTATGTTTGCTCATTTAT
T->A
8254822 (137018-137040) +
10-10
TTCGCCAATATTTTCCATTGCGAT
T->A
8254822 (137401-137424) +
AS 14971, 14972
AS 14971, 14972
A->T
8254823 (262686-262708) -
AS 15059
A->G
8254823 (262710-262732) -
AS 15059
8254822 (134516-134539) +
T->G
8254822 (134571-134594) +
AS 14971
AS 14971
AS 14971
Cluster 11 (scaffold 8254823)
11-0
11-1
TACTATTATTATTTCCCCTTATA
TACTTTTGAGGTTACTCTTGAGA
Non-clustered sRNAks
12-0
TAATATTTTATTTAAAAATATTTAT
8254010 (344454-344478) -
S 30245
12-1
TAAGGCCTACCTTATGGTTTTTT
8254233 (2710-2732) -
S 3315
12-2
TCTAATAAACTCCTCAACTTTTAA
8254284 (170445-170468) +
12-3
TCAGTGTTTTACTTTGCTTTCCT
AS 9013, 9014
AS 18483,18484
12-4
CCCTTATACTCATGGCGCTAAACT
8254515 (228925-228948) +
S 20780
12-5
AGGATGAATTGAATTGTTTACC
8254545 (665724-665745) +
AS 25918
12-6
TCTTTTCTTAAGCAACAAGTCAT
8254594 (262178-262200) -
12-7
AATACTGGCCACTGCTCAATTAG
S 9778
AS 12216, S 12217 (both
wlin
1kb)
12-8
TAAATTCTATGCTTAGAAGTTCTT
12-9
GCTCTGCTATTCTAGTCTGACACT
T->C
8254495 (340176-340198) -
8254649 (100250-100272) +
8254661 (32971-32994) +
G->A
T->A
8254737 (177118-177141) -
111
S 25563
S,SRP 7SL RNA
co
0
z
W
0o
N
N
Im
0
0
aeea
."ci
se
ImL
9L
o
e
o) o
V-
CL0
0.
0
m
D04
o
aa·~
I CLU
x
m 0O.
£88r~r2:
a-a-A-a
00
CE
CL
°°)
co
. o
toaf cc
0'4
CL
m
x
a
.0
X
Co
m
x c
rU 00
0
a
aR
cd
q*
+
3uo
o
Go
'l-A
0
S,
0
~cc
r-cc
0
s
F
x
Ng
occV Goq
o4Cd
on
Sa
L
,ag
e'J
d
CDcc0
ccx
c
*q
*I
cn
cc:
Aco:
i
I0
0ý
*le 1.9
IIg
LULO
.
.
LU
CV)
cYe
0
LU
0
0014
Go
0)
C0
Cc
cc
C
CD
com
LU
0
c
OR(
)4 t 4 r- c
c(
08
- 'a0
() 4
0) U)
IVV.~
:
C1
CD~·
0~0
ODCDRC
r-
CD
C
act
Ul)
II
A
04 a
*
.5
4
LU
0(DZ
(
(N
LOv
to
0C
0 04
ao
Go
V)
0. I
a0
0
(4)
C
('9
go ý
U
c
-
E
240
03
ae
az
z N
-
S,
-r
1
z<8
ZC
z
0
0
CL
CL
0
I'
E
Ln
.C
CA
0
0
.0
50
ill
X
4)
5
cc
0~S
un
0.
S
o)
Va
0
(0
w" 0
O
co
V(Q,
00OR
LOCD(0
Sco
'cri~
0) L~
4)•i
H•I
*o*ac
a
LC
CO
a
w0
LO
'IT
m
0
c
UlIt cCV
(0
.-)
C) 1
cc
CN
4N
co0 'It
W)
w
C4C4cn
(0
0C- v.i (
0O
(0 (0
CO
(D CD
0I
Cu
Co
co
4) co
In
.0.
04
W)
0W.-
0(
m 04
04
0)
0)5
+0
00
dI
(0 (0
ql
cc
N
c-
CDC
cc
ao
a,
N
-(0~
z"
z
ers
cc
v
li~
to
co
0,
0,
C4
<u
zCco~
0,1-
C14
CN •0o ¸
04
CV
cM
WV cM
NV
Ia
C'
C•
CVlr
C
E
(0
c
co(13
-c
c,
ct ct
I
-c
C:
C.,
C, 0
c
t·
m
o
r,-
c
oCa
C
CCC
-Cc:
-C
Et
a
0
C
cCt
o
C
CO
c-
C
a
c
c
Z
C',
C,
o.rc
0o
C
cz
C
5C
C
C
-2
C
aCC
C:
cC
c
C5
c
c
CC
-C
CC
C
C
c
.o
4.
zE
Cu
C.
Cu
Q.,
03
cp
a-
CA
0)
It
0
0
4)
zr
CA
1
1
4) E
mz
4-
Cco
z
CA
Table A3. Locations of A rich motif predicted near sRNA clusters by HMM trained on MEME
output.
Pred.
sRNA gene
Cluster ID
0
1038
1
21693
2
23309
23311
4
4800
Scaffold
8254028
8254557
8254572
8254572
8254600
Gene/motif
Strand
-1
1
1
1
-1
4801
4802
8861
8862
8863
8254600
8254600
8254638
8254638
8254638
-1
-1
-1
-1
-1
6
7
11387 8254659
1
8
9
14809
14810
14811
16336
8254678
8254678
8254678
8254697
1
1
1
1
10
11
16337 8254697
14971 8254822
15059 8254823
1
-1
1
HMM-predicted A-rich motif positions
22121-22178
48059-48084
4987-5067
9533-9546, 9512-9531, 9548-9593, 9637-9660
158149-158203
162623-162643,161780-161812,161814-161832,
161756-161778
165199-165221, 165223-165239,165243-165277
961147-961207, 961123-961145,961009-961039
962199-962221, 962223-962273
965750-965791, 965795-965830
1292754-1292779, 1292888-1292907, 12927811292802, 1292734-1292752
17382-17460, 16738-16755, 17683-17731, 1773317752
19254-19333
21232-21253, 20749-20832
None
276477-276494, 277071-277100, 276991-277007,
276558-276582
132986-133008, 134472-134479
263899-263978, 264380-264401
115
Table A4. Predicted genes with high-scoring downstream A-rich motifs, grouped by paralogy.
Bold genes indicate novel predictions of sRNA activity.
sRNA-Associated Family I
4800
4801
4802
8861
8862
sRNA-Associated Family III
15058
14810
14811
14813
14809
sRNA-Associated Family IV
23309
23310
23311
1848
8220
23410
23409
23407
23408
Motif-Associated Family V (predictedsRNA activity)
9721
9739
9740
17099
Singleton sRNA and/or motif-associated genes
2261
1039
3676
1038
1353
5186
4571
4741
5720
5363
6435
7362
6390
7056
6308
10454
8796
10064
11046
9974
15951
15309
15421
15229
15238
17284
16830
16957
17969
16958
20945
21168
20500
20550
20748
22415
22418
22743
21693
21824
24638
24580
24723
24849
24874
27183
26513
26307
26779 27021
29694
29753
8863
11387
12216
15059
15060
8221
16957
16958
4272
5751
8539
11140
15981
18983
21495
22938
25094
27251
4283
6201
8741
11988
16023
19753
21583
22967
25268
27577
4302
6308
8758
14135
16308
20335
21639
23763
25646
29059
Preliminary sequence data for the T. thermophila scaffolds was obtained from The Institute for
Genomic Research website. Sequencing of the genome is supported by award from the NIBMS
and NSF.
116
Depletion for microRNA targeting in different tissues (pg 113)
)_
_
miR-345
miR-21
a
miR-339
miR-4Q9(It a
miR :iJ
miRl B-7
miR-38(ta
miR-32
mimiR-29
15321
miR-.312-p(tim)
miR-39a-S c1a
miR-335~
ml -3(t
miR-Wol
miR2
R-1529495a
miR-3Q
biR-1ta
miR-146~a
mi7d
miR-1428a/
mifR-1
MiR-484
57
miR1 IRmiR-29b1-29a111 R-217-22
miR-l80-3
miR-32 8ti9
mR-1-7
4
1~
SL
0
10
30
-log1O[Depletion P-value (one-sided KS)]
Figure Al. Tissue-specific patterns of depletion for targeting by the full set of 187 mouse
miRNA families, hierarchically clustered by P-values. (Figure tiled over 3 pages)
117
Depletion for microRNA targeting in different tissues (pg 2/3)
-loglO[Depletion P-value (one-sided KS)]
118
Depletion for microRNA targeting in different tissues (pg 313)
-loglO[Depletion P-value (one-sided KS)]
119
Enrichment for microRNA targeting in different tissues (pg 113)
miR-128
-b1
mIJ-221I-22:
miR-31 t1al
miR-20
miR-22 (tla
,let-d
miR-
-loglO[Enrichment P-value (one-sided KS)]
Figure A2. Tissue-specific patterns of enrichment for targeting by the full set of 187
mouse miRNA families, hierarchically clustered by P-values. (Figure tiled over 3 pages)
120
Enrichment for microRNA targeting in different tissues (pg 2/3)
miR-10T,"
miR-39
a
0
Fw-
10
20
30
-M
-loglO[Enrichment P-value (one-sided KS)]
121
Enrichment for microRNA targeting in different tissues (pg 313)
miIR-17.-
.td1.)
miR-9
ml:
miR-m4
miil
miR-
miR-2
miQ--
miR-3-
-
-
MiR-12
miR
MiR
Mi -
miR-2
+
R
0
-
10
20
30
-loglO[Enrichment P-value (one-sided KS)]
122