Computational Prediction of RNA-based Gene Regulatory Mechanisms in Human and Tetrahymena by Jacob O.Kitzman Submitted to the Department of Electrical Engineering and Computer Science in Partial Fulfillment of the Requirements for the Degree of Master of Engineering in Electrical Engineering and Computer Science at the Massachusetts Institute of Technology February 9, 2006 Copyright C 2006 Jacob O.Kitzman. All rights reserved. The author hereby grants to M.I.T. permission to reproduce and distribute publicly paper and electronic copies of this thesis and to grant others the right to do so. Author \ - Department of Electrical Engineering and Computer Science February 9, 2006 Certified by Christopher B. Burge Accepted by_ _-- 'Arthur C. Smith Chairman, Department Committee on Graduate Theses OF TECHNOLOGY AUG 14 2006 LIBRARIES ARCHIVES Computational Prediction of RNA-based Gene Regulatory Mechanisms in Human and Tetrahymena by Jacob O. Kitzman Submitted to the Department of Electrical Engineering and Computer Science February 9, 2006 In Partial Fulfillment of the Requirements for the Degree of Master of Engineering in Electrical Engineering and Computer Science ABSTRACT The diversity and profound impact of gene regulation mediated by small RNAs (sRNAs) is just beginning to come into focus. RNA interference (RNAi) pathways have been shown to mediate processes such as genomic rearrangement in ciliates and developmental timing and tissue differentiation in plants and animals. Here we present a computational study into the function of two distinct classes of sRNAs. In the first section, we examine an uncharacterized class of sRNAs isolated from the ciliate Tetrahymena thermophila, present functional comparison to known classes of sRNAs in other organisms, and note a strong and specific relationship to a novel sequence motif. In the second section, we examine the evolutionary impact of microRNAs (miRNAs), which mediate potent posttranscriptional repression on their targets. We observe that miRNAs with tissue-specific expression exert remarkable evolutionary pressure, compelling many preferentially coexpressed genes to avoid accumulating target sites. We present tissue-specific patterns of such target depletion and note strong agreement with experimentally obtained miRNA expression patterns. Conversely, we report enrichment for targeting among genes with expression patterns spatially or temporally complementary to the miRNAs', suggesting a widespread role of tissue identity maintenance for miRNA-mediated regulation. Thesis Supervisor: Christopher B. Burge Title: Associate Professor, Department of Biology, M.I.T. To my best friend, Maggie, with love. Acknowledgements I gratefully acknowledge Professor Chris Burge for granting me the privilege of studying in one of the most exciting centers of bioinformatics research among incredibly gifted colleagues. His mentoring, and that of my coworkers in the Burge lab, were an essential source of guidance and have greatly influenced my approach to scientific thinking. I also thank Dr. Kathleen Collins and Suzanne Lee, both of the University of CaliforniaBerkeley, for graciously providing Tetrahymena sRNA sequences prior to publication. Lastly, I wish to thank my mother Helen and father Craig, as well as my sisters Eva and Hannah, for their love, support, and work in making educational opportunities available to me over the many years. Table of Contents ABSTRACT ................................... Acknowledgements...................................................................................................... Table of Figures .......................................................................................................... Table of Tables ............................................................................................................ C H A PT ER 1 ..................................................................................................................... 1.1. M otivation.............................................................................................................. 1.2. B ackground ............................................................................................................ 1.2.1. The Central Dogma of Molecular Biology ....................................... ..... 1.2.2. Small RNAs Play Potent Regulatory Roles ..................................... .... 1.2.3 Experimental Techniques............................................... 1.2.3.1. Gene Expression Microarrays........................................ ................ ...... 1.2.2.3.2. Small RNA Assays ..................................... 1.3. References Cited ................................................. 2 4 7 9 11 12 12 12 13 17 17 17 18 C H A PT E R 2 ..................................................................................................................... 19 Abstract ......................................................................................................................... 19 ........... 20 2.1. Overview: Tetrahymena thermophila ........................................ 2.1.1. Introduction..................................................................................................... 20 ............ 20 2.1.2. Lifecycle and Nuclear Dualism ......................................... ....... 21 2.1.3. Macronuclear Genomic Characteristics ...................................... ............ 22 2.1.4. RNA-Guided Genomic Rearrangement............................ .... ............. 23 2.2. Overview: -23nt sRNAs in Tetrahymena.......................... 2.2.1. Introduction .............................................. ................................................. 23 2.2.2. Small RNA Loci Cluster within the MAC Genome .................................... 24 24 2.2.3. Composition and Editing ...................................................................... ........... 26 2.3. Screen for RNA Secondary Structure ........................................ 2.3.1. Overview ......................................................................................................... 26 2.3.2. M ethods........................................................................................................... 26 2.3.3. Results............................................................................................................. 27 2.4. Screen for trans-TargetingPotential............................................................. 32 2.4.1. O verview ............................................... .................................................... 32 2.4.2. M ethods........................................................................................................... 32 ................. 34 ...... 2.4.3. Results and Discussion ..................................... ....... 38 ...................................... Related are Cluster Sequences 2.5. Small RNA 2.5.1 O verview .......................................................................................................... 38 2.5.2. M ethods........................................................................................................... 39 39 2.5.2.1 Paralog Gene Families ................................................................... 2.5.3. R esults............................................................................................................. 40 ........... 40 2.5.3.1. Gene Prediction Verification ..................................... .. 41 2.5.3.2. Paralog Gene Families ...................................................................... 42 ..................................... Interrelated Highly 2.5.3.3. Small RNA Clusters Are 2.5.3.4. Adjacent Genes Generally Tend to be Paralogous ............................... 43 2.5.3.5. Genes Overlapping sRNA Clusters Have Average Overall Paralog Counts 44 ........... ................................................................... 2.5.3.6. Rearrangement of Related sRNA Clusters ..................................... .44 2.5.3.7. Individual sRNA positions are not conserved ..................................... 45 2.5.3.8. Paralogy to non-sRNA Associated Loci ..................................... .46 2.5.4. D iscussion .............................................. ................................................... 46 2.6. A Novel Sequence Motif Is Strongly and Specifically Associated with -23nt sRN A Activity .............................................................................................................. 48 2.6. 1. Overview ......................................................................................................... 48 2.6.2. M ethods........................................................................................................... 49 2.6.2. 1. General Motif-Finding at sRNA Clusters ..................................... .49 2.6.2.2. Motif-Finding Controls.............................................. 50 2.6.3. R esults............................................................................................................. 55 2.6.3.1. General Motif Finding Among sRNA-cognate genes .......................... 55 2.6.3.2. Discriminative Modeling of A-Rich Tracts ..................................... . 58 2.6.3.3. D iscussion ............................................................... ........................... 60 2.7. Conclusion .................................................. ..................................................... 62 2.8. References C ited .................................................................. ............................ 63 C HA PT E R 3 ..................................................................................................................... 67 A bstract ................................................... 67 3.1. Introduction............................................................................................................ 68 3.2. M ethods .................................................................................................................. 70 3.2. 1. Microarray Dataset Processing ......................................... ............ 70 3.2.2. Observed Sequence Features ................................................................ 71 3.2.3. Sequence Feature Background Model ....................................... ....... 71 3.2.4. Sequence Feature Sets.................................................... 72 3.2.5. Tissue Specificity Index score ......................................... ............. 73 3.2.6. Measurement of Feature Depletion and Enrichment ................................... 76 3.2.6.1. O verview ............................................ ................................................ 76 3.2.6.2. Binning Strategy ...................................................... 77 3.2.6.3. KS Test Statistics ...................................................... 77 78 Background Distribution ............................ 3.2.6.4. Estimation of KS Statistic 3.2.6.5. Application of KS Test .......................................... ............... 78 3.2.6.6. False Positive Analysis ..................................... .... ............... 79 3.3. M ethods.................................................................................................................. 80 3.3.1. Comparison of Expected and Observed Sequence Feature Counts ............. 80 3.3.2. Tissue-Specific Index Score Evaluation ...................................... ... . 81 3.3.3. Tissue-Specific Depletion of microRNA Target Sites................................ 85 3.3.3. 1. Weak Depletion Signals also Coincide with microRNA Expression...... 89 3.3.3.2. Signal-to-Noise Estimation.......................................... 90 3.3.3.3. Comparison to Experimentally Determined MicroRNA Expression ...... 91 3.3.3.4. Comparison to Results of Farh et al.................................... ....... 97 3.3.4. Tissue-Specific Enrichment for MicroRNA Target Sites .......................... 100 3.4. Conclusion and Future Directions ..................................... 104 3.5. References Cited ........................................ 106 A PPEN D IX..................................................................................................................... 6 109 Table of Figures CHAPTER 1 Figure 1: Mammalian microRNA biogenesis ..................................... ..... Figure 2: MicroRNA-target pairing............................. .... 14 .............. 15 CHAPTER 2 Figure 1: Minimal flanking folding energies for Tetrahymena sRNAs ............. and controls ............................................... ..................................................... 28 Figure 2: Minimal flanking folding energies for Arabidopsis............................... miRNAs and controls.................................................... 30 Figure 3: Hit count distributions to different genomic features........................ 36 Figure 4: Comparison of predicted and curated annotation of gene H2A.Z......... 41 Figure 5: Rearrangement of sRNA-cognate gene clusters........................... . 45 Figure 6: A-Rich hidden Markov model.................................. ....... 52 Figure 7: A-rich motif Sequence logo ...................................... ...... 55 Figure 8: Distribution of motif hits for sRNAs and randomized controls ............ 57 Figure 9: Quality and quantity of A-rich motif instances predicted by H M M ......................................... ................................................ 59 CHAPTER 3 Figure 1: Tissue Specificity Index calculation................................. 75 Figure 2: Gene binning algorithm........................................... 77 Figure 3: Sequence feature background model performance ............................ 80 Figure 4: Expression and average rank correlations in 61 mouse tissues.......... 82 Figures 5a, 5b: Examination of TSI scoring in three tissues ............................. 83 Figure 6: Selected miRNAs depleted for targeting in non-brain tissues ........... 87 Figure 7: Selected miRNAs depleted for targeting in brain tissues.................. 88 Figure 8: Signal to noise estimation for depletion analysis ............................... 91 Figure 9: Signal to noise estimation for enrichment analysis ......................... 101 Figure 10: Enrichment of targeting among tissues spatially/temporally complementary to miRNA expression.......................... 102 CHAPTER 3 Figure Al: MicroRNA targeting depletion........................... 117 Figure A2: MicroRNA targeting enrichment........................... 120 Table of 1 ables CHAPTER 2 Table 1: Eukaryotic genomic summary statistics ........................................... Table 2: Length and composition of various classes of short RNAs................. Table 3: Genome-wide hits from -23 nt sRNAs ............................................. CHAPTER 3 Table 1: MicroRNA sequence feature sets .................................... .......... Table 2: Target depletion compared with in situ miRNA expression data........... APPDENIX Table A l: Genomic loci of sRNAs cloned by Lee et al .................................. 109 Table A2: Paralogy relationships between sRNA-associated clusters ............ 112 Table A3: HMM-identified A-rich motif instances near sRNA clusters ........ 115 Table A4: HMM-identified strong A-rich motif instances genome-wide ....... 116 CHAPTER 1 Introduction 1.1. Motivation Following the convergence of whole-genome sequencing and high-throughput assays in life science research, the unprecedented opportunity has arisen to glimpse life's full biological complexity at the molecular, cellular, and physiological levels. By contrast, experimental molecular biology has traditionally been applied with great success to focused studies, characterizing in detail the actions of individual genes and proteins. Today, experimental inquires are often motivated by computational analyses which mine massive datasets for interesting patterns and trends that have eluded manual detection. Conversely, computational analysis is also commonly used to follow up experimental discovery, for instance by measuring genome-wide signal for an event experimentally observed in limited scope but suspected to have larger role. In this thesis, we present two distinct studies each seeking to assess in silico the role or scope of a biological phenomenon. 1.2. Background 1.2.1. The Central Dogma of Molecular Biology The central dogma of molecular biology provides a first-order summary of the pathways and substrates that support the living cell. Elegant in its simplicity, the dogma can be visualized as a graph with three nodes and three edges: DNA composing the genome, RNA transcribed from it, and finally, proteins, which are the primary structural and enzymatic components of living cells. Two edges represent the molecular actions that carry information encoded in the genome through to a physical manifestation: transcription, in which genes are copied from the genome to messenger RNA (mRNA), and translation, in which protein molecules are synthesized based on instructions from these transcripts. Proteins interact with each other and ultimately act on genomic DNA to initiate or block transcription, thus feeding back into the graph. Following this model, protein-protein interactions eventually flowing back to alter DNA state were considered the primary modes action sustaining cellular function. This model appeals by rough analogy to the digital computer, which features robust but inflexible long-term information storage in some sense similar to genomic DNA. Random-access memory resembles RNA, transiently preserving state to carry out only the tasks at hand. DNA and RNA are each encoded by 4 bases, a strong parallel to the electrostatic binary encoding used in the modem computer. Lastly, instructions encoded in memory direct input/output devices such as a printer or more exotically, perhaps a robotic arm, which effect physical action, analogous to the synthesis of proteins with enzymatic activity. 1.2.2. Small RNAs Play Potent Regulatory Roles RNA, which is chemically much less stable than DNA, was for many years deemed too transient to be capable of carrying out precise regulation and was relegated to the role of information carrier. Over the course of the past decade, this view has been shattered by the discovery of diverse regulatory pathways mediated by tiny RNAs, collectively termed RNA interference (RNAi). Core elements of these pathways, which are shared among all plants and animals, as well as some fungi, unify to repress (or "silence") gene expression despite differences in mode and mechanism. Though initially proposed to mediate only a few specific developmental timing effects, RNAi is now understood to exert regulatory control over as many as one-third of all transcribed human genes (1, 2), and has been implicated in diverse physiological processes, including neuronal differentiation (3), and muscle (4)and skin morphogenesis (5). I:xn•rt Mammalian miRNA Biogenesis Fxnort to N. ..--- Tr; Figure 1. MicroRNA hairpin-loop precursors undergo several rounds of processing before being loading to the RISC effector complex. Target repression is thought to be carried out both through blockage of productive translation as well as transcriptional cleavage. Adapted from Bartel (6). MicroRNAs (miRNAs) are an endogenous class of -22 nt RNAs that mediate posttranscriptional gene silencing in both animals and plants and in human comprise approximately 1%of all known genes (6). MicroRNA precursor sequences (premiRNAs) characteristically form energetically favorable secondary structures by forming Watson-Crick pairs between complementary bases (adenosine and uracil; guanine and cytidine), as depicted in Figure 1. Precursor structures are observed to include an extended hairpin with imperfect pairing flanking the mature sequence and meeting at a terminal loop of unpaired bases, structural features thought to be required for recognition and subsequent processing. Pre-miRNAs are exported to the cytoplasm, and are cleaved by the endonuclease Dicer to yield a double-stranded RNA (dsRNA) containing the mature miRNA and its complement, the miR*, each bearing a 2nt 3' overhang characteristic of RNAseIII cleavage. MicroRNA-Target Pairing tl t2 ... t7 Target 3' UTR 3'-UAGUGUAA-5' IIII1III miR-23a 5'-AUCACAUUGCCCGAGGGAUUUCC-3' ml m2 ... m7 seed region (bases 2-7) 5'- P-3' Figure 2. Metaozoan microRNAs target messages primarily by Watson-Crick base pairing to sequence complementary to miRNA bases 2 to 7 (ml-m7). Pairs at m and m8 are also observed but are not required. Following processing, the mature miR is loaded into the RNA-Induced Silencing Complex (RISC), while the complementary miR* strand is degraded. RISC-bound miRNAs then target messages by binding reverse complementary sequences in their 3' untranslated regions (3'-UTRs), causing repression of productive translation and in some cases, target cleavage (Figure 2). Pairing to the miRNA 5' seed region (bases 2-7) is thought to be the primary determinant of targeting specificity for animal miRNAs, and experimental data suggests that pairing to this region is necessary and sufficient for target repression (7). Comptuational analyses (1,2) reveal that 6-7 nt microRNA seed matches are evolutionarily conserved 2- to 3-fold above expectation, further supporting a widespread role for miRNA-mediated silencing. MicroRNA targeting has also been shown to play an important role in plant development (6). Unlike in metazoans, plant microRNAs target sites with complementary pairing along the full length of the miRNA molecule, though with greater tolerance for bulges or internal loops. Also, plant miRNA targets are generally silenced by cleavage rather than by translational repression. Small interfering RNAs (siRNAs) are another class of small RNAs of both endogenous and exogenous origin that mediate transcriptional and post-transcriptional gene silencing. In contrast to microRNAs, these are not derived one or two at a time from hairpin-loop precursor structures, but instead many at a time in phase from bidirectional transcripts. Knockdown experiments employ silencing mediated by transfected siRNAs and have many applications, such as to study a gene's deficiency phenotype. An endogenous class of siRNAs, the plant trans-actingsiRNAs (tasiRNAs) rely partially on the miRNA biogenesis pathway and differ from many other siRNAs by targeting distant loci in trans. Numerous other classes of small RNAs have been observed (8), some of which may carry out novel regulatory function. In Chapter 2, we examine a small RNA of uncharacterized function isolated from the single-celled eukaryote Tetrahymena thermophila, and speculate that it may participate in a functionally distinct RNAi pathway. 1.2.3 Experimental Techniques 1.2.3.1. Gene Expression Microarrays DNA microarrays are a popular large-scale biological assay for measuring the expression levels of a large number of genes simultaneously. One high-density microarray can interrogate transcript abundances of every known or predicted gene in the genome. Typically, a series of identical microarrays is used to measure expression levels in samples extracted from different tissues, time-points, or treatments. Microarrays have enabled genome-wide studies of tissue-specific gene expression and alternative splicing, two cellular activities that have recently been proposed to account for much of the difference in biological complexity between humans and simpler organisms. In Chapter 3 of this study, we make use of measurements from a microarray study (9) probing the levels of known and predicted transcripts of over 13,000 genes in 61 mouse tissues. By considering comparing the entire transcriptional contexts of different tissues, we observe potent evolutionary pressure to avoid, or in some cases, gain target sites for tissuespecific miRNAs. 1.2.2.3.2. Small RNA Assays Microarrays have also been used to assay miRNA abundance, and we refer to several such studies to confirm patterns suggesting tissue-specific miRNA expression. Several other experimental procedures can be used as well. One classical technique is the northern blot, in some sense a small-scale microarray in which a panel of RNA samples are size-fractionated in separate lanes by gel electrophoresis and then simultaneously hybridized with labeled probes. In another technique, in situ hybridization, probes are hybridized to mounted cells or tissue sections, revealing not only expression level but also spatial context, providing an excellent basis of comparison for tissue-specific effects. 1.3. References Cited 1. Lewis BP, Burge CB, Bartel DP. Conserved Seed Pairing, often Flanked by Adenosines, Indicates that Thousands of Human Genes are microRNA Targets. Cell. 2005 Jan 14; 120(1):15-20. 2. Lewis BP, Shih IH, Jones-Rhoades MW, Bartel DP, Burge CB. Prediction of Mammalian microRNA Targets. Cell. 2003 Dec 26; 115(7):787-98. 3. Schratt GM, Tuebing F, Nigh EA, Kane CG, Sabatini ME, Kiebler M, Greenberg ME. A Brain-Specific microRNA Regulates Dendritic Spine Development. Nature. 2006 Jan 19;439(7074):283-9. 4. Chen JF, Mandel EM, Thomson JM, Wu Q, Callis TE, Hammond SM, Conlon FL, Wang DZ. The Role of microRNA-I and microRNA-133 in Skeletal Muscle Proliferation and Differentiation. Nat Genet. 2006 Feb;38(2):228-33. 5. Yi R, O'carroll D, Pasolli HA, Zhang Z, Dietrich FS, Tarakhovsky A, Fuchs E. Morphogenesis in Skin is Governed by Discrete Sets of Differentially Expressed microRNAs. Nat Genet. 2006 Feb 5. 6. Bartel DP. MicroRNAs: Genomics, Biogenesis, Mechanism, and Function. Cell. 2004 Jan 23; 116(2):281-97. 7. Brennecke J, Stark A, Russell RB, Cohen SM. Principles of microRNA-Target Recognition. PLoS Biol. 2005 Mar;3(3):e85. 8. Ambros V, Lee RC, Lavanway A, Williams PT, Jewell D. MicroRNAs and Other Tiny Endogenous RNAs in C. Elegans. Curr Biol. 2003 May 13;13(10):807-18. 9. Su AI, Wiltshire T, Batalov S, Lapp H, Ching KA, Block D, Zhang J, Soden R, Hayakawa M, Kreiman G, Cooke MP, Walker JR, Hogenesch JB. A Gene Atlas of the Mouse and Human Protein-Encoding Transcriptomes. Proc Natl Acad Sci U S A. 2004 Apr 20; 101(16):6062-7. CHAPTER 2 Computational Characterization of a Recently Discovered Class of Small RNAs in Tetrahymena thermophila. Abstract Tetrahymena thermophilarecently became the first unicellular organism reported to have multiple distinct RNAi pathways as a novel, -23 nt class of small RNA (sRNA) of unknown function (1) joined the previously-described sRNAs that mediate genomic rearrangement in this ciliated eukaryote (2). Here we report efforts to characterize these -23 nt sRNAs in silico. We apply computational methods to assess any functional similarities to known classes of sRNA in higher organisms, showing that these sRNAs do not appear to be processed from microRNA-like precursors, and moreover do not appear to target loci in trans, suggesting that they participate in yet another RNAi pathway of novel mechanism or effect. Lastly, we report that an A-rich motif is specifically associated with the small RNAs, and apply a discriminative Hidden Markov Model (HMM) to predict other genomic loci with high potential for sRNA activity. 2.1. Overview: Tetrahymena thermophila 2.1.1. Introduction Tetrahymena thermophilais a single-celled eukaryote indigenous to temperate freshwater pond habitats worldwide. Perhaps its most physiologically striking feature is the covering of cilia on its exterior. A member of the family Alveolata, which includes the malaria pathogen Plasmodium, Tetrahymena exhibits remarkable genomic, biochemical, and structural complexity despite being unicellular. Combined with great ease of experimental manipulation, this has made it increasingly popular as a model organism. Several fundamental discoveries have been made in Tetrahymena studies, including selfsplicing RNA, telomeres, and chromatin structure. 2.1.2. Lifecycle and Nuclear Dualism Like other ciliated protozoa but in stark contrast to almost all other eukaryotes, Tetrahymena possesses two distinct nuclei - a diploid, germline micronucleus (MIC), which has N=5 pairs of chromosomes, and a haploid somatic macronucleus (MAC), which contains about 275 chromosomes, most at -45 copies each. Tetrahymena's nuclear dualism is complemented by its two distinct life cycles (3, 4). In the first such cycle, called vegetative growth, cells reproduce through successive rounds of asexual splitting. During this cycle, the MIC is transcriptionally silent while the MAC expresses the set of messages necessary to divide and sustain cellular function. Vegetative growth proceeds indefinitely until cells first undergo nutritional stress and then encounter cells of different mating type. At that point, the two mating cells undergo conjugation, combining micronuclear genetic material through meiosis and cross-fertilization to yield daughter micronuclei from which new macronuclei are subsequently derived. 2.1.3. Macronuclear Genomic Characteristics The Tetrahymena macronuclear genome was recently sequenced at the Institute for Genomic Research (TIGR). With a total assembled length of 104 Mb, of which 77.6% are adenosine and thymine (A+T), the MAC genome is comparable in size and composition to that of the related ciliate Parameciumtetraurelia,as well as the more distantly-related Plasmodium. A comparison of the genomic features with several other model organisms is shown in Table 1. Despite having much shorter genomes than higher eukaryotes, Tetrahymena like other ciliates, has a very high coding density and is predicted to have roughly as many genes as mammals. However, in contrast to mammalian genes, the predicted genes in Tetrahymena have fewer exons and shorter introns, possibly reflecting more simplistic spliceosomal processes. Table 1. Genomic summary statistics for several eukaryotes. Genome Length %(A+T) %(G+C) Predicted no. genes Average genomic Human Arabidopsis 2.91 Gb 125 Mb 55% 45% -32,000 27 kb Plasmodium 64% 36% 25,498 Paramecium MIC: -100 Mb MAC: -90 Mb 72% 28% >30,000 2.01 kb 1.43 kb 2.53 kb 23 Mb 81% 19% 5,266 extent / gene Average exon length Average no. no. intron length 1.67 kb (median) 218 bp 145 bp 250 bp 419 bp 949 bp median 7 (median) 5 (median) 3.3 2.4 3 (median) 178.7 bp (median) 1694 bp 1117 bp (median) (median) exons / gene Average Tetrahymena MIC: -120 Mb MAC: -105 Mb 78% 22% -27,300 intron length 3.37 kb 168 bp 25.4 bp Average 51.2 kb intergenic (Chr14 only) 2.26 kb 202 bp distance (Chrl4 only) Averages are means unless otherwise noted. Sources: (5-11) 86 bp (median) 2.1.4. RNA-Guided Genomic Rearrangement The MAC genome is derived from its germline cousin by specific elimination of -15% of the micronuclear sequence followed by fragmentation into many smaller chromosomes which are selectively amplified to 45C. Selective deletion of MIC sequence was shown to be mediated by a class of-28nt single-stranded RNA dubbed the scnRNAs for their proposed mechanism of "scanning" the MIC genome for complementary sequence to be deleted (2). Within such internal eliminated sequences (IESs), specific promoter elements initiate bidirectional transcription, yielding double-stranded transcript that is sequentially cleaved by the Dicer homolog Dcllp (12) into the -28nt scnRNAs which target reversecomplement sites within the MIC genome for deletion. By silencing specific target loci, this process achieves a similar effect to that of other RNAi pathways, but does so by a unique method - actual deletion of targeted sequence from the nascent somatic macronucleus. 2.2. Overview: -23nt sRNAs in Tetrahymena 2.2.1. Introduction A novel class of endogenous small RNAs (sRNAs) 23-24nt in length was recently reported by Lee and colleagues (1). These sRNAs appear to be distinct from other classes of sRNAs in Tetrahymena such as the scnRNAs or starvation-induced tRNA degradation products (13), suggesting the presence of an additional RNAi pathway of asyet unknown function. To date, no other unicellular organism has been reported to have multiple distinct classes of small RNAs. The -23nt sRNAs are differentiated from other classes of sRNA in Tetrahymena by several characteristics in addition to their length. First, the -23 nt sRNAs accumulate throughout the Tetrahyemna lifecycle whereas tRNA cleavage products and scnRNAs are generated primarily in response to starvation and conjugation, respectively. The -23 sRNAs are derived in a perfectly strand-specific fashion, in contrast to the scnRNAs which are endonucleolytic products of both strands of double-stranded transcript. The -23 sRNAs rely upon a distinct factor for processing and, unlike the scnRNAs, are observed in DCR1 knockout progeny. The ubiquitously-expressed protein Dcr2p, the most canonical of the three Dicer homologs found in the MAC genome, was postulated as the agent of these sRNAs' biogenesis (1). Significantly, DCR2 is an essential gene, suggesting that its putative -23nt substrates are functionally important. 2.2.2. Small RNA Loci Cluster within the MAC Genome Of the 151 -23 nt sRNAs cloned and sequenced by Lee and colleagues, 107 (71%) matched the MAC genome in at least one place, and of those, 89 (84%) mapped unambiguously to a genomic locus (listed in Appendix Table Al). Each sRNA sequence was cloned exactly once, suggesting high diversity in the population of -23nt sRNAs. Remarkably, of the 107 MAC sRNAs, 97 (91%) are grouped in twelve distinct, unlinked clusters each containing between 2 and 16 sRNAs and spanning between 46 and 8588 bp (mean = 3237 bp). This clustered organization bears similarity to small RNAs in other organisms: mammalian microRNA genes are commonly found in clusters (14), plant tasiRNAs are sequentially derived in 23-nt phase from primarily one strand of a dsRNA transcript (15), and the "X-cluster" sRNAs of C. elegans densely reside in one poorlycharacterized locus without apparent phase or structure (16). Distinguishing them from known examples, though, the sRNAs are derived in a perfectly strand-specific manner from each cluster. Moreover, all but two of the clusters overlap predicted genes on the opposite strand. 2.2.3. Composition and Editing The -23nt sRNAs display several notable compositional biases (Table 2). First, they display a preference for 5' uracil similar to but much stronger than similar bias noted for metazoan and plant mi/siRNAs (17, 18), and most closely resembling that of the plant tasiRNAs. In contrast to other reported classes of sRNA in Tetrahymena, the nucleotide composition of the -23 sRNAs appears to complement that of predicted coding sequence, further suggesting high coding potential among sequences antisense to sRNA loci. Lee et al. (1) report that over half of the cloned -23nt sRNA sequences differ from their genomic sequence by one or two bases, mostly at the 3' terminal nucleotide. Because this discrepancy was not observed among other classes of sRNAs cloned in the same study, they propose an untemplated 3' end activity specific to this class of sRNAs. Table 2. Length and composition for short RNAs of various organisms. Predicted coding sequence composition in Tetrahymena is shown for comparison. Overall composition sRNA Length Tetrahymena -23nt sRNAs matching MAC genome scnRNAs Predicted cds Arabidopsis tasiRNAs(•T microRNAs(t) Human microRNAs * Mean 23.5 nt (22 nt: 1,23 nt: 55, 24 nt: 48, 25 nt: 3) Mean 28.6 nt e' %(5' U) 94% 84% %A 30% %U () %C %G ) 10% ( 45% 15% 34%1' 36% ' 17% (' 13% V) NA NA 40% 32% 13% 15% Mean 21.2 nt Mean 22.0 nt 70% 37% 30% 30% 36% 23% 18% 23% 16% 24% Mean 21.9 nt 49% 23% 30% 22% 25% (*)Compositional frequencies reported by Lee et al (1). (t) Sequences for 23 Arabidopsis tasiRNAs and 298 miRNAs downloaded from ASRP (19). ($) Sequences for 319 human microRNAs downloaded from mirBase RFAM 7.1 (20, 21) These sRNAs bear considerable similarities in composition and organization to a variety of RNAi-mediating molecules in other eukaryotes. We believe that it is unlikely that Tetrahymena would retain these clusters of-23nt sRNAs and the cellular machinery necessary for their synthesis if they did not have an important function, especially given the organism's notable efficiency. In an attempt to ascertain their function, we compare them to two well-studied such classes, namely microRNAs and tasiRNAs. Finally, we present evidence for common sequence-level features shared by sRNA clusters and identify additional genomic loci with sRNA-generating potential. 2.3. Screen for RNA Secondary Structure 2.3.1. Overview The Tetrahymena -23nt sRNAs share several characteristics with microRNAs such as a length distribution tightly centered on 23 nt and a strong bias for uracil at position 1, leading us to investigate the full extent of these molecules' similarity in form and function. Hairpin-loop secondary structure is a defining characteristic of microRNAs in both plants and animals and is strictly required for their processing and nuclear export (22). In order to assess the sRNAs' similarity to miRNAs, we examined the potential of their flanking sequences to assume an energetically favorable secondary structure. 2.3.2. Methods The program RNAfold from the freely-available Vienna RNA software package (23) was used to determine the most energetically-favorable RNA structure of each sequence. For each of 97 clustered -23nt sRNAs, every window of length L (tested using L = 150 nt and L = 350 nt) containing its genomic locus was extracted. RNAfold (arguments "-noLP") was applied to determine the optimal secondary structure of every such sequence window and for each sRNA, the surrounding window with lowest mean free energy (MFE) of folding was selected. To more systematically assess whether the cloned sRNAs arose from pre-miRNA-like structures, their flanking sequences' folding energies were compared with those of a randomized control set. For each sRNA, a control cohort of sequences was obtained by selecting randomly from the MAC genome non-overlapping sequences having the length equal to and dinucleotide content similar to that of the sRNA. This roughly equalized the folding energy of the sRNA and its controls under the null hypothesis that the sRNAs fold no better than random given their length and composition. To select the control cohort for a given cloned sequence i, equal-length sequences were sampled from randomly chosen positions in the MAC genome sequence. Each sampled sequencej was kept if the root mean squared deviation between its dinucleotide frequencies and those of the cloned sRNA fell below a given threshold rcore, that is, if Rc(i, j)o re < rcore, where D is the set of 16 dinucleotides and ffre equals the observed frequency of dinucleotide k in sequence i. As an additional requirement, a similarly-defined threshold rflank was applied to the dinucleotide frequencies of the surrounding window. Parameter values (rflank = 0.015, rcore = 0.030) were chosen on the basis that they were high enough to avoid overfitting and allow for rapid sampling but were far below the average standard deviation of dinucleotide frequencies in the genome as a whole, ensuring that a reasonably large control cohort would mirror the composition of the sRNA set. In the same way as for the cloned sequences, a fixed-length window was slid across each control sequence, and the lowestMFE window was selected. 2.3.3. Results Only a small subset of the -23nt sRNAs are flanked by sequence predicted to have sufficiently low folding energy to potentially exhibit pre-miRNA-like structure. We manually examined these cases and found that although some of the predicted structures did contain stem loops, none of the cloned sRNAs resided in stem regions as would be expected for a mature miRNA. The distributions of lowest-energy flanking windows surrounding both the cloned sequences and the control cohorts each take an extreme-valued form as would be expected for the selection of best-folding window surrounding each sequence. Distributions were obtained for sequence windows of length L = 150nt and L = 350nt, with cohorts of 300 and 45 controls per sRNA sequence, respectively, and are shown in Figure 1. Minimal Flanking Folding Energies for Tetrahymena sRNAs and Controls (Window size: 150 nt) (Window size: 350 nt) Control sequences (45 4 er sRNA) 0.08 - Cloned RNAs 0.1 (N=107) 0.07 0.08 0.06 LL I n tI 0.06 0.05 0.04 :;i 0.04 0.02 0.03 0.02 ii -80 -60 -40 -20 n -120 -100 -80 -60 -40 0.01 -20 Folding energy of best-folding window, Kcal/mol Figure 1. Distributions of minimum free energy (MFE) for best-folding windows surrounding Tetrahymena -23nt sRNAs or control cohorts for window sizes 150nt and 350nt. Small RNA mean MFE: -24.1 Kcal/mol and -61.5 Kcal/mol, respectively; control cohort mean MFE: -26.9 Kcal/mol, -66.3 Kcal/mol, respectively. In fact, the --23nt sRNA sequences have significantly higher folding energies than their control cohorts (P < 1.3x10 -8 and P < 7.5x10 -9 , respectively; Mann-Whitney-Wilcoxon rank sum test), exactly the opposite of the effect expected were the sRNAs derived from highly structured precursors such as pre-miRNAs. A likely explanation for the strength of the result in this direction is the possible inclusion of other classes of structural RNA in the control cohort, skewing it to actually fold better than the sRNAs on average. To positively control for the ability of this method to identify microRNA-like secondary structure, it was applied to known microRNAs from the plant Arabidopsis thaliana. The set of 114 known Arabidopsis microRNAs was downloaded from the miRBase (RFAM release 7, 7/12/2005: (20, 21)), and a control cohort was selected from the Arabidopsis genome, preserving sequence length and composition as before. Because secondary structures in plant pre-miRNAs commonly extend beyond 150bp, this analysis was performed only with a window size of 300 nt. Minimal Flanking Folding Energies for Arabidopsis miRNAs and Controls (Window size: 350 nt) I 0.04 I- 0.035 S 1 1 1 r I 1 Control sequences (-80 per miRNA) IArabidopsis miRNAs (N=1 14) 0.03 c 0.025 p 0.02 LL 0.015 1: -u 0.01 0.005 i n-F -180 -160 -140 3 i ;; .r r i -120 -100 11310-i -80 -60 -40 Folding energy of best-folding window, Kcal/mol Figure 2. Best-folding 350nt sequence windows containing Arabidopsis miRNAs (clear) fold on average with significantly lower MFE than those of control cohorts (gray). MicroRNA mean: 103.5 Kcal/mol; control cohort mean: -91.77 Kcal/mol. As shown in Figure 2, the Arabidopsis miRNAs fold with significantly lower MFE than their control cohorts (P < 2.0x10017). In fact, this comparison likely underestimates the difference in folding energies between the miRNAs and controls, as no attempt was made to exclude other classes of structural RNA from the control sequences. We have presented a screen for RNA secondary structure potential in sequences flanking a set of uncharacterized loci. This screen strongly reinforces the null hypothesis that sequences flanking Tetrahymena-23nt sRNAs have no greater folding potential than expected given their composition and length. Despite the diversity of organisms in which microRNAs are found, a unifying requirement for their processing is hairpin-loop secondary structure, and so we conclude that the -23nt sRNAs of Tetrahymena are not microRNAs. 2.4. Screen for trans-Targeting Potential 2.4.1. Overview We next investigated the possibility that the Tetrahymena -23nt sRNAs target messages for post-translational repression in the same manner as do the plant tasiRNAs. These sRNAs bear several similarities to the tasiRNAs, including their organization into several clusters, strand bias within each cluster, and compositional features. This resemblance, however, is imperfect - the tasiRNAs show a strong but decidedly incomplete strand bias, and they are derived in a near-perfect 21 nt phase from their precursor transcripts, whereas the sRNAs show no discernable phase. Previous computational studies (24, 25) have reported detection of signal above noise for target sites to mammalian microRNAs, and the signal for endogenous short RNAs in plants is likely even more evident given their more stringent targeting requirements. We attempted to detect a similar signal in Tetrahymena for widespread gene targeting by the -23nt sRNAs. 2.4.2. Methods To test the trans-targetingpotential of the Tetrahymena sRNAs, we searched the genome for messages with target sites for the 107 sRNA sequences and compared the results to those obtained by searching with a randomized cohort of sequences. The alignment program wublastn (26) was used to search the genome for all instances of each sRNA's reverse complement (settings "dbgcode=6 C=6 W=3 E=le-7 hspsepSmax=20 hspsepQmax=20 qframe=1 hspmax=2000 V=2000 B=2000"). As interfering RNA has been shown to target messages with imperfect complimentarity in both animals and plants, up to 8 mismatches and gaps were allowed. Prior to searching, the program DUST was applied to the genome in order to filter out a potentially large number of hits to low complexity sequence. Even after filtering, five sRNAs with very low (G+C)% yielded extensive hits to repetitive elements and were excluded from further analysis. Lastly, hits were mapped by genomic coordinate to the TIGR-predicted exonic, intronic, and integenic regions. Hits overlapping any sRNA's own locus of origin were excluded and not counted. A first-order Markov chain model was created from each cloned sRNA sequence and used to generate a control cohort of sequences with dinucleotide composition equal in expectation to that of the sRNA. This model is comprised of four fully-connected states with one emitting each RNA nucleotide and a silent start state with one outgoing edge to each RNA-emitting state. Transition probabilities between states were set using the conditional probabilities of each dinucleotide within the cloned sRNA sequence: P(Krrx) = fIx = fxY / fXy,. Across all models, transition probabilities from the start y. state were set to the pooled mononucleotide frequencies of the sRNA sequences in order to capture their strong bias towards U in the first position. Because control sequences equal in length to their paired sRNAs were desired, no end state or other length distribution was used, and sequences emitted by the chain model were simply trimmed to have the same length as the appropriate sRNA. Each sRNA sequence model was used to emit 500 sequences comprising the control cohort for that sRNA. The DUST-filtered genomic sequence was searched as before the reverse complement of each individual control sequence, and the results were grouped by corresponding sRNA and mapped to genomic features. As the control sequences were randomly generated rather than being drawn from the genome, they have no locus of origin per se and so no hits from the control sequences were excluded. 2.4.3. Results and Discussion Reverse complement hit counts to different genomic features are shown in Table 3. Imposing a higher stringency level by allowing fewer mismatched or gapped bases in the hit alignments naturally resulted in fewer hits. Across all stringency levels examined, cloned sequences had on average more hits in both orientations to every genomic feature than their controls did. This difference was greater than two-fold for nearly every combination of stringency level and hit type, suggesting some similarity to distant loci beyond the controlled attributes of length and mono- or di-nucleotide content. Table 3. Hit counts from 102 cloned sRNAs and control cohorts categorized down rows by type and orientation of genomic feature hit, and across columns by stringency level (maximum number of mismatches or gaps allowed). Hits from five cloned sequences closely resembling low-complexity repeats and their cohorts were excluded. Hits from cloned RNAs (excluded five cloned seqs; excluded hits to exact locus of origin) Stringency Level 4 3 1546 388 187 Hits from randomized controls Average over 500 controls per included cloned RNA 28 Stringency Level 4 3 685.4 179.9 2 8.86 46 3 78.02 20.01 0.89 863 189 6 426.3 104.9 4.85 Sense hits to exon/intron boundary Intergenic hits 102 33 4 46.25 11.54 0.54 3585 946 57 1522 409.0 19.51 Total hits 6283 1602 98 2758 725.4 34.64 Hit Tye Antisense to exon Antisense hits to exon/intron boundary Sense hits to exon 2 The distributions of hit counts for individual sRNAs and controls are shown in Figure 3. Notably, at all stringency levels, the number of antisense hits to exons from cloned RNAs is near or above the third quartile of antisense hits from controls; for no other category of hits is this the case. Although this disparity may suggest a trans-targetingproperty of the sRNAs, a more simple explanation is that it reflects a protein coding sequence on the sense strand opposite the sRNA loci. Randomized controls are controlled for di- but not tri-nucleotide composition and therefore do not reflect the codon biases seen in proteincoding portions of the Tetrahymena genome. If the sRNAs are largely found antisense to protein-coding genes, it might be reflected in a greater number of hits antisense to other protein-coding regions of the genome relative to controls. Count distributions of hits to different genomic elements for 102 Tetrahymena sRNAs and controls Figure 3. Hit count distributions for sRNAs and controls to different genomic features. Cloned sRNAs' antisense hits to genes exceeded those of controls across all stringency levels. Although cloned sequences had overall more hits to genic regions than controls, among genic hits there was no detectable disparity between antisense and sense orientation: at stringency levels of 2 and 3 the difference in preferences towards antisense versus sense genic hits was insignificant (P < 0.20 and P < 0.13, respectively; chi-squared). At stringency level 4, the bias towards antisense hits compared to controls was significant (P<0.01) but in terms of actual hit counts represented an average of less than one hit above expectation per sRNA. The short RNAs likewise displayed no significant preference to hit either genic or intergenic sequence relative to their controls. The cloned sRNAs had very slightly fewer hits to genes than expected at stringency levels 4 and 3 (P<0.0044 and P<0.034, respectively). At stringency level 2, the difference was insignificant. As before, these differences were marginal, in this case averaging to less than two hits below expectation per cloned sequence. Although these sRNAs had in sum more reverse-complement hits to the genome than did their controls, within their respective target sets there was neither any significant bias above noise for antisense hits to genes nor for genic versus intergenic regions. Thus, as a group, the -23 nt sRNAs of Tetrahymena appear unlikely to target messages in trans by imperfect pairing to reverse-complement sites in the same manner as classical modes of RNA interference. 2.5. Small RNA Cluster Sequences are Related 2.5.1 Overview The clustered organization of the -23nt sRNAs cloned from Tetrahymena led us to investigate whether any particular sequence feature triggered their expression and subsequent processing. Indeed, simple pairwise BLAST alignments reveal a high degree of homology between and within the clusters, in most cases far exceeding homology to loci not containing observed sRNAs. Lee et al. (1) used this homology to group the sRNA clusters into three families; we slightly modify and extend these family definitions (Appendix Table A2). Extant clusters' grouping into related families, within some of which are shared >1 kb blocks of near perfect identity, suggests duplication and divergence from a common set of ancestors. Such sequence duplication is a common event in all organisms, and is thought to be a necessary step in evolution of new function. Duplicated gene copies, called "psuedogenes", are free from selective pressure and may accumulate mutations without phenotypic effect. Most such mutations would be deleterious if not for the original copy of the gene, leading most psuedogenes to be eventually disposed. In some rare cases, however, the sequence may undergo a so-called "gain-of-function" mutation which confers new or enhanced function upon its encoded protein. When a newlyfunctional psuedogene undergoes such a change, it again comes under selective pressure to remain in the genome, becomes fixed as a related but distinct copy of the original gene, and is referred to as a paralog. More drastic events such as the loss, duplication, and rearrangement of entire blocks of sequence also shape paralogs. We measured the genome-wide network of gene paralogy in Tetrahymena and used it as a baseline to assess the degree of relatedness of genes in different sRNA clusters as well as those within the same cluster. We hypothesize that a series of divergence events led from several ancestral sequences to eight of the twelve observed sRNA clusters. Lastly, we present evidence that the clusters are related to other loci which may be centers of asyet unobserved -23 nt sRNA synthesis. 2.5.2. Methods 2.5.2.1 Paralog Gene Families The protein-coding region (cds) for each of the 27,306 predicted macronuclear genes was extracted and aligned to the genome using wu-tblastx to find putative paralogs. Even at seemingly significant E-values (e.g., 10.- < E < 10-25), many alignments did not seem reflective of true protein-coding homology when subjected to manual examination. In particular, there were numerous long alignments with matches or positives sparsely distributed along their length. The matches in these alignments were enriched for a limited set of amino acids (primarily asparagine, lysine, and glutamine) with A+U-rich codons not reflective of overall transcriptome composition. We hypothesized that the genome's heterogeneity in A+T% might be causing wu-tblastx to locally underestimate the background score distribution, artificially improving (lowering) reported E-values of these hits. A stringent set of filters was applied in order to remove as many spurious alignments as possible. Two predicted genes are considered paralogous under this filtering if an exon from one was aligned to an exon from another with a BLAST E-value of less than 10-10 and had at least one 70-aa window of at least 55% identity and at least 80% positive+identity characters. Each exon was also required to be overlapped by the alignment by the lesser of 100 bp or 50% of the exon's length. Lastly, each corresponding pair of splice sites was required to align within a distance of 2,000 bp. As many of the exon boundaries may be incorrectly predicted or simply missed, we expect some loss of sensitivity to result from the requirements that aligned genes' exons overlap in this manner. 2.5.3. Results 2.5.3.1. Gene Prediction Verification In order to estimate the accuracy of the TIGR gene predictions, we obtained cDNA sequences of several manually-curated genes and compared their genomic alignments with their predicted counterparts. We surveyed only sequences deposited in GenBank after the gene predictions were issued (May, 2005), as older sequences may have been used as training data during gene prediction. In all cases examined, one or more gene(s) were predicted to overlap the curated genes' verified locations. Given that many Tetrahymena genes lack homology to known protein families, it appears that the gene finder was fairly sensitive. However, each gene model we examined differed from the curated gene models, often by the inclusion of apparently spurious exons. A typical example was the histone H2 variant A gene (H2A.Z) shown in Figure 4, where the predicted gene structure includes two 5' exons not annotated in the curated gene structure. Although such differences could reflect legitimate alternative spliceforms not deposited in GenBank, there is no reason to believe that they are not simply erroneous predictions. Unfortunately, neither any predicted genes near -23 nt sRNA loci nor their paralogs could be verified by EST or manually-curated gene model evidence. Comparison of Predicted and Curated Annotation for Gene H2A.Z MAC scaffold CH445731 3 ik 371.5k 3t2k Gene Predictions 11.m00369 H sprotein, no similarity or motifs detected 11.m00370 Histone HTIA3: varant H2A.Z) Genbank Sequence Alignments X15548 -----------c~1----- Tetrahymena thermophila hvl gene for histone H2 variant Figure 4. Comparison of predicted and manually-curated models for gene H2A.Z. The predicted gene structure includes two apparently spurious 5' exons. 2.5.3.2. Paralog Gene Families Using wutblastx alignments and filtering for conservation of likely coding sequence, we inferred families of paralogs among the predicted macronuclear genes in Tetrahymena. Alignment of the 27,306 predicted genes against the genome and subsequent filtering yielded 215,379 hits between exon pairs. Mapping these to gene pairs and discarding redundant matches, we arrived at a list of 27,491 unique paralogy relationships between predicted genes, with under a fifth (5,348) of all predicted genes having at least one paralog. 2.5.3.3. Small RNA Clusters Are Highly Interrelated Initial BLAST analysis showed that several genes at different sRNA cluster loci are related to each other, suggesting that a common feature may trigger sRNA synthesis. The genome-wide collection of paralogs includes 8 unique pairs of genes overlapping the sRNA clusters (designated "predicted paralog") in Table A2. In all eight cases, both paralogous genes are unlinked, each overlapping different sRNA clusters. This is a very significantly higher degree of paralogy (P ; 0, chi-squared ) between different clusters than would be expected had the clusters been placed antisense to genes at random, controlling for the fact that some sRNA clusters span multiple adjacent genes and are thus more likely to overlap paralogous genes. Four families of sRNA-associated genes are listed in Table A2. These combine paralogy relationships discovered through automated alignment and filtering analysis (designed "predicted paralog"), as well as those inferred from manual examination of sRNAassociated genes' alignments (designated "probable paralog"). In the latter case, the strength of alignment supports a likely paralogy relationship but one or more of the alignment filters failed. In addition to the BLAST E-value statistic, which can be misleading, this table lists the percent similarity (% ident, % ident+positives) of the best 360-nt window overlapping the predicted cds of both genes by the lesser of 100-bp or 50% of the overlapped exon's length. Our paralogy predictions differ somewhat from those of Lee et al. First, we do not find strong evidence for paralogy between genes within most clusters. Though alignment results included such intra-cluster matches, the alignments failed automated quality filters and most appeared to be artifactual upon manual inspection. We group the sRNAassociated clusters in the same families as do Lee et al, except that we exclude sRNA cluster 0 from Family I on the basis of weak alignment data. We also include a novel family (IV) which includes sRNA cluster 2 and putative paralogs. Despite the differences, both the paralogy relationships we find and those reported by Lee et al reflect stronger paralogy between separate clusters than within each, and both suggest an ancient tandem duplication event and subsequent divergence followed by much more recent duplication to numerous distant sites. 2.5.3.4. Adjacent Genes Generally Tend to be Paralogous Whereas the degree of paralogy between unlinked sRNA clusters clearly exceeds expectation, we find that intra-cluster paralogy is a genome-wide phenomenon not specific to sRNA clusters. Clusters of genes along the same strand separated by intergenic stretches of less than I kb, such as those overlapping the sRNA clusters, are found throughout the MAC genome. Such gene clusters contain between 2 to 30 genes and in sum include over two-thirds of all predicted macronuclear genes. Moreover, gene clusters are very significantly enriched for paralogous sets of genes. There are 27306) :8.7x1 0- possible unique paralogy relationships between predicted genes, of which an estimated 27,491 are realized. 2,766 of these paralog pairs have both genes in the same cluster, leaving 24,520 pairs where genes are in different clusters (including the case where either or both clusters are of size 1). The total number of possible paralogy relationships between two genes within the same cluster is -xkI k=2 2 = 32803, where xk is the number of gene clusters of size k. The chi-square test strongly rejects the hypothesis that paralogy relationships are distributed homogeneously between gene pairs contained within a cluster and all other pairs. Such tendency of paralogs to reside nearby each other in clusters is likely reflective of the higher incidence of tandem duplication compared to duplication at distant loci. 2.5.3.5. Genes Overlapping sRNA Clusters Have Average Overall Paralog Counts Additionally considering paralogs to other loci, genes overlapping sRNA clusters have somewhat higher numbers of paralogs (Mann-Whitney; P < 0.0398) than other genes in the genome. Because the background distribution of paralog counts is likely shifted downward by spurious gene predictions having no paralogs, and given the marginal significance of this result to begin with, it seems likely that the genes overlapping sRNA clusters do not individually have significantly more paralogs than other genes around the genome. That sRNA-yielding genes' paralogy is enriched more significantly amongst themselves than with other genes suggests that some shared feature triggers antisense sRNA synthesis at those locations rather than paralogy or high copy number per se. 2.5.3.6. Rearrangement of Related sRNA Clusters In at least one case, synteny is not preserved for paralogous genes across sRNA clusters. Clusters 4 and 6 share extensive homology, perhaps following large-scale rearrangement and insertion as suggested by the alignment visualization in Figure 5. Predicted gene 4800 within cluster 4 is closely homologous to predicted gene 8863 of cluster 6, and looking upstream in cluster 4 but downstream in cluster 6, predicted genes 4802 and 8862 are largely homologous with the exception of an -0.9kb indel. This rearrangement could be explained by tandem duplication of three genes (ABC 4 ABCABC) followed by gene loss and divergence (ABCABC - C'A') Rearrangement of sRNA-Cognate Gene Clusters sRNA Cluster 4 4112 ,411 41141, O N. .... . . . ---- ----- -un u- ur- -s, _Z_. .. .. .. . .. .. .. . .. .. .. . .. .. .. . .. .. . . . .. .. .... ....... . .. .... . .... .. . ..•. . .. .. .. . .. .. ". .. .. . . .. . .. ..;o+"7 1141 sRNA Cluster 6 Figure 5. Alignment visualization generated by the program GATA (27) reveals rearrangement between sRNA clusters 4 and 6. Predicted genes models are shown above and below the alignment; connecting lines indicate pairwise blastn hits and are shaded by the strength of homology. 2.5.3.7. Individual sRNA positions are not conserved Positions of the -23nt sRNAs did not appear to be conserved between pairs of related clusters. Excluding the cases where a single cloned sRNA maps to multiple different clusters and may have originated from any subset of them, we found only two pairs of sRNAs whose aligned positions partially overlapped sRNAs of another cluster. Not only did the sRNAs of related clusters not tend to overlap each other in alignments, they showed no tendency to localize to regions of high sequence conservation between the two clusters. Their lack of positional conservation despite sequence homology suggests that to the extent that can be ascertained by the given sample, the sRNAs are randomly distributed within each cluster locus. In turn, this suggests that while some shared feature may serve to define each cluster, no shared feature within the cluster denotes the individual sRNA positions. 2.5.3.8. Paralogy to non-sRNA Associated Loci Lastly, we obtained a list of sRNA cluster paralogs to which no cloned sequences mapped. We hypothesize that some of these paralogs may have sRNA-generating potential but for some reason weren't observed in the cloned sample, due perhaps to transcriptional inactivity or some other cellular state. Such paralogs of sRNA-associated clusters themselves lacking observed sRNA are highlighted in Table A2. Small RNA cluster 2 had especially many paralogs to loci not associated with cloned sRNAs; we grouped this cluster and its paralogs into Family IV. Genes and flanking sequences adjacent to the sRNAs of cluster 2 are very closely duplicated on several other scaffolds. The two most closely duplicated regions are the first -6.5-kbp of scaffold 8254572, running from the start of the scaffold, including gene 23409, and ending before predicted gene 23310, and the first -1.5-kbp of predicted gene 23312. Interestingly, four of the five sites of near-perfect duplication are at or near the 5' ends of their respective scaffolds, suggesting that the duplicated sequences could be located nearby repetitive elements or MIC chromosomal breakage sites. 2.5.4. Discussion It is clear from the sRNA clusters' paralogy that they arose from a group of common ancestral sequences. The exact series of divergence events leading to each extant cluster is unknown but includes numerous rearrangement and duplication steps. The nearcomplete incidence of sRNAs within each cluster's set of paralogs suggests that some shared sequence feature confers their sRNA-generating potential. In contrast to the homology between related clusters, we find little or no conservation of individual sRNAs' positions and propose that they are synthesized more or less randomly within each cluster. Lastly, we examine the set of exceptions: the few strong sRNA cluster paralogs to which no observed sRNAs mapped. In the next section, we will search for shared sequence features of the sRNA clusters and assess their presence in this set of paralogs. 2.6. A Novel Sequence Motif Is Strongly and Specifically Associated with -23nt sRNA Activity 2.6.1. Overview Based on the strong and specific homology between sRNA clusters, we hypothesized that they share a motif responsible for directing nearby sRNA biogenesis. A diverse range of cellular pathways and functions are directed in cis at genomic and transcriptional loci by the presence of DNA and RNA sequence motifs. Such motifs commonly serve to specifically bind the protein factors that actually effect processes such as transcriptional initiation, alternative splicing, nuclear export, and post-translational repression. The presence of a sequence motif, namely a variable-length, nearly homopolymeric stretch of adenosine bases ("A-rich tracts") opposite the sRNA clusters on the predicted coding strand, was suggested during the course of this study (K. Collins, personal communication). The existence of this motif near the sRNA clusters was later reported (1), its specificity to these loci and its functional role were not explored. The genomic loci of A-rich motif instances nearby sRNA clusters are shown in Appendix Table A3. Motif positions do not appear to be conserved between paralogous sRNA cluster pairs, suggesting that this shared feature could have arisen convergently rather than simply by duplication. We searched systematically for all motifs overrepresented among sequences flanking the small RNA loci. To determine the specificity of any resulting motifs to sRNA-associated loci, we derived a randomized control set of predicted genes and performed a similar motif search. We next developed a model to capture the important elements of the A-rich tracts and applied it genome-wide to more rigorously demonstrate their specific association with sRNA activity. Lastly, we predict a set of loci strongly associated with this motif at which sRNA activity has not yet been observed but may be likely under the motif's hypothesized function. 2.6.2. Methods 2.6.2. 1. General Motif-Finding at sRNA Clusters The motif-finding software MEME (28) was applied to probe for overrepresented motifs among the sRNA cluster sequences, including but in no way favoring the A-rich tracts. The full genomic sequence and downstream flanking regions (including putative 3' UTRs) of all predicted genes overlapping each of the 10 sRNA clusters was extracted. Two clusters that did not overlap any gene predictions were excluded, and the resulting 18 sequences were input to MEME. For this and subsequent searches, the "-zoops" option was used to direct MEME to find either zero or one motif instances per sequence in order to preclude artifactual motifs that could arise were the software constrained to find a motif hit in every sequence of the input. Additionally, average genomic mono- and dinucleotide frequencies were supplied to MEME to ensure that the motifs' expectations of occurrence were properly estimated. Motifs reported by MEME were assigned "strength" and "quality" scores. Motif strength was taken as the negative log of the E-value reported by MEME, thus reflecting the degree to which the reported motif is overrepresented among the input sequences. Motif quality is taken as the number of input sequences reported by MEME to match a given motif. Low-quality motifs - those with instances in only a few of the inputs - were generally artifacts reflecting those sequences' paralogy. 2.6.2.2. Motif-Finding Controls A similar motif search was performed among a set of control sequences to provide a comparative basis for the strength and specificity of any sRNA-associated motifs. As the input set of 18 genes overlapping sRNA clusters included several sets of adjacent genes, a randomized cohort was selected to control for the effects of paralogy between nearby genes by drawing equally-sized sets of genes from gene clusters of corresponding counts. For example, the sRNA cluster on scaffold 8254600 overlaps three adjacent predicted genes, so for its corresponding control, a cluster of three or more genes was randomly taken from the set of all such gene clusters, and then a contiguous block of three genes was selected randomly from that cluster. This process was repeated 20,000 times, each time yielding a control set of 18 genes. MEME was then run on each control set using the same options as for the sRNA-associated genes. The resulting motif strength and quality scores comprise the background joint distribution over these statistics for motifs found within and downstream of predicted genes in Tetrahymena. 2.6.2.3. Discriminative Motif Modeling 2.6.2.3.1. A-Rich Motif Model Architecture MEME employs a fixed-length positional weight matrix (PWM) to model sequence motifs and was ill-suited to describe the putatively sRNA-associated A-rich tracts because their lengths varied widely. In place of MEME, a custom Hidden Markov Model (HMM) was developed to better describe this motif. The HMM state architecture (shown in Figure 6) was composed of a background sub-model (states named "B_...") and a motif submodel (states "M_..."). The former was a first-order Markov chain with five fullyconnected nodes (one for each RNA nucleotide plus the ambiguity character "N"). The background sub-model was fully unrestricted in transitions and used to collectively model all non-motif sequence. The motif sub-model was designed to emit sequences resembling the A-rich motif instances observed near sRNA clusters. On the strand opposite the sRNAs, these motif instances were generally composed of runs of A's separated by intervening mono and dinucleotides. By disallowing certain transitions, the motif sub-model was designed to emit with nonzero probability any sequence matching the pattern (A>2[CGU]1 -2 )_1 . The motif sub-model contains six nodes: an entry node Mo, an "A-repeat" node M I1, and lastly nodes M2X (X e {C,G,U}) and M3 which model intervening mono- or dinucleotides. A-Rich Motif Hidden Markov Model Figure 6. HMM architecture for discriminative modeling of putatively-sRNA associated Arich motifs and all other macronuclear genomic sequence. Circles denote states and edges denote transitions having nonzero probability; transitions between motif states are shown red. The red character(s) in each non-silent state are the emissible characters at that state. 2.6.2.3.2. A-Rich Motif Model Training The model was trained on a fully-labeled version of the same 18 coding and downstream sequences of sRNA-cognate genes as previously input to MEME. Fully-labeled sequences were obtained by assigning A-rich regions previously identified by MEME to the motif sub-model and every remaining region to the background sub-model. Given such a partition of each input sequence into disjoint motif and background regions, a unique parse exists, allowing the sequence to be unambiguously labeled. This labeling method was accommodated by the particular model architecture used and the restriction of certain transition and emission probabilities. In general, this is not possible with HMMs, and training would otherwise be performed using an iterative "missing-data" approach such as the Baum-Welch algorithm. Several parameters were fixed prior to training in order to avoid over-fitting characteristics specific to the inputs, thus hopefully improving generalization performance. In particular, transition probabilities within the background sub-model were set equal to the corresponding genome-wide dinucleotide frequencies conditioned on the first base. This was done to avoid biasing the background sub-model towards the highly paralogous sRNA-cognate genes in the input set. Additionally, transition probabilities from the start state and to the end state were set equal to the corresponding average genomic mononucleotide frequencies. Several motif sub-model parameters were likewise fixed. Intervening dinucleotides were rarer within motifs than mononucleotides, and so it was not possible to reliably estimate the composition at their second base (state M3). In particular, this state's emission and outgoing transition probabilities were set equal to appropriately-conditioned compositional frequencies drawn from the genome at large. Viterbi training was then performed on the remaining set of free parameters. Each was set equal to its maximum likelihood value, i.e., the observed frequencies of each transition or emission in the labeled training data. For each such trained parameter, a pseudo-count of I was added, e.g., a m2U < AM.M mGM where a denotes the trained frequency and A the observed count. All calculations were performed in logarithm space to improve performance and numerical stability. Leave-one-out cross-validation was performed to assess the stability of motif predictions in the face of limited training data. The model was re-trained 18 times, each time excluding a different one of the input sequences. Following each re-training, the Viterbi (maximum-likelihood) parse of the excluded input sequence was determined and the intervals labeled as motif hits by the HMM compared with the MEME labeling. 2.6.2.3.3. Genome-Wide A-Rich Motif Prediction and Scoring Finally, the HMM was applied to every gene in the Tetrahymena MAC genome in order to search for additional instances of the A-rich motif. For each of the 27,306 gene predictions, the primary transcript sequence including downstream flanking region was extracted. Because many predicted genes lack 3' UTR annotations, the downstream flanking region was taken as the sequence following the given gene's STOP codon and extending to the next annotated start codon or scaffold end, whichever was closer. The optimal parse Ho,,,,, = arg max , {Pr(Hp,,)} of each extracted gene and flanking sequence i was determined by the Viterbi dynamic programming algorithm, and its likelihood hp,,, was recorded. Additionally, the parse containing only background states was evaluated for the given sequence and its likelihood hbg.i recorded. For each sequence, a motif quality score was taken as the log-likelihood ratio of the optimal parse, including any motif intervals, and the background parse, where no motif subintervals were allowed, scaled by the number of characters ni predicted to be motif hits: Si = 1 ni Ig her, p hbg, . This score is taken as a metric of motif quality, reflecting the confidence per predicted motif character in each gene with predicted motif runs. In genes where no motif runs are predicted (ni = 0), Si was set equal to zero. 2.6.3. Results 2.6.3.1. General Motif Finding Among sRNA-cognate genes Using MEME, we identified four motifs within the coding and flanking regions of the input genes. The strongest motif discovered (E = 6.8e46) occurred in each one of the 18 inputs and corresponds to the A-rich runs noted near sRNA clusters. A sequence logo representation of the motif as reported by MEME is shown below in Figure 7. 2- 5' ,,., .- , N NNNN " mcn mv X" 3' Figure 7. Sequence logo of A-rich motifs found by MEME. This logo reflects the motif's frequency of adenosines, but is based on a position weight matrix and does not capture any dependence between features within the motif. The three next strongest motifs reported by MEME were significantly weaker (E >= 4.1 e 19). Moreover, two of them hit only within predicted coding regions of 6 pairwise paralogous genes at sRNA loci and thus are unlikely to be general features of sRNA- associated sequences. Temporarily excluding the predicted coding sequences and restricting the search to the downstream sequences of the same 18 genes, MEME again found a strong A-rich motif(E = 7.8e55) in each input sequence. For the downstream sequences, all further motifs found by MEME are weak (E >= 9.8e 10) and have instances in five or fewer of the inputs. Taken together, these results suggest that runs of adenosines on the coding strand downstream of predicted genes are the strongest shared sequence feature of sequences at the sRNA loci. We next applied MEME in the same manner to the control cohort of genes selected to have potential for common motifs comparable to that of the sRNA-cognate genes. Motif strength (E-value) and quality (number of inputs hit) are shown in Figure 8 for MEME results on 20,000 control sets and the sRNA-cognate genes. • Strength vs. Quality for Motifs Found Among sRNA-Cognate Genes and Randomized Controls W WWM W __ 0) 4-0 0 8'o E6 60-* C,, S. 40 = 40- II)I~0 I!". I w1 200 0 Ii ilnlliiill oI 5 10 * Control runs 15 Hit count, best motif (max=18) (n=20000) * sRNA-cognates Figure 8. Joint empirical distribution over strength (-log MEME E-value) and quality (number of inputs hit) of best (lowest E-value) motif hit reported by MEME for sRNAassociated and control sets of 18 genes each. Adjacent histograms show marginal distributions of each statistic. All 18 sRNA-associated genes tested contain a motif exceeding in strength all those found in more than 7 of 18 genes in every control run. Out of 20,000 control runs, only two motifs were identified with higher E-value than the motif near sRNA clusters, and those two motifs were only found in half or fewer of the inputs to MEME. Thus, the A-rich motif does not appear to be a general feature found downstream of Tetrahymena genes, and is much more strongly associated with sRNAcognate regions than motifs found within comparable controls. 2.6.3.2. Discriminative Modeling of A-Rich Tracts 2.6.3.2.1. Genome-Wide Search Using PWM We initially attempted to use the MEME-companion tool MAST to search for genomewide A-rich motif instances. This, however, amounted to aligning a long PWM with strong bias for "A" to a genome with nearly 80% A+T composition. Because the PWM could not capture apparently important features of the A-rich tracts - such as the relative paucity of intervening characters longer than 2-3nt - this approach exhibited very poor specificity, and motivated our decision to develop a custom model to describe this motif. 2.6.3.2.2. Genome-Wide Search Using HMM We next applied the trained HMM to find instances of the A-rich motif nearby other genes. For every predicted macronuclear gene and downstream region, we obtained two scores: the number of bases in that gene likely to be within a motif instance, and the perbase score of the best motif match in that gene's sequence. The joint distribution of these two metrics is shown in Figure 9. Discarding the lowest sRNA-associated score on each axis as an outlier, we count 109 genes with A-rich motif instances at least as strong as the weakest remaining sRNA-associated gene. Genes with strong motif scores are listed in Table A4; those not located near known sRNA clusters are bolded. Several of these ("family IV") are close paralogs of sRNA cluster 2 (8254572). Four other genes with strong A-rich motifs (9721, 9739-40, 17099) group by paralogy and are duplicated in numerous places throughout the genome. All other genes appear to be singletons or are divergent enough to fail alignment filtering. Quality vs. Quantity of Motif Instances for All Tetrahymena Predicted Genes as Determined by HMM Ila.i .A.11 * 6 * * U) 1 0 .. I . 0@ 0 0 sRNA-cognate genes (N=18) High-scoring genes (N= 109) All other genes (N=27, 179) r-' 8 0 €/1 fn =J I I 4. I .* 0· *' oo 00 * * * I * I 100 150 150 I 200 2 250 Number of motif nucleotides (ni) Figure 9. All predicted Tetrahymena macronuclear genes plotted by strength and quantity of A-rich motif instances as scored by HMM. Known sRNA-associated genes (red) comprised the training set and score highly. 109 genes without known sRNA activity (light blue) have relatively long motif hits with similar characteristics to those near sRNA loci. 2.6.3.2.3. Cross-Validation Results A Hidden Markov Model (HMM) was specifically designed to classify A-rich motif instances. Promisingly, under leave-one-out cross-validation, each motif hit near sRNA loci identified by MEME was rediscovered using our HMM during each round of cross- validation. Because MEME was constrained to return no more than one motif hit per sequence, the HMM identified additional stretches of A-rich motifs near some of the 18 genes, for a total of 28 motif hits. Of the additional ten hits beyond the MEME-identified set, all but two were rediscovered in every round of cross-validation testing, and those two were rediscovered in more than half of the cross-validation rounds. The stability of these results indicates that this HMM motif model is robust to perturbations in training data, and suggests the motif runs identified by the model are reflective of those near observed sRNA clusters. 2.6.3.3. Discussion We have shown that an HMM trained on sRNA-associated A-rich motif instances discriminates very well between sRNA-cognate genes and all others. As these A-rich tracts appear to be the strongest feature at these loci and are nearly exclusive to them, we propose that they trigger sRNA biosynthesis through an unknown mechanism. Instances of this motif occur in a small set of other loci throughout the genome which may also have sRNA-generating potential and would therefore make ideal subjects for follow-up investigation in vivo. 2.6.3.3.1 Proposed Experiments We propose to probe some of the poly-A loci for sense and antisense transcription. It might be best to start at loci with paralogy to sites of known sRNA production. Among genes with strong A-rich motifs, predicted gene 12216 seems best supported by paralogs with observed sRNA activity (sRNA clusters 4, 6, and 7). Family IV, which includes cluster 2, contains gene predictions with nearby A-rich motifs but only two sRNA loci. Among the remaining motif-associated genes, four group by paralogy (family V; gene preds. 9721, 9739-40, and 17099) in a manner similar to known sRNA loci. For a negative control on sRNA activity, we suggest probing for antisense -23nt sRNAs a number of well-characterized genes expressed throughout vegetative growth. If it turns out that motif-associated genes have antisense sRNA activity, we propose the creation of chimeric genes combining the positive control genes with A-rich regions to attempt inducing sRNA activity at new loci. Also, to investigate whether the observed sRNA clusters are essential for viability, we suggest the creation of knockout strains lacking these loci. 2.7. Conclusion The precise function of the -23nt sRNAs of Tetrahymena remains to be seen. Despite sharing numerous characteristics with well-characterized classes of interfering RNA, our analysis failed to uncover evidence supporting functional similarity to classical RNAi pathways. Given the ciliates' unique molecular organization, it would hardly be surprising to discover that they possess another highly specialized "niche" mode of RNAi in addition to scanning RNA. Yet despite the -23nt sRNAs' present functional ambiguity, it is tempting for several reasons to speculate that they play an essential cellular role. First, their putative nuclease, Dcl2p, is essential for vegetative growth. Tetrahymena grows extremely rapidly given its size and must come under considerable stress to maintain transcriptional efficiency, suggesting that the sRNA processing machinery and the cluster loci themselves are under selective pressure to remain expressed. Via computational analysis, we uncovered a strong and specific association between sRNA activity and an A-rich motif. We have predicted additional occurrences of this motif and suggest that one way to test their association with the sRNAs would be to probe for their presence at these loci. That the -23nt sRNAs were organized in clusters despite the apparent diversity of their population suggests that they are processed from a relatively small number of loci genome-wide, consistent with the relatively small number of A-rich motif occurrences discovered computationally. Ultimately, the question of these sRNAs' purpose remains. One speculative hypothesis is that they guide post-transcriptional silencing of their own loci in cis. RNAi can be induced by feeding in the related ciliate Paramecium(29), so the necessary silencing factors may exist in Tetrahymena. Silencing could be induced for any of numerous reasons, including to quickly dampen transcript levels in response to some environmental stimulus, to buffer against residual transcriptional activity of genes recently made obsolete, or perhaps to protect against macronuclear contamination from MIC or other foreign DNA. 2.8. References Cited 1. Lee SR, Collins K. Two Classes of Endogenous Small RNAs in Tetrahymena Thermophila. Genes Dev. 2006 Jan 1;20(1):28-33. 2. Mochizuki K, Fine NA, Fujisawa T, Gorovsky MA. Analysis of a Piwi-Related Gene Implicates Small RNAs in Genome Rearrangement in Tetrahymena. Cell. 2002 Sep 20; 110(6):689-99. 3. Yao MC, Chao JL. RNA-Guided DNA Deletion in Tetrahymena: An RNAi-Based Mechanism for Programmed Genome Rearrangements. Annu Rev Genet. 2005;39:53759. 4. Collins K, Gorovsky MA. Tetrahymena Thermophila. Curr Biol. 2005 May 10;15(9):R317-8. 5. Dessen P, Zagulski M, Gromadka R, Plattner H, Kissmehl R, Meyer E, Betermier M, Schultz JE, Linder JU, Pearlman RE, Kung C, Forney J, Satir BH, Van Houten JL, Keller AM, Froissard M, Sperling L, Cohen J. Paramecium Genome Survey: A Pilot Project. Trends Genet. 2001 Jun; 17(6):306-8. 6. Paramecium tetraurelia Genome Sequencing Project. ParameciumtetraureliaGenome Browser [Internet]. http://www.genoscope.cns.fr/externe/Francais/Projets/Projet_FN/:; 2005 5/25/2005. 7. Zagulski M, Nowak JK, Le Mouel A, Nowacki M, Migdalski A, Gromadka R, Noel B, Blanc I, Dessen P, Wincker P, Keller AM, Cohen J, Meyer E, Sperling L. High Coding Density on the Largest Paramecium Tetraurelia Somatic Chromosome. Curr Biol. 2004 Aug 10;14(15): 1397-404. 8. Arabidopsis Genome Initiative. Analysis of the Genome Sequence of the Flowering Plant Arabidopsis Thaliana. Nature. 2000 Dec 14;408(6814):796-815. 9. Ku HM, Vision T, Liu J,Tanksley SD. Comparing Sequenced Segments of the Tomato and Arabidopsis Genomes: Large-Scale Duplication Followed by Selective Gene Loss Creates a Network of Synteny. Proc Natl Acad Sci U S A. 2000 Aug 1;97(16):9121-6. 10. Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J, Devon K, Dewar K, Doyle M, FitzHugh W, Funke R, Gage D, Harris K, Heaford A, Howland J, Kann L, Lehoczky J,LeVine R, McEwan P, McKernan K, Meldrim J, Mesirov JP, Miranda C, Morris W, Naylor J,Raymond C, Rosetti M, Santos R, Sheridan A, Sougnez C,Stange-Thomann N, Stojanovic N, Subramanian A, Wyman D, Rogers J, Sulston J, Ainscough R, Beck S, Bentley D, Burton J,Clee C, Carter N, Coulson A, Deadman R, Deloukas P, Dunham A, Dunham 1,Durbin R, French L, Grafham D,Gregory S, Hubbard T, Humphray S, Hunt A, Jones M, Lloyd C, McMurray A, Matthews L, Mercer S, Milne S, Mullikin JC, Mungall A, Plumb R, Ross M, Shownkeen R, Sims S, Waterston RH, Wilson RK, Hillier LW, McPherson JD, Marra MA, Mardis ER, Fulton LA, Chinwalla AT, Pepin KH, Gish WR, Chissoe SL, Wendl MC, Delehaunty KD, Miner TL, Delehaunty A, Kramer JB, Cook LL, Fulton RS, Johnson DL, Minx PJ, Clifton SW, Hawkins T, Branscomb E, Predki P, Richardson P, Wenning S, Slezak T, Doggett N, Cheng JF, Olsen A, Lucas S, Elkin C, Uberbacher E, Frazier M,Gibbs RA, Muzny DM, Scherer SE, Bouck JB, Sodergren EJ, Worley KC, Rives CM, Gorrell JH, Metzker ML, Naylor SL, Kucherlapati RS, Nelson DL, Weinstock GM, Sakaki Y, Fujiyama A, Hattori M, Yada T, Toyoda A, Itoh T, Kawagoe C, Watanabe H, Totoki Y, Taylor T, Weissenbach J, Heilig R, Saurin W, Artiguenave F, Brottier P, Bruls T, Pelletier E, Robert C, Wincker P, Smith DR, Doucette-Stamm L, Rubenfield M, Weinstock K, Lee HM, Dubois J, Rosenthal A, Platzer M,Nyakatura G, Taudien S, Rump A, Yang H, Yu J, Wang J,Huang G, Gu J, Hood L, Rowen L, Madan A, Qin S, Davis RW, Federspiel NA, Abola AP, Proctor MJ, Myers RM, Schmutz J, Dickson M, Grimwood J, Cox DR, Olson MV, Kaul R, Raymond C, Shimizu N, Kawasaki K, Minoshima S, Evans GA, Athanasiou M, Schultz R, Roe BA, Chen F, Pan H, Ramser J, Lehrach H, Reinhardt R, McCombie WR, de la Bastide M, Dedhia N, Blocker H, Hornischer K,Nordsiek G, Agarwala R,Aravind L, Bailey JA, Bateman A, Batzoglou S, Birney E, Bork P, Brown DG, Burge CB, Cerutti L, Chen HC, Church D, Clamp M, Copley RR, Doerks T, Eddy SR, Eichler EE, Furey TS, Galagan J,Gilbert JG, Harmon C, Hayashizaki Y, Haussler D, Hermjakob H,Hokamp K, Jang W, Johnson LS, Jones TA, Kasif S, Kaspryzk A, Kennedy S, Kent WJ, Kitts P, Koonin EV, Korf I, Kulp D, Lancet D, Lowe TM, McLysaght A, Mikkelsen T, Moran JV, Mulder N, Pollara VJ, Ponting CP, Schuler G, Schultz J, Slater G, Smit AF, Stupka E, Szustakowski J, ThierryMieg D, Thierry-Mieg J,Wagner L, Wallis J, Wheeler R, Williams A, Wolf YI, Wolfe KH, Yang SP, Yeh RF, Collins F, Guyer MS, Peterson J, Felsenfeld A, Wetterstrand KA, Patrinos A, Morgan MJ, de Jong P, Catanese JJ, Osoegawa K, Shizuya H, Choi S, Chen YJ, International Human Genome Sequencing Consortium. Initial Sequencing and Analysis of the Human Genome. Nature. 2001 Feb 15;409(6822):860-921. 11. Heilig R, Eckenberg R, Petit JL, Fonknechten N, Da Silva C, Cattolico L, Levy M, Barbe V, de Berardinis V, Ureta-Vidal A, Pelletier E,Vico V, Anthouard V, Rowen L, Madan A, Qin S, Sun H, Du H, Pepin K, Artiguenave F, Robert C, Cruaud C, Bruls T, Jaillon O, Friedlander L, Samson G, Brottier P, Cure S, Segurens B, Aniere F, Samain S, Crespeau H, Abbasi N, Aiach N, Boscus D, Dickhoff R, Dors M, Dubois I, Friedman C, Gouyvenoux M, James R, Madan A, Mairey-Estrada B, Mangenot S, Martins N, Menard M, Oztas S, Ratcliffe A, Shaffer T, Trask B, Vacherie B, Bellemere C, Belser C, Besnard-Gonnet M, Bartol-Mavel D, Boutard M, Briez-Silla S, Combette S, DufosseLaurent V, Ferron C, Lechaplais C, Louesse C, Muselet D, Magdelenat G, Pateau E, Petit E, Sirvain-Trukniewicz P, Trybou A, Vega-Czarny N, Bataille E, Bluet E, Bordelais I, Dubois M, Dumont C, Guerin T, Haffray S, Hammadi R, Muanga J, Pellouin V, Robert D, Wunderle E, Gauguet G, Roy A, Sainte-Marthe L, Verdier J, Verdier-Discala C, Hillier L, Fulton L, McPherson J, Matsuda F, Wilson R, Scarpelli C, Gyapay G, Wincker P, Saurin W, Quetier F, Waterston R, Hood L, Weissenbach J. The DNA Sequence and Analysis of Human Chromosome 14. Nature. 2003 Feb 6;421(6923):601-7. 12. Mochizuki K, Gorovsky MA. A Dicer-Like Protein in Tetrahymena has Distinct Functions in Genome Rearrangement, Chromosome Segregation, and Meiotic Prophase. Genes Dev. 2005 Jan 1;19(1):77-89. 13. Lee SR, Collins K. Starvation-Induced Cleavage of the tRNA Anticodon Loop in Tetrahymena Thermophila. J Biol Chem. 2005 Dec 30;280(52):42744-9. 14. Lee Y, Jeon K, Lee JT, Kim S, Kim VN. MicroRNA Maturation: Stepwise Processing and Subcellular Localization. EMBO J. 2002 Sep 2;21(17):4663-70. 15. Vazquez F, Vaucheret H, Rajagopalan R, Lepers C, Gasciolli V, Mallory AC, Hilbert JL, Bartel DP, Crete P. Endogenous Trans-Acting siRNAs Regulate the Accumulation of Arabidopsis mRNAs. Mol Cell. 2004 Oct 8; 16(1):69-79. 16. Ambros V, Lee RC, Lavanway A, Williams PT, Jewell D. MicroRNAs and Other Tiny Endogenous RNAs in C. Elegans. Curr Biol. 2003 May 13;13(10):807-18. 17. Tang G, Reinhart BJ, Bartel DP, Zamore PD. A Biochemical Framework for RNA Silencing in Plants. Genes Dev. 2003 Jan 1;17(1):49-63. 18. Lau NC, Lim LP, Weinstein EG, Bartel DP. An Abundant Class of Tiny RNAs with Probable Regulatory Roles in Caenorhabditis Elegans. Science. 2001 Oct 26;294(5543):858-62. 19. Gustafson AM, Allen E, Givan S, Smith D, Carrington JC, Kasschau KD. ASRP: The Arabidopsis Small RNA Project Database. Nucleic Acids Res. 2005 Jan 1;33(Database issue):D637-40. 20. Griffiths-Jones S, Grocock RJ, van Dongen S, Bateman A, Enright AJ. MiRBase: MicroRNA Sequences, Targets and Gene Nomenclature. Nucleic Acids Res. 2006 Jan 1;34(Database issue): D140-4. 21. Griffiths-Jones S. The microRNA Registry. Nucleic Acids Res. 2004 Jan 1;32(Database issue):D109-1 1. 22. Bartel DP. MicroRNAs: Genomics, Biogenesis, Mechanism, and Function. Cell. 2004 Jan 23;116(2):281-97. 23. Hofacker IL, Fontana W, Stadler PF, Bonhoeffer S, Tacker M, Schuster P. Fast Folding and Comparison of RNA Secondary Structures. Monatshefte f Chemie. 1994;125:167,167-188. 24. Lewis BP, Burge CB, Bartel DP. Conserved Seed Pairing, often Flanked by Adenosines, Indicates that Thousands of Human Genes are microRNA Targets. Cell. 2005 Jan 14;120(1):15-20. 25. Lewis BP, Shih IH, Jones-Rhoades MW, Bartel DP, Burge CB. Prediction of Mammalian microRNA Targets. Cell. 2003 Dec 26;115(7):787-98. 26. Gish W. [Internet].; 1996 1996-2003. Available from: http://blast.wustl.edu. 27. Nix DA, Eisen MB. GATA: A Graphic Alignment Tool for Comparative Sequence Analysis. BMC Bioinformatics. 2005 Jan 17;6(1):9. 28. Bailey TL, Elkan C. Fitting a Mixture Model by Expectation Maximization to Discover Motifs in Biopolymers. Proc Int Conf Intell Syst Mol Biol. 1994;2:28-36. 29. Galvani A, Sperling L. RNA Interference by Feeding in Paramecium. Trends Genet. 2002 Jan;18(1): 11-2. CHAPTER 3 Estimating Tissue-Specific Patterns of Depletion and Enrichment for microRNA Targeting in Mammalian Genes Abstract Over the course of evolution, mammalian genes have come under tremendous selective pressure to avoid certain regulatory sequence motifs while maintaining others. MicroRNA-mediated RNA interference is a potent repressor of mammalian gene expression, and it was recently reported (1) that genes preferentially expressed in certain tissues are under significant and measurable pressure to avoid targeting by coexpressed microRNAs (miRNAs). An enhanced method to estimate the degree of motif depletion and enrichment is reported and applied to miRNA target motifs across a panel of 61 mouse tissues. Resulting patterns of target depletion display strong correlation with miRNAs' reported expression profiles, reinforcing prior reports. Lastly, the inverse analysis is applied, revealing target enrichment consistent with miRNAs' recently hypothesized role in tissue identity maintenance. 3.1. Introduction Numerous regulatory pathways in the cell are mediated by specific sequence motifs. Transcription, the first step in the central dogma of molecular biology, is activated by factors that recognize and bind promoter elements within DNA loci to be expressed. For example, a circuit of transcriptional control largely conserved from humans to archaebacteria responds to heat shock (2) by activating genes with special promoter motifs. As biological complexity grew during the course of evolution, organisms relied upon such regulatory layers to exert increasingly fine-tuned control over gene expression to respond to different environmental stimuli and develop specialized molecular and physiological structures. The mammalian trancsriptome comes under the control of numerous regulatory pathways, including RNAi, signal transduction, splicing, and nonsense-mediated decay, to name a few. The benefits of complex regulation are accompanied by selective pressure to maintain or gain functional motifs and, conversely, to avoid those with deleterious effects. Here we describe a method to estimate enrichment or depletion for such motifs specific to genes expressed in a certain tissue or cell population. This method enhances one recently applied to find tissue-specific patterns of microRNA target depletion (1), and for comparative purposes we apply it to the same dataset. We additionally propose that this method could be applied to other regulatory processes, both to measure tissuespecificities of known motifs, and to discover novel ones. Farh et al reported that genes preferentially expressed in a given tissue show significant avoidance for targeting by miRNAs coexpressed in the same tissue. Many microRNAs are expressed in a highly tissue-specific fashion and are regulate a small set of target genes with dramatic effect during tissue development. However, an inevitably large remainder of genes are required to be expressed unhindered and are thus depleted for such miRNA target sites. We estimate patterns of targeting depletion largely coinciding with recent reports (1, 3) as well as with miRNA expression data. Finally, we invert this analysis to search for target enrichment, and find patterns consistent with miRNAs' proposed role in silencing aberrant transcription to help maintain tissue idenitity. 3.2. Methods 3.2.1. Microarray Dataset Processing The GeneAtlas v2 mouse microarray dataset (4) was downloaded in MAS5-normalized form from the Novartis Research Foundation (http://wombat.gnforg). This dataset comprises intensity levels for 36,182 probes interrogating over 20,000 RefSeq-annotated genes in cellular extracts of 61 different tissues. The geometric mean was taken over the two replicate sets of measurements. Microarray probes were mapped by name to RefSeq-annotated mouse genes using mapping tables provided by the UCSC Genome Browser (5). Following exclusion of probes with names ending in "_xat" which target entire gene families rather than single genes, 35,224 probes remained. Even so, some genes were targeted by multiple probes; in these cases, the arithmetic mean was taken over the corresponding probe sets. Averaging over replicates and redundant probes yielded a 13,894x61 matrix of intensity values with genes on the rows and tissues on the columns. To mitigate the noise inherent in array measurements, the sorted rank of every gene's intensity level in a given tissue was taken with respect to the levels of all other genes in that tissue. This was repeated separately for each tissue, yielding a 13,894x61 of tissuespecific ranks with values ranging from 1, representing the lowest expression level, to 13,894, representing the highest level. 3.2.2. Observed Sequence Features 3' UTR coordinates were obtained from RefSeq annotations for all genes targeted by the microarray dataset by concatenating exonic ranges not contained within the annotated coding sequence. Transcripts with 3' UTRs less than 50 nt in length are likely to reflect incorrect annotation and were discarded for statistical reasons. For the remaining 13,144 genes, a total of 13,289 3' UTR sequences were then extracted from the mm7 assembly of the mouse genome. Counts of all hexamers and heptamers were then obtained for the extracted sequences, resulting in two matrices of 13,144 rows by 4096 and 16384 columns, respectively. Constituent oligonucleotides (kmers) containing ambiguity characters at any position were not counted. When a gene had multiple 3' UTR annotations, the arithmetic mean of kmer counts was taken over its transcript sequences. 3.2.3. Sequence Feature Background Model The expected counts of all hexamers and heptamers given a second-order Markov model were calculated for each sequence. Under this model, a sequence's expected count of every order k 2 3 oligo is a function of its trinucleotide frequencies and its length: k E,,x,...,x, =(L, - k +1) f jxb x,_x,_, ,where i indexes genes and X.i are bases. The j=3 background conditional frequency of the trinucleotide X,X 2X 3 (or, more precisely, the conditional frequency of base X3 given the preceding dinucleotide XX 2) was taken as bx,~xx = _r.c.rX f , where fX is the observed frequency of trinucleotide X in the ZYE(A.C,GT) fXIXw 3' UTR of gene i. 3.2.4. Sequence Feature Sets Observed and expected counts were calculated in every 3' UTR sequence for two sets of sequence features: all hexamers ("all6mers") and all heptamers ("all7mers"). For analysis of known microRNAs (miRNAs), mature sequences for the 288 mouse miRnAs in RFAM release 7.1 were downloaded from miRBase (6, 7). MiRNAs were grouped into 187 families having identical sequence in 5' bases 1-8 (i.e., the "ml" base and the seed region). Two sequence feature sets were derived from these miRNA sequences: "rf71 _con6", which comprised for each miRNA seed+m I sequence the combination of three overlapping reverse-complement hexamers, and "rfl71_con7", which was analogously composed of combinations of two overlapping heptamers. For miRNA sequences not starting with U at the 5' end, the exact reverse complement of bases 1-8 was investigated as with the other sequences, but additionally a "tl A" record was added, combining both counts of kmers exactly reverse complement to the miR as well as those with an "A" at the position opposite the m base. The observed/expected counts of overlapping hexamers and heptamers were then drawn from the all6mers and all7mers sets, respectively, and added. Table 1. Compilation of rfl ICon6 miRNA seed match feature set from constituent overlapping hexamers; rf7 ICon7 was compiled analogously using heptamers. Mouse microRNAs mmu-miR-99b mmu-miR-99b (+tlA) mmu-miR-125a mmu-miR-127 mature sequence seed match constituent hexamers 5'-CACCCGUA... 5'-CACCCGUA... 5'-UCCCUGAG... 5'-UCGGAUCC... ... UACGGGUG-3' ... UACGGGU[GIA]-3' ...CUCAGGGA-3' ...GGAUCCGA-3' UACGGG+ACGGGU+CGGGUG 2*UACGGG+2*ACGGGU+CGGGUG+CGGGUA CUCAGG+UCAGGG+CAGGGA GGAUCC+GAUCCG+AUCCGA 3.2.5. Tissue Specificity Index score A Tissue Specificity Index (TSI) score was computed for each gene-tissue pair, measuring both the strength and specificity of the given gene's expression in that tissue. To assess specificity of gene expression in one tissue it was first necessary to quantify the average strength of expression in all others in the set of tissues T. To calculate the average expression rank R, of gene i, the median of the expression ranks of that gene in all tissues ri = median (R/ : je T was found. The median of a set of ranks is not necessarily itself a rank and so these median values were ranked with respect to each other to give the average expression rank R,. It was expected that gene with ubiquitously high expression would have large values of R,, whereas those with low expression in most tissues would have small Ri. The use of ranked medians of ranks was intended to preclude genes expressed highly in only a few tissues from being assigned high average expression, such as might occur by taking the mean of intensity values. The average expression rank Ri of a gene across all tissues was subtracted from the gene's specific expression rank in each tissuej to yield the TSI score for that gene-tissue pair: TSI = R - Ri . Positive scores indicate strong expression specific to that tissue, whereas scores near zero indicate a similar level of expression to that found in other tissues. For example, genes strongly expressed in muscular tissues but not elsewhere, for instance specific myosin isoforms, would be expected to have positive scores. Conversely, ribosomal proteins, which are strongly expressed in every cell type, would be expected to have scores near zero. In contrast, negative scores are reflective of tissuespecific avoidance of a certain gene's expression. Calculation of Tissue Specificity Index score Rank genes by tissue-specific expression Skeletal muscle expression +sort 1 (lowest) 13144 (highe,st) SRank genes by average expression by ...... expressionJ. Tlssue-sReclflc ranks 'E Mylpf 5871 3831 0 4272 13143 5129 13125 13134 13139 13128 13132 13144 Rps16 ribosomal protein S16 13136 13136 13090 13067 13128 13143 myosin light ch., fast twitch c ox7c cytochrome c oxidase 5261 tissues (or conditions) Calculate and rank genes by Tissue Specificity Index score 131A4 / Genes ranked by TSI (Skeletal Muscle) R, 1 (k = skeletal muscle) A TSi,k =tR,,k- ME )•muscle- 0' LU specific genes (myosin, creatine kinase, etc.) rank in skeletal muscle Figure 1. Calculation of tissue average rank and Tissue Specificity Index scores from expression atlas microarray data. Defining the TSI in this way appeals to an intuitive geometric interpretation. Because across all genes, average ranks were strongly correlated with tissue-specific ranks, it was natural to fit them to a linear model. This can be visualized as a plot of all genes' average ranks versus their ranks in a specific tissue (e.g., skeletal muscle) with a linear fit relating the variables RWand R/i. In no tissue examined did the fitted linear relationship deviate strongly from a zero-intercept, 45-degree line, and thus the linear model was fixed as (o,f)= (0,1,-i). As the signed perpendicular distance to a line L = {fo + "-X = 0} is 1(p6 x +Po), the distance from the point representing each gene's expression rank to II6I the imposed line relating the ranks is 2 (R - R), exactly proportional to the TSI for that gene. Thus if a slope-one, zero-intercept linear model is considered as the default relationship between each tissue-specific rank and the average rank, the TSI of each gene may be viewed as its signed deviation from that model. 3.2.6. Measurement of Feature Depletion and Enrichment 3.2.6.1. Overview Sequence features were then individually assessed for depletion and enrichment relative to their tissue-specific expected distributions. The expected distribution of a feature in a given tissue was set by ranking genes by TSI and binning so as to equalize the total expected count of that feature in each bin. Observed feature counts were compared in each bin, and the running difference of expected and observed counts was calculated. This amounted to finding the maximal difference between the empirical cumulative distribution functions of two discrete variables. A background distribution on this statistic was obtained and the empirically-fit KS test was applied to yield two P-values, one for the given sequence feature's enrichment and one for its depletion among genes with high TSI ranks in the given tissue. 3.2.6.2. Binning Strateav For a given tissuej, genes were ranked by increasing TSI' in that tissue, so that genes specifically avoided had small ranks, those neutrally expressed had moderate ranks, and those specifically highly expressed in that tissue had large ranks. Using this ordering, genes were divided into 100 bins such that the summed expected counts of the given sequence feature X in each bin was roughly equal (details in Figure 2). bin_equal_expcounts(X, E, binNumbers, nbins=100) curBin <-0 acc - 0 for gene_i=l to size(E) acc - acc + E[gene_i] binNumbers [gene i] < curBin if acc > sum(E) / n bins curBin +- curBin + 1 acc < 0 Figure 2. Binning algorithm psuedocode scanned through genes ordered by TSI, accumulating expected counts of a given feature and placing bin boundaries whenever the accumulator exceeded the equal bin capacities for that feature. Observed counts were then summed within each of these bins. As a corrective factor, the summed expected counts in each bin were scaled by the ratio F,'FI Ex to normalize their totals to equal those of the observed counts F; . 3.2.6.3. KS Test Statistics While this tissue-specific binning induced an approximately uniform distribution on the expected feature counts, the observed counts were free to change. To obtain the cumulative distribution function over the difference of observed and expected counts, the summed difference within each bin was found and the running sum of those differences taken. From the running sum, two one-sided discrete Kolmogorov-Smirnoff (KS) test statistics were calculated: the largest non-negative difference was taken as the enrichment statistic and the largest negative difference was taken as the depletion statistic. Both were obtained for each pair of tissue and sequence feature. 3.2.6.4. Estimation of KS Statistic Background Distribution For each sequence feature, a set of 10,000 random gene orderings was generated and used to estimate background distributions of the enrichment and depletion statistics. One randomized ordering was generated by taking, for each gene, the tissue-specific rank from a tissue randomly chosen with replacement. The resulting vector of values was sorted and each value was replaced by its sorted rank. This ranked order was then used in the same way as a tissue-specific ranking: a TSI score was calculated between the control and average rankings and was used to order genes, and the binning and calculation of enrichment and depletion statistics was performed as before. This process was repeated 10,000 times per kmer set to yield an empirical background distribution for each of these two statistics for each sequence feature. 3.2.6.5. Application of KS Test Enrichment and depletion P-values were determined from the empirical background distribution using the method employed by Farh et al (1). When the test statistic x was greater than the 9 8 th percentile of the background distribution, P-values were instead calculated from the fitted asymptotic KS tail probability Q = e- 2nx . The theoretical and empirical tails were fitted by setting n so that their 9 8 th percentiles were equal: n +- -log 0.02/(2x2), where x,, is the empirical 8 9 8 th percentile of the test statistic x. The resulting P-values estimate the significance of departure from the null hypotheses under which sequence feature enrichment and depletion are distributed homogeneously across scramblings of the tissue labels. 3.2.6.6. False Positive Analysis False discovery rate was assessed by generating additional control gene reorderings, thus drawing values from the test statistics' background distributions, and determining the number of feature-tissue pairs called significant among this contrived set. The significance threshold in each dataset was then set to the largest P-value for which the number of expected false positives was less than one per sequence feature. 3.3. Methods 3.3.1. Comparison of Expected and Observed Sequence Feature Counts We evaluated the performance of our 3' UTR background models by comparing the total expected and observed counts of each feature, summed over all genes (Figure 3). There were few outliers among all 4,096 hexamers and all 187 sets of miRNA target heptamer pairs, suggesting the features' expected counts indeed provide a good basis of comparison for estimation of enrichment or depletion. Observed vs. expected counts of each sequence feature, summed over 3' UTRs All hexamers microRNA target seed heptamers 4000 3500 3000 u) 2500 8 U) 10- •o 2000 1500 O 1000 500 10 2 - 102 -- 103 4 10 Expected counts 105 0 1000 2000 3000 4000 Expected counts Figure 3. A second-order background model adequately captures sequence compositional effects for both hexamers and miRNA target heptamers without overfitting or bias. Each point represents a sequence feature plotted by its total expected and observed counts in all genes. 3.3.2. Tissue-Specific Index Score Evaluation We next examined the Tissue-Specific Index scores to verify that they properly reflected tissue-specific expression or avoidance thereof. Our assumption was that the average tissue ranks were correlated with tissue-specific ranks for the majority of genes such that outlier genes were those specifically expressed or avoided in each tissue or cell type (8). In each tissue, we took the Friedman rank-based correlation between the tissue-specific and average rankings (Figure 4). In 57 tissues, a very strong correlation was observed (0.799 <p <0.921; p = 0.873, a, = 0.152). Four tissues had distinctly lower correlation values: ovary, fertilized egg, testis, and pancreas (0. 6 8 8 <p <0.716). Strikingly, three of these were from reproductive organs. Under hierarchical clustering, these tissues showed no apparent relationship to each other (data not shown), suggesting their shared deviation from the average expression ranking was not simply by virtue of having similar expression profiles to each other. Friedman rank correlation between genes' tissue-specific and average ranks ii i I I I II I I I I I I I I I I I I I I I I"I "" I I I r" Irl I I171 I 17111 I I7 I I I 0.8 0.6 0.4 3 Figure 4. Average gene rank is highly correlated with tissue-specific expression rank for genes in all 61 tissues, indicating that tissue-specific genes can be detected as outliers from a linear regression model. Four tissues, of which three were germline-related, had distinctly weaker (but still highly significant) correlations, suggesting highly specialized programs of transcription. Goodness-of-fit analysis indicated that fixed slope-one, zero-intercept lines adequately modeled the relationships between average and tissue-specific ranks in every tissue. Pvalues were approximately zero for all tissues except the four noted to have lower correlations; P-values in those tissues were each less than 7.4xl0-14 (F-test, I and 61 df). We visually inspected the average-to-tissue-specific ranking relationships for several tissues. In each tissue, most genes were densely clustered around the best-fit line, though with some local differences in variance. Most tissues had a marked cluster of presumably tissue-specific genes in the lower-right corner. We plotted these ranks for three tissues (Figure 5) and performed literature searches to confirm that high TSI scores identify genes with tissue-specific function. TSI scoring remained sensitive even among genes that are only moderately expressed in the tissue, and exhibited high specificity for both highly- and moderately-expressed genes, assigning low scores to genes ubiquitously expressed or specific to another tissue. Examination of tissue-specific index scoring intwo tissues muscle Skeletal C 2 0 _Iz 2 C ca 0 Cu 0) 1= 0 2000 4000 6000 Rmki a" 8000 1l 10000 12000 " cerebLMeumll A a "AM 0 0.5 1. U.a TSI score(x 10^4)UA^4 TSI score 13144 Hrc histidine rich calcium binding 12732 Trim54 protein tripartite motif-containing 54 12662 13143 Txnip 0 05 1 15 I 4) TS score (x 10A TissueTSI Rank Specific Comments score (TSI) Rank Description Skeletal muscle: high muscle-specific rank: high TSI. Name Ndufal (x 12971 Funcitons in Ca(2+) release during muscle contraction (9) 12803 Function: myogenic differentiation; disease phenotype: muscle atrophy(10) Skeletal muscle: high muscle-specific rank: neutral TSI. 1 7034 13058 Ubiq; mitochondrial electron transport chain NADH dehydrogenase (ubiquinone) 1 alpha subcomplex, 1 1 7032 13004 Ubiq; regulates cellular redox state thioredoxin interacting protein Skeletal muscle: moderate muscle-specific rank; high TSI Mpz Cabcl Mus musculus myelin protein zero Mus musculus chaperone, ABC1 activity of bcl complex 6587 12983 7796 7081 13007 7656 Myelin component; Deficiency: muscle demylination. (11) p53/apoptosis; Northern blot shows ubiq, but enriched in SM, heart (12) like Skeletal muscle: moderate muscle-specific rank; neutral TSI Abcc8 Gtf2ird2 ATP-binding cassette, subfamily C (CFTR/MRP), member 8 GTF21 repeat domain containing 2 0 7025 5831 No specificity for muscle; known function in beta cells of pancreatic islands 0 7027 7282 Basally transcribed in a wide range of tissues Figure 5a. Tissue-specific and tissue-average ranks for cerebellum and skeletal muscle, two example tissues in which those ranks were very highly correlated. TSI scoring readily identifiers outlier genes with tissue-specific function at high and moderate expression levels. Two genes with highest and two genes with lowest (absolute) skeletal muscle TSI were chosen, first from the top decile of skeletal muscle expression and then from the middle decile. Testis a) (U U) (U U) (0 (0 0 2000 4000 6000 8000 10000 12000 Rank intestis ( 1000 D500 1 0 1. 5 1 0. 5 0 05 1 15 TSI score (x 10^4) Figure 5b. Testis was one of four tissues displaying higher deviance from average ranks; however, a cluster of testisspecific genes was still evident and assigned high TSI score. TSI scoring identifies genes with specific expression and function in the tissues we examined. Even so, we remain concerned that this method's sensitivity is suboptimal for highly-correlated tissues, and propose several improvements (see "Future Directions"). 3.3.3. Tissue-Specific Depletion of microRNA Target Sites Genes with high TSI scores were enriched for tissue-specific function, and many are likely to be required for proper cellular function in the given tissue. Indeed, some of the high-scoring genes identified in muscle (e.g., Hrc, Mpc) have severe deficiency phenotypes and therefore are likely to come under considerable selective pressure to avoid miRNA-mediated knockdown. Consequently, we expected to observe significant depletion among high-scoring genes for seed matches to co-expressed miRNAs. Farh et al (1) recently measured a similar depletion effect among highly (though not necessarily specifically) expressed genes. Computational (13, 14) and experimental (15) studies have identified Watson-Crick pairing to the miRNA seed region (bases 2-7) as the primary determinant of specificity in miRNA targeting. We searched 3' UTRs of mouse genes for depleted counts of heptamers reverse complementary to each of 187 mouse miRNA families at these positions as well as at bases 1-7. Selected miRNAs with strong signals for targeting depletion are shown in Figure 6 (nonbrain tissues; full set of 181 miRNAs shown in Figure Al). Many miRNAs' strongest depletion effects coincide with their reported tissues of expression, reinforcing the hypothesis (1, 3, 16) that preferentially-coexpressed messages avoid accumulation of these target sites. For example, miR-1/-206 and miR-133 are well-known to be specifically expressed in muscular tissue (17), and show very strong depletion for targeting among genes specifically expressed in skeletal muscle (P < 7.6x 0-12 and P< 5.5xl 0-'0, respectively), heart (P < 5.7x 106 and P < 9.2xl 0-5), and to a lesser extent, another muscular tissue, tongue (P < 0.0085 and P < 0.0097). We likewise found very strong depletion for targeting among specifically braintranscribed messages by a non-overlapping set of 37 miRNAs, many of which are known to be expressed specifically in the brain (Figure 7). The strongest pattern of depletion was seen for miR-124a, particularly in frontal cortex (P < 1.2x10-24), olfactory bulb (P < 2.1x10-20 ), cerebral cortex (P < 2.1x10- 9 ), and cerebellum (1.9x10 20 ). Highly significant depletion was observed for miR- I24a targeting in all other measured tissues of the central and peripheral nervous system. Only one other tissue, testis, was found to be depleted for miR-124 targeting, and only weakly so (P < 0.0026) - a false positive or perhaps signal from weak and as-yet undocumented expression there, but either way demonstrating a high degree of specificity in estimation of target depletion, at least for highly expressed miRNAs such as miR-124a. Selected miRNAs depleted for targeting innon-brain tissues miR-143 miR-331 tla miR-3730 a miR-378(t1a) l I I miR-21 miR-27b/-27a miR-23b/-23a (tla) miR--381 miR-30a-5p/-30bcde miR-19b/-19a miR-106b/-20 miR-106a/-93/-17-50 miR-130a/-301/-130b miR-132/-212 a miR-291-3p/-294/-295 (tl miR-203 miR-375 I101I I miR-384 (tla miR-103/-107 (tha I miR-424 (tla I miR-22 miR-325 (tla) miR-15b/-195/-15ai-16 __ 1 miR-154 II miR-151 (tla) miR-126-5p tla) miR-142-3 miR-350 miR-34c (tla) miR-34a/-449 miR-196a/-196b miR-96 miR-144 miR-139 miR-451 (tla) miR-129-5p (tl miR-200b13a/-3200b 10 1 I 10 M I I L miR-141/-200a miR-194 miR-215 (tl a) miR-122a miR-133a/-133bl --- miR-1/-206 miR-218 m_ T 0 0 o "mm -3 ý._.Mg 5 10 -log 15 20 25 30 depletion P-values (one-sided KS) Figure 6. Selected miRNAs with significant targeting depletion among genes preferentially expressed in various tissues are shown, manually arranged by their tissues of depletion. Depletion profiles for the full set of miRNAs are shown in Figure Al. Depletion profiles for the full set of miRNAs are shown inFigure AL MicroRNAs depleted for targeting among brain tissues miR-183 miR-153 miR-465 miR-142-5p (tla) miR-485-3p (tla) miR-380-3p miR-324-5p (tla) miR-152/-148a/-148b miR-134 miR-463 miR-202 (tla) miR-24 miR-138 (tla) miR-298 (tla) miR-137 miR-483 miR-452 miR-433-3p (tla) miR-410 (tla) miR-376a (tla) miR-292-5p (tla) miR-100 (tla) miR-99b (tla) miR-99a (tla) miR-7/-7b miR-290 (tla) miR-128a/-128b miR-187 miR-464 miR-9 miR-125a/-125b/-351 miR-124a miR-29b/-29a/-29c miR-221/-222 (tla) miR-10a let-7d (tla) miR-98/let-7abcegi 1" D4'' 3~~E~ o< 'WO g 02 -g'a'% on~ 0~ C 33333 OMMr CORCC +D 0D -'x ~~a~~rc ~30~6 ___W !_0_W (D Ejs~rmmpa4 pE ouu,0009i.8 V. MO M - -7 I 0&ý@Qi C 3 Ca-.S 0 W n A CD+ wag nSO)MB.) ,::r~ &Pi9 O5:t 0A Rtiw'2. (D In 1C 13n ~Jcr3;a' 'IA 4S ')fl 1). ')P n -log,o depletion P-values (one-sided KS) Figure 7. 37 miRNAs showed depletion for targeting among genes specifically expressed in the brain. Some miRNAs have depletion signals that differ sharply between highly related tissues, suggesting that this method can resolve differences in miRNA targeting on a physiologically fine scale. For instance, mammalian microarray (18) and zebrafish in situ experiments (19) showed that miR-138 is expressed in the brain. However, we find significant targeting depletion specifically in the trigeminal and dorsal root ganglia (P < 2.8x 106 and P < 1.7x10-8 ) and not in any other nervous system tissues, suggesting that miR-138 plays a specialized role in these tissues. 3.3.3.1. Weak Depletion Signals also Coincide with microRNA Expression Some miRNAs had weak depletion signals (0.1 > P > 0.01) that nevertheless matched their reported expression patterns. Returning to miR-138, we observed a signal (P < 0.05) in B220+ B cells, which undergo maturation in bone marrow, where miR-138 is weakly expressed in mammals (19). Similarly, we found no tissues significantly depleted for miR-217 targeting, but obtained weak signals for the following: pancreas (P< 0.04), large intestine (P < 0.08), salivary gland (P < 0.04), and frontal cortex (P < 0.03). Although not reaching statistical significance, these agreed with hybridization evidence showed miR-217 expression in the brain, spinal cord, eyes, and the pancreas (specifically in exocrine cell populations). Furthermore, miR-217 was only weakly detected by in situ, suggesting high sensitivity even to subtle depletion effects such as for targeting by tissuespecific miRNAs present at low copy numbers. Another miRNA with a weak depletion signal was miR-134, which was recently implicated in neuronal development in the rat hippocampus (20). In fact, we observed a weak depletion signal for miR-134 in hippocampus (P < 0.0196), second only to blastocysts (P < 0.0112). The weak depletion seen for miR-134 targeting could be a consequence of its relatively low expression level (compared to miR-124) or could reflect a limited potential to target genes due to its specific localization in the synapto-dendritic compartment. 3.3.3.2. Signal-to-Noise Estimation We drew additional samples from the background depletion distribution of each miRNA target heptamer pair in order to estimate the signal-to-noise ratio for the depletion analysis. MicroRNA seed matches significantly depleted among these background samples were deemed false positives. As Figure 8 shows, the depletion analysis achieved positive signal over noise for all P-value cutoffs less than 0.32 (10-05). Estimation of false discovery rate for miRNA-target depletion analysis nes 3 S10 2 E 10 S10 ~ 10 E 10 10 2 ._)10 3 10 0 1 2 3 4 5 Significance cutoff (-log P-value) 6 7 Figure 8. The number of significant tissue-heptamer pairs, constituting positive predictions is shown along with the number of control instances deemed positive for each P-value cutoff. At a cutoff of 10"', signal:noise is 1.53; at 10-2, signal:noise is 3.74. 10,000 shuffled controls were drawn from the background distribution and the number of false positives was normalized by dividing by the ratio of shuffles drawn to the number of tissues tested (10,000 / 61). We conclude that a P-value cutoff of 0.001 was, if anything, conservative. Erroneous results - significant depletion of targeting not coinciding with miRNA expression - are more likely to arise from basic limitations of the method (e.g., its inability to resolve nontissue-specific effects) rather than sampling variability. The choice of P-value cutoff remains somewhat arbitrary, and there are few, if any, miRNAs with precisely known expression patterns that could be used to find the optimal cutoff. 3.3.3.3. Comparison to Experimentally Determined MicroRNA Expression We systematically compared our estimates miRNAs' targeting depletion across tissues with their expression patterns as determined by Wienholds et al (19). In their study, chemically-modified miRNA-targeting probes were hybridized in situ to zebrafish embryo slices. The particular method attained sufficiently high binding specificity to discern between miRNAs differing in sequence by as little as one base, revealed their patterns of expression to the sub-tissue level in some cases. Note that by comparing depletion value estimates to expression levels measured in zebrafish embryos, we make several assumptions. First, there may not in all cases be a direct correspondence between miRNA expression (even if perfectly tissue-specific) and depletion of seed-match targeting. To bridge this gap, we first accept the growing consensus that seed matching is the primary means of metazoan miRNA targeting, and further assume that no subset of miRNAs systematically target messages by alternate or complementary means. Studies have demonstrated 3' pairing as an additional targeting mechanism in specific cases (15), although the extent of such targeting may be limited (13,21). Beyond that, there are several difficulties: firstly, we used gene expression data from mostly mature tissues, whereas the hybridization study used embryonic samples. The authors commented, however, that miRNAs' primary role may generally be "not in tissue fate establishment but in ... maintenance of tissue identity". If so, then their expression levels may be comparable between late embryonic and mature stages. Physiological differences between zebrafish and mammalians are an additional complication, making the comparison most relevant for ancient miRNAs predating specialized mammalian physiological features. Nevertheless, the zebrafish hybridization results generally agreed well with microarray data for mammalian miRNAs and tissues, while resolving tissue-specific differences on a much finer basis, making it the best dataset presently available with which to verify our results. Table 2 shows a comparison between our targeting depletion estimates and the expression data for miRNAs found to be highly tissue-specific in the hybridization study. Ao o "3 'U ca U o 5 0.)• 4- 9? E r-0 0E ,. o U') U)C (h)r m 0) co 04 0) Cu Cc_ 30 "5 c cc. o "• 0v, m > cc 3") 0, (U~ (.3 'a c cu D o• 0. 060 ,E _•o• Cu0)-.t,.. r (h! ca). 3) , o (D oC. 3 o - 1) o .;_ 0oE ... o 'a cc CL C)d E 0°.2 C.) Em mj -~E a) a) ACL, cu 00. C.) nco 00 0300 C 3), CD CO r CE•o cc E 3 c E c 0 CuX -L0, 0.- Cu >• "to cd bD0 0.) c> M V v E' 0r. L) (D I0) ot 03 a) 0. E + cuu m o -2 0_ t, o~o 0- o" Cla) *0. r >1 ( = E, <o0 -E!' a C/) co Cu 0h ci E r,, E -o cc E Cl *C- *- Cl) 0 Cr cu C In = 0, a, 03 Cu .0 03. 0. a ro0 .2 CL cr .)E z i'0, 0 -M cuM a cu 0-Ut o 0. .2S-do _•> .-- (D CL 0. ' Cu 0.0, 03 0, 0, C t EL c~ 03 8.: .2 ci +076 c E" EaC o2 a) LO, O a Cu mC ._• • cn c 030 0,T Q$ . C .Ž *0 0 VCu 0 t ) 03. 03 .5V C a, 4) LU (U) a C m cn -.v0 cc 0S ICL t 0, Cut CD. 0b a, Cu: 0) 0 0,0 0,) t >. 0) & C m 03 to ._5 0) to f-0 cn - , c m CL.. 03 C.) 0, c €m c• Do (D Cu0 CuC cu In E w C, E E c ._ • c co CM 00 10) -F Cu 0 20, E .S -C4 CE (' c" 0, E mCu In tc 0.to c E 0 0c_ oC, CuCu 0) 4-Cu 0 0) m c C. 0 Ea). 0E• cc o, .E 00 t m 0,~ .u, am 07 0)i6CL cu E I-2" 'O 0 0 vV Uf) o 00 •8 o X-. O-C .0 LOC 0 0 a) .04 0 C.) Cu 0 U3 0 0 o'._0 0 0 E a, 0 C 1) 0 0 0 V E0 o 0 tC 0 C 0 E 0. I O Ec 2) 2 0 Cu4 U) -M (N0 4- 059 c00 0i- 0 0*- - 0)5- co cc c 0. 01 0. 0 0 t- -C -o0- 0 0 E E- 0 4- 0 . _cu CO F >.-u - C Cu, 0 cc0 vC. 0 0 Cu CN -4 00. CL -s cn• E Cu C.. 0) C4 co 0. -J t-0.a E 0)-a CC V(Io= .. ~13 0cc (n .T cu C C) E Cu 0"D o 0 .D 0.c C)~ 'aCu 2. Cu 0 0 0 . c c CUi 0 C E x -C L. C Cu 0) 0 E 0 ;-Z C*) C) Cu$ 0 U) Uf) 00 co E EN 0. E cts CD : C.c M M t.L2 C 09 Un E m0 c 0 E I- 0- 0 i._~ rCuC u C c 002 a-Eo C IM 0. zV C' E Cý /) c c C-.EL-0l=~.cc E~ L.> + t 0. Uf) CD .C-(- TC -0oW CU -)C 0C CU C c 0..._ L .3-Li cu C 0C') V2o .C 0C 0 0 Oc C0 > 0) 0 co c E i5 m Cu C: :32. E Cu E ,- 0 0 0.0.. E• Q. a: 0. r- .Cu (D 0 :30. (D 0 -C CL cu0 Cu %0 Eo SCC o V °0 0 C C- m 0 :-o a) E C: C E 0 0.. cn Cu) CL 7-5 a)0 c Cu mCa E •40 0.r CIL 0. .E -0. -c L0 -ca L cc O CL .2 0 0 0) 0 - ,.. Cu 0 D CL U) C? 0 5 C'.4 C-- LO -L. Cu w U1) 7i m 0 (a DE DO =o .0-0- C4, Cu .E (D °_E E) 0.E =3_ :) C: 0C? cu 0. Cu4 .1J. E CL E 0 E 0 _ CN OC3 o M.3 Q. 0 C .. J 0) A? •.•m 04 a, 0m DO 2 0 0- LD --C ", C: E U, :3 E 0 0C -EE0.2 a: E0 a)-EC CW CD .=-E E o EE E M0 CE 4, o ._.. CV) M r) a)3 C~) 0"0 o ,o C) U) q (0 0 o; C.) .0)C U) E 00 U) 0 ai+ CO .0= a) r0 .o C) Cu ov Cum 0) () 0 U/) LO Eio C E6+ oz a)00) a cc .;_r 0 o C'1 m E 0 (o 0 (. 0 0 a) E 0 z bE 0 0 o cc 0 Eccn .1 CLu Z3 a) 0 a.) a) as a) 0 CD E o 0. O 5--CD •EO c) E-) 01) oDCL CD a) n0 c,< (D0 o- o 0Cu a) 0)a)0 LoCc. C 0 a) a) 0) oc -a) CD 6 ca)0 m-0D 0 Cu 0 Cu0 i.. 0 .. 0 00) c a) Lý" m Cc (Da CY) cu 2 -c 0 cura 0 (mo uE Cu 0-- 0. co I- m(D a) a) 2 C "0 E > z -o a) ao Cr Z? -V6 a) a)f d- CD a) a) 3 a=,o 3 -a) C *- a)0 .0 UO C0 u, 0 0 o i6 Saa U) V 0 Cn 4c 0.C)* a) Cu 0u0 c- a) .. 3 oo.Ca) C-) Cu 0. 0-a) U)Cu S'.•. o 5, o. Za) ZC 7a mC 0. U -o U o io :,.. N C )cn 0 =O 0-.• m .5. a) 0 =a_ a) 0o S.r- 0a) I 0) C a m 2., Cý E• E 0 & rn CuE 0) 0)3 03 nE E E E 0 In only seven of 49 cases did the predicted targeting depletion contradict the zebrafish in situ and mammalian microarray expression measurements (miR-34a, miR-103/-107, miR-199a*, miR-184, miR-9*, 203, 204). In at least one of these cases, miR-34a, we did not detect depletion in the tissue of strongest reported expression - brain - but did so in namely lung (P < 1.9x 103), another tissue reported (18) to contain that miRNA, though at lower abundance. Several other cases were ambiguous, mostly for miRNAs expressed in zebrafish tissues not represented on the GNF expression atlas and for which no mammalian microarray measurements were available. Even if all ambiguous cases are regarded as missed predictions, miRNAs' patterns of target depletion show remarkable agreement with their tissue-specific expression. That the depletion statistics were estimated using gene expression in adult tissues reinforces the hypothesis that miRNAs activated during embryonic development continue to be expressed and remain in effect, maintaining tissue idenity in the adult organism. 3.3.3.4. Comparison to Results of Farh et al. The present study was motivated in part by the notable finding of Farh and colleagues (1) that mammalian genes have evolved under considerable pressure to avoid targeting by coexpressed miRNAs. While we searched for and report a similar effect, our approach differs in several ways. Firstly, we measure depletion of targeting by a particular miRNA as the cumulative number of target sites - rather than the proportion of targeted messages - below expectation. When measuring enrichment of targeting, this allows us to capture extra signal when a single miRNA targets a message in several places. Some evidence suggests that miRNA targeting is cooperative in this fashion (15). When measuring targeting depletion, using counts allows us to better resolve messages that under neutral evolution would be expected to have two or three seed matches given their UTR length and composition, but have avoided all of them. Another consequence of this difference is that our estimation of the background targeting level is deterministic, allowing us to calculate the depletion statistic without introducing sampling variance. Lastly, by using sequence feature counts directly rather than imposing a Poisson event model, we avoid making an independence assumption between potentially overlapping sequence features of interest. The other major difference between our approach and that of Farh et al is our use of the Tissue-Specificity Index (TSI), which may be more effective in assigning high rank to genes which come under tissue-specific pressure to avoid seed matches to coexpressed miRNAs. Despite these differences, we arrive at a set of depletion patterns that are, on the whole, very similar. There are a few notable differences, however. Among them was miR-125, the mammalian lin-4 ortholog found in the brain. We predict significant depletion for miR-125 targeting in twelve CNS tissues (see Figure 7), most strongly in amygdala (P < 1.1xl10I ") and lower spinal cord (2.3x10-9). In no other tissues did we find a significant depletion signal, any in only one (embryo day 10.5, P < 0.05) did we detect any signal at all. In contrast, Farh et al report significant depletion of miR- 125 targeting in three embryonic stages (blastocyst, 8.5, 9.5d) but not in any of the adult CNS tissues. In situ staining of 9.5 day-old mouse embryos reveals that miR-125b is specifically expressed at the midbrain-hindbrain region (22), supporting a role in neuronal differentiation. However, experimental studies have found miR-125 to be highly abundant in adult brain tissue, by cloning frequency more abundant than any other brain miRNA (23). A microarray study (18) directly compared the level of miR-125a and 125b in adult brain tissues and various embryo stages and found both miRNAs to be much present at much higher levels in the former, placing them (along with miR-222) in a "late brain" cluster of miRNAs. Finally, we compared our depletion results with the northern-blot assays for miRNA expression level reported by (1). In Figure SI of that publication, miR-142-3p expression level is shown to be highest in CD8+ T cells, followed by CD4+ T cells and B cells, in that order. Our estimates of targeting depletion recapitulate the order of expression these three tissues exactly (P < 2.7x 10-6, 1.2x l0-5, 4.39x 10-3, respectively). We also find statistically significant targeting depletion of miR-124 and miR-7 in exactly the tissues showing expression to those miRNAs in their Figure 4D. 3.3.4. Tissue-Specific Enrichment for MicroRNA Target Sites MicroRNAs collectively mediate repression of thousands of target genes (13). While some of these targets are strongly downregulated with switch-like phenotypic effects, many more may simply be downregulated as a safety mechanism. Stark and coworkers (3) recently proposed that miRNAs serve to reinforce the fidelity of tissues' transcriptional programs and thus help to maintain tissue identity more than to actually determine it, consistent with the observation that many tissue-specific miRNA are expressed after cell fate is determined (19). Under this model, genes that shouldn't be expressed in a given tissue context come under positive selective pressure to accumulate target sites for miRNAs active in that context to ensure attenuation of any leaky transcription. Stark et al measured tissue-specific targeting enrichment among Drosophilagenes with annotated functional categories. Genes having epidermal, tracheal, or digestive function were enriched for target sites of miR-124, which is highly expressed in brain tissues, where expression of such genes could have severe phenotype. Similarly, genes with functions in ventral sensory, PNS, and digestive tissues were found to be enriched for targets of miR-1, which is expressed in muscle tissue. We repeated the same analysis that we used to measure tissue-specific depletion of miRNA targeting, this time inverting the direction of the KS test in order to measure targeting enrichment. 100 As for depletion, we find a significant signal-to-noise ratio (3.48 at P-value cutoff of 0.01; Figure 9) for enrichment of targeting, suggesting similarly widespread impact on the mammalian transciptome. Estimation of false discovery rate for miRNA-target enrichment analysis I . . . .. I nes 10 S101 C 10 10 S210 10 2 || 3 0 I I 1 2 I I a 3 4 5 Significance cutoff (-logloP-value) I I 6 7 Figure 9. False discovery was estimated in the same manner as for target depletion. Enrichment analysis has estimated signal:noise of 1.56 at P-value cutoff of 10-', and 3.48 at 10.2. We find 50 miRNAs with significant target enrichment among genes highly expressed in 3 or more tissues, suggesting a pattern of mutual exclusivity between miRNA and target expression as suggested by Stark et al. Clusters of miRNAs enriched for targets in the brain, embryo, and epidermal tissues are evident (Figure 10). 101 Tissues complementary to microRNA expression display enrichment for targeting miR-141/-200a miR-153 miR-25 (tla) miR-200bc/-429 miR-203 miR-451 (tla) miR-448 miR-204/-211 miR-133a/-133b miR-350 miR-186 (ta) miR-320 (tl a miR-9 miR-467 (tl a mimiRR-9d miR-375 miR-218 miR-126-(tl2a miR-103/-107 (tla) miR-424 (tla) miR-15b/-195/-15a1-16 miR-378 (tla) miR-R302 1,-291-3p/-294/-295 (tla) miR-134 miR-208 (tla) miR- 299 miR-183 miR-129-3p (tla) miR-14M-3p miR-127 miR-322 (tla) miR-410 (tla) miR-29b/-29a/-29c miR-189 (tla) miR-125a/-125b/-351 miR-124a miR-7/-7b miR-344 miR-9 miR-99b (tla) miR-1/-206 miR-341 miR-224 miR-10a miR-24 miR-202 (tIa) let-7d tla) let-7abcefgiVmiR-9 0 5 3 10 15 20 25 30 -log 1oenrichment P-values (one-sided KS) Figure 10. MicroRNAs show tissue-specific patterns of target enrichment generally mutually exclusive with their expression contexts. Clustering reveals groups of miRNAs enriched for preferentially brain, epidermal, and embryonic targets. Enrichment values for full set of mouse miRNAs are shown in Figure A2. 102 We found considerable overlap with the enrichment results reported by Stark et al for two of the three fly miRNAs having mouse orthologs. The muscle-associated miR-l was enriched for targeting in olfactory sensory tissues of the mouse, similar to its enrichment in fly genes with PNS and sensory functions. For miR-124, we observed very strong targeting enrichment in gut, epidermal, and tracheal tissues, again overlapping the functional categories enriched in fly. Lastly, we noted a cluster of 7 miRNA families that show temporally-phased enrichment for genes preferentially expressed in the developing mouse embryo. The strongest of these, let-7, is known to control stage-specific development in C. elegans (24), and is conserved to mammals, in which it has been postulated to mediate differentiation during development (25). 103 3.4. Conclusion and Future Directions We have presented a statistical method to measure motif enrichment or depletion among genes with tissue-specific expression. We applied this to miRNA targeting of 3' UTRs, reaffirming the recent reports (1, 3) that miRNA target sites are depleted among genes preferentially coexpressed in the same tissue. Although we did not attempt experimental validation of our results, we note that they recapitulate results from in situ hybridization and microarray (18, 19, 22) studies with great accuracy. Numerous miRNAs for which we predict tissue-specific targeting depletion have not been fully characterized and may be subjects of novel tissue-specificity predictions. Lastly, we showed that tissue-specific enrichment for miRNA targeting is at least as widespread as depletion, possibly helping cells "clean up" after aberrant transcription, thus supporting but not determining tissue identity. We intend to implement several methodological improvements and apply this method to several new problems. First, we are concerned that the TSI scoring method is not robust to the introduction of numerous highly-correlated tissues. TSI scoring is obtained by comparing tissue-specific expression rank a gene's median rank across tissues. Introducing many similar tissues inflates the median rank for genes highly expressed among them, degrading the method's sensitivity to all such tissues and also causing unwanted effects on the predictions in other tissues. To correct for this, we will employ a clustering-based distance metric between tissues and use it to recalculate a different average rank from the perspective of each tissue that downweights expression ranks in 104 highly correlated tissues. This will allow us to obtain greater resolution on tissues that only differ in expression of a subset of genes, such as embryonic development or immune cell differentiation stages. Having revised the average rank measure, we will turn to a more empirically-motivated fit to determine the TSI score taking into account the local variability around the linear model. With these enhancements in place, we are eager to pursue several follow-up investigations. Firstly, we will be able to profile targeting depletion - and thus predict miRNA activity - in much larger tissue/condition panels. Because this method is nonparametric in the actual microarray data, relying only on ranks, we are free to integrate heterogeneous measurements from different labs, reports, and even array platforms. Secondly, we believe that this method can be readily applied to discover tissue-specific miRNAs that may be difficult to detect through experimental means because of low abundance or highly-specialized expression. Lastly, we expect this method will be applicable to evaluation or discovery of other classes of motifs such as the splicing regulatory elements. The primary weakness of this method is its inability to discover motifs such as ubiquitous miRNAs with uniform depletion or enrichment across tissues. Yet this is also its greatest strength, as it is what confers discriminative power between different tissues. Tissuespecific regulatory effects of cellular processes such as miRNA-mediated silencing or alternative splicing are generally more difficult to study by experimental means than are 105 ubiquitous effects, making computational approaches such as the present work a natural complement. 3.5. References Cited 1. Farh KK, Grimson A, Jan C, Lewis BP, Johnston WK, Lim LP, Burge CB, Bartel DP. The Widespread Impact of Mammalian MicroRNAs on mRNA Repression and Evolution. Science. 2005 Dec 16;310(5755):1817-21. 2. Lindquist S, Craig EA. The Heat-Shock Proteins. Annu Rev Genet. 1988;22:631-77. 3. Stark A, Brennecke J, Bushati N, Russell RB, Cohen SM. Animal MicroRNAs Confer Robustness to Gene Expression and have a Significant Impact on 3'UTR Evolution. Cell. 2005 Dec 16;123(6):1133-46. 4. Su Al, Wiltshire T, Batalov S, Lapp H, Ching KA, Block D, Zhang J, Soden R, Hayakawa M, Kreiman G, Cooke MP, Walker JR, Hogenesch JB. A Gene Atlas of the Mouse and Human Protein-Encoding Transcriptomes. Proc Natl Acad Sci U S A. 2004 Apr 20;101(16):6062-7. 5. Hinrichs AS, Karolchik D, Baertsch R, Barber GP, Bejerano G, Clawson H, Diekhans M, Furey TS, Harte RA, Hsu F, Hillman-Jackson J, Kuhn RM, Pedersen JS, Pohl A, Raney BJ, Rosenbloom KR, Siepel A, Smith KE, Sugnet CW, Sultan-Qurraie A, Thomas DJ, Trumbower H, Weber RJ, Weirauch M, Zweig AS, Haussler D, Kent WJ. The UCSC Genome Browser Database: Update 2006. Nucleic Acids Res. 2006 Jan 1;34(Database issue):D590-8. 6. Griffiths-Jones S, Grocock RJ, van Dongen S, Bateman A, Enright AJ. MiRBase: MicroRNA Sequences, Targets and Gene Nomenclature. Nucleic Acids Res. 2006 Jan 1;34(Database issue):D140-4. 7. Griffiths-Jones S. The microRNA Registry. Nucleic Acids Res. 2004 Jan 1;32(Database issue):D109-1 1. 8. Velculescu VE, Madden SL, Zhang L, Lash AE, Yu J, Rago C, Lal A, Wang CJ, Beaudry GA, Ciriello KM, Cook BP, Dufault MR, Ferguson AT, Gao Y, He TC, Hermeking H, Hiraldo SK, Hwang PM, Lopez MA, Luderer HF, Mathews B, Petroziello JM, Polyak K, Zawel L, Kinzler KW. Analysis of Human Transcriptomes. Nat Genet. 1999 Dec;23(4):387-8. 9. Hong S, Kim TW, Choi 1,Woo JM, Oh J, Park WJ, Kim do H, Cho C. Complementary DNA Cloning, Genomic Characterization and Expression Analysis of a Mammalian 106 Gene Encoding Histidine-Rich Calcium Binding Protein. Biochim Biophys Acta. 2005 Mar 10;1727(3):188-96. 10. Meroni G, Diez-Roux G. TRIM/RBCC, a Novel Class of 'Single Protein RING Finger' E3 Ubiquitin Ligases. Bioessays. 2005 Nov;27(11): 1147-57. 11. Frei R, Motzing S, Kinkelin I, Schachner M, Koltzenburg M, Martini R. Loss of Distal Axons and Sensory Merkel Cells and Features Indicative of Muscle Denervation in Hindlimbs of PO-Deficient Mice. J Neurosci. 1999 Jul 15; 19(14):6058-67. 12. liizumi M, Arakawa H, Mori T, Ando A, Nakamura Y. Isolation of a Novel Gene, CABC I1,Encoding a Mitochondrial Protein that is Highly Homologous to Yyast Activity of bc I Complex. Cancer Res. 2002 Mar 1;62(5): 1246-50. 13. Lewis BP, Burge CB, Bartel DP. Conserved Seed Pairing, often Flanked by Adenosines, Indicates that Thousands of Human Genes are microRNA Targets. Cell. 2005 Jan 14;120(1):15-20. 14. Lewis BP, Shih IH, Jones-Rhoades MW, Bartel DP, Burge CB. Prediction of Mammalian microRNA Targets. Cell. 2003 Dec 26; 115(7):787-98. 15. Brennecke J, Stark A, Russell RB, Cohen SM. Principles of microRNA-Target Recognition. PLoS Biol. 2005 Mar;3(3):e85. 16. Lai EC. MicroRNAs: Runts of the Genome Assert Themselves. Curr Biol. 2003 Dec 2; 13(23):R925-36. 17. Lagos-Quintana M, Rauhut R, Yalcin A, Meyer J, Lendeckel W, Tuschl T. Identification of Tissue-Specific microRNAs from Mouse. Curr Biol. 2002 Apr 30;12(9):735-9. 18. Thomson JM, Parker J, Perou CM, Hammond SM. A Custom Microarray Platform for Analysis of microRNA Gene Expression. Nat Methods. 2004 Oct;1(1):47-53. 19. Wienholds E, Kloosterman WP, Miska E, Alvarez-Saavedra E, Berezikov E, de Bruijn E, Horvitz HR, Kauppinen S, Plasterk RH. MicroRNA Expression in Zebrafish Embryonic Development. Science. 2005 Jul 8;309(5732):310-1. 20. Schratt GM, Tuebing F, Nigh EA, Kane CG, Sabatini ME, Kiebler M, Greenberg ME. A Brain-Specific microRNA Regulates Dendritic Spine Development. Nature. 2006 Jan 19;439(7074):283-9. 21. Lai EC. MiRNAs: Whys and Wherefores of miRNA-Mediated Regulation. Curr Biol. 2005 Jun 21;15(12):R458-60. 22. Kloosterman WP, Wienholds E, de Bruijn E, Kauppinen S, Plasterk RH. In Situ Detection of miRNAs in Animal Embryos using LNA-Modified Oligonucleotide Probes. Nat Methods. 2006 Jan;3(1):27-9. 107 23. Kim J, Krichevsky A, Grad Y, Hayes GD, Kosik KS, Church GM, Ruvkun G. Identification of Many microRNAs that Copurify with Polyribosomes in Mammalian Neurons. Proc Natl Acad Sci U S A. 2004 Jan 6;101(1):360-5. 24. Reinhart BJ, Slack FJ, Basson M, Pasquinelli AE, Bettinger JC, Rougvie AE, Horvitz HR, Ruvkun G. The 21-Nucleotide Let-7 RNA Regulates Developmental Timing in Caenorhabditis Elegans. Nature. 2000 Feb 24;403(6772):901-6. 25. Pasquinelli AE, Reinhart BJ, Slack F, Martindale MQ, Kuroda MI, Mailer B, Hayward DC, Ball EE, Degnan B, Muller P, Spring J, Srinivasan A, Fishman M, Finnerty J, Corbo J, Levine M, Leahy P, Davidson E, Ruvkun G. Conservation of the Sequence and Temporal Expression of Let-7 Heterochronic Regulatory RNA. Nature. 2000 Nov 2;408(6808):86-9. 108 Table Al. Genomic loci of 107 -23nt sRNA cloned by Lee et al. 97 sequences map to twelve distinct clusters each on separate genomic scaffolds. Nearly all sRNA clusters are oriented antisense to overlapping or nearby predicted genes. sRNA # Sequence (forward) Diffs to genome Coordinates Overlapping / nearby gene ID and orientation Cluster 0 (scaffold 8254028) A->T 8254028 (22460-22482) + 8254028 (22804-22827) + C->T 8254028 (22820-22842) + AS 1038 AS 1038 TAGAATAATTATTTAATTGCTCA A->C 8254028 (23191-23213) + 8254028 (23423-23445) + TATTTTTTAGAATAAATCCTATAT T->A 8254028 (24685-24708) + AS 1038 0-0 TCATAAGGTATAGCTTCTTAGTA 0-1 TTCGAAGATTTACCCTTACTATAT 0-2 TACTATATTTGTGCGCTCTGTAC 0-3 TGTGACAATCTATTTATCATTAT 0-4 0-5 Cluster I (scaffold 8254557) 1-0 TTATCTCGAAGTTTTTCTCTTTGT AS 1038 AS 1038 AS 1038 8254557 (44124-44147) - AS 21693 1-1 1-2 1-3 TCTAGTTATGTCTTTATCTCGAAT T->G 8254557 (44137-44160) - AS 21693 TAATTCGCTTAATTTTGCTTTTAT T->A 8254557 (44194-44217) - AS 21693 TATTCACCATCAATCAGTTAGAT 8254557 (44638-44660) - AS 21693 1-4 TAAACGTTCTTCTTGCATATCGTT 8254557 (45139-45162)- AS 21693 1-5 TAACGAAGAGTATTTTCTCTTGT 8254557 (45549-45571) - AS 21693 1-6 CTGTTGGATCTGTCAATTCTTTTA 8254557 (45902-45925) - AS 21693 1-7 TAATCTCAGAGCTGTTCTAGATTT 8254557 (46576-46599) - AS 21693 8254572 (1855-1877) 8254572 (7970-7992) - AS 23309 (> 1 kb away) AS 23311 8254597 (5178-5201) + AS 26106 (> 1 kb away) 8254597 (5202-5224) + AS 26106 (> 1 kb away) Cluster 2 (scaffold 8254572) 2-0 TTCAATCCCAACTATTCGTTCAA 2-1 TAAATTATTTCATTATATTTAAT Cluster 3 (scaffold 8254597) 3-0 3-1 TTATTATATTCACCTTTATTATAT TAATTACTAAACCTATTTGATTT 3-2 GTAGCCTCTCTTTAAAAGCACGC G->A 8254597 (5304-5326) + AS 26106 (> I kb away) 3-3 TCTTTGATATCTTTAATATTTTT T->A 8254597 (5432-5454) + AS 26106 (> 1 kb away) 3-4 TAAACCAAAGCTTTCATCTAGCT 8254597 (5500-5522) + AS 26106 (> 1 kb away) Cluster 4 (scaffold 8254600) 4-0 TGCATACTATTAAGCTTATCCAT 8254600 (158620-158642) + AS 4800 4-1 TCTATAATTAATAGTTCTACAGAT T->A 8254600 (159859-159882)+ AS 4800 4-2 TCAACATGGTTTAAGTCTTCGAT T->A 8254600 (160455-160477) + AS 4800 4-3 TTGTCAATATTTTGTTGAATAAT 8254600 (160479-160501) + AS 4800 4-4 TTCGTAGTTCTATCCCAGTACGT 8254600 (162017-162039) + AS 4801 4-5 TCTACTATTGTTATTAGTTGTTT T->A T->C 8254600 (162447-162469) + AS 4801 4-6 TATATTTATGCTATTTTAAATTGC C->T 8254600 (162919-162942) + AS 4801 4-7 TCAATTGTTGTATTTATTATCGT 8254600 (165328-165350) + AS 4802 4-8 TACAAGGTCAATGCTTGGTTTTT 8254600 (165458-165480) + AS 4802 4-9 TCATAAAGCATAATATTTTATAAT 8254600 (166559-166582) + AS 4802 4-10 TATATCCATGTATAGTAGTTAAT 8254600 (166704-166726) + AS 4802 4-11 TCTATATCAATATGAGATTATAT 8254600 (167186-167208) + AS 4802 TGAAATACTTCCTATTTTTTTAA 8254617 (879961-879983)- S 6834, AS 6835 (both >1kb away) TATGAAATACTTCCTATTTTTTT 8254617 (879963-879985)- T->G T->A Cluster S(scaffold 8254617) S 6834, AS 6835 (both >1kb TTCAAAGAAGTGATATTCCCAAT A->T 8254617 (880452-880474) - TAACTTTTTAAAGTATAAAGTTGT T->G 8254617 (880659-880682) - 109 away) S 6834, AS 6835 (both >1kb away) S 6834, AS 6835 (both >1kb away) Cluster 6 (scaffold 8254638) 6-0 TCTTATTCTAATTTAAGATGAAAT 6-1 TCAATTGTTGTATTTATTATCGT 6-2 TACAAGGTCAATGCTTGGTTTTT TCCAGGTATTTCATCATCATCTTT 6-3 8254638 (961835-961858) + AS 8861 8254638 (962329-962351) + AS 8861, 8862 T->G 8254638 (962492-962514) + AS 8862 T->G 8254638 (963121-963144) + AS 8862 AS 8862 T->A 6-4 TCATAAAGCATAATATTTTATAAT 8254638 (963617-963640) + 6-5 6-6 TACGGTTCTAACACATATCCGTGT 8254638 (963749-963772) + AS 8862 8254638 (964060-964083) + AS 8862 8254638 (964376-964398) + AS 8862 TGGTGTATAATCAAAATCTTCACC TCrGTTCTCATCTAATTCAATCT C->G, T>C AS 8862 TCAAATCTTATTCTTATTGAGATT 6-9 6-10 6-11 TATATAATACATCCACTTGTTATTA C->T, A>G 8254638 (964747-964771) + AS 8862 TATAATACATCCACTTGTTATTGT C->T, T->C 8254638 (964749-964772) + AS 8862 TCTATATCAATATGAGATTATAT T->A 8254638 (965183-965205) + AS 8862 6-12 TAATTACTGTGTTTTTTCACTATT T->A 8254638 (965668-965691) + AS 8862 6-13 6-14 6-15 TACTGTGTTTTTTCACTATAATT T->A 8254638 (965672-965694) + AS 8862 TAAATTCTATGCTTAGAAGTTCTT T->A 8254638 (965990-966013) + AS 8863 TATAGTACATTTTTCCAAAATGAT T->A 8254638 (967380-967403) + AS 8863 Cluster 7 (scaffold 8254659) 7-0 TGAAGTATGTATTCTTCGATTTT 8254659 (1290365-1290387) - AS 11387 7-1 TGAATATTCTAATCATAAATATAA A->T 8254659 (1290401-1290424) - AS 11387 7-2 TCAATTTCATTTAATTTATTGAAA A->T 8254659 (1290604-1290627) - AS 11387 7-3 TAATGGATTGCATTTCCCAAATT T->A 8254659 (1290681-1290703)- AS 11387 7-4 TATGACTACATACAATCTTCATT T->C 8254659 (1290918-1290940) - AS 11387 7-5 TAGTGTTTTATGACTACATACAAT 8254659 (1290925-1290948) - AS 11387 7-6 ATTGGAATATTTTTTGTTTCCTT 8254659 (1291024-1291046) - AS 11387 7-7 TTCATACTTAATTAACAGTTTTA 8254659 (1291047-1291069) - AS 11387 7-8 GTATATTCCCATGTATAGTGGTTT 8254659 (1291207-1291230) - AS 11387 AS 11387 G,T->A 7-9 TATCTGTGTCATGATTCATTCTAT 8254659 (1291294-1291317) - 7-10 TCTTAGTTTTCTTAGCTATCTGT 8254659 (1291311-1291333)- AS 11387 7-11 TGTTGTACCCTGATCGTTATCAT 8254659 (1291957-1291979)- AS 11387 Cluster 8 (scaffold 8254678) 8-0 TATTTTCATTTAACAATCATTCAT 8254678 (17318-17341) - AS 14809 8-1 TCATATTTATCAAATTCGGTATT 8254678 (18513-18535)- AS 14810 8-2 TAAGAATATTTAAATCATATTTAT 8254678 (18526-18549)- AS 14810 8-3 TAAAAAGGTATTTATTCATCTTT 8254678 (19121-19143) - AS 14810 8-4 TTCAAGAACATTCCTATTGGTTT 8254678 (19498-19520) - AS 14811 8-5 TATAACTATTGCTTAGCATTGAT 8254678 (19527-19549) - AS 14811 8-6 TAAGAATCTTTTTTGTTTCTTCAT 8254678 (19780-19803) - AS 14811 8-7 TGAATATTGTTAATCGTCTTTGCT 8254678 (19818-19841) - AS 14811 8-8 TTAGACAATATAATTTGTCAAGAA 8254678 (20043-20066) - AS 14811 8-9 TCTACAGCTATAGGACAACTAATT 8254678 (20332-20355) - AS 14811 8254697 (272579-272602) - AS 16336 8254697 (273688-273710) - AS 16336 8254697 (274888-274911) - AS 16337 8254697 (275109-275132) - AS 16337 8254697 (275779-275802) - AS 16337 8254697 (276041-276063) - AS 16337 8254697 (276278-276302) - AS 16337 8254697 (276439-276461) - AS 16337 Cluster 9 (scaffold 8254697) 9-0 TATGTTACGTTCATAGTTCCAGCA 9-1 TATATAGTTCAATAAACTATCAC 9-2 TGAACAAAGATTAACAGTTCAATT 9-3 9-4 9-5 9-6 9-7 TTAAAATCCAAAACCTTAATTTTT C->T T->A TATGCAAATTCTTTATATGTCTAA TGAAAATCTAAAAGATTAAAATT TAATATTTTATTTAAAAATATTTAT T->A T->A TACTTTCCCAAATTATTTCTGAT 110 TCTTACAATGATTAATGATTTGT 9-8 Cluster 10 (scaffold 8254822) 8254697 (276577-276599)- AS 16337 10-0 TCATCTTCTGTAAAAGATAGTAT 10-1 TCTAGATTTCCTTGTTTTCTTGT 8254822 (133383-133405) + 8254822 (134096-134118) + AS 14971 10-2 TACTAATTTATTCGCATAAATAAA 10-3 TAATCTCAGAGCTGTTCTAGATTT 10-4 TATTTAACGTATTGTTGTATTTTT 8254822 (135210-135233) + AS 14971 10-5 TACTAAACAAGTCATAAAATTAGT T->C 8254822 (136262-136285) + AS 14971 10-6 TCTTAAATTCTCTTATTTTTCTT T->A 8254822 (136673-136695) + AS 14971 10-7 TCACGAAGATTAAATTTTTGCAT 8254822 (136729-136751) + AS 14971 10-8 TATAGTCTGAATAATCTTCTAAAT T->G T->A 8254822 (136817-136840) + AS 14971 10-9 TATCGTTATGTTTGCTCATTTAT T->A 8254822 (137018-137040) + 10-10 TTCGCCAATATTTTCCATTGCGAT T->A 8254822 (137401-137424) + AS 14971, 14972 AS 14971, 14972 A->T 8254823 (262686-262708) - AS 15059 A->G 8254823 (262710-262732) - AS 15059 8254822 (134516-134539) + T->G 8254822 (134571-134594) + AS 14971 AS 14971 AS 14971 Cluster 11 (scaffold 8254823) 11-0 11-1 TACTATTATTATTTCCCCTTATA TACTTTTGAGGTTACTCTTGAGA Non-clustered sRNAks 12-0 TAATATTTTATTTAAAAATATTTAT 8254010 (344454-344478) - S 30245 12-1 TAAGGCCTACCTTATGGTTTTTT 8254233 (2710-2732) - S 3315 12-2 TCTAATAAACTCCTCAACTTTTAA 8254284 (170445-170468) + 12-3 TCAGTGTTTTACTTTGCTTTCCT AS 9013, 9014 AS 18483,18484 12-4 CCCTTATACTCATGGCGCTAAACT 8254515 (228925-228948) + S 20780 12-5 AGGATGAATTGAATTGTTTACC 8254545 (665724-665745) + AS 25918 12-6 TCTTTTCTTAAGCAACAAGTCAT 8254594 (262178-262200) - 12-7 AATACTGGCCACTGCTCAATTAG S 9778 AS 12216, S 12217 (both wlin 1kb) 12-8 TAAATTCTATGCTTAGAAGTTCTT 12-9 GCTCTGCTATTCTAGTCTGACACT T->C 8254495 (340176-340198) - 8254649 (100250-100272) + 8254661 (32971-32994) + G->A T->A 8254737 (177118-177141) - 111 S 25563 S,SRP 7SL RNA co 0 z W 0o N N Im 0 0 aeea ."ci se ImL 9L o e o) o V- CL0 0. 0 m D04 o aa·~ I CLU x m 0O. £88r~r2: a-a-A-a 00 CE CL °°) co . o toaf cc 0'4 CL m x a .0 X Co m x c rU 00 0 a aR cd q* + 3uo o Go 'l-A 0 S, 0 ~cc r-cc 0 s F x Ng occV Goq o4Cd on Sa L ,ag e'J d CDcc0 ccx c *q *I cn cc: Aco: i I0 0ý *le 1.9 IIg LULO . . LU CV) cYe 0 LU 0 0014 Go 0) C0 Cc cc C CD com LU 0 c OR( )4 t 4 r- c c( 08 - 'a0 () 4 0) U) IVV.~ : C1 CD~· 0~0 ODCDRC r- CD C act Ul) II A 04 a * .5 4 LU 0(DZ ( (N LOv to 0C 0 04 ao Go V) 0. I a0 0 (4) C ('9 go ý U c - E 240 03 ae az z N - S, -r 1 z<8 ZC z 0 0 CL CL 0 I' E Ln .C CA 0 0 .0 50 ill X 4) 5 cc 0~S un 0. S o) Va 0 (0 w" 0 O co V(Q, 00OR LOCD(0 Sco 'cri~ 0) L~ 4)•i H•I *o*ac a LC CO a w0 LO 'IT m 0 c UlIt cCV (0 .-) C) 1 cc CN 4N co0 'It W) w C4C4cn (0 0C- v.i ( 0O (0 (0 CO (D CD 0I Cu Co co 4) co In .0. 04 W) 0W.- 0( m 04 04 0) 0)5 +0 00 dI (0 (0 ql cc N c- CDC cc ao a, N -(0~ z" z ers cc v li~ to co 0, 0, C4 <u zCco~ 0,1- C14 CN •0o ¸ 04 CV cM WV cM NV Ia C' C• CVlr C E (0 c co(13 -c c, ct ct I -c C: C., C, 0 c t· m o r,- c oCa C CCC -Cc: -C Et a 0 C cCt o C CO c- C a c c Z C', C, o.rc 0o C cz C 5C C C -2 C aCC C: cC c C5 c c CC -C CC C C c .o 4. zE Cu C. Cu Q., 03 cp a- CA 0) It 0 0 4) zr CA 1 1 4) E mz 4- Cco z CA Table A3. Locations of A rich motif predicted near sRNA clusters by HMM trained on MEME output. Pred. sRNA gene Cluster ID 0 1038 1 21693 2 23309 23311 4 4800 Scaffold 8254028 8254557 8254572 8254572 8254600 Gene/motif Strand -1 1 1 1 -1 4801 4802 8861 8862 8863 8254600 8254600 8254638 8254638 8254638 -1 -1 -1 -1 -1 6 7 11387 8254659 1 8 9 14809 14810 14811 16336 8254678 8254678 8254678 8254697 1 1 1 1 10 11 16337 8254697 14971 8254822 15059 8254823 1 -1 1 HMM-predicted A-rich motif positions 22121-22178 48059-48084 4987-5067 9533-9546, 9512-9531, 9548-9593, 9637-9660 158149-158203 162623-162643,161780-161812,161814-161832, 161756-161778 165199-165221, 165223-165239,165243-165277 961147-961207, 961123-961145,961009-961039 962199-962221, 962223-962273 965750-965791, 965795-965830 1292754-1292779, 1292888-1292907, 12927811292802, 1292734-1292752 17382-17460, 16738-16755, 17683-17731, 1773317752 19254-19333 21232-21253, 20749-20832 None 276477-276494, 277071-277100, 276991-277007, 276558-276582 132986-133008, 134472-134479 263899-263978, 264380-264401 115 Table A4. Predicted genes with high-scoring downstream A-rich motifs, grouped by paralogy. Bold genes indicate novel predictions of sRNA activity. sRNA-Associated Family I 4800 4801 4802 8861 8862 sRNA-Associated Family III 15058 14810 14811 14813 14809 sRNA-Associated Family IV 23309 23310 23311 1848 8220 23410 23409 23407 23408 Motif-Associated Family V (predictedsRNA activity) 9721 9739 9740 17099 Singleton sRNA and/or motif-associated genes 2261 1039 3676 1038 1353 5186 4571 4741 5720 5363 6435 7362 6390 7056 6308 10454 8796 10064 11046 9974 15951 15309 15421 15229 15238 17284 16830 16957 17969 16958 20945 21168 20500 20550 20748 22415 22418 22743 21693 21824 24638 24580 24723 24849 24874 27183 26513 26307 26779 27021 29694 29753 8863 11387 12216 15059 15060 8221 16957 16958 4272 5751 8539 11140 15981 18983 21495 22938 25094 27251 4283 6201 8741 11988 16023 19753 21583 22967 25268 27577 4302 6308 8758 14135 16308 20335 21639 23763 25646 29059 Preliminary sequence data for the T. thermophila scaffolds was obtained from The Institute for Genomic Research website. Sequencing of the genome is supported by award from the NIBMS and NSF. 116 Depletion for microRNA targeting in different tissues (pg 113) )_ _ miR-345 miR-21 a miR-339 miR-4Q9(It a miR :iJ miRl B-7 miR-38(ta miR-32 mimiR-29 15321 miR-.312-p(tim) miR-39a-S c1a miR-335~ ml -3(t miR-Wol miR2 R-1529495a miR-3Q biR-1ta miR-146~a mi7d miR-1428a/ mifR-1 MiR-484 57 miR1 IRmiR-29b1-29a111 R-217-22 miR-l80-3 miR-32 8ti9 mR-1-7 4 1~ SL 0 10 30 -log1O[Depletion P-value (one-sided KS)] Figure Al. Tissue-specific patterns of depletion for targeting by the full set of 187 mouse miRNA families, hierarchically clustered by P-values. (Figure tiled over 3 pages) 117 Depletion for microRNA targeting in different tissues (pg 2/3) -loglO[Depletion P-value (one-sided KS)] 118 Depletion for microRNA targeting in different tissues (pg 313) -loglO[Depletion P-value (one-sided KS)] 119 Enrichment for microRNA targeting in different tissues (pg 113) miR-128 -b1 mIJ-221I-22: miR-31 t1al miR-20 miR-22 (tla ,let-d miR- -loglO[Enrichment P-value (one-sided KS)] Figure A2. Tissue-specific patterns of enrichment for targeting by the full set of 187 mouse miRNA families, hierarchically clustered by P-values. (Figure tiled over 3 pages) 120 Enrichment for microRNA targeting in different tissues (pg 2/3) miR-10T," miR-39 a 0 Fw- 10 20 30 -M -loglO[Enrichment P-value (one-sided KS)] 121 Enrichment for microRNA targeting in different tissues (pg 313) miIR-17.- .td1.) miR-9 ml: miR-m4 miil miR- miR-2 miQ-- miR-3- - - MiR-12 miR MiR Mi - miR-2 + R 0 - 10 20 30 -loglO[Enrichment P-value (one-sided KS)] 122