Supplementary notes on Alu-transcript numbers, location, subfamilies and function Supplementary Document 1: The transcript repeat element: the human Alu sequence as a component of gene networks influencing disease Paula Moolhuijzen1*, Jerzy K Kulski1, 2, 3 *, David S Dunn1, David Schibeci1, Roberto Barrero1, Takashi Gojobori4, Matthew Bellgard1 Plant and animal small RNA target genes Small RNAs have been found in all plant and animal multicellular organisms investigated so far. Plant and animal miRNAs are evolutionarily ancient small RNAs, 19–24 nucleotides in length, generated by cleavage from larger highly structured precursor molecules. In both plants and animals, miRNAs regulate gene expression post transcriptionally through interactions with their target mRNAs. These miRNAs, excised from endogenously encoded hairpin RNAs, negatively regulate endogenous target genes by cleavage or translational inhibition of their mRNA (Millar and Waterhouse 2005). To date, plant miRNAs share much higher complementarities to their target genes (zero to three mismatches) than animal miRNAs to their target genes, although in both cases, high to perfect complementarities between the target mRNA and the 5′-half of the miRNA is required (Doench and Sharp 2004, Parizotto et al. 2004). The current knowledge on animal miRNA-binding sites is limited and biased because their discovery was primarily based upon computer predictions using databases composed of only 3′- UTR sequences (Enright et al. 2003, Lewis et al. 2003, Xie et al. 1 Supplementary notes on Alu-transcript numbers, location, subfamilies and function 2003). In comparison, plant miRNA-binding sites are found almost exclusively within the open-reading frames of the target genes (Millar and Waterhouse 2005, Xie et al. 2003). Supplementary notes on Alu-cDNA numbers, location, subfamilies and function Alu-cDNA Functions The functional categories of the Alu-cDNA collected from the H-Inv human cDNA database were analysed by Gene Ontology (GO), which classifies the known functions of gene products into at least 5,175 categories according to biological processes, cellular components and molecular functions (Ashburner et al. 2000). . GO analysis was performed for cDNA with a NCBI gene identifier and CDS to determine the different metabolic and regulatory pathways that are possibly associated with the cDNA-Alu in normal and cancerous tissues. Over 8,000 cDNA-Alu from the H-Inv database were classified into nearly 3000 of the GO functional categories, and then assigned into 35 GO Slim categories for the Homo sapiens taxa identifier 9636 (http://www.geneontology.org/GO.downloads.shtml#ont). Over 70% of the cDNA-Alu loci grouped into binding and catalytic activity molecular functions (Supplementary Figure 3), and more than 80% were involved in biological processes, such as physiological process, nucleotide metabolism, cell communication and transport. Less than 3% of the cDNA and cDNA-Alu loci were designated “molecular function unknown” or “biological process unknown”. The molecular function, biological process and cellular component of loci unique to tissue traits for the full cDNA and cDNA-Alu subsets showed no cDNA-Alu affect on the GO slim category proportions. The 2 Supplementary notes on Alu-transcript numbers, location, subfamilies and function proportion of loci unique to tissue traits for a few molecular functions did however vary between unique normal and disease traits for all cDNA and cDNA-Alu. The molecular functions of both cDNA and cDNA-Alu were associated mainly with binding (40%) and catalytic activity (20%). The number of loci products involved in the function of nucleic acid binding increased nearly two-fold for the disease loci, whereas loci associations with signal transducer activity and transporter activity decreased in the cancerous trait three and two fold, respectively. These shifts in function were significant by a Fisher’s exact test at P<0.0001 (Supplementary Figure 3). In the GO analysis of approximately half of the cDNA-Alu loci, there was no difference between cDNA and cDNA-Alu for the biological and cellular processes or molecular function. However, theproportion of categories for function, process and cellular components of the expressed loci changed when the normal and cancerous trait of the tissues were compared., with the number of cDNA and cDNA-Alu subsets increased for nucleic acid binding and decreased for signal transducer and transporters activity in the cancerous trait. In general, the majority of the cDNA isoforms containing full-length Alu in the coding regions of the transcripts translated into hypothetical proteins rather than known proteins. As the function of these cDNA-Alus is not known or is labelled ‘hypothetical’, it is proposed that they are either untranslated or destined for rapid degradation or they serve a binding role against excessive retrotranspositional activity within the genome. 3 Supplementary notes on Alu-transcript numbers, location, subfamilies and function Alu Families and Subfamilies When cDNA-Alu transcripts Alu elements were categorized into their particular families and subfamilies, the contribution of the Alu families in cDNA was similar to the proportion of Alu families within the genome; AluS 54%, AluJ 26%, AluY 10%, Monomeric 8% and Alu 2% (Price et al. 2004) (Supplementary Figure 2B). The most frequent Alu subtype was AluSx. No family bias was detected between the cancerous and normal tissue sets. Alu Locations Within the cDNA The relative position and size of Alu insertions within the cDNA sequences was determined for 17,819 of 17,861 distinct cDNA sequences with an annotated CDS that contained at least one Alu. The Alus positions were partitioned into three groups according to the regions in which they were found (1) overlapping the coding sequence (CDS), (2) exclusively in the 5’ UTR, or (3) exclusively in the 3’ UTR. Supplementary Figure 1 shows the number of fragments and distinct transcripts in normal or cancerous tissues, and the transcripts with known functions are indicated in red. The transcript Alu content was measured based on the proportion of transcripts containing an Alu fragment represented within the CDS, 5’UTR or 3’UTR or represented by sequence length (base pairs) to the total sum of sequence represented in these three regions. Alu sequence quantification within the cDNA Greater than 28,000 Alu fragmented or full-length elements were found in the 17,000 cDNA-Alu sequences surveyed, at an average of 1.6 Alu per sequence. The Alu elements 4 Supplementary notes on Alu-transcript numbers, location, subfamilies and function ranged in size from10 nucleotides in length to full-sized. All these Alu fragments were within the RepeatMasker recommended RM score to be outside the threshold for a false positive match (http://www.repeatmasker.org/). Supplementary Figure 2A shows the number of Alu fragments plotted as a % of the consensus Alu sequence length for Alu subfamilies, and for transcripts represented in cancerous and normal tissues, with known and unknown (hypothetical) functions. The plots show that the number of cDNA-Alu in normal tissues is clearly higher than in cancerous tissues. In the cDNA-Alu plot, minor peaks represent a slightly increased number of the monomeric form of the Alu structure (110-130 bp) and major peaks above 90% of the Alu consensus full length sequence represent a greatly increased number of the dimeric form of the Alu structure (>250 bp). This trend of minor and major peaks was similar for the cDNA-Alu in both the normal and diseased tissues. While multiple fragments of Alu elements can overlap the CDS within a single transcript, the largest number of Alu contained in the CDS was 4 copies in one particular transcript sequence. Full-length dimeric Alu (>280 bp) fragments overlapping the CDS that represent half of the transcripts potentially have cryptic splicing sites provided by the Alu sequences. Alu-cDNA: small non-protein-coding RNA There are several classes of small non-protein-coding RNA (npcRNA) that play important roles in cellular metabolism including mRNA decoding, RNA processing and mRNA stability. Indeed, altered expression of some of these npcRNAs has been 5 Supplementary notes on Alu-transcript numbers, location, subfamilies and function associated with cancer, neurodegenerative diseases such as Alzheimer's disease, as well as various types of mental retardation and psychiatric disorders (Xie et al. 2008). It has been demonstrated that both free and embedded Alu RNAs play a major role in post transcriptional regulation of gene expression, for example by affecting protein translation, alternative splicing and mRNA stability (Hasler et al. 2007). As transposable element evolution, frequencies and mechanisms for insertion differ further opportunity remains to investigate other classes of retrotransposable elements. Adenosine to inosine (A-to-I) RNA editing DNA methylation and differences in histone modifications are the best-studied epigenetic control mechanisms involved in cancer development (Feinberg and Vogelstein 1983, Feinberg et al. 2006). Recently RNA silencing mechanisms, such as RNA interference and microRNA regulation were also linked to cancer (Lu et al. 2005; Feinberg et al. 2006). Adenosine to inosine (A-to-I) RNA editing is a site-specific modification in stem– loop structures within precursor mRNAs, catalyzed by members of the double stranded RNA (dsRNA)- specific ADAR (adenosine deaminase acting on RNA) family (Bass 2002). ADAR-mediated RNA editing is essential for the normal development of both invertebrates (Palladino et al. 2000) and vertebrates (Higuchi et al. 2000; Wang et al. 2000; Hartner et al.2004). The splicing and translational machineries recognize inosine (I) as guanosine (G). These A-to-I editing events occur in noncoding repetitive sequences, mostly Alu elements, and tend to undergo multi-editing in tight clusters. In addition, RNA editing was shown to be involved in the regulation of nuclear retention and in miRNAs biogenesis (Prasanth et al. 2005; Blow et al. 2006; Yang et al. 2006). Paz 6 Supplementary notes on Alu-transcript numbers, location, subfamilies and function et al. (2007) found reduced editing levels at Alu sequences in a human brain tumor. Alterations in editing and alternative splicing of serotonin receptor transcripts were found to also correlate with a decrease in enzymatic activity of the editing enzyme adenosine deaminase acting on RNA (ADAR) 2, as deduced from the analysis of ADAR2 selfediting. These results suggest a role for RNA editing in tumor progression (Maas et al. 2001). In the vast majority of edited RNAs, A-to-I substitutions are clustered within transcribed sense or antisense Alu sequences. Edited bases are primarily associated with retained introns, extended UTRs, or with transcripts that have no known corresponding gene. Therefore, Alu-associated RNA editing may be a mechanism for marking nonstandard transcripts to bypass the translational machinery (Kim et al. 2004). 7 Supplementary notes on Alu-transcript numbers, location, subfamilies and function Supplementary Table 1 Subset of full-length Alu-transcript spanning CDS isoforms function exaimation. accession gene_id loci cancer gene_id gene_symbol location AK022432 GID:950 HIX0018713 normal GID:950 SCARB2 4q21.1 AK025390 GID:25926 HIX0014090 normal GID:25926 NOL11 AK026942 GID:411 HIX0018251 cancer GID:411 ARSB 17q24.2 5p11q13 AK024421 GID:56996 HIX0006933 normal GID:56996 SLC12A9 7q22 AK024421 GID:56996 HIX0006933 normal GID:56996 SLC12A9 7q22 AK024494 GID:56996 HIX0006933 normal GID:56996 SLC12A9 7q22 AK024494 GID:56996 HIX0006933 normal GID:56996 SLC12A9 7q22 AK054840 GID:152641 HIX0007805 normal GID:152641 FLJ30277 4q35.1 AK055885 GID:1 HIX0015537 normal GID:1 A1BG 19q13.4 AK091237 GID:134266 HIX0005298 cancer GID:134266 GRPEL2 5q33.1 AK092255 GID:169611 HIX0035002 cancer GID:169611 OLFML2A 9q33.3 AK094948 GID:286207 HIX0008404 normal GID:286207 C9orf117 9q34.11 gene_description scavenger receptor class B, member 2 nucleolar protein 11 repeat isoformProtein FLAM_C hypothetical arylsulfatase B solute carrier family 12 (potassium/chloride transporters), member 9 solute carrier family 12 (potassium/chloride transporters), member 9 solute carrier family 12 (potassium/chloride transporters), member 9 solute carrier family 12 (potassium/chloride transporters), member 9 hypothetical protein FLJ30277 alpha-1-B glycoprotein GrpE-like 2, mitochondrial (E. coli) olfactomedin-like 2A chromosome 9 open reading frame AluSx hypothetical FLAM_A AluSx AluJb AluSx AluSx AluSq hypothetical AluSq FLAM_A hypothetical AluSq hypothetical AluSx hypothetical 8 Supplementary notes on Alu-transcript numbers, location, subfamilies and function AK000385 GID:8635 HIX0017958 normal GID:8635 RNASET2 6q27 AK098245 GID:5549 HIX0001490 normal GID:5549 PRELP 1q32 BC011823 GID:55652 HIX0010579 disease GID:55652 FLJ20489 12q13.11 BC010030 GID:202020 HIX0004117 disease GID:202020 FLJ39653 4p15.32 AL162039 GID:92597 HIX0004271 normal GID:92597 MOBKL1A 4q13.3 AF010144 AK129645 GID:27308 GID:83546 HIX0028517 HIX0024449 normal normal GID:27308 GID:83546 AD7C-NTP RTBDN 1p36 19p12 AK124186 GID:196743 HIX0009334 normal GID:196743 PAOX 10q26.3 AK125252 GID:2523 HIX0010735 normal GID:2523 FUT1 19q13.3 AK125657 GID:142679 HIX0002652 disease GID:142679 DUSP19 2q32.1 AK126660 GID:59271 HIX0027783 normal GID:59271 C21orf63 21q22.11 AK127614 GID:9462 HIX0023758 normal GID:9462 RASAL2 AF218028 GID:81607 HIX0001228 normal GID:81607 PVRL4 1q24 1q22q23.2 BC009467 GID:158960 HIX0056219 disease GID:158960 LOC158960 Xq28 BC024593 GID:11165 HIX0033170 disease GID:11165 NUDT3 6p21.2 BC037327 GID:22873 HIX0011407 normal GID:22873 DZIP1 13q32.1 BC018643 GID:83716 HIX0020171 disease GID:83716 CRISPLD2 16q24.1 117 ribonuclease T2 proline/argininerich end leucinerich repeat protein hypothetical protein FLJ20489 hypothetical protein FLJ39653 MOB1, Mps One Binder kinase activator-like 1A (yeast) neuronal thread protein AD7c-NTP retbindin polyamine oxidase (exo-N4-amino) fucosyltransferase 1 (galactoside 2alpha-Lfucosyltransferase, H blood group) dual specificity phosphatase 19 chromosome 21 open reading frame 63 RAS protein activator like 2 poliovirus receptorrelated 4 hypothetical protein BC009467 nudix (nucleoside diphosphate linked moiety X)-type motif 3 DAZ interacting protein 1 cysteine-rich secretory protein LCCL domain AluSq hypoth FLAM_C hypoth AluSx hypoth AluSg hypoth AluSx hypoth AluSc AluSx hypoth hypoth AluSx AluSx hypoth AluSq hypoth AluSq hypoth AluSx FLAM_A hypoth FLAM_A hypoth AluSg hypoth FLAM_C AluSx hypoth 9 Supplementary notes on Alu-transcript numbers, location, subfamilies and function BC044913 GID:92597 HIX0004271 normal GID:92597 MOBKL1A 4q13.3 BC060883 CR627453 GID:400581 GID:57653 HIX0039163 HIX0008209 normal normal GID:400581 GID:57653 LOC400581 KIAA1529 17p11.2 9q22.33 containing 2 MOB1, Mps One Binder kinase activator-like 1A (yeast) GRB2-related adaptor protein-like KIAA1529 AluSx hypoth AluSx AluYb8 hypoth hypoth 10 Supplementary notes on Alu-transcript numbers, location, subfamilies and function Acknowledgment The H-Inv human cDNA data was analysed mostly by Paula Moolhuijzen as part of her PhD requirements. References Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G (2000) Gene Ontology: tool for the unification of biology. Nat Genet 25:25-29 Doench JG, Sharp PA (2004) Specificity of microRNA target selection in translational repression. Genes Dev 18:504-511. DOI 10.1101/gad.1184404 Enright AJ, John B, Gaul U, Tuschl T, Sander C, Marks DS (2003) MicroRNA targets in Drosophila. Genome Biol 5 Hasler J, Samuelsson T, Strub K (2007) Useful 'junk': Alu RNAs in the human transcriptome. Cell Mol Life Sci 64:1793-1800 Kim DDY, Kim TTY, Walsh T, Kobayashi Y, Matise TC, Buyske S, Gabriel A (2004) Widespread RNA Editing of Embedded Alu Elements in the Human Transcriptome. Genome Res 14:1719-1725 Lewis BP, Shih IH, Jones-Rhoades MW, Bartel DP, Burge CB (2003) Prediction of mammalian microRNA targets. Cell 115:787-798 Maas S, Patt S, Schrey M, Rich A (2001) Underediting of glutamine receptor GluR-B mRNA in malignant gliomas. Proc Natl Acad Sci 98:14687-14692 Millar AA, Waterhouse PM (2005) Plant and animal microRNAs: similarities and differences. Funct Integr Genomics 5:129-135 Parizotto EA, Dunoyer P, Rahm N, Himber C, Voinnet O (2004) In vivo investigation of the transcription, processing, endonucleolytic activity, and functional relevance of the spatial distribution of a plant miRNA. Genes Dev 18:2237-2242 Price AL, Eskin E, Pevzner PA (2004) Whole-genome analysis of Alu repeat elements reveals complex evolutionary history. Genome Res 14:2245-2252 Saito Y, Suzuki H, Tsugawa H, Nakagawa I, Matsuzaki J, Kanai Y, Hibi T (2009) Chromatin remodeling at Alu repeats by epigenetic treatment activates silenced microRNA-512-5p with downregulation of Mcl-1 in human gastric cancer cells. Oncogene 28:2738-2744 Xie W, Ted Brown W, Denman RB (2008) Translational regulation by non-proteincoding RNAs: different targets, common themes. Biochem Biophys Res Commun 373:462-466 Xie Z, Kasschau KD, Carrington JC (2003) Negative feedback regulation of Dicer-Like1 in Arabidopsis by microRNA-guided mRNA degradation. Curr Biol 13:784-789 11 Supplementary notes on Alu-transcript numbers, location, subfamilies and function 12