Organization of the Caenorhabditis elegans small non-coding transcriptome by Rnomics, tiling array and bioinformatics 陈润生(Runsheng CHEN) Institute of Biophysics, CAS 2007-8-29 How many characters are in the “Heaven Book”? 9 3*10 10,000 books 1 book 100 pages 1 page 3,000 characters CCGGTCTCCCCGCCCGCGCGCGAAGTAAAGGCCCAGCGCAGCCCGCGCTCCTGCCCT GGGGCCTCGTCTTTCTCCAGGAAAACGTGGACCGCTCTCCGCCGACAGTCTCTTCCAC AGACCCCTGTCGCCTTCGCCCCCCGGTCTCTTCCGGTTCTGTCTTTTCGCTGGCTCGAT ACGAACAAGGAAGTCGCCCCCAGCGAGCCCCGGCTCCCCCAGGCAGAGGCGGCCCC GGGGGCGGAGTCAACGGCGGAGGCACGCCCTCTGTGAAAGGGCGGGGCATGCAAATT CGAAATGAAAGCCCGGGAACGCCGAAGAAGCACGGGTGTAAGATTTCCCTTTTCAAAG GCGGGAGAATAAGAAATCAGCCCGAGAGTGTAAGGGCGTCAATAGCGCTGTGGACGA GACAGAGGGAATGGGGCAAGGAGCGAGGCTGGGGCTCTCACCGCGACTTGAATGTGG ATGAGAGTGGGACGGTGACGGCGGGCGCGAAGGCGAGCGCATCGCTTCTCGGCCTTT TGGCTAAGATCAAGTGTAGTATCTGTTCTTATCAGTTTAATATCTGATACGTCCTCTATCC GAGGACAATATATTAAATGGATTGATCAATCCGCTTCAGCCTCCCGAGTAGCTGGGACT ACAGACGGTGCCATCACGCCCAGCTCATTGTTGATTCCCGCCCCCTTGGTAGAGACGG GATTCCGCTATATTGCCTGGGCTGGTGTCGAACTCATAGAACAAAGGATCCTCCCTCCT GGGCCTGGGCGTGGGCTCGCAAAACGCTGGGATTCCCGGATTACAGGCGGGCGCACC ACACCAGGAGCAAACACTTCCGGTTTTAAAAATTCAGTTTGTGATTGGCTGTCATTCAGT ATTATGCTAATTAAGCATGCCCGGTTTTAAACCTCTTAAAACAACTTTTAAAATTACCTTT CCACCTAAAACGTTAAAATTTGTCAAGTGATAATATTCGACAAGCTGTTATTGCCAAACT ATTTTCCTATTTGTTTCCTAATGGCATCGGAACTAGCGAAAGTTTCTCGCCATCAGTTAA AAGTTTGCGGCAGATGTAGACCTAGCAGAGGTGTGCGAGGAGGCCGTTAAGACTATAC TTTCAGGGATCATTTCTATAGTGTGTTACTAGAGAAGTTTCTCTGAACGTGTAGAGCACC GAAAACCACGAGGAAGAGAGGTAGCGTTTTCATCGGGTTACCTAAGTGCAGTGTCCCC CCTGGCGCGCAATTGGGAACCCCACACGCGGTGTAGAAATATATTTTAAGGGCGCG (1250 characters) Noncoding sequences: Sequences in genome, which are not coding for any proteins. How many of the human genome are noncoding sequences? More than 97%!!! BREAKTHROUGH OF THE YEAR(2001): Science celebrates nine other areas in which important findings were reported this year, from subatomic to atmospheric and beyond. First runner-up: RNA ascending. Short RNAs clearly play important biological roles. Dozens of the molecules are now known to exist in the nematode and fruit fly. The coding for these molecules is contained in the DNA sequence. Some 100 of these tiny RNA "genes" have been found in the gut bacterium Escherichia coli, and some 200 were uncovered in DNA from mouse brain tissue. In the nematode and fruit fly, they seem to be involved in development; in E. coli, they may facilitate rapid responses to environmental change and could serve similar functions in mammals. Nature 391, 806 - 811 (19 February 1998) Potent and specific genetic interference by doublestranded RNA in Caenorhabditis elegans ANDREW FIRE*, SIQUN XU*, MARY K. MONTGOMERY*, STEVEN A. KOSTAS*†, SAMUEL E. DRIVER‡ & CRAIG C. MELLO‡ * Carnegie Institution of Washington, Department of Embryology, 115 West University Parkway, Baltimore, Maryland 21210, USA † Biology Graduate Program, Johns Hopkins University, 3400 North Charles Street, Baltimore, Maryland 21218, USA ‡ Program in Molecular Medicine, Department of Cell Biology, University of Massachusetts Cancer Center, Two Biotech Suite 213, 373 Plantation Street, Worceste Massachusetts 01605, USA Transcriptional output/complexity 基因组的转录情况 Genome and transcription (tiling array data) (基因组和转录) Protein coding sequence (编码蛋白序列) –人 (Human) ~2-3 % of genome –线虫(C.elegans) ~25 % of genome Transcriptional activity (基因组的转录水平) –人(Human) ≧ 60 % (20-30X) of genome –线虫(C.elegans) ~70 % (2-3X) of genome The majority of transcripts are non-coding RNAs The major differences among different organisms are ncRNAs Biological Dark Matter Newfound RNA suggests a hidden complexity inside cells John Travis In the early 1990s, Victor Ambros and his colleagues were conducting a gene hunt. In particular, they were searching for the gene that was mutated in a perplexing strain of Caenorhabditis elegans, the small nematode whose development many biologists study. Unlike most genes, the one identified by Ambros' group doesn't encode a protein. It spawns a small molecule of RNA—a chemical relative of DNA—that somehow turns off other genes that play a role in worm development. Several groups, including one led by Eddy, Ambros' team and two other research groups reported that Escherichia coli , worms, flies, and people contain dozens of previously undetected genes that spawn RNA instead of protein. The RNA genes found so far are "just the tip of a huge iceberg," says Ruvkun. Organization of the Caenorhabditis elegans small non-coding transcriptome: Genomic features, biogenesis, and expression 1、Found 100 novel noncoding RNAs and their genes in C.elegans by Rnomics Applying a novel cloning strategy, we have cloned 100 novel and 61 known or predicted Caenorhabditis elegans full-length ncRNAs (different from microRNA). Genome Research 16: 20-29, 2006; NCBI accession number: AY948555-- AY948719 Studying the genomic environment and transcriptional characteristics have shown that two-thirds of all ncRNAs, including many intronic snoRNAs, are independently transcribed under the control of ncRNA-specific upstream promoter elements. Furthermore, the transcription levels of at least 60% of the ncRNAs vary with developmental stages. We identified two new classes of ncRNAs, stem–bulge RNAs (sbRNAs) and snRNA-like RNAs (snlRNAs), both featuring distinct internal motifs, secondary structures, upstream elements, and high and developmentally variable expression. Most of the novel ncRNAs are conserved in Caenorhabditis briggsae, but only one homolog was found outside the nematodes. To classify two new categories The stem-bulge RNAs of C. elegans The snRNA like RNAs of C. elegans Confirm three special upstream motifs of noncoding genes—UM1-3 located within 40-80 bp upstream of the transcription initiation sites of the ncRNA loci were further revealed by MEME (Bailey and Elkan, 1995). Found that many of the ncRNA genes are located in the introns of host protein-coding genes and are under the control of independent promoter elements. 300 Group V Host gene EST hits 250 Group II 200 150 100 50 0 0 5 10 15 20 25 ncRNA library clone number The expression levels of non-motif snoRNAs with the frequencies of ESTs corresponding to exons of their host genes, produced a distinct positive correlation not found for motifcontaining loci. 构建了编码与非编码基因同时测量的混合芯片 Profiling Caenorhabditis elegans non-coding RNA expression with a combined microarray Housheng He1,5, Lun Cai2,5, Geir Skogerbø1, Wei Deng1, Tao Liu1,5, Xiaopeng Zhu1,5, Yudong Wang1, Dong Jia1, Zhihua Zhang1,5, Yong Tao5,6, Haipan Zeng7, Muhammad Nauman Aftab1,5, Yan Cui4, Guozhen Liu7 and Runsheng Chen1,2,3,*, Nucleic Acids Research, 2006, Vol. 34, No. 10, 2976–2983 Biogenesis of C. elegans ncRNAs Arrangements of transcriptional elements and genomic locations of small noncoding ncRNA loci, as inferred from genomic and experimental data. Developmentally regulated ncRNAs Analysis of transcription levels of 106 ncRNA families were carried out with Northern blot. 61 showed variation exceeding two standard variation, composed of 6 distinct expression clusters. Public release date: 9-Jan-2006 Contact: Maria Smit smit@cshl.edu 516-422-4013 Cold Spring Harbor Laboratory 'Pregnant' protein-coding genes carry RNA 'babies' Scientists characterize large numbers of independently expressed, nonprotein-coding RNA genes in the introns of protein-coding genes BEIJING, China Scientists from the Chinese Academy of Sciences have performed a comprehensive analysis of small, non-protein-coding RNAs in the model nematode, C. elegans. They characterize 100 heretoforeundescribed transcripts, including two novel classes; they provide insights into the genomic structure and transcriptional regulation of non-coding RNAs; and they underscore the importance of non-coding RNAs in nematode development. Their work appears this month in the journal Genome Research. *"The significance of non-protein-coding RNAs as central components of various cellular processes has risen sharply over the recent years," explains Prof. Runsheng Chen, principal investigator on the study. Excluding microRNAs (miRNAs), or small transcripts that have recently received widespread attention and are known to play important roles in transcriptional regulation, small non-coding RNAs (or ncRNAs) in C. elegans have not been extensively investigated until now. Using a new, high-throughput procedure to clone small, full-length ncRNAs, Chen's laboratory isolated and characterized 161 unique transcripts. A major advantage of the new cloning procedure is that it achieves an extraordinarily high detection rate for ncRNAs by current standards. "Studies published over recent years have only been able to reach a detection rate of about 3%, but our method reached a detection rate of 30% a 10-fold increase in cloning efficiency," explains Chen. "It's like going from a Model T Ford to a Ferrari in one fell swoop!" Of the 161 transcripts detected by Chen's group, 100 were novel and 61 were previously known or predicted. Among the 100 novel genes, 30 had no known function, whereas 70 belonged to the ubiquitous class of small nucleolar RNAs (snoRNAs). Based on sequence and structural features, Chen and his colleagues were able to classify more than half of the 30 unknown RNAs into two new categories: stem-bulge RNAs (sbRNAs) and small nuclear-like RNAs (snlRNAs). Both classes of transcripts exhibited enhanced expression during the later stages of worm development, indicating a functional role for these transcripts in developmental processes. "The interesting thing about nematodes is that their genomic organization of both snoRNAs and other ncRNAs is quite different from other animals," says Chen. In contrast to the genomes of other metazoans, where most snoRNAs are found in introns and are under the control of independent promoters, nematode snoRNA loci are both intergenic and intronic (with and without promoters). Interestingly, plant snoRNAs are primarily located in intergenic regions. Other ncRNA genes (i.e., non-snoRNA genes) are mainly located in intergenic regions in both plants and animals. But in nematodes, Chen's team found that many of these other ncRNA genes are located in the introns of host protein-coding genes and are under the control of independent promoter elements. Finally, Chen and his colleagues estimated that 2700 ncRNA genes are present in the C. elegans genome. "One particularly intriguing aspect of the non-coding transcriptome is its potential to fill the regulatory gap created by the surprisingly low number of protein-coding genes in higher organisms," says Chen. "Between one-celled yeast, thousand-celled nematodes, and trillion-celled mammals, there is a difference of a mere 6,000 to 19,000 to 25,000 in protein-coding gene numbers. We think that regulation by noncoding RNA accounts for this discrepancy and helps to explain the additional biological complexity of higher organisms." 2、Mapping the C. elegans noncoding transcriptome with a whole genome tiling microarray Tiling Structure of eukaryotic mRNA Cap 5’-UTR Coding region Initiation (AUG) 3’-UTR Poly-A Termination (AUG, UGA, UAA) RNA was extracted from a mixed stage population of wild type C. elegans strain N2 Three kinds of samples: PA: PolyA tailed RNA NPA: Non-polyA tailed RNA SNPA: small Non-polyA tailed RNA (RNA<500nt) Build Transfrag Find TUFs (Transfrag of Unknown Function) Transfrag distribution in the three different samples. “Other annotated” mainly includes tandem repeats and pseudogenes Detection rates of annotated ncRNAs in the SNPA sample The NPA sample produced 97,548 transfrags which could potentially all represent non-coding transcripts. Nearly 24% are non-annotated intergenic TUFs. The RT-PCR analysis confirmed 89% (25/29) of randomly sampled intronic and intergenic TUFs, effectively excluding the possibility that the majority of the NPA TUFs are a result of microarray nonspecific hybridization. TUFs in the NPA sample are also fairly well conserved, with 54% showing at least some level of conservation (weak WABA (Kent and Zahler, 2000)) in C. briggsae. NPA TUFs are generally short (mean 88 nt, median 75 nt), however, of these only 557 overlapped with the SNPA TUFs. A possible explanation is that short NPA TUFs in close proximity may represent longer transcripts. Chromosomal distribution of small ncRNAs (tRNAs excluded) and SNPA TUFs The novel upstream motif 4 (UM4) compared to UM3 and UM1. All three motifs share the submotif TGTCNG (green rectangles), but at different relative positions. What is mRNA-like ncRNA? transcribed by RNA polymerase II PolyA tail They are often spliced They have none or very short orf Bioinformatics Research Group, Institute of Computing Technology, CAS. ENCODE pilot project September 2003 saw the birth of the ENCODE project — The Encyclopedia of DNA elements — the goal of which was to identify and document the functional elements within the genome using high-throughput methods. Thirty-five groups took part in the project, bringing expertise that ranged from genome annotation, to RNAexpression analysis, to comparative genomics. Their analysis of 1% of the human genome (distributed among 44 genomic regions) resulted in more than 200 experimental and computational data sets. Some of the most striking results concern transcription and its regulation. [1]. Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature, 2007, 447: 799. [2]. The ENCODE Project Consortium. Science, 2004, 306: 636. [3]. 注: ENCODE详细情况,请登陆http://www.genome.gov/ENCODE We learn that as much as 93% of the interrogated region can be transcribed, indicating that transcription is not confined to what we (for now) identify as genes. Many transcripts are non-coding, whereas others seem to form fusion transcripts between ORFs that had previously been annotated as distinct. A gene is a union of genomic sequences encoding a coherent set of potentially overlapping functional products. 2、预测了人类三号染色体上的非编码基因 (have predicted noncoding genes in chromosome 3 of human genome) 人类基因组完成图的绘制是基因组研究的重要步骤, 现国际上正一个基因组、一个基因组的进行。现在6, 7, 13, 14, 19, 20, 21 22, 和 Y共九条染色体的完成图的绘 制工作已结束,并都发表了Nature文章。我们参加了由 美国贝勒医学院牵头的人三号染色体完成图的工作,具 体负责NcRNA基因标注。为此我们建立了一套识别 NcRNA的软件包。文章已发表在 Nature 440 1194-1198 2006. 人3号染色体上发现的非编码基因 RAN classes Total Number snRNA 83 Y RNA 46 SnoRNA (C/D box) SnoRNA (HA/CA box) 21 22 tRNA 13 SRP RNA 17 miRNA 3 mRNA–like ncRNA 481 rRNA 10 telomerase RNA 1 3 2 1 713 7SK RNA snmRNA scaRNA Total Methods Number H.F. 72 RFAM 80 H.F. 35 RFAM 45 H.F. 6 RFAM 4 snoScan 17 H.F. 12 Fisher 8 C.M 8 H.F. 10 tRNA –Scan 9 H.F. 1 SRP RNA Scan 16 H.F. 1 RFAM 3 H.F.-FANTOM H.F.-FLJ/H-InV 1 452 Unigene Filter 28 H.F. 3 RFAM 7 H.F. H.F. H.F. H.F. 3 3 2 1 872 (Redundant) 3、构建了NcRNA数据库 (have built the noncoding RNA database— NONCODE) 收集了在各种杂志上发表的、网站上公布的所有 被实验证实的NcRNA基因,发展了相应的软件及检索工 具,建成了NcRNA数据库。相关论文已送Nucleic Acids Research。韩国已要求成为我们的镜象。上网仅两个多 月点击我们数据库的目前已超过12万次(平均每天约 2000次)来自约60,000个不同的IP地址。 论文已发表在2005年第一期Nucleic Acids Research 上。 ABSTRACT NONCODEis an integrated knowledge database dedicated to non-coding RNAs (ncRNAs), that is to say, RNAs that function without being translated into proteins. All ncRNAs in NONCODE were filtered automatically from literature and GenBank, and were later manually curated. The distinctive features of NONCODE are as follows: (i) the ncRNAs in NONCODE include almost all the types of ncRNAs, except transfer RNAs and ribosomal RNAs. (ii) All ncRNA sequences and their related information (e.g. function, cellular role, cellular location, chromosomal information, etc.) in NONCODE have been confirmed manually by consulting relevant literature: more than 80% of the entries are based on experimental data. (iii) Based on the cellular process and function, which a ivenncRNAis involved in,weintroduced a novel classification system, labeled process function class, to integrate existing classification systems. (iv) In addition, some 1100 ncRNAs have been grouped into nine other classes according to whether theyare specific to gender or tissue or associated with tumors and diseases, etc. (v) NONCODE provides a user-friendly interface, a visualization platform and a convenient search option, allowing efficient recovery of sequence, regulatory elements in the flanking sequences, secondary structure, related publications and other information. The first release of NONCODE (v1.0) contains 5339 non-redundant sequences from 861 organisms, including eukaryotes, eubacteria, archaebacteria, virus and viroids. Access is free for all users through a web interface at http://noncode.bioinfo.org.cn. ncRNA 数据库 谢谢大家!