PPT - The International Conference on Bioinformatics InCoB by

advertisement
Organization of the Caenorhabditis elegans
small non-coding transcriptome by Rnomics,
tiling array and bioinformatics
陈润生(Runsheng CHEN)
Institute of Biophysics, CAS
2007-8-29
How many characters are in the “Heaven Book”?
9
3*10
10,000 books
1 book 100 pages
1 page 3,000 characters
CCGGTCTCCCCGCCCGCGCGCGAAGTAAAGGCCCAGCGCAGCCCGCGCTCCTGCCCT
GGGGCCTCGTCTTTCTCCAGGAAAACGTGGACCGCTCTCCGCCGACAGTCTCTTCCAC
AGACCCCTGTCGCCTTCGCCCCCCGGTCTCTTCCGGTTCTGTCTTTTCGCTGGCTCGAT
ACGAACAAGGAAGTCGCCCCCAGCGAGCCCCGGCTCCCCCAGGCAGAGGCGGCCCC
GGGGGCGGAGTCAACGGCGGAGGCACGCCCTCTGTGAAAGGGCGGGGCATGCAAATT
CGAAATGAAAGCCCGGGAACGCCGAAGAAGCACGGGTGTAAGATTTCCCTTTTCAAAG
GCGGGAGAATAAGAAATCAGCCCGAGAGTGTAAGGGCGTCAATAGCGCTGTGGACGA
GACAGAGGGAATGGGGCAAGGAGCGAGGCTGGGGCTCTCACCGCGACTTGAATGTGG
ATGAGAGTGGGACGGTGACGGCGGGCGCGAAGGCGAGCGCATCGCTTCTCGGCCTTT
TGGCTAAGATCAAGTGTAGTATCTGTTCTTATCAGTTTAATATCTGATACGTCCTCTATCC
GAGGACAATATATTAAATGGATTGATCAATCCGCTTCAGCCTCCCGAGTAGCTGGGACT
ACAGACGGTGCCATCACGCCCAGCTCATTGTTGATTCCCGCCCCCTTGGTAGAGACGG
GATTCCGCTATATTGCCTGGGCTGGTGTCGAACTCATAGAACAAAGGATCCTCCCTCCT
GGGCCTGGGCGTGGGCTCGCAAAACGCTGGGATTCCCGGATTACAGGCGGGCGCACC
ACACCAGGAGCAAACACTTCCGGTTTTAAAAATTCAGTTTGTGATTGGCTGTCATTCAGT
ATTATGCTAATTAAGCATGCCCGGTTTTAAACCTCTTAAAACAACTTTTAAAATTACCTTT
CCACCTAAAACGTTAAAATTTGTCAAGTGATAATATTCGACAAGCTGTTATTGCCAAACT
ATTTTCCTATTTGTTTCCTAATGGCATCGGAACTAGCGAAAGTTTCTCGCCATCAGTTAA
AAGTTTGCGGCAGATGTAGACCTAGCAGAGGTGTGCGAGGAGGCCGTTAAGACTATAC
TTTCAGGGATCATTTCTATAGTGTGTTACTAGAGAAGTTTCTCTGAACGTGTAGAGCACC
GAAAACCACGAGGAAGAGAGGTAGCGTTTTCATCGGGTTACCTAAGTGCAGTGTCCCC
CCTGGCGCGCAATTGGGAACCCCACACGCGGTGTAGAAATATATTTTAAGGGCGCG
(1250 characters)
Noncoding sequences: Sequences in
genome, which are not coding for any
proteins.
How many of the human genome are
noncoding sequences?
More than 97%!!!
BREAKTHROUGH OF THE YEAR(2001):
Science celebrates nine other areas in which
important findings were reported this year, from
subatomic to atmospheric and beyond.
First runner-up: RNA ascending.
Short RNAs clearly play important biological roles.
Dozens of the molecules are now known to exist in
the nematode and fruit fly. The coding for these
molecules is contained in the DNA sequence. Some
100 of these tiny RNA "genes" have been found in
the gut bacterium Escherichia coli, and some 200
were uncovered in DNA from mouse brain tissue. In
the nematode and fruit fly, they seem to be involved
in development; in E. coli, they may facilitate rapid
responses to environmental change and could serve
similar functions in mammals.
Nature 391, 806 - 811 (19 February 1998)
Potent and specific genetic interference by doublestranded RNA in Caenorhabditis elegans
ANDREW FIRE*, SIQUN XU*, MARY K. MONTGOMERY*, STEVEN A. KOSTAS*†,
SAMUEL E. DRIVER‡ & CRAIG C. MELLO‡
* Carnegie Institution of Washington, Department of Embryology, 115 West University
Parkway, Baltimore, Maryland 21210, USA
† Biology Graduate Program, Johns Hopkins University, 3400 North Charles Street,
Baltimore, Maryland 21218, USA
‡ Program in Molecular Medicine, Department of Cell Biology, University of
Massachusetts Cancer Center, Two Biotech Suite 213, 373 Plantation Street, Worceste
Massachusetts 01605, USA
Transcriptional output/complexity
基因组的转录情况
Genome and transcription (tiling array data) (基因组和转录)
Protein coding sequence (编码蛋白序列)
–人 (Human)
~2-3 % of genome
–线虫(C.elegans)
~25 % of genome
Transcriptional activity (基因组的转录水平)
–人(Human)
≧ 60 % (20-30X) of genome
–线虫(C.elegans)
~70 %
(2-3X) of genome
The majority of transcripts are non-coding RNAs
The major differences among different organisms are ncRNAs
Biological Dark Matter
Newfound RNA suggests a hidden complexity
inside cells
John Travis
In the early 1990s, Victor Ambros and his colleagues were
conducting a gene hunt. In particular, they were searching for the gene
that was mutated in a perplexing strain of Caenorhabditis elegans, the
small nematode whose development many biologists study. Unlike
most genes, the one identified by Ambros' group doesn't encode a
protein. It spawns a small molecule of RNA—a chemical relative of
DNA—that somehow turns off other genes that play a role in worm
development. Several groups, including one led by Eddy, Ambros' team
and two other research groups reported that Escherichia coli , worms,
flies, and people contain dozens of previously undetected genes that
spawn RNA instead of protein.
The RNA genes found so far are "just the tip of a huge iceberg,"
says Ruvkun.
Organization of the Caenorhabditis elegans
small non-coding transcriptome: Genomic
features, biogenesis, and expression
1、Found 100 novel noncoding RNAs and
their genes in C.elegans by Rnomics
Applying a novel cloning strategy, we have cloned
100 novel and 61 known or predicted
Caenorhabditis elegans full-length ncRNAs
(different from microRNA).
Genome Research 16: 20-29, 2006;
NCBI accession number: AY948555-- AY948719
Studying the genomic environment and transcriptional
characteristics have shown that two-thirds of all ncRNAs,
including many intronic snoRNAs, are independently
transcribed under the control of ncRNA-specific upstream
promoter elements. Furthermore, the transcription levels of at
least 60% of the ncRNAs vary with developmental stages. We
identified two new classes of ncRNAs, stem–bulge RNAs
(sbRNAs) and snRNA-like RNAs (snlRNAs), both featuring
distinct internal motifs, secondary structures, upstream
elements, and high and developmentally variable expression.
Most of the novel ncRNAs are conserved in Caenorhabditis
briggsae, but only one homolog was found outside the
nematodes.
To classify two new categories
The stem-bulge RNAs
of C. elegans
The snRNA like RNAs of C. elegans
Confirm three special upstream motifs
of noncoding genes—UM1-3
located within
40-80 bp
upstream of
the
transcription
initiation sites
of the ncRNA
loci were
further
revealed by
MEME (Bailey
and Elkan,
1995).
Found that many of the ncRNA genes are located
in the introns of host protein-coding genes and are
under the control of independent promoter
elements.
300
Group V
Host gene EST hits
250
Group II
200
150
100
50
0
0
5
10
15
20
25
ncRNA library clone number
The expression levels of non-motif snoRNAs with the
frequencies of ESTs corresponding to exons of their host genes,
produced a distinct positive correlation not found for motifcontaining loci.
构建了编码与非编码基因同时测量的混合芯片
Profiling Caenorhabditis elegans non-coding RNA expression
with a combined microarray
Housheng He1,5, Lun Cai2,5, Geir Skogerbø1, Wei Deng1, Tao Liu1,5, Xiaopeng Zhu1,5,
Yudong Wang1, Dong Jia1, Zhihua Zhang1,5, Yong Tao5,6, Haipan Zeng7,
Muhammad Nauman Aftab1,5, Yan Cui4, Guozhen Liu7 and Runsheng Chen1,2,3,*,
Nucleic Acids Research, 2006, Vol. 34, No. 10, 2976–2983
Biogenesis of
C. elegans
ncRNAs
Arrangements of
transcriptional
elements and
genomic locations
of small noncoding ncRNA loci,
as inferred from
genomic and
experimental data.
Developmentally
regulated ncRNAs
Analysis of
transcription
levels of 106
ncRNA families
were carried
out with
Northern blot.
61 showed
variation
exceeding two
standard
variation,
composed of 6
distinct
expression
clusters.
Public release date: 9-Jan-2006
Contact: Maria Smit
smit@cshl.edu
516-422-4013
Cold Spring Harbor Laboratory
'Pregnant' protein-coding genes carry RNA 'babies'
Scientists characterize large numbers of independently expressed, nonprotein-coding RNA genes in the introns of protein-coding genes
BEIJING, China Scientists from the Chinese Academy of Sciences have
performed a comprehensive analysis of small, non-protein-coding RNAs in
the model nematode, C. elegans. They characterize 100 heretoforeundescribed transcripts, including two novel classes; they provide insights
into the genomic structure and transcriptional regulation of non-coding
RNAs; and they underscore the importance of non-coding RNAs in
nematode development. Their work appears this month in the journal
Genome Research.
*"The significance of non-protein-coding RNAs as central components of
various cellular processes has risen sharply over the recent years," explains
Prof. Runsheng Chen, principal investigator on the study. Excluding
microRNAs (miRNAs), or small transcripts that have recently received
widespread attention and are known to play important roles in transcriptional
regulation, small non-coding RNAs (or ncRNAs) in C. elegans have not been
extensively investigated until now.
Using a new, high-throughput procedure to clone small, full-length ncRNAs,
Chen's laboratory isolated and characterized 161 unique transcripts. A major
advantage of the new cloning procedure is that it achieves an extraordinarily
high detection rate for ncRNAs by current standards. "Studies published over
recent years have only been able to reach a detection rate of about 3%, but our
method reached a detection rate of 30% a 10-fold increase in cloning
efficiency," explains Chen. "It's like going from a Model T Ford to a Ferrari in
one fell swoop!"
Of the 161 transcripts detected by Chen's group, 100 were novel and 61 were
previously known or predicted. Among the 100 novel genes, 30 had no known
function, whereas 70 belonged to the ubiquitous class of small nucleolar
RNAs (snoRNAs). Based on sequence and structural features, Chen and his
colleagues were able to classify more than half of the 30 unknown RNAs into
two new categories: stem-bulge RNAs (sbRNAs) and small nuclear-like
RNAs (snlRNAs). Both classes of transcripts exhibited enhanced expression
during the later stages of worm development, indicating a functional role for
these transcripts in developmental processes.
"The interesting thing about nematodes is that their genomic organization of
both snoRNAs and other ncRNAs is quite different from other animals," says
Chen. In contrast to the genomes of other metazoans, where most snoRNAs
are found in introns and are under the control of independent promoters,
nematode snoRNA loci are both intergenic and intronic (with and without
promoters). Interestingly, plant snoRNAs are primarily located in intergenic
regions. Other ncRNA genes (i.e., non-snoRNA genes) are mainly located in
intergenic regions in both plants and animals. But in nematodes, Chen's team
found that many of these other ncRNA genes are located in the introns of host
protein-coding genes and are under the control of independent promoter
elements.
Finally, Chen and his colleagues estimated that 2700 ncRNA genes are
present in the C. elegans genome. "One particularly intriguing aspect of the
non-coding transcriptome is its potential to fill the regulatory gap created by
the surprisingly low number of protein-coding genes in higher organisms,"
says Chen. "Between one-celled yeast, thousand-celled nematodes, and
trillion-celled mammals, there is a difference of a mere 6,000 to 19,000 to
25,000 in protein-coding gene numbers. We think that regulation by noncoding RNA accounts for this discrepancy and helps to explain the additional
biological complexity of higher organisms."
2、Mapping the C. elegans noncoding
transcriptome with a whole genome tiling
microarray
Tiling
Structure of eukaryotic mRNA
Cap
5’-UTR
Coding region
Initiation (AUG)
3’-UTR
Poly-A
Termination (AUG,
UGA, UAA)
RNA was extracted from a mixed stage population
of wild type C. elegans strain N2
Three kinds of samples:
PA: PolyA tailed RNA
NPA: Non-polyA tailed RNA
SNPA: small Non-polyA tailed RNA
(RNA<500nt)
Build Transfrag
Find TUFs (Transfrag of Unknown
Function)
Transfrag distribution in the three different samples.
“Other annotated” mainly includes tandem repeats
and pseudogenes

Detection rates of annotated ncRNAs in
the SNPA sample
The NPA sample produced 97,548 transfrags which could
potentially all represent non-coding transcripts. Nearly 24%
are non-annotated intergenic TUFs. The RT-PCR analysis
confirmed 89% (25/29) of randomly sampled intronic and
intergenic TUFs, effectively excluding the possibility that the
majority of the NPA TUFs are a result of microarray nonspecific hybridization. TUFs in the NPA sample are also fairly
well conserved, with 54% showing at least some level of
conservation (weak WABA (Kent and Zahler, 2000)) in C.
briggsae. NPA TUFs are generally short (mean 88 nt, median
75 nt), however, of these only 557 overlapped with the SNPA
TUFs. A possible explanation is that short NPA TUFs in close
proximity may represent longer transcripts.
Chromosomal distribution of small ncRNAs
(tRNAs excluded) and SNPA TUFs
The novel upstream motif 4 (UM4) compared to UM3 and UM1. All three
motifs share the submotif TGTCNG (green rectangles), but at different
relative positions.
What is mRNA-like ncRNA?




transcribed by RNA polymerase II
PolyA tail
They are often spliced
They have none or very short orf
Bioinformatics Research Group, Institute of Computing Technology, CAS.
ENCODE pilot project
September 2003 saw the birth of the ENCODE project —
The Encyclopedia of DNA elements — the goal of which was
to identify and document the functional elements within the
genome using high-throughput methods.
Thirty-five groups took part in the project, bringing
expertise that ranged from genome annotation, to RNAexpression analysis, to comparative genomics. Their analysis
of 1% of the human genome (distributed among 44 genomic
regions) resulted in more than 200 experimental and
computational data sets. Some of the most striking results
concern transcription and its regulation.
[1]. Identification and analysis of functional elements in 1% of the human genome by
the ENCODE pilot project. Nature, 2007, 447: 799.
[2]. The ENCODE Project Consortium. Science, 2004, 306: 636.
[3]. 注: ENCODE详细情况,请登陆http://www.genome.gov/ENCODE
We learn that as much as 93% of the interrogated region
can be transcribed, indicating that transcription is not
confined to what we (for now) identify as genes. Many
transcripts are non-coding, whereas others seem to form
fusion transcripts between ORFs that had previously been
annotated as distinct.
A gene is a union of genomic sequences encoding
a coherent set of potentially overlapping functional
products.
2、预测了人类三号染色体上的非编码基因
(have predicted noncoding genes in chromosome 3 of
human genome)
人类基因组完成图的绘制是基因组研究的重要步骤,
现国际上正一个基因组、一个基因组的进行。现在6, 7,
13, 14, 19, 20, 21 22, 和 Y共九条染色体的完成图的绘
制工作已结束,并都发表了Nature文章。我们参加了由
美国贝勒医学院牵头的人三号染色体完成图的工作,具
体负责NcRNA基因标注。为此我们建立了一套识别
NcRNA的软件包。文章已发表在
Nature 440 1194-1198 2006.
人3号染色体上发现的非编码基因
RAN classes
Total Number
snRNA
83
Y RNA
46
SnoRNA (C/D box)
SnoRNA (HA/CA box)
21
22
tRNA
13
SRP RNA
17
miRNA
3
mRNA–like ncRNA
481
rRNA
10
telomerase RNA
1
3
2
1
713
7SK RNA
snmRNA
scaRNA
Total
Methods
Number
H.F.
72
RFAM
80
H.F.
35
RFAM
45
H.F.
6
RFAM
4
snoScan
17
H.F.
12
Fisher
8
C.M
8
H.F.
10
tRNA –Scan
9
H.F.
1
SRP RNA Scan
16
H.F.
1
RFAM
3
H.F.-FANTOM
H.F.-FLJ/H-InV
1
452
Unigene Filter
28
H.F.
3
RFAM
7
H.F.
H.F.
H.F.
H.F.
3
3
2
1
872 (Redundant)
3、构建了NcRNA数据库
(have built the noncoding RNA database—
NONCODE)
收集了在各种杂志上发表的、网站上公布的所有
被实验证实的NcRNA基因,发展了相应的软件及检索工
具,建成了NcRNA数据库。相关论文已送Nucleic Acids
Research。韩国已要求成为我们的镜象。上网仅两个多
月点击我们数据库的目前已超过12万次(平均每天约
2000次)来自约60,000个不同的IP地址。
论文已发表在2005年第一期Nucleic Acids
Research 上。
ABSTRACT
NONCODEis an integrated knowledge database dedicated to non-coding RNAs (ncRNAs), that is to say,
RNAs that function without being translated into proteins. All ncRNAs in NONCODE were filtered automatically
from literature and GenBank, and were later manually curated. The distinctive features of NONCODE are as
follows: (i) the ncRNAs in NONCODE include almost all the types of ncRNAs, except transfer RNAs and
ribosomal RNAs. (ii) All ncRNA sequences and their related information (e.g. function, cellular role, cellular
location, chromosomal information, etc.) in NONCODE have been confirmed manually by consulting relevant
literature: more than 80% of the entries are based on experimental data. (iii) Based on the cellular process
and function, which a ivenncRNAis involved in,weintroduced a novel classification system, labeled process
function class, to integrate existing classification systems. (iv) In addition, some 1100 ncRNAs have been
grouped into nine other classes according to whether theyare specific to gender or tissue or associated with
tumors and diseases, etc. (v) NONCODE provides a user-friendly interface, a visualization platform and a
convenient search option, allowing efficient recovery of sequence, regulatory elements in the flanking
sequences, secondary structure, related publications and other information. The first release of NONCODE
(v1.0) contains 5339 non-redundant sequences from 861 organisms, including eukaryotes, eubacteria,
archaebacteria, virus and viroids. Access is free for all users through a web interface at
http://noncode.bioinfo.org.cn.
ncRNA 数据库
谢谢大家!
Download