…. and the RNA world

advertisement
RNAs in the human genome
Sam Griffiths-Jones
The Wellcome Trust Sanger Institute
Outline
• I. Non-coding RNA
• The genome’s dark matter
• Family classification
• Genome annotation
• II. ncRNA genes in the human genome
• Rogue’s gallery
• miRNAs
• Regulatory elements
T. thermophilus - Ramakrishnan et al., Cell, 2002
Protein/RNA genes
DNA
RNA
X
protein
ncRNA genes
• …. code for functional RNAs
• Many cellular machines contain RNA
•
•
•
•
Ribosome
Spliceosome
Telomerase
SRP
rRNA
snRNAs (U1,U2,U4,U5,U6)
Telomerase RNA
SRP RNA
How many genes in the
human genome?
Gene sweep
• CSHL 2000-2003
• Rules
• $1 in 2000, $5 in 2001 and $20 in 2002
• A gene is a set of connected transcripts. A transcript is a set of exons connected
via transcription. At least one transcript must be expressed outside of the nucleus
and one transcript must encode a protein.
• One bet per person, per year
• Results
• 165 bets
• Mean 61710
• Lowest 25947
• Highest 153478
• Answer: 21000
Winner: Lee Rowen
• http://www.ensembl.org/Genesweep/
ncRNA genes
• Genomic dark matter
• Ignored by gene prediction methods
• Not in EnsEMBL
• Computational complexity
• ~10% of human gene count?
The RNA World
• Origin of life / central dogma paradox
• DNA needs proteins to replicate
• Proteins coded for by DNA
• RNA can be code and machinery
• Selex, aptamers
• RNAs are remnants
• Ancient
• Essential
Biological sequence analysis
Protein easy
RNA hard
Gene finding
• Rules
• ATG
• TAA, TGA, TAG
• GT…..AG
• Compositional features
•
•
•
•
Exon lengths
Intron lengths
Codon bias
General genomic properties
• Homology
?
?
Protein sequence analysis
Query:
1 MKFYTIKLPKFLGGIVRAMLGSFRKD 26
M+ TIKLPKFL IVR
G+ + D
Sbjct: 390 MRIMTIKLPKFLAKIVRMFKGNKKSD 467
RNA sequence analysis
RNA sequence analysis
Why are families useful?
• Alignments of related sequences
• Phylogenetic trees
• Homologue detection
• Genome annotation
• Secondary structure prediction
S.
P.
P.
K.
SS
cerevisiae
canadensis
strasburgensis
thermotolerans
UCCUCGUGAGAGGG
GUCUC.UGAGAGAU
CUCUC.UGAGAGAG
UUCUCGUGAGAGAA
<<<<<....>>>>>
RNA models
• Covariance models (profile-SCFGs)
• Analogue to profile-HMMs
• Statistical representation of the alignment
with structure
• Homologue detection
• Multiple sequence alignment
• (Sean Eddy)
Protein sequence analysis - HMMs
ERELKKQKKLSNR
ERELKK..KQSNR
ERELKRQRKQSNR
KAAAQRQKMIKNR
B
EREKKKRKQSNR
D
D
D
D
M
M
M
M
I
I
I
E
RNA sequence analysis - SCFGs
MP
G
A
A
A–U
G–C
G–C
MP
MP
ML
ML
ML
G G A A G A
< < < . . .
U C C
> > >
RNA models - problems
• Problems
• Speed
• Memory
• Sensitivity
• Speed
•
•
•
•
30 billion bases in DBs
O(N3) wrt model length
small model
300 b/s
28S rRNA
200 b/day
Sanger supercomputers
Rfam 5.0
• http://www.sanger.ac.uk/Software/Rfam/
• http://rfam.wustl.edu/
• 176 ncRNA families
•
•
•
•
Structure annotated alignments
Species distributions
Keyword searches
Sequence searches
• >235000 regions in EMBL 76
ncRNA families
What we have:
What we don’t:
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
tRNA
5S, 5.8S rRNAs
Spliceosomal RNAs
SRP, RNaseP
Telomerase, tmRNA, vault
E. coli screens
Some snoRNAs
Some miRNAs
Some UTR elements
Self-splicing introns
…… more
18S, 23S rRNAs
Other large things (Xist etc)
Lots of snoRNAs
Lots of miRNAs
Many small families
Unknowns
Genome annotation
• General
One tool fits all
Automatic
Comprehensive
Great for prokaryotes
Compute drain
Eukaryotic complications
• Specific
Heuristics
Increased speed
Increased sensitivity
One family, one gene finder
tRNAscan-SE, BRUCE, SRPscan, snoscan
Outline
• I. Non-coding RNA
• The genome’s dark matter
• Family classification
• Genome annotation
• II. ncRNA genes in the human genome
• Rogue’s gallery
• miRNAs
• Regulatory elements
Outline
• I. Non-coding RNA
• The genome’s dark matter
• Family classification
• Genome annotation
• II. ncRNA genes in the human genome
• Rogue’s gallery
• miRNAs
• Regulatory elements
International Human Genome Sequencing Consortium, Nature, 2001
X chromosome inactivation in mammals
X
X
X
Dosage compensation
X
Y
Xist – X inactive-specific transcript
Avner and Heard, Nat. Rev. Genetics 2001 2(1):59-67
International Human Genome Sequencing Consortium, Nature, 2001
microRNAs
•
•
•
•
•
A novel class of ncRNA gene
Products are ~22 nt RNAs
Precursors are 70-100 nt hairpins
Gene regulation by pairing to mRNA
Unknown before 2001
Timeline
•
Late 70’s – lin-4 and let-7 regulate developmental timing in worm
•
1993 – lin-4 codes for a ~22 nt RNA, complementary to 3’ UTR of lin-14
•
2000 – …. so does let-7 (stRNAs)
•
2000 – let-7 is conserved in bilaterally symmetric animals
•
2001 – ~100 miRNAs discovered by cloning in worm, fly and human
•
2002 – miRNAs conserved in plants
•
2002 – Science magazine’s breakthrough of the year
•
2002 – miRNA Registry established
•
2003 – miRNAs may account for 1% of total gene count in animals
•
2003 – a few targets of miRNAs identified
•
2004 – miRNA Registry has 719 miRNAs
Number of publications
“miRNA” in PubMed
140
120
100
80
60
40
20
0
1999
2000
2001
2002
Year
2003
2004
miRNA biogenesis
Adapted from DP Bartel, Cell 116:281-297(2004)
miRNAs targets
DP Bartel, Cell 2004 116:281-287
PNAS 99:15524-15529(2002)
miRNA Registry 3.0
• Searchable database of published miRNAs
• http://www.sanger.ac.uk/Software/Rfam/mirna/
• 719 entries from human, mouse, rat, worm, fly, and
plants
• Naming service
• Pre-publication
• Unique names for distinct miRNAs
• Confidentiality for unpublished data
Genomic context
180 known miRNAs in human
130 intergenic
60 polycistronic
70 monocistronic
50 intronic
ncRNA gene contexts
tRNA, snRNAs,SRP, RNase P …..
AAAAAAA
Xist
miRNAs
miRNAs, snoRNAs
Inside-out genes
protein
Inside-out genes
snoRNA
degradation
Gas5, UHG, U17HG,U19H
Cis-regulatory RNA elements
PrfA in Listeria
25oC
37oC
PrfA
Virulence gene
expression
UTR elements in human
•
•
•
•
•
•
IRE
SECIS
Histone 3’ UTR
Vimentin 3’ UTR
CAESAR
…. many more
regulation of iron metabolism
UGA -> SeC
3’ end formation
mRNA localisation
CTGF repression
ncRNAs in human genome
•
tRNA
600
•
SRP RNA
1
•
18S rRNA
200
•
RNase P RNA
1
•
5.8S rRNA
200
•
Telomerase RNA
1
•
28S rRNA
200
•
RNase MRP
1
•
5S rRNA
200
•
•
Y RNA
5
snoRNA
300
•
miRNA
250
•
Vault
4
•
U1
40
•
7SK RNA
1
•
U2
30
•
Xist
1
•
U4
30
•
H19
1
•
U5
30
•
BIC
1
•
U6
20
•
U4atac
5
•
Antisense RNAs
1000s?
•
U6atac
5
•
Cis reg regions
100s?
•
U11
5
•
Others
•
U12
5
?
Summary
• ncRNA genes ….
•
•
•
•
•
have diverse and essential roles
may be relics of ancient RNA-based life
provide major computational challenges
are often ignored!
>10% of human gene count?
• Family classifications are useful for ….
• finding homologues
• predicting structure
• allow automatic genome annotation
Just plain weird
• Vault is huge
• 13 Md
• 30 x 55 nm
• Described in 1986
• 3 proteins
• MVP
• TEP1
• vPARP
• vRNA
• Conserved in higher euks
http://vaults.arc.ucla.edu/sci/sci_home.htm
http://vaults.arc.ucla.edu/sci/sci_home.htm
Thanks
•
•
•
•
•
Alex Bateman
Mhairi Marshall
Simon Moxon
Ajay Khanna
Sean Eddy
• Informatics support group
• Ian Holmes
• Bjarne Knudsen
• Robbie Klein
• David Bartel
• Tom Tuschl
• Victor Ambros
Bibliography
• Computational genomics of non-coding RNA genes. Sean R.
Eddy, Cell 109:137-140 (2002)
• Non-coding RNAs: the architects of eukaryotic complexity. John
S. Mattick, EMBO Reports 2:986-991 (2001)
• MicroRNAs: Genomics, biogenesis, mechanism and function.
David P. Bartel, Cell 116:281-297 (2004)
• Rfam: An RNA family database. Sam Griffiths-Jones et al.,
Nucl. Acids Res. 31:439-441 (2003)
sgj@sanger.ac.uk
http://www.sanger.ac.uk/Software/Rfam/
rfam@sanger.ac.uk
http://www.stats.ox.ac.uk/~hein/HumanGenome/
Download