RNAs in the human genome Sam Griffiths-Jones The Wellcome Trust Sanger Institute Outline • I. Non-coding RNA • The genome’s dark matter • Family classification • Genome annotation • II. ncRNA genes in the human genome • Rogue’s gallery • miRNAs • Regulatory elements T. thermophilus - Ramakrishnan et al., Cell, 2002 Protein/RNA genes DNA RNA X protein ncRNA genes • …. code for functional RNAs • Many cellular machines contain RNA • • • • Ribosome Spliceosome Telomerase SRP rRNA snRNAs (U1,U2,U4,U5,U6) Telomerase RNA SRP RNA How many genes in the human genome? Gene sweep • CSHL 2000-2003 • Rules • $1 in 2000, $5 in 2001 and $20 in 2002 • A gene is a set of connected transcripts. A transcript is a set of exons connected via transcription. At least one transcript must be expressed outside of the nucleus and one transcript must encode a protein. • One bet per person, per year • Results • 165 bets • Mean 61710 • Lowest 25947 • Highest 153478 • Answer: 21000 Winner: Lee Rowen • http://www.ensembl.org/Genesweep/ ncRNA genes • Genomic dark matter • Ignored by gene prediction methods • Not in EnsEMBL • Computational complexity • ~10% of human gene count? The RNA World • Origin of life / central dogma paradox • DNA needs proteins to replicate • Proteins coded for by DNA • RNA can be code and machinery • Selex, aptamers • RNAs are remnants • Ancient • Essential Biological sequence analysis Protein easy RNA hard Gene finding • Rules • ATG • TAA, TGA, TAG • GT…..AG • Compositional features • • • • Exon lengths Intron lengths Codon bias General genomic properties • Homology ? ? Protein sequence analysis Query: 1 MKFYTIKLPKFLGGIVRAMLGSFRKD 26 M+ TIKLPKFL IVR G+ + D Sbjct: 390 MRIMTIKLPKFLAKIVRMFKGNKKSD 467 RNA sequence analysis RNA sequence analysis Why are families useful? • Alignments of related sequences • Phylogenetic trees • Homologue detection • Genome annotation • Secondary structure prediction S. P. P. K. SS cerevisiae canadensis strasburgensis thermotolerans UCCUCGUGAGAGGG GUCUC.UGAGAGAU CUCUC.UGAGAGAG UUCUCGUGAGAGAA <<<<<....>>>>> RNA models • Covariance models (profile-SCFGs) • Analogue to profile-HMMs • Statistical representation of the alignment with structure • Homologue detection • Multiple sequence alignment • (Sean Eddy) Protein sequence analysis - HMMs ERELKKQKKLSNR ERELKK..KQSNR ERELKRQRKQSNR KAAAQRQKMIKNR B EREKKKRKQSNR D D D D M M M M I I I E RNA sequence analysis - SCFGs MP G A A A–U G–C G–C MP MP ML ML ML G G A A G A < < < . . . U C C > > > RNA models - problems • Problems • Speed • Memory • Sensitivity • Speed • • • • 30 billion bases in DBs O(N3) wrt model length small model 300 b/s 28S rRNA 200 b/day Sanger supercomputers Rfam 5.0 • http://www.sanger.ac.uk/Software/Rfam/ • http://rfam.wustl.edu/ • 176 ncRNA families • • • • Structure annotated alignments Species distributions Keyword searches Sequence searches • >235000 regions in EMBL 76 ncRNA families What we have: What we don’t: • • • • • • • • • • • • • • • • • tRNA 5S, 5.8S rRNAs Spliceosomal RNAs SRP, RNaseP Telomerase, tmRNA, vault E. coli screens Some snoRNAs Some miRNAs Some UTR elements Self-splicing introns …… more 18S, 23S rRNAs Other large things (Xist etc) Lots of snoRNAs Lots of miRNAs Many small families Unknowns Genome annotation • General One tool fits all Automatic Comprehensive Great for prokaryotes Compute drain Eukaryotic complications • Specific Heuristics Increased speed Increased sensitivity One family, one gene finder tRNAscan-SE, BRUCE, SRPscan, snoscan Outline • I. Non-coding RNA • The genome’s dark matter • Family classification • Genome annotation • II. ncRNA genes in the human genome • Rogue’s gallery • miRNAs • Regulatory elements Outline • I. Non-coding RNA • The genome’s dark matter • Family classification • Genome annotation • II. ncRNA genes in the human genome • Rogue’s gallery • miRNAs • Regulatory elements International Human Genome Sequencing Consortium, Nature, 2001 X chromosome inactivation in mammals X X X Dosage compensation X Y Xist – X inactive-specific transcript Avner and Heard, Nat. Rev. Genetics 2001 2(1):59-67 International Human Genome Sequencing Consortium, Nature, 2001 microRNAs • • • • • A novel class of ncRNA gene Products are ~22 nt RNAs Precursors are 70-100 nt hairpins Gene regulation by pairing to mRNA Unknown before 2001 Timeline • Late 70’s – lin-4 and let-7 regulate developmental timing in worm • 1993 – lin-4 codes for a ~22 nt RNA, complementary to 3’ UTR of lin-14 • 2000 – …. so does let-7 (stRNAs) • 2000 – let-7 is conserved in bilaterally symmetric animals • 2001 – ~100 miRNAs discovered by cloning in worm, fly and human • 2002 – miRNAs conserved in plants • 2002 – Science magazine’s breakthrough of the year • 2002 – miRNA Registry established • 2003 – miRNAs may account for 1% of total gene count in animals • 2003 – a few targets of miRNAs identified • 2004 – miRNA Registry has 719 miRNAs Number of publications “miRNA” in PubMed 140 120 100 80 60 40 20 0 1999 2000 2001 2002 Year 2003 2004 miRNA biogenesis Adapted from DP Bartel, Cell 116:281-297(2004) miRNAs targets DP Bartel, Cell 2004 116:281-287 PNAS 99:15524-15529(2002) miRNA Registry 3.0 • Searchable database of published miRNAs • http://www.sanger.ac.uk/Software/Rfam/mirna/ • 719 entries from human, mouse, rat, worm, fly, and plants • Naming service • Pre-publication • Unique names for distinct miRNAs • Confidentiality for unpublished data Genomic context 180 known miRNAs in human 130 intergenic 60 polycistronic 70 monocistronic 50 intronic ncRNA gene contexts tRNA, snRNAs,SRP, RNase P ….. AAAAAAA Xist miRNAs miRNAs, snoRNAs Inside-out genes protein Inside-out genes snoRNA degradation Gas5, UHG, U17HG,U19H Cis-regulatory RNA elements PrfA in Listeria 25oC 37oC PrfA Virulence gene expression UTR elements in human • • • • • • IRE SECIS Histone 3’ UTR Vimentin 3’ UTR CAESAR …. many more regulation of iron metabolism UGA -> SeC 3’ end formation mRNA localisation CTGF repression ncRNAs in human genome • tRNA 600 • SRP RNA 1 • 18S rRNA 200 • RNase P RNA 1 • 5.8S rRNA 200 • Telomerase RNA 1 • 28S rRNA 200 • RNase MRP 1 • 5S rRNA 200 • • Y RNA 5 snoRNA 300 • miRNA 250 • Vault 4 • U1 40 • 7SK RNA 1 • U2 30 • Xist 1 • U4 30 • H19 1 • U5 30 • BIC 1 • U6 20 • U4atac 5 • Antisense RNAs 1000s? • U6atac 5 • Cis reg regions 100s? • U11 5 • Others • U12 5 ? Summary • ncRNA genes …. • • • • • have diverse and essential roles may be relics of ancient RNA-based life provide major computational challenges are often ignored! >10% of human gene count? • Family classifications are useful for …. • finding homologues • predicting structure • allow automatic genome annotation Just plain weird • Vault is huge • 13 Md • 30 x 55 nm • Described in 1986 • 3 proteins • MVP • TEP1 • vPARP • vRNA • Conserved in higher euks http://vaults.arc.ucla.edu/sci/sci_home.htm http://vaults.arc.ucla.edu/sci/sci_home.htm Thanks • • • • • Alex Bateman Mhairi Marshall Simon Moxon Ajay Khanna Sean Eddy • Informatics support group • Ian Holmes • Bjarne Knudsen • Robbie Klein • David Bartel • Tom Tuschl • Victor Ambros Bibliography • Computational genomics of non-coding RNA genes. Sean R. Eddy, Cell 109:137-140 (2002) • Non-coding RNAs: the architects of eukaryotic complexity. John S. Mattick, EMBO Reports 2:986-991 (2001) • MicroRNAs: Genomics, biogenesis, mechanism and function. David P. Bartel, Cell 116:281-297 (2004) • Rfam: An RNA family database. Sam Griffiths-Jones et al., Nucl. Acids Res. 31:439-441 (2003) sgj@sanger.ac.uk http://www.sanger.ac.uk/Software/Rfam/ rfam@sanger.ac.uk http://www.stats.ox.ac.uk/~hein/HumanGenome/