9 January 2009 1. Introduction 2. Methods / Algorithms Liming Cai 3. Results - Searching Genomes 4. Results - Telomerase RNAs, Other Utilities 5. Future Plans Changing Views of RNA Information Catalysis Regulation Pre 1944 Protein Protein Protein 1944-1982 DNA, RNA Protein Protein Post 1982 DNA, RNA Protein, RNA Protein Post 1998 DNA, RNA Protein, RNA Protein, RNA Post 2008? 80% of Human Genome Transcribed? BioinformaticChallenges with ncRNA Central Paradigm of Bioinformatics, Ability to Predict: Genome » Expressed Molecules » Phenotype • Predict structure of an RNA from sequence. • Align RNAs on the basis of their 2o structure. • Find new instances of known ncRNAs in genomes and sequence databases. • Predict transcripts from genomic sequence (both known and unknown RNAs) Conserved RNA Secondary Structures C G A U Covariation G C U A RNA structure is often more conserved than its primary sequence is. Search for structures not sequence. Campbell, Biology RNA Pseudoknot dna loops and spool dna loops and spool dna loops and spool Searls, D.B., 1992 The linguistics of DNA. American Scientist 80: 579-591. Many non-coding RNAs have pseudoknots that are essential to their function. Telomerase RNA 1. RNA structure is conserved, but sequence and lengths are not. 2. The pseudoknot is essential to function. http://tolweb.org/tree/ Lin, J. et al. (2004) PNAS 101: 1471314718. Tzfati, Y. et al. (2003) Genes & Dev. 17: 1779-1788. Bacterial tmRNAs -A-B-D-E-F-G-H-g-h-I-J-j-i-K-L-M-N-m-O-o-l-k-n-P-p-Q-R-S-r-q-s-T-U-V-W-X-v-u-t-Z-!-z-1-@-#-2-3-x-w-f-e-d-b-$-4-a- PK1 PK2 Felden et al. 2001. NAR 29: 1602-1607. PK3 PK4 Functions in Trans Translation Large Database of Structures Available Goals Search genomic sequences for RNA structures, not primary sequence. • • Predicting the structure of RNAs from primary sequence. Aligning RNAs to each other by structure. Study the evolution of RNA structure. Eventually, predict new ncRNA from genomic sequence. Our methods handle pseudoknots as well as stem-loops. Free Energy Minimization – Zuker Probablistic Models – Eddy, Durbin Tree Decomposition – Cai Zucker energy minimization lowest ΔG from stacking. Durbin et al. Fig. 10.10 Hidden Markov Models – Sequence Alignment CUCGCACGGUGCUGA-AUGCCCGUA CUCGCACAG-GCUGAGAUGCUUGUA Consider aligning second sequence to first At each position – match, insertion, deletion - states Match state – 16 possibilities A/A, A/C, A/G, A/U, G/A … Insert or Delete states – 4 possibilities B D D D M M M I I I A/-, C/-, G/-, U/- E Hidden Markov Models – Sequence Alignment CUCGCACGGUGCUGA-AUGCCCGUA CUCGCACAG-GCUGAGAUGCUUGUA Transitions: Switches between Match, Insert, Delete Emissions: What each state does in the alignment A/A, A/C, A/G, A/U, G/A … D D D M M M I I I A/-, C/-, G/-, U/- Probabilities for emissions and state transitions We use HMMs for non-paired regions of RNA structures B E Chomsky Transformational Grammars Colorless green ideas sleep furiously. A grammar generates a string of symbols (letters). Terminals and Non-terminals. SaS SgS S or: S a S | g S | S aS agS aggS agg We use transformational grammars for paired regions (stems) of RNA structures RNA Stem Loop Durbin et al. p243 W1 W2 W3 0S aW1u aW2u aW3u gaaa S | | | | cW1g | gW1c | uW1a cW2g | gW2c | uW2a cW3g | gW3c | uW3a gcaa aW1u acW2gu acgW3cgu acggaaacgu S W1 W2 W3 a c g g a a a c g u Parse tree format Profiling Structurally Aligned RNAs Loops (unpaired): Hidden Markov Models Stems (paired): Stochastic Grammar S B D D D M M M W1 E W2 W3 I I I a c g g a a a c g u Searching by Profiling Durbin & Eddy Model Hidden Markov, Stochastic Grammar Aligned by structure profile model alignment genome scanning window (target sequence) Computational Complexity Memory and/or Time Requirements Stem Loops Pseudoknots Context Free Grammar Context Sensitive Grammar O(N4) O(N6) Filters — Weinberg & Ruzzo Searching sequences for RNA structures by Stochastic Grammars is very slow. Find a conserved subsequence or simple structure. Search genome or database for all matches. Screen region around match in detail. NNNNNNNNNNNNNNNNNNNCGCCNNNNNNNNNNNNNNNNNNNNNNNNNN Liming’s New Approach: Conformational Graph / Tree Decomposition RNA Structure Graph Tree Robertson, Neil; Seymour, Paul D. (1984), "Graph minors III: Planar tree-width", Journal of Combinatorial Theory, Series B 36: 49–64 1. Represent RNA secondary structure as a graph 2. Use tree decomposition of the structure graph to reduce the complexity of the problem 3. The structure-sequence alignment problem becomes a subgraph isomorphism problem Conformational Graph of an RNA Stem CM loop HMM structure (mixed) graph 5’ Campbell, Biology 5’ 3’ 3’ Tree Decomposition of an RNA Graph Robertson, N. Seymour, P.D. (1986) Graph minors II. Algorithmic aspects of tree-width, Journal of Algorithms, 7, 309-322. a a b a’ b d x d’ b a’ c’ b’ c y c’ a’ b b’ c’ y b’ c c’ y d b’ y d d’ b’ x t = Tree width = |bag| - 1 O(ktm2n) Alignment is Calculated in the Tree Bags a b a’ b a’ c’ b b’ c’ y b’ c c’ y d b’ y d d’ b’ x Each bag (node) has a dynamic programming table in it. Alignment of vertices (stems and loops) between profile- model and target. Calculated from bottom to top; traceback. profile model alignment genome scanning window (target sequence) Stem CM loop HMM structure (mixed) graph 3’ 5’ Yong Wu Zhibin Huang b a’ c’ b b’ c’ y b’ c c’ y d b’ y Joseph Robertson d d’ b’ x RNATOPS 1. Takes input alignment of training sequences. 2. Constructs a profile model of the structure. 3. Searches sequences for instances of the structure. 4. Can use filters (sequence or structure). Cross Validation Tests: Genomes where the position is known. Leave ncRNA from the genome to be scanned out of the training set. Determine if the program can find the ncRNA. Bacterial RNAseP B genomes of ~3 x 106 bp Program: Training Set Sensitivity Specificity Avg. Time RNATOPS 1.0 9 100% 100% 15 min. Infernal 9 100% 100% 98 min. Yeast Telomerase RNA (Saccharomyces) genomes of ~ 107 bp Program: Training Set Sensitivity Specificity Avg. Time RNATOPS 1.0 5 100% 100% 6 min. Infernal 5 100% 100% 442 min. RNATOPS-W is web server software to search sequences for RNA secondary structures including pseudoknots. • automatic selection of an HMM (sequence) filter • interactive selection of a substructure filter Yingfeng Wang Input Profile for RNATOPS and -W Pasta Format (Pairing + Fasta) >pairs AAAAA...BBBBBBB...aaaaa.....bbbbbbbAAAAA...aaaaa >index 11111...1111111...11111.....111111122222...22222 >sequence 1 CGGUGCAGGCUA-GAGACCACCGAUUUUAAUCGUAGCCGGCACCACCG >sequence 2 CGGUGAAGGCUG-GAGACCACCGUAA--UCGCAGCCGGUGAACCACCG >sequence 3 CGGUGAAGGCUA-GAGACCACCGUUUUUAAUCGUAGCCGGCACCACCG Typical RNATOPS-W Output hit 2 ---->genome_seq2(380-415)[379-466] Plus search result Hit Positions: 380-415 Alignment score = 2.25878 Alignment to the filter GTCTCTTGGCCCAGTTGGTTAAGGCACCGTGCTAAT mmmmmmmmmmmmmmmimmmmmmmmmmmmmmmmmmmm Extension positions: 379-466 Extension of the hit GGTCTCTTGGCCCAGTTGGTTAAGGCACCGTGCTAATAACGCGGGGATCA GCGGTTCGATCCCGCTAGAGACCATAGTGCTGGAGTTG RNATOPS and RNATOPS-W Search genomes for new instances of RNAs on the basis of structure Can use filters if appropriate conserved features exist Fast and accurate Requires a training set Version 1.0 does not handle variable structures Improvements In Process 10x – 100x increase in speed Allow for presence or absence of stems Handle variability in stem and loop length Stem Length Distribution profile model alignment genome scanning window Dong Zhang Leilei Guo Mike McEachern TRFolder – helps predict/identify complex secondary structures in short sequences. Used to find telomerase RNA structures in newly sequenced yeast genomes. Telomerase RNAs 3’ ACCCCAGACCCACGAC 5’ linear chromosome Telomeres: Repeated sequences at the ends of linear chromosomes Protect interior sequences from degradation Telomerase: Protein, RNA - containing the template for the repeat Implicated: Aging and Cancer TRFolder Useful if you suspect there is a defined structure within a given sequence fragment 1. Start with ~5000 base segment 2. Predict pseudoknots 3. Input additional stems to predict 4. Input ranges of stem and loop sizes 5. Search for triple helix regions TRFolder: Telomerase RNAs in Newly SequencedYeast Genomes 1. Use template region to find candidate telomerase RNA gene regions. 3’ ACCCCAGACCCACGAC 5’ 2. Verify by checking the known flanking genes from species where it was previously identified. 3. Use TRFolder to predict structural features TRFolder: Telomerase RNAs in Newly SequencedYeast Genomes Tested on Known Structures: Saccharomyces and Kluyveromyces species Predict New Structure for Known Telomerase RNAs: Schizosaccharomyces pombe and Candida albicans Predict New Telomerase RNA Genes and RNA Structure: Candida glabrata, Candida guilliermondii, Candida tropicalis, Ashbya gossypii, Debaryomyces hansenii, Pichia stipitis TRFolder: Typical Output The predicted structures of the top 2 candidates for A. gossypii: Total:70.11 =pk Score:3*13.22 +Triple Score:1*13.77 The predicted structures of top 3 candidates +boundaryEle Score: ..-302..........-288..-256....-248....-9.....-2.....0.................. for A. gossypii: .....|.............|.....|.......|.....|......|.....|.................. .....GGAGCGGCUCCUGGA.....UGCCCACUG.....CAUGGGCA.....ACCGCUGAGAGACCCAUAC .....(((((((((((((((.....((((((.((.....)))))))).....******************* ....................................................................... Total:69.53 =pk Score:3*13.22 +Triple Score:1*13.77 +boundaryEle Score: ..-267.-262..-256....-248....-9.....-2.....0........................... .....|....|.....|.......|.....|......|.....|........................... .....GUCUCA.....UGCCCACUG.....CAUGGGCA.....ACCGCUGAGAGACCCAUACACCACACCG .....((.(((.....((((((.((.....)))))))).....**************************** ....................................................................... Accepts: Rfam alignment files (Stockholm) or Pasta format (Pairing + Fasta) >pairs AAAAA...BBBBBBB...aaaaa.....bbbbbbbAAAAA...aaaaa >index 11111...1111111...11111.....111111122222...22222 >sequence 1 CGGUGCAGGCUA-GAGACCACCGAUUUUAAUCGUAGCCGGCACCACCG >sequence 2 CGGUGAAGGCUG-GAGACCACCGUAA--UCGCAGCCGGUGAACCACCG >sequence 3 CGGUGAAGGCUA-GAGACCACCGUUUUUAAUCGUAGCCGGCACCACCG Alignment Utility - RNApasta Example Features Arc Diagram Partition into 2 Sets By Stem Length Stacking Alignment Utility - RNApasta Java Utility with Graphical Interface Statistical Analysis of Alignments (examples:) • Basepairing frequencies by position, region, stem position • Summaries of stem and loop length distributions Alignment Editing • Identify non-cannonical basepairs • Extract regions • Partition sequences into subsets • Remove gap columns (examples:) •Evolution of RNA Structure (in process) •Structural Alignment of RNAs (in process) •Alternative Splicing ? •Finding All Transcripts ? minus Anuj Srivastava Introduced Tree Decomposition Method RNATOPS – Search Genomes for ncRNA TRFolder – Telomerase (& other) RNA Structures RNApasta – Statistical Analysis & Alignment Editing Our software: • Is very fast. • Handles pseudoknots as well as stem-loops. http://www.uga.edu/RNA-Informatics/ Liming Cai Biomedical Information Science and Technology Initiative (BISTI)