Mira Abraham-Cohen and Haim J.Wolfson Blavatnik School of Computer Science Tel Aviv University Tel Aviv, Israel Why RNA? RNA (ribonucleic acid) is: not solely a carrier of genetic information (non-coding RNAs) DNA RNA X The Central Dogma of Molecular Biology Protein Why RNA? RNA (ribonucleic acid) is: not solely a carrier of genetic information (non-coding RNAs) a key player in essential cellular processes (e.g. protein synthesis and transport, gene silencing) involved in pathological processes (e.g. cancerous tumors, AIDS) a potential drug or drug-target (e.g. RNAi, bacterial ribosomes as antibiotic-targets) RNA Structure 1D 2D 3D Why RNA secondary structure? “RNA structure” usually refers to 2D structure Easier to achieve (more common than 3D structures) Secondary structure elements Helix Loop Secondary Structure elements Helix Bulge Internal loop Multi branch loop Hairpin GUCUGUCCCCACACGACAGAUAAUCGGGUGCAACUCCCGCCCCUUUUCCGAGGGUCAUCGGAACCA .((((((.......))))))....((((.......)))).[[[..((((((]]]...))))))... Pseudoknot structural motif Important for the function of many RNAs helix1 i1 < i2 < j1 < j2 helix2 RNA 2D structure alignment Disregarding pseudoknots O(n4) [Zhang and Shasha 1989] Including pseudoknots NP-Hard [Zhang et al. 1999] Why do pseudoknots make a difference? Are they common? Over 30% of the functional groups Less than 70% 2D similarity Previous work – RNA 2D alignment Methods disregarding pseudoknots RNAforester [Hofacker et al. 2004] Migals [Allali and Sagot 2005] MARNA [Siebert and Backofen 2005] Methods that deal with limited cases rna_align (DP) [Jiang et al. 2001] pkalign (DP) [Mohl et al. 2009] Previous work – RNA 2D alignment A method that deals with the general problem LARA (ILP) [Bauer et al. 2007] All current methods dealing with pseudoknots High time and memory complexity Impractical for big structures rna_align < 150 nts pkalign < 800 nts LARA < 1600 nts on pc-wolfson1 (2GB RAM) HARP Motivation Preserved 3D structure ? Preserved function Preserved relative 3D distances Preserved function Preserved relative 2D distances Preserved function HARP Aligns RNA 2D structures with no limitation on the pseudoknot type Exploits inherent RNA distance constraints Distances between 2D elements are usually conserved Pseudoknots often create spatial distance constraints Goal: Finding the largest set of conserved helices Heuristic method based on an analog of Geometric Hashing Geometric hashing Point of “view” Each pair of points defines a “view” Voting table HARP - Overview R1 R2 Generate reduced “helix” graph representations G1 G2 Build a look-up table of geodesic distances in all bases Query the look-up table Refine alignments and extend the match Reduced graph representation Vertices- stable helices Helix beginning, termination and length Edges connect adjacent helices Direction: polymerization direction Weight: minimal number of nucleotides needed for connection Graph representation On log n Graph representation i k j forward 11 i k 7 backward 4 20 j i k 16 4 j Building a look-up table Shortest path between any two vertices Any two vertices (i,j) define a “view” forward backward Similar views Inserting G1 triangles Querying with G2 triangles O n 3 Querying the vote table Querying the table with the indexing edges of G2 ε-vicinity Indexing edges Basis edge • Filtering by – Triangle type F/B – ε-vicinity 3 On Alignment refinement G1 G2 w l vi l v j 1 w C 1 d v , v 1 l v l v 2 i j i j f Distance between the Correlation between verticesHungarian algorithm helices’ lengths O n 3 O n 7 Alignment extension and scoring Greedy approach 6 O n Starting with the largest (pair of bases) match Extending by adding the pair that contributes most to the extension Score Sbp R1 , R2 NSbp R1 , R2 min bp1, bp 2 Complexity Generating reduced graphs representations On log n In practice: Building a look-up size structures less than a second 3 Average table n Big structures (~2800 nucleotides) less than a Querying the look-up minute and 10 MB table O O Generating alignments: 1. Alignments refinement 2. Alignment extension n O n 7 O n 6 3 Results HARP’s statistics Average score and p-value Comparison with LARA Alignment examples HARP’s statistics Functional Group (based on DARTS) Group size Average size (nts) Average score p-value tRNA 4 78 100% 0 Ribosomal 23S subunit 4 2852 71.9% 0 Ribosomal 5S subunit 4 120 77.2% 0.18 Ribosomal 16S subunit 2 1530 86.7% 0 Self splicing group I introns 2 224 78.0% 0.02 Thi-box riboswitch 2 80 95.0% 0.07 Guanine riboswitch 2 69 100% 0 SRP S domain 2 114 73.2% 0.17 RNase P catalytic domain 2 298 68.9% 0.02 Total 24 596 83.4% 0.05 Similar 2D yet different function 5S ribosomal RNA SRP Comparison with LARA 23 rRNA TP/P=TP/(TP+F N) Sensitivity Comparison with LARA HARP LARA 1-Specificity = FPR FP / N = FP / (FP + TN) Self splicing group I introns 68.9% similarity (left) PDB id 1zzn chain B, 10 stable helices. (right) PDB id 1y0q chain A, 13 stable helices. Catalytic domains of ribonuclease P (left) PDB id 2a2e chain A, 19 stable helices (right) PDB id 2a64 chain A, 16 stable helices . Conclusions HARP HARP is a tool for the alignment of RNA secondary structures, which may include pseudoknots Accurate tool capable of distinguishing between homologous structures and non-homologous structures Highly efficient Takes less than a second for average-size structures Less than a minute and 10 MB for very big structures Web server : http://bioinfo3d.cs.tau.ac.il/HARP Thank you for your attention !