A brief introduction to computational prediction of RNA secondary structure Lars Barquist 13.11.2018 Lecture Goals At the end of this lecture you should be able to: • Describe the most common algorithm and energy model underlying minimum free energy (MFE) RNA secondary structure prediction • Name at least one major weaknesses of standard MFE predictions • Describe 3 alternative approaches that can complement MFE prediction RNA Structure Central Dogma of Genomics: sequence determines structure determines function (Petsko, 2000) A Primary Structure 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 Ψ Ψ 5’ GCGGAUUUAGCUCAGDDGGGAGAGCGCCAGACUGAAYA.CUGGAGGUCCUGUGT.CGAUCCACAGAAUUCGCACCA 3’ B Secondary Structure 3’A C Tertiary Structure 75 C 5’ G C Acceptor C G G C 70 Stem GU 5A U D Loop 15 5’ T ΨC Loop C A T ΨC Loop UA 60 DGA U U A 65 C UA A D GACAC C U C G 10 G G CUGUG G G G A A G C 25 C 50 C T Ψ U . G 20 C GAGG 55 C G 45 AU 40 AG 30 C Variable C Ψ. A Anticodon U Loop . G A AY Loop D Loop 3’ Acceptor Stem Anticodon Loop 35 How do we build a folding algorithm? Need three things: • An encoding of secondary structure we can compute on • Definition (and restriction) of the problem to be solved • A method to produce optimal solutions to this problem Representing RNA secondary structure Matrix representation of RNA structure Constraints on secondary structure (assumptions) Also generally assume only Watson-Crick (G:C, A:U) and wobble pairs (U:G)! Optimization: Dynamic Programming • Not dynamic, not programming • Foundational optimization technique used in many genomics applications (e.g. BLAST) • Break down optimization problem into series of subproblems • Basic idea: fill out score matrix using recursion, traceback along maximal scoring path to deduce optimal sequence of events (Viterbi, Needleman-Wunsch) • First goal: maximize base-pairs (Nussinov) The Nussinov folding algorithm Recursion to maximize number of base-pairs: Nussinov, SIAM 1978; Eddy, Nat. Biotech. 2004 The Nussinov folding algorithm Nussinov et al, SIAM 1978; Eddy, Nat. Biotech. 2004; Durbin et al, Biological Sequence Analysis Problems with the Nussinov algorithm • Solutions frequently not unique • 77 bp tRNA-His has thousands of “optimal” structures with 26 base-pairs! • Maximizing base-pairs frequently doesn’t reflect biological structure • Real tRNA structure has ~21 bases • Lacks physical realism, folding energy Need an energy model! Parameters of the Turner energy model Base stacking Optical melting Doktycz et al., JBC 1995 Nearest Neighbor Model W X\Y Z CG GC AU UA GU UG CG -3.26 -2.36 -2.11 -2.08 -1.41 -2.11 GC -3.42 -3.26 -2.35 -2.24 -1.53 -2.51 AU -2.24 -2.08 -0.93 -1.10 -0.55 -1.36 UA -2.35 -2.11 -1.33 -0.93 -1.00 -1.27 GU -2.51 -2.11 -1.27 -1.36 +0.47 +1.29 UG -1.53 -1.41 -1.00 -0.55 +0.30 +0.47 Mathews et al., JMB 1999 Laurberg et al., Nature 2008 Zuker folding A U C G L A U Free Energy = L + S 1 +S 2 + S 3 = 5.00 − 2.11 − 2.35 − 2.24 = −1.70 kcal/mol S1 C G U A G C S2 S3 Zuker & Stiegler, NAR 1981 Dynamic programming over thousands of measured and inferred parameters for structural components Exercise: What is this sequence? UCCUCUGUAGUUCAGUCGGUAGAACGGCGGACUGUUAA UCCGUAUGUCACUGGUUCGAGUCCAGUCAGAGGAGCCA (https://sites.google.com/site/rnainformatics/sequence) Take 5 min, use Vienna RNAfold (http://rna.tbi.univie.ac.at) and NUPACK (http://www.nupack.org) to fold it! Play around with default settings if you’ve got time. Exercise results Should we believe these results? • Benchmarks: • By method developers: Walters et al. 1994, mean sensitivity 63.6% Mathews et al. 1999, mean sensitivity 72.9% • Independent: Doshi et al. 2004, mean sensitivity 41% Dowell & Eddy 2004, sens 56%, PPV 48% Gardner & Giegerich 2004, sens 56%, PPV 46% • Datasets: tRNA, SSU & LSU rRNA, SRP, Rnase P, tmRNA • Energy model only accurate to within 5-10%! Less obvious results What is this stuff? What is optimization? The energy model defines a space of potential folds – optimization identifies the best scoring fold within this space O’Reilly Media Centroid: balance point for this space The problem with optimization on an inaccurate energy model “There is an embarrassing abundance of structures having a free energy near that of the optimum” (McCaskill 1990) −20 −20.2 −20.4 Δ G (kcal/mol) −20.6 −20.8 −21 Biological MFE −21.2 A GC CG GC GU AU UA UA UG A U CG A C C UA A G UG C GC G UA G A GC AU GU CG GU C GU C A CC G AU CG UGA G G A G A U G A UU −21.4 −21.6 −21.8 −22 −5 0 5 A GC CG GC GU AU UA UA A G U GA A C C C CU CU UG U U C U G GG AG UC A G G A G AA GA U G UU CU U C A UG G A G GC G A C A GCG Suboptimal 10 d 15 (S ,S BP Back to our results: A GC CG GC GU AU UA CUA UGA UU A C U G C U C GA GACA G G GAGC C U G U U UC GG A C U G G C GAG CG AU GU AU C GA U A G A i ) mfe 20 25 30 35 Wuchty et al., Biopolymers 1990 Other alternatives to “plain” MFE optimization • Determine how confident the model is in the global structure • Sampling! • Augmenting the model with additional information • Comparative prediction of secondary structure • Experimental constraints on secondary structure Sampling from the energy model 2 6 3 92 7 5 17 21 20 4 9 40 10 14 18 A GC CG GC GU AU UA UA UG A U CG A C C UA A G UG C GC G UA G A GC AU GU CG GU C GU C A CC G AU CG UGA G G A G A U G A UU A GC CG GC GU AU UA UA U GA UGA A U C CUCG A G GAGC C GG A G C C G CU A A A C UG C G U A U A G G AU G U U U C UG C G U A GG A GC CG GC GU AU A UU A CUA UA G C G GACA CU U G U GU U C C CC A GA U G U U C U GA A G G G A G G C G A G C A A U A GCG UU GG Cluster tree of the 20 most frequent structures in a sample of 1000 16 1 2 103 11 A GC CG GC GU AU UA UA A G U GA A C C C CU CU UG U CU U G GG AG UC A G G A G AA GA U G UU CU U C A UG G A G GC G C A A GCG 12 15 19 22 8 13 A GC CG GC GU AU UA CUA U UA ACAC G UGA G A U G CUCG C U G U U UC G C GAGC GG A U G G C GAG CG AU GU AU C GA U A G A Ding & Lawrence, NAR 2003 Can provide a more intuitive picture of uncertainty in structure Additional information Woese et al., Micro. Rev. 1983 Additional information: covariation } Credit: Eric Nawrocki @NCBI Additional information: covariation Credit: Eric Nawrocki @NCBI Additional information: covariation Credit: Eric Nawrocki @NCBI Additional information: covariation • Incorporating evolutionary information into structure prediction is a deep topic! • Co-estimation of alignment and structure prohibitive • Resources: • Barquist et al., Curr Protoc Bioinformatics 2016 • Pitfalls of naïve approaches: Rivas et al., Nat Methods 2017 • Rfam (rfam.xfam.org) • Producing these sequence:structure models more art than science, BUT still one of the best sources of evidence for conserved functional structure! Additional information: experimental More from Redmond next! Westhof & Romby, Nat. Methods 2010 Exercise: What is this sequence? UCCUCUGUAGUUCAGUCGGUAGAACGGCGGACUGUUAA UCCGUAUGUCACUGGUUCGAGUCCAGUCAGAGGAGCCA (https://sites.google.com/site/rnainformatics/sequence) Use Sfold to get an idea of what the structural ensemble looks like: http://sfold.wadsworth.org/cgi-bin/srna.pl Now say you’ve gotten information that one base is definitely unpaired – use RNAfold with the following constraint: .........................x.................................................. Exercise results: Sfold sampling 63% 37% Note: results are stochastic – different runs will give slightly different answers! Exercise results: Sfold sampling Exercise results: experimental constraints Can be done high-throughput but need to deal with noise – “soft constraints” see: Eddy, Annu Rev Biophys 2014 Summary • Don’t trust point (MFE) estimates! • They can be useful but treat with skepticism • Parameters have uncertainty • Some possibilities completely excluded (base modifications, pseudoknots, tertiary contacts) • The ensemble can be more informative! • Base-pairing probabilities can inform on stable substructures • Dot plots can be informative about possible local alternatives • Sampling can give an idea as to global alternatives • If you need an accurate structure, bring in more information: covariation and/or experimental!