RecitaLon 3-­‐19 CB Lecture #10 RNA Secondary Structure 1 Announcements • Exam 1 grades and answer key will be posted Friday a=ernoon – We will try to make exams available for pickup Friday a=ernoon (probably from 3:30-­‐4pm and 5-­‐5:30pm, before and a=er the Friday secLon) • Pset #3 has been released, due April 3rd – much longer programming problem than Pset #2 – Because of spring break, only one set of formal office hours before due date, but please email us with your quesLons • Updated aims with research strategy will be due Friday April 4th 2 RNA Secondary Structure • Just as protein can form secondary structure (α-­‐helix and β-­‐ sheet), so too can single-­‐stranded RNA by folding back on itself to form double-­‐stranded regions rRNA mRNA tRNA © unknown. All rights reserved. This content is excluded from our Creative Commons license. For more information, see http://ocw.mit.edu/help/faq-fair-use/. Source: http://www.mun.ca/biology/scarr/rRNA_folding.html © source unknown. All rights reserved. This content is excluded from our Creative Commons license. For more information, see http://ocw.mit.edu/help/faq-fair-use/. © Dept. Biol. Penn State. All rights reserved. This content excluded from our Creative Commons license. For more information, see http://ocw.mit. edu/help/faq-fair-use/. hNp://www.uic.edu/classes/phys/phys461/phys450/ANJUM04/RNA_sstrand.jpg hNps://www.mun.ca/biology/scarr/rRNA_folding.html hNps://wikispaces.psu.edu/download/aNachments/54886630/figure_17_12.jpg 3 …and virtually every other RNA! RNA Secondary Structure • RNA’s secondary structure is o=en inLmately Led with its funcLon – rRNA and tRNA always adopt the same structure; funcLon depends on it – mRNA may adopt different structures in different condiLons – due to cell types, temperature, ion concentraLon, etc. • mRNA’s processing may depend on what structure is (or is not) present -­‐ Can inhibit or strengthen ability of RNA binding protein to bind mRNA and affect alternaLve splicing -­‐ Can inhibit the ribosome’s ability to translate through the mRNA due to sequestraLon of ribosome binding site or hiOng structured road block Schematic diagram of an E. coli cell removed due to copyright restrictions. See the image here. Riboswitches are metabolite-­‐sensing RNAs, typically located in the non-­‐coding porLons of messenger RNAs, that control the synthesis of metabolite-­‐related proteins 4 hNp://www-­‐abell.ch.cam.ac.uk/images/riboswitchfig1.jpg RNA is at the catalyLc site of the ribosome (which is a ribozyme) RNA/protein distribution on the 50S ribosome linguini = protein fettucini = RNA © American Association for the Advancement of Science. All rights reserved. This content is excluded from our Creative Commons license. For more information, see http://ocw.mit.edu/help/faq-fair-use/. Source: Ban, Nenad, Poul Nissen, et al. "The Complete Atomic Structure of the Large Ribosomal Subunit at 2.4 Å Resolution." Science 289, no. 5481 (2000): 905-20. 5 -­‐Ribozymes – RNAs capable of catalyzing biochemical reacLons -­‐ provide support for “RNA world” hypothesis – that life evolved from a world with RNAs but no DNA or protein Terminology © Oxford University Press. All rights reserved. This content is excluded from our Creative Commons license. For more information, see http://ocw.mit.edu/help/faq-fair-use/. © Kenyon College Biology Department. All rights reserved. This Source: Ding, Ye, and Charles E. Lawrence. "A Statistical Sampling Algorithm for RNA content is excluded from our Creative Commons license. For Secondary Structure Prediction." Nucleic Acids Research 31, no. 24 (2003): 7280-301. more information, see http://ocw.mit.edu/help/faq-fair-use/. 6 hNp://www.ims.nus.edu.sg/Programs/biomolecular07/files/ clote_tut2b.pdf hNp://biology.kenyon.edu/courses/biol114/KH_lecture_images/ transcripLon/FG03_15b.JPG Arc NotaLon © source unknown. All rights reserved. This content is excluded from our Creaative Commons license. For more information, see http://ocw.mit.edu/help/faq-fair-use/. 7 hNp://www.ims.nus.edu.sg/Programs/biomolecular07/files/clote_tut2b.pdf non-­‐coding RNAs (ncRNAs) • Any RNA molecule that doesn’t code for protein (any non-­‐mRNA molecule) – tRNAs, rRNAs, miRNAs, snRNAs, snoRNAs, ribozymes (RNase P), lnRNAs, riboswitches • Due to the central role of structure in facilitaLng RNA’s funcLon, we’d like the determine structure – 2 different approaches for secondary structure 1. CovariaLon and compensatory changes through evoluLon 2. Energy minimizaLon 8 CovariaLon and compensatory changes • Idea: If structure is contribuLng to funcLon but actual sequence is not, we should see structure conserved but not necessarily sequence – So evoluLon allows as longby as covariation secondary structure is RNAmutaLons 2o structure / maintained compensatory changes 9 Seq1: A C G A A A G U Seq2: U A G U A A U A Seq3: A G G U G A C U Seq4: C G G C A A U G Seq5: G U G G G A A C CovariaLon and compensatory changes • Need sufficient divergence so that a decent number of mutaLons and compensatory mutaLons have occurred, but not so much that sequences can’t be aligned • Need a large number of homologs sequenced to have power to detect compensatory mutaLons 10 Mutual InformaLon (MI) • The most common way of quanLfying sequence covariaLon for the purpose of RNA secondary structure determinaLon • A measure of two variables' mutual dependence – Measures the informaLon that X and Y share: it measures how much knowing one of these variables reduces uncertainty about the other – If X and Y are independent, then knowing X does not give any informaLon about Y and vice versa, so their MI = 0 – At the other extreme, if X is a determinisLc funcLon of Y and Y is a determinisLc funcLon of X, then all informaLon conveyed by X is shared with Y: knowing X determines the value of Y and vice versa • As a result, in this case the mutual informaLon is the same as the uncertainty contained in Y (or X) alone, namely the entropy of Y (= entropy of X) – Mutual informaLon between aligned columns of nucleoLdes that are base-­‐paired should be high 11 • Knowing one of the nucleoLdes tells you everything about the other (if A, other is U; if C, other is G, etc.) Mutual InformaLon (MI) MI between two columns i and j: (i,j) fx,y : fracLon of sequences with x in column i AND y in column j fx(i) : fracLon of sequences with x in column i 12 -­‐RelaLve entropy of the joint distribuLon relaLve to the individual distribuLons of the nucleoLdes in columns i and j -­‐MI is maximal (2 bits) if x and y appear at random (all 4 nts equally likely) but perfectly covary (e.g. always complementary) Mutual InformaLon (MI) MI is maximal (2 bits) if x and y appear at random (all 4 nts equally likely) but perfectly covary (e.g. always complementary) (i,j) f What is x,y ? (i,j) fx,y Because x and y perfectly covary, 1 = for the 25% of covarying events (e.g. (x,y) = (A,U)) 4 (i,j) fx,y = 0 for the 75% of non-existent events (e.g. (x,y) = (A,A)) 13 (i) (j) What is fx fy ? 1 1 1 ⇤ = 4 4 16 Mutual InformaLon (MI) MI is maximal (2 bits) if x and y appear at random (all 4 nts equally likely) but perfectly covary (e.g. always complementary) X 1 = log2 4 (x,y)=(A,U ),(C,G),(G,C),(U,A) X 1 ⇤2 = 4 (x,y)=(A,U ),(C,G),(G,C),(U,A) 14 = 2 bits ✓ 1/4 1/16 ◆ Energy Minimization App nd 2 approach: Energy minimiza'on ΔGfolding = Gunfolded - Gfolded -­‐ Assume that RNA will fold to its lowest energy state re-­‐ are typically many possible folded states Simplest model: all base pairs contribute equally to lowering structure’s energy assumption that minimum energy state(s) will be oc -­‐ Base Pair MaximizaLon (ignores energy contribuLons of base stacking, loops, entropy, etc.): +1 for paired bases, 0 for unpaired -­‐ Use the Nussinov algorithm of recursive maximizaLon of base pairing ΔG = ΔH - TΔS halpy favors folding 15 See the Eddy Nature Biotechnology Primer 2004 Nussinov algorithm • Look at one conLguous sub-­‐sequence from posiLon i to posiLon j in our complete sequence of length N, and calculate the score of the best structure for just that sub-­‐sequence • This opLmal score (call it S(i,j)) can be defined recursively in terms of opLmal scores of smaller sub-­‐sequences • Four possible ways that a structure of nested base pairs on i...j can be constructed RRIIM MEER R PPRRI IMME1.ERi,Rj are a base pair, added on to a structure for i+1 ... j–1 2. i is unpaired, added on to a structure for i+1 ... j 3. j is unpaired, added on to a structure for i ... j–1 i, j are paired, butbest not toscore each other; the structure for i...j together sub-­‐structures Recursive definition ofofthe for i,ji,jadds looks atatfour possibilities: Recursive4. definition the best score foraasub-sequence sub-sequence looks four possibilities: aa Recursive definition of the best score for a sub-sequence i,j looks at four possibilities: for two sub-­‐sequences, i ... k and k+1 ... j (a bifurcaLon) Recursive definition of the best score for a sub-sequence i,j looks at four possibilities: S( +i +1,j1,j––1)1) S(iS( i +i 1,j – 1–)1) S( + 1,j S( i +i + 1,j1,j )) S( S(S( i +i 1,j )) + 1,j S(S( i,ji,j– 1–)1) S( i,ji,j–– 1)1) S( S( i,k ) ) S( S( i,ki,ki,k )) S( i+ i +1i1+i 1+ 1 j –j –1j – 1j 1– 1 ii i i jjj j 16 i i i ii+i i+ 1+i11+ 1 j j j i ii i j j–j–j1–11jj j 1.1.1. i,ji,j pair j junpaired 1.i,j i,jpair pair 2.2. i unpaired 3.3. 3. unpaired 2.i2. i unpaired unpaired pair iunpaired 3. j junpaired unpaired ii i i kkkk S( k +k1,j ) +)1,j S( S(kS( k++1,j 1,j ) ) k111+ 1 j jj j kkk+++ Bifurcation Bifurcation 4.4. 4.Bifurcation Bifurcation theofoptimal is just the maximum ofthe top of thescore four S(i,j) possibilities. We’ve thus defi the four possibilities. We’ve thus defined of the optimal score S(i,j) recursively as a fu sible of ways Nussinov algorithm optimal score S(i,j) recursively as a funcyson i..jthe tion only of optimal scores of smaller s can 1. i,j are a base pair, added on to a structure for i+1 ... j–1 – Theonly scoresequences; we add foroptimal the baseso pair i,jwe isscores independent ofneed any detailsto of the of of smaller subn tion only remem opLmal structure on i + 1...j – 1 – Similarly,these the opLmal structure on i + 1...j – need 1 and its score S(i +remember 1, j – 1) are sequences; so we only to scores, not the combinatorial explos structure unaffected by whether i, j are base paired or not (or anything else that happens in thepossible restnot of the sequence) of structures. Mathematically these scores, the combinatorial explosion P R I M E R re – Therefore, S(i, j) is just S(i + 1, j – 1) plus one, if i, j can base pair. recursion looks like: of possible structures. Mathematically this ucture for a Recursive definition of the best score for a sub-sequence i,j looks at four possibilities: recursion looks like: or S(i + 1,j – 1) +1 [if i,j base p S(i,j – 1) ucture for S(i + 1,j) S(i,k) S(k +1,j ) S(i,j) = S(i max+ 1,j – 1) +1 [if i,j base pair] S(i,j – 1) or S(i + 1,j) S(i,j) = i +1 j –1 max maxi<k<j S(i,k) + S(k + 1,j) ch other; S(i + 1,j ) S(i + 1,j – 1) i j ther sub-1. i,jpair r; 17 i k k +1 j S(i,j –i 1) j –1 j 2. i unpaired max3. j unpaired S(i,k) +4. Bifurcation S(k + 1,j) i i +1 j on i,j conditional on i a ted base pairs the sub-sequence structure onthat i,j conditional on i andstructure j being structure on i,j conditional of nested base pairs that the sub-sequence ture on i,j conditional on i and j being pairedpaired but notbut to not eachtoother. The key is key to recognize that this paired but not to each other. each other. form. The is to recognize that this d but not to each other. Since these are theare only four pos core (call these it(call S(i,j)) canonly becan defined Since are the four possible cases, Since these the only fo mal score it S(i,j)) be defined nce these are the only four possible cases, the optimal score S(i,j) isS(i,j) just is theju y the in optimal terms ofscore optimal scores of maximum S(i,j) is just the the optimal score ursively in terms of optimal scores of ptimal score S(i,j) is just the maximum 2. i is unpaired, added on of to a structure for i+1 ... j of the four possibilities. We’ve We th b-sequences. As shown in the top of the four possibilities. We’ve thus defined of the four possibilities. ller sub-sequences. As shown in the top of • In case 2, the opLmal scorethus S(i + 1,j) is independent of the addiLon of an e four possibilities. We’ve defined the optimal score S(i,j) there are only four possible ways the optimal score as a funcunpaired base i, S(i,j) so four S(i +recursively 1, possible j) + 0 is the score of the opLmal structure onrecursively i,j the optimal score S(i,j) recu ure 1, there are only ways ptimal condiLonal score S(i,j) recursively as a funcon i being unpaired tionsubonly of optimal scores of sm cture ofonly nested base pairsscores on i..jof cansmaller tion of optimal tion only of optimal scores a structure of nested base pairs on i..j can only of optimal scores of smaller subsequences; need to ucted: 3. j is so unpaired, added on to to aremember structure forso i ... we j–1soonly sequences; we only need R I M E R sequences; we only nee onstructed: so we only need to remember Pences; R I •M E R pair, Case added 3 is the same thing, but condiLonal on j being unpaired these scores, not the combinatoria these scores, not the combinatorial explosion base on to a structure these scores, not the combin not the j scores, are a base pair,combinatorial added on to aexplosion structure of possible structures. Mathema 1..jof– 1. possible structures. Mathematically this ofi,j possible structures. Ma ossible structures. Mathematically this Recursive definition of the best score for a sub-sequence looks at four possibilities: or i + 1..j – 1. recursion looks like:possibilities: a recursion Recursive definition of the best score for a sub-sequence i,j looks at four looks like: paired, added on to a structure for recursion looks like: sion looks like: is unpaired, added on to a structure for S( ) S(S( i + 1,j – 1) S(ii++1,j 1,j ) i + 1,j – 1) S(i,ji,j––11)) S( S(i + 1,j – 1) +1 [if i + 1..j. S(i + 1,j – 1) +1 [if i,j base pair] + 1,j – 1) + paired, added a structure for pair] k1,j+S(i S(i on + 1,jto– 1) +1 [if i,j base S(S( i,ki,k ) ) S(iS(+ kS( +1,j) )1,j ) is unpaired, added to a structure forS(i,j) = max S(i on + 1,j) S(i + 1,j) S(i,j) = max S(i + 1,j) S(i,j – 1) S(i,j) = max ,j) .j –=1.max S(i,j – 1)S(i,j – 1) S(i,j – 1)+ S(k maxi<k<j S(i,k) paired, ibut +i1+ 1 not j –j – 11 to each other; maxnot S(i,k) + S(kother; + 1,j) i<k<j+to max S(i,k jucture are paired, but each S(i,k) S(k + 1,j) max i<k<j i j – 1 j i i + 1 j i k k + 1 j i j i j – 1 j i i + 1 i k k + 1 j i j i<k<j for i..j adds together subhe adds together subpair i..j 2. 2. unpaired 4.4. Bifurcation i,ji,jfor pair i i unpaired unpaired Bifurcation res structure for two1.1.sub-sequences, i..k and3.3. jj unpaired To run this recursion efficient Nussinov algorithm 18 quence structure on i,j conditional on i and j being the optimal score S(i,j) is just the maximum at this paired but not to each other. of the four possibilities. We’ve thus defined Since these are the only four possible cases, efined theofoptimal score S(i,j) recursively as amaximum functhe optimal score S(i,j) is just the res 4. i,j are paired, but not to each other; the structure for i...j adds together tion of optimal ofWe’ve smaller sub-­‐structures twoscores sub-­‐sequences, i ...thus k andsubk+1 ... j (a bifurcaLon) of the fourforpossibilities. defined top of only Weoptimal deal two independent sub-­‐structures so with we puOng onlyS(i,j) need to remember score recursively as a func-together, the e sequences; ways -­‐ the opLmal score S(i,k) is independent of anything going on in k+1 ... j, and these scores, not of theoptimal combinatorial only scores ofexplosion smaller subi..j can tion vice versa R IofM possible E R sequences; soall we only remember structures. Mathematically -­‐ Must consider possible k’s bneed etweentoi and jthis these scores, recursion looks like:not the combinatorial explosion ucture of possible structures. Mathematically Recursive definition of the best score for a sub-sequence i,j looksthis at four possibilities: recursionS(i looks like: + S( 1,j – 1) +1 [if i,j base pair] ure for S( i + 1,j ) i + 1,j – 1) S(i,j – 1) S(i + 1,j) S(i,j) = max S(i + 1,j – 1) +1 [if i,j baseS(pair] i,k ) S(k + 1,j ) S(i,j – 1) ure for S(iS(i,k) + 1,j) + S(k + 1,j) maxi<k<j S(i,j) = max Nussinov algorithm i +1 j –1 S(i,j – 1) +j –S(k i 1 j + 1,j) i i + 1max j i<k<j S(i,k) other; i i k k +1 j r subTo run this recursion 1. i,j pair 2. i unpaired efficiently, 3. j unpairedwe just 4. Bifurcation 19 j Since these are the only four possible cases, re (call it S(i,j)) can be defined in terms of optimal scores of the optimal score S(i,j) is just the maximum sequences. As shown in the top of of the four possibilities. We’ve thus defined ere are only four possible ways the optimal score S(i,j) recursively as a func Sincebase these only four cases, the opLmal scoresubtion possible only of optimal scores of smaller ure of•nested pairsare onthe i..j can S(i, j) is just the maximumsequences; of the four sopossibiliLes we only need to remember ed: • We’ve thus the opLmal scorenot S(i,j) as explosion a these scores, the recursively combinatorial ase pair, added on to defined a structure funcLon only of opLmal scores of smaller sub-­‐sequences, so this of possible structures. Mathematically – 1. we only need to remember these looks scores, not the RRred, IIM EER recursion like: Madded R on to a structure for Nussinov algorithm PPRRI IMME ERR combinatorial explosion of possible structures S(i + 1,j – 1) +1 [if i,j base pair] red, addeddefinition on to aof structure for for a sub-sequence i,j S(i +looks 1,j)atatfour Recursive the best score Recursive definition of the best score for aS(i,j) sub-sequence i,jlooks fourpossibilities: possibilities: = max aa Recursive definition of the best score for a sub-sequence i,j looks at four possibilities: Recursive definition of the best score for a sub-sequenceS(i,j i,j looks at four possibilities: – 1) S( i + 1,j ) ) maxi<k<j S(i,k) + S(k + 1,j) i +i +1,j1,j––1to )1) each other; S( aired, butS( not S( S(ii++1,j 1,j ) S(S( i +i 1,j – 1–)1) + 1,j S(i + 1,j ) ture for i..j adds together subfor two sub-sequences, i..k and bifurcation). S(S( i,ji,j– 1–)1) S( i,ji,j–– 1)1) S( S( i,k ) ) S( S( i,ki,ki,k )) S( S( k +k1,j ) +)1,j S( S(kS( k++1,j 1,j ) ) To run this recursion efficiently, we just need to make sure that whenever we try to i+ –j –1j – i +1i1+i 1+If 1j 1–add 1 jwe 1 the first case. on a i,j compute an S(i,j), we already have calculated j j–j–j1–11jj j i i i ii+i i+ 1+i11+ 1 j j j jjj j k111+ 1 j jj j ii i ii i i kkkk kkk+++ i i the iscores for smaller sub-sequences. This nto i + 1..j –i i 1, what is the score 1.1.1. i,ji,j pair 2.2. 3.3. jup Bifurcation 1.i,j i,jpair pair the i unpaired sets 3. unpaired Bifurcation 20 2.i2. i unpaired unpaired junpaired 4.4. a dynamic programming ially, we know (from definipair iunpaired 3. j junpaired unpaired 4.Bifurcation Bifurcation algorithm. IMER Nussinov algorithm Pdefinition R II M M EE R R the best score for a sub-sequence i,j looks at four possibilities: Recursive of P R PRIMER aa S(i + 1,j ) S(i + 1,j – 1) S(for i,j – 1)sub-sequence i,j looks at four possibilities: Recursive definition definition of of the thebest bestscore score Recursive for aasub-sequence i,j looks at four possibilities: S( ) k + 1,j ) a Recursive definition of the best score for a sub-sequence i,ji,klooks atS( four possibilities: S(i i++1,j1,j)) S(ii++1,j 1,j––11)) S( S( S(i,ji,j––1)1) S( S(i + 1,j ) S(i + 1,j – 1) S(i,j – 1) S(i,ki,k) ) S( + 1,j S( S( k k+ 1,j )) i +1 j –1 S(i,k ) S(k + 1,j ) j i 1. i,j pair ii++11 2. ii i + 1 i i +1 i j j –1 j i j j––11 i unpaired 3.j j unpaired i i i +1 j j –1 i j i j i +1 j i i +1 i i j j –11 j j j– j –1 j k k +1 j 4. Bifurcation i k k +1 i k i k k +1 k +1 j j j ww.nature.com/naturebiotechnology ww.nature.com/naturebiotechnology ww.nature.com/naturebiotechnology 1. i,j i,j pair pair 2. i iunpaired unpaired unpaired Bifurcation 1. 2. 3.3.j junpaired 4.4.Bifurcation 1. i,j pair for 2. 3. j unpaired 4. Bifurcation Dynamic programming algorithm alli unpaired sub-sequences i,j, from smallest to largest: j j b Dynamic programming algorithm j smallest to largest: forall allsub-sequences sub-sequencesi,j, i,j,from from ithm for smallest to largest: G G G A G 0 G 0 0 G 0 0 A 0 0i A 0 A U C C 21 bA ADynamic U C C programming G G algorithm G A A A f U C C j G 0 0 0 0 0 0j 1 2 3 G G G Aj A A U C C G G G Aj A A U 1 2 3 G 0 G G 0 G G G A A AG U 0C 0C 0 G 00 00 0 G0 A0 A0 A1 G G 0 0 0 0 0 0 G 00 00 00 01 002 00201 G 0 0 i G G G G 00 00 A i 0 G 0 000 000 001 001 00101 i 0A A G A 0 0A A 0A 0 U U 0 C C C C G 0 0 0 0 0 Ai 0 0 0 0 0 1 01 01 1 A 0 0 0 0 00 00 A 0 0 0 0 A 0 0 0 0 0 0U 0 0 0C 0 0 A A U 0 0 0 0 C C0 0 0 0 Initialization;Initialization; Initialization; 0 0 C A A U C C 0 0 01 001 00101 0 0 1 0 0 0 0 00 0 0 0 0 00 0 0 0 G G 0 C C U2 C3 G C 0 1 i2 G 3 2 3 1 2 1 1 1 1 1 1 1 0 0 0 0 0 recursive fill; recursive fill;fill; recursive 2 A 3 2 2 2 1A 1 1 1 1 A 1 1 1 U 1 0 0 C 0 0 0 C 0 0 0 0 G G A A A U C C j 0 0 0j0 0 1 2 3 GG G G G G A A A A A A U U C C C C 0 0 00 00 0 1 0 21 32 3 G 0 00 G 00 0 0 0 0 1 2 3 0 0 000 000 0 1 0 00 G0G 00 0 0 121 222 3 3 i i GG 0 0 00 0 000 000 0 10 0 111 212 2 2 AA AA AA UU CC CC 0 0 111 111 1 1 0 0 000 000 0 1 0 0 111 111 1 1 00 000 0 1 00 00 1 1 1 1 1 1 0 0 0 0 00 00 0 00 0 0 00 00 0 0 0 0 0 000 0 0 traceback; traceback; traceback; AA AA A A AA U U UC GC GA G CC G GG C GGG C result. result. result. where inside add a score f get S(i,j), be whereinside insideth where available to add a score fo add a score for where inside th already pair getS(i,j), beca add a S(i,j), scorebeca for get optimal sub available get S(i,j), becau available totob hasn’t kept alreadypaired paire available to ba already how the recw optimal subalready paired optimal sub-st hasn’t kepttra tr optimal sub-str to remembe hasn’t kept hasn’t kept trac howthe the recu how the recurs of com how the recursS toremember remember to structures o to remember S of the comb of the combin recursion is of the combin structures structures onon Thereison are structures recursion isinv it recursion recursion isare invR with pseudo There There are RN There are RN withpseudokn pseudo serious limi with with pseudokn serious limita serious limitat cient algorit serious limitati cientalgorithm algorith cient ing) that ca cient algorithm ing)that thatcan cang ing) longthat as can onegu ing) longasasone oneuse u long long as onesyste use scoring scoring system scoring system scoring system, dependent dependent tht dependent the dependent ther plex dynamp plexdynamic dynami plex dynamic plex guaranteeopti op guarantee o guarantee opt guarantee Nussinov algorithm • Storing the S(i, j) matrix requires memory proporLonal to N2, similar to what sequence alignment algorithms need • However, the innermost loop of having to find opLmal potenLal bifurcaLon points k means that the folding algorithm requires Lme proporLonal to N3, a factor of N more Lme-­‐intensive than sequence alignment – RNA folding calculaLons o=en require a large amount of computer power 22 e for just that thing going on in k + 1..j, and vice versa, so imum num- S(i,k) + S(k + 1,j) is the score of the optimal ub-sequence structure on i,j conditional on i and j being ize that this paired but not to each other. We want to fold the following RNA sequence: Since these are the only four possible cases, n be defined AAGUUCG S(i,j) is just the maximum al scores of the optimal score in the top of of the four possibilities. We’ve thus defined (1) Write the sequence along the top and le= side of the matrix ossible ways the optimal score S(i,j) recursively as a funcirs on i..j can tion only of optimal scores of smaller subsequences; so weofonly need toand remember (2) IniLalize the diagonal the matrix one-­‐below to zero o a structure these scores, not the combinatorial explosion of possible structures. Mathematically this recursion looks like: (3) Fill jth entries according to structure forin i, Nussinov Algorithm Example structure for each other; ogether subnces, i..k and 23 S(i,j) = max S(i + 1,j – 1) +1 [if i,j base pair] S(i + 1,j) S(i,j – 1) maxi<k<j S(i,k) + S(k + 1,j) To run this recursion efficiently, we just Nussinov Algorithm -­‐ iniLalizaLon 24 . r t e s d f f s n e r r ; - 25 same thing, but conditional on j being unpaired. In case 4, where we deal with putting two independent sub-structures together, the optimal score S(i,k) is independent of anything going on in k + 1..j, and vice versa, so S(i,k) + S(k + 1,j) is the score of the optimal structure on i,j conditional on i and j being paired but not to each other. Since these are the only four possible cases, the optimal score S(i,j) is just the maximum of the four possibilities. We’ve thus defined the optimal score S(i,j) recursively as a function only of optimal scores of smaller subsequences; so we only need to remember these scores, not the combinatorial explosion of possible structures. Mathematically this recursion looks like: Nussinov Algorithm . r t e s d f f s n e r r ; - 26 same thing, but conditional on j being unpaired. In case 4, where we deal with putting two independent sub-structures together, the optimal score S(i,k) is independent of anything going on in k + 1..j, and vice versa, so S(i,k) + S(k + 1,j) is the score of the optimal structure on i,j conditional on i and j being paired but not to each other. Since these are the only four possible cases, the optimal score S(i,j) is just the maximum of the four possibilities. We’ve thus defined the optimal score S(i,j) recursively as a function only of optimal scores of smaller subsequences; so we only need to remember these scores, not the combinatorial explosion of possible structures. Mathematically this recursion looks like: Nussinov Algorithm same thing, but conditional on j being unpaired. In case 4, where we deal with putting two independent sub-structures together, the optimal score S(i,k) is independent of anything going on in k + 1..j, and vice versa, so S(i,k) + S(k + 1,j) is the score of the optimal structure on i,j conditional on i and j being paired but not to each other. Since these are the only four possible cases, the optimal score S(i,j) is just the maximum of the four possibilities. We’ve thus defined the optimal score S(i,j) recursively as a function only of optimal scores of smaller subsequences; so we only need to remember these scores, not the combinatorial explosion of possible structures. Mathematically this recursion looks like: . r t e s d f f s n Nussinov Algorithm e r r ; - 27 . r t e s d f f s n e r r ; - 28 same thing, but conditional on j being unpaired. In case 4, where we deal with putting two independent sub-structures together, the optimal score S(i,k) is independent of anything going on in k + 1..j, and vice versa, so S(i,k) + S(k + 1,j) is the score of the optimal structure on i,j conditional on i and j being paired but not to each other. Since these are the only four possible cases, the optimal score S(i,j) is just the maximum of the four possibilities. We’ve thus defined the optimal score S(i,j) recursively as a function only of optimal scores of smaller subsequences; so we only need to remember these scores, not the combinatorial explosion of possible structures. Mathematically this recursion looks like: Nussinov Algorithm . r t e s d f f s n e r r ; - 29 same thing, but conditional on j being unpaired. In case 4, where we deal with putting two independent sub-structures together, the optimal score S(i,k) is independent of anything going on in k + 1..j, and vice versa, so S(i,k) + S(k + 1,j) is the score of the optimal structure on i,j conditional on i and j being paired but not to each other. Since these are the only four possible cases, the optimal score S(i,j) is just the maximum of the four possibilities. We’ve thus defined the optimal score S(i,j) recursively as a function only of optimal scores of smaller subsequences; so we only need to remember these scores, not the combinatorial explosion of possible structures. Mathematically this recursion looks like: Nussinov Algorithm same thing, but conditional on j being unpaired. In case 4, where we deal with putting two independent sub-structures together, the optimal score S(i,k) is independent of anything going on in k + 1..j, and vice versa, so S(i,k) + S(k + 1,j) is the score of the optimal structure on i,j conditional on i and j being paired but not to each other. Since these are the only four possible cases, the optimal score S(i,j) is just the maximum of the four possibilities. We’ve thus defined the optimal score S(i,j) recursively as a function only of optimal scores of smaller subsequences; so we only need to remember these scores, not the combinatorial explosion of possible structures. Mathematically this recursion looks like: . r t e s d f f s n Nussinov Algorithm e r r ; - 30 . r t e s d f f s n e r r ; - 31 same thing, but conditional on j being unpaired. In case 4, where we deal with putting two independent sub-structures together, the optimal score S(i,k) is independent of anything going on in k + 1..j, and vice versa, so S(i,k) + S(k + 1,j) is the score of the optimal structure on i,j conditional on i and j being paired but not to each other. Since these are the only four possible cases, the optimal score S(i,j) is just the maximum of the four possibilities. We’ve thus defined the optimal score S(i,j) recursively as a function only of optimal scores of smaller subsequences; so we only need to remember these scores, not the combinatorial explosion of possible structures. Mathematically this recursion looks like: Nussinov Algorithm . r t e s d f f s n e r r ; - same thing, but conditional on j being unpaired. In case 4, where we deal with putting two independent sub-structures together, the j of anyoptimal score S(i,k) is independent versa, so thing going on in k + 1..j, and vice A A G S(i,k) + S(k + 1,j) is the score of the optimal A 0 on 0 structure on i,j conditional i and j 0 being paired but not to each other. A the only 0 four possible 0 0 Since these are cases, i the optimal score G S(i,j) is just the 0 maximum 0 of the four possibilities. We’ve thus defined the optimal score U S(i,j) recursively as a0 function only of optimal scores of smaller subU only need to remember sequences; so we these scores, not the combinatorial explosion C Fill of in possible structures. Mathematically this highlighted recursion looks like: G Nussinov Algorithm square: (i = 1,j = 7) S(i,j) = max S(i + 1,j – 1) +1 [if i,j base pair] S(i + 1,j) S(i,j – 1) maxi<k<j S(i,k) + S(k + 1,j) U U C G 1 2 2 3 1 1 1 1 0 0 1 1 0 0 0 1 0 0 0 1 0 0 1 0 0 A-­‐G don’t base pair = 1 = 2 k = 2: S(1,2) + S(3,7) = 0+1 = 1 k = 3: S(1,3) + S(4,7) = 0+1 = 1 k = 4: S(1,4) + S(5,7) = 1+1 = 2 k = 5: S(1,5) + S(6,7) = 2+1 = 3 k = 6: S(1,6) + S(7,7) = 2+0 = 2 Nussinov Algorithm -­‐traceback j i A A A 0 A 0 G G U U C G 0 0 1 2 2 3 0 0 1 1 1 1 0 0 0 0 1 1 0 0 0 0 1 0 0 0 1 0 0 1 0 0 U U C G start here Can you draw this folded RNA? k = 5: S(1,5) + S(6,7) = 2+1 = 3 Op'mal sub-­‐structure from 1-­‐5 (with 2 matches) Op'mal sub-­‐structure from 6-­‐7 (with 1 match) Nussinov Algorithm -­‐traceback j i A A A 0 A 0 G G U U C G 0 0 1 2 2 3 0 0 1 1 1 1 0 0 0 0 1 1 0 0 0 0 1 0 0 0 1 0 0 1 0 0 U U C G Can you draw this folded RNA? G A A U U C G start here -­‐ note that in reality, stems can’t form if the loop is less than 3bp due to restric'ons on backbone angles Improvements on Nussinov algorithm • Nussinov is the “core” of most RNA folding programs, but they all have bells & whistles 3’…CCAUUCAUAG…5’ RNA Energetics I – Take into account that loop must be 3 or more nucleo'des |||||| – Not all base pairs are equal in reality (we treated them all at +1 in Nussinov) 5’…CGUGAGU…3’ – Base stacking interac'ons • base pairing: G > C A U > G U • base stacking: GpA | | CpU 35 Base stacking contributes more source unknown. All rights reserved. This content is excluded from our Creaative to free energy than base pairing © Commons license. For more information, see http://ocw.mit.edu/help/faq-fair-use/. Improvements on Nussinov algorithm • Nussinov is the “core” of most RNA folding programs, but they all have bells & whistles Take into account that loop must be 3 or more nucleo'des Not all base pairs are equal in reality (we treated them all at +1 in Nussinov) Base stacking interac'ons Penalizes interior bulges Extra terms at terminal ends of RNA exposed to solvent -­‐Nussinov algorithm cannot detect pseudoknots, since these do not sa'sfy the recursive assump'on that each structure can be split into smaller self-­‐contained sub-­‐structures -­‐ more advanced algorithms – With all these addi'ons, mfold gets ~70% of bases correctly folded; preYy good on average but would likely want to do in vivo structure profiling of your RNA if you really want to know its structure – – – – – – 36 Happy Spring Break! 37 MIT OpenCourseWare http://ocw.mit.edu 7.91J / 20.490J / 20.390J / 7.36J / 6.802J / 6.874J / HST.506J Foundations of Computational and Systems Biology Spring 2014 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.