Document 13496310

advertisement
RecitaLon 3-­‐19
CB Lecture #10
RNA Secondary Structure
1
Announcements
•  Exam 1 grades and answer key will be posted
Friday a=ernoon
–  We will try to make exams available for pickup
Friday a=ernoon (probably from 3:30-­‐4pm and
5-­‐5:30pm, before and a=er the Friday secLon)
•  Pset #3 has been released, due April 3rd
–  much longer programming problem than Pset #2
–  Because of spring break, only one set of formal
office hours before due date, but please email us
with your quesLons
•  Updated aims with research strategy will be
due Friday April 4th
2
RNA Secondary Structure
• Just as protein can form secondary structure (α-­‐helix and β-­‐
  sheet), so too can single-­‐stranded RNA by folding back on
itself to form double-­‐stranded regions
rRNA mRNA tRNA © unknown. All rights reserved. This content is excluded from
our Creative Commons license. For more information,
see http://ocw.mit.edu/help/faq-fair-use/.
Source: http://www.mun.ca/biology/scarr/rRNA_folding.html
© source unknown. All rights reserved. This
content is excluded from our Creative
Commons license. For more information,
see http://ocw.mit.edu/help/faq-fair-use/.
© Dept. Biol. Penn State. All rights reserved. This
content excluded from our Creative Commons
license. For more information, see http://ocw.mit.
edu/help/faq-fair-use/.
hNp://www.uic.edu/classes/phys/phys461/phys450/ANJUM04/RNA_sstrand.jpg
hNps://www.mun.ca/biology/scarr/rRNA_folding.html hNps://wikispaces.psu.edu/download/aNachments/54886630/figure_17_12.jpg
3
…and virtually every other RNA! RNA Secondary Structure
•  RNA’s secondary structure is o=en inLmately Led with its
funcLon
–  rRNA and tRNA always adopt the same structure; funcLon depends on it
–  mRNA may adopt different structures in different condiLons – due to cell
types, temperature, ion concentraLon, etc.
•  mRNA’s processing may
depend on what structure is
(or is not) present
-­‐ Can inhibit or strengthen ability of
RNA binding protein to bind mRNA and affect alternaLve splicing
-­‐ Can inhibit the ribosome’s ability to
translate through the mRNA due to
sequestraLon of ribosome binding
site or hiOng structured road block
Schematic diagram of an E. coli cell removed due
to copyright restrictions. See the image here.
Riboswitches are metabolite-­‐sensing RNAs,
typically located in the non-­‐coding porLons
of messenger RNAs, that control the
synthesis of metabolite-­‐related proteins
4
hNp://www-­‐abell.ch.cam.ac.uk/images/riboswitchfig1.jpg
RNA is at the catalyLc site of the
ribosome (which is a ribozyme)
RNA/protein distribution on the 50S ribosome
linguini = protein
fettucini = RNA
© American Association for the Advancement of Science. All rights reserved. This content is excluded
from our Creative Commons license. For more information, see http://ocw.mit.edu/help/faq-fair-use/.
Source: Ban, Nenad, Poul Nissen, et al. "The Complete Atomic Structure of the Large Ribosomal
Subunit at 2.4 Å Resolution." Science 289, no. 5481 (2000): 905-20.
5
-­‐Ribozymes – RNAs capable of catalyzing biochemical reacLons -­‐ provide support for “RNA world” hypothesis – that life evolved from a world with RNAs but no DNA or protein
Terminology
© Oxford University Press. All rights reserved. This content is excluded from our Creative
Commons license. For more information, see http://ocw.mit.edu/help/faq-fair-use/.
© Kenyon College Biology Department. All rights reserved. This
Source: Ding, Ye, and Charles E. Lawrence. "A Statistical Sampling Algorithm for RNA
content is excluded from our Creative Commons license. For
Secondary Structure Prediction." Nucleic Acids Research 31, no. 24 (2003): 7280-301.
more information, see http://ocw.mit.edu/help/faq-fair-use/.
6
hNp://www.ims.nus.edu.sg/Programs/biomolecular07/files/
clote_tut2b.pdf
hNp://biology.kenyon.edu/courses/biol114/KH_lecture_images/
transcripLon/FG03_15b.JPG
Arc NotaLon © source unknown. All rights reserved. This content is excluded from our Creaative
Commons license. For more information, see http://ocw.mit.edu/help/faq-fair-use/.
7
hNp://www.ims.nus.edu.sg/Programs/biomolecular07/files/clote_tut2b.pdf non-­‐coding RNAs (ncRNAs) •  Any RNA molecule that doesn’t code for protein (any non-­‐mRNA molecule) –  tRNAs, rRNAs, miRNAs, snRNAs, snoRNAs, ribozymes (RNase P), lnRNAs, riboswitches
•  Due to the central role of structure in facilitaLng RNA’s funcLon,
we’d like the determine structure
–  2 different approaches for secondary structure
1. CovariaLon and compensatory changes through evoluLon
2. Energy minimizaLon
8
CovariaLon and compensatory changes
•  Idea: If structure is contribuLng to funcLon but actual sequence
is not, we should see structure conserved but not necessarily
sequence
–  So evoluLon allows
as longby
as covariation
secondary structure
is
RNAmutaLons
2o structure
/
maintained
compensatory changes
9
Seq1:
A C G A A A G U
Seq2:
U A G U A A U A
Seq3:
A G G U G A C U
Seq4:
C G G C A A U G
Seq5:
G U G G G A A C
CovariaLon and compensatory changes
•  Need sufficient divergence so that a decent number of
mutaLons and compensatory mutaLons have occurred, but not
so much that sequences can’t be aligned
•  Need a large number of homologs sequenced to have power to
detect compensatory mutaLons
10
Mutual InformaLon (MI)
•  The most common way of quanLfying sequence covariaLon for
the purpose of RNA secondary structure determinaLon
•  A measure of two variables' mutual dependence
–  Measures the informaLon that X and Y share: it measures how
much knowing one of these variables reduces uncertainty about the
other
–  If X and Y are independent, then knowing X does not give any informaLon about Y and vice versa, so their MI = 0
–  At the other extreme, if X is a determinisLc funcLon of Y and Y is a
determinisLc funcLon of X, then all informaLon conveyed by X is
shared with Y: knowing X determines the value of Y and vice versa
•  As a result, in this case the mutual informaLon is the same as the
uncertainty contained in Y (or X) alone, namely the entropy of Y (= entropy of X)
–  Mutual informaLon between aligned columns of nucleoLdes that
are base-­‐paired should be high
11
•  Knowing one of the nucleoLdes tells you everything about the other (if A,
other is U; if C, other is G, etc.)
Mutual InformaLon (MI)
MI between two columns i and j:
(i,j)
fx,y : fracLon of sequences with x in column i AND y in column j
fx(i) : fracLon of sequences with x in column i
12
-­‐RelaLve entropy of the joint distribuLon relaLve to the individual distribuLons
of the nucleoLdes in columns i and j
-­‐MI is maximal (2 bits) if x and y appear at random (all 4 nts equally likely) but
perfectly covary (e.g. always complementary)
Mutual InformaLon (MI)
MI is maximal (2 bits) if x and y appear at random (all 4 nts equally
likely) but perfectly covary (e.g. always complementary)
(i,j)
f
What is x,y ? (i,j)
fx,y
Because x and y perfectly covary, 1
= for the 25% of covarying events (e.g. (x,y) = (A,U))
4
(i,j)
fx,y
= 0 for the 75% of non-existent events (e.g. (x,y) = (A,A))
13
(i) (j)
What is fx fy ?
1
1 1
⇤ =
4 4
16
Mutual InformaLon (MI)
MI is maximal (2 bits) if x and y appear at random (all 4 nts equally
likely) but perfectly covary (e.g. always complementary)
X
1
=
log2
4
(x,y)=(A,U ),(C,G),(G,C),(U,A)
X
1
⇤2
=
4
(x,y)=(A,U ),(C,G),(G,C),(U,A)
14
= 2 bits
✓
1/4
1/16
◆
Energy
Minimization
App
nd
2 approach: Energy minimiza'on ΔGfolding = Gunfolded - Gfolded
-­‐ Assume that RNA will fold to its lowest energy state
re-­‐  are
typically many possible folded states
Simplest model: all base pairs contribute equally to lowering structure’s energy
assumption
that minimum
energy
state(s)
will be oc
-­‐  Base Pair MaximizaLon
(ignores energy
contribuLons
of base stacking,
loops, entropy, etc.): +1 for paired bases, 0 for unpaired
-­‐  Use the Nussinov algorithm of recursive maximizaLon of base pairing
ΔG = ΔH - TΔS
halpy favors folding
15
See the Eddy Nature Biotechnology Primer 2004
Nussinov algorithm
•  Look at one conLguous sub-­‐sequence from posiLon i to posiLon j in our
complete sequence of length N, and calculate the score of the best structure
for just that sub-­‐sequence
•  This opLmal score (call it S(i,j)) can be defined recursively in terms of opLmal
scores of smaller sub-­‐sequences
•  Four possible ways that a structure of nested base pairs on i...j can be
constructed
RRIIM
MEER
R
PPRRI IMME1.ERi,Rj are a base pair, added on to a structure for i+1 ... j–1
2. i is unpaired, added on to a structure for i+1 ... j
3. j is unpaired, added on to a structure for i ... j–1
i, j are paired,
butbest
not
toscore
each other;
the
structure for i...j
together
sub-­‐structures
Recursive
definition
ofofthe
for
i,ji,jadds
looks
atatfour
possibilities:
Recursive4.
definition
the
best
score
foraasub-sequence
sub-sequence
looks
four
possibilities:
aa Recursive
definition
of
the
best
score
for
a
sub-sequence
i,j
looks
at
four
possibilities:
for
two
sub-­‐sequences,
i
...
k
and
k+1
...
j
(a
bifurcaLon)
Recursive definition of the best score for a sub-sequence i,j looks at four possibilities:
S(
+i +1,j1,j––1)1)
S(iS(
i +i 1,j
– 1–)1)
S(
+ 1,j
S(
i +i +
1,j1,j
))
S(
S(S(
i +i 1,j
))
+ 1,j
S(S(
i,ji,j– 1–)1)
S(
i,ji,j––
1)1)
S(
S(
i,k ) )
S(
S(
i,ki,ki,k
))
S(
i+
i +1i1+i 1+ 1 j –j –1j –
1j 1– 1
ii i i
jjj j
16
i i i ii+i i+
1+i11+ 1 j j j
i ii i
j j–j–j1–11jj j
1.1.1.
i,ji,j
pair
j junpaired
1.i,j
i,jpair
pair 2.2.
i unpaired 3.3.
3.
unpaired
2.i2.
i unpaired
unpaired
pair
iunpaired
3.
j junpaired
unpaired
ii i i kkkk
S(
k +k1,j
)
+)1,j
S(
S(kS(
k++1,j
1,j
) )
k111+ 1 j jj j
kkk+++
Bifurcation
Bifurcation
4.4.
4.Bifurcation
Bifurcation
theofoptimal
is just the
maximum
ofthe top
of thescore
four S(i,j)
possibilities.
We’ve
thus defi
the four
possibilities.
We’ve
thus defined
of
the
optimal
score
S(i,j)
recursively
as a fu
sible of
ways
Nussinov algorithm
optimal
score
S(i,j)
recursively
as
a
funcyson i..jthe
tion
only
of
optimal
scores
of
smaller
s
can
1. i,j are a base pair, added on to a structure for i+1 ... j–1
–  Theonly
scoresequences;
we add
foroptimal
the baseso
pair i,jwe
isscores
independent
ofneed
any
detailsto
of the
of
of
smaller
subn tion
only
remem
opLmal structure on i + 1...j – 1
–  Similarly,these
the opLmal
structure
on
i + 1...j
– need
1 and
its score
S(i +remember
1, j – 1) are
sequences;
so
we
only
to
scores,
not
the
combinatorial
explos
structure
unaffected by whether i, j are base paired or not (or anything else that
happens
in thepossible
restnot
of the sequence)
of
structures.
Mathematically
these
scores,
the
combinatorial
explosion
P
R
I
M
E
R
re
–  Therefore, S(i, j) is just S(i + 1, j – 1) plus one, if i, j can base pair.
recursion
looks
like:
of
possible
structures.
Mathematically this
ucture for
a Recursive
definition of the
best score
for a sub-sequence i,j looks at four possibilities:
recursion
looks
like:
or
S(i + 1,j – 1) +1 [if i,j base p
S(i,j – 1)
ucture for
S(i + 1,j) S(i,k) S(k +1,j )
S(i,j) = S(i
max+ 1,j – 1) +1 [if i,j base pair]
S(i,j – 1)
or
S(i
+
1,j)
S(i,j)
=
i +1
j –1 max
maxi<k<j S(i,k) + S(k + 1,j)
ch other;
S(i + 1,j )
S(i + 1,j – 1)
i
j
ther sub-1. i,jpair
r;
17
i
k
k +1
j
S(i,j –i 1) j –1 j
2. i unpaired
max3. j unpaired
S(i,k) +4. Bifurcation
S(k + 1,j)
i
i +1
j
on i,j conditional
on i a
ted
base pairs
the
sub-sequence
structure
onthat
i,j conditional
on i andstructure
j being
structure
on
i,j
conditional
of
nested
base
pairs
that
the
sub-sequence
ture on i,j conditional on i and j being
pairedpaired
but notbut
to not
eachtoother.
The
key
is key
to recognize
that this
paired
but
not
to
each
other.
each other.
form.
The
is
to
recognize
that
this
d but not to each other.
Since
these
are
theare
only
four
pos
core
(call these
it(call
S(i,j))
canonly
becan
defined
Since
are
the
four
possible
cases,
Since
these
the
only
fo
mal
score
it
S(i,j))
be
defined
nce these are the only four possible cases,
the
optimal
score S(i,j)
isS(i,j)
just is
theju
y the
in optimal
terms
ofscore
optimal
scores
of maximum
S(i,j)
is
just
the
the
optimal
score
ursively
in
terms
of
optimal
scores
of
ptimal score
S(i,j)
is just
the maximum
2.
i
is
unpaired,
added
on of
to a structure
for i+1
... j
of
the
four
possibilities.
We’ve We
th
b-sequences.
As
shown
in
the
top
of
the
four
possibilities.
We’ve
thus
defined
of
the
four
possibilities.
ller
sub-sequences.
As
shown
in
the
top
of
•  In
case 2, the opLmal
scorethus
S(i + 1,j)
is independent of the addiLon of an
e four
possibilities.
We’ve
defined
the
optimal
score
S(i,j)
there
are
only
four
possible
ways
the
optimal
score
as
a
funcunpaired
base
i, S(i,j)
so four
S(i +recursively
1, possible
j) + 0 is the score
of
the
opLmal
structure
onrecursively
i,j
the
optimal
score
S(i,j) recu
ure
1,
there
are
only
ways
ptimal condiLonal
score S(i,j)
recursively
as
a
funcon i being
unpaired
tionsubonly of optimal scores of sm
cture
ofonly
nested
base
pairsscores
on i..jof
cansmaller
tion
of
optimal
tion only of optimal scores
a
structure
of
nested
base
pairs
on
i..j
can
only
of
optimal
scores
of
smaller
subsequences;
need to
ucted:
3. j is so
unpaired,
added
on to
to aremember
structure
forso
i ... we
j–1soonly
sequences;
we
only
need
R
I
M
E
R
sequences;
we
only nee
onstructed:
so
we
only
need
to
remember
Pences;
R I •M
E
R
  pair,
Case added
3 is the
same
thing,
but condiLonal
on j being
unpaired
these
scores,
not the combinatoria
these
scores,
not
the
combinatorial
explosion
base
on
to
a
structure
these scores, not the combin
not the
j scores,
are a base
pair,combinatorial
added on to aexplosion
structure
of possible
structures. Mathema
1..jof– 1.
possible
structures.
Mathematically
this
ofi,j possible
structures.
Ma
ossible
structures.
Mathematically
this
Recursive
definition
of the
best score for a sub-sequence
looks
at
four
possibilities:
or
i
+
1..j
–
1.
recursion
looks
like:possibilities:
a recursion
Recursive definition
of the best score for a sub-sequence
i,j looks
at four
looks
like:
paired,
added
on
to
a
structure
for
recursion looks like:
sion
looks like:
is
unpaired,
added
on
to
a
structure
for
S(
)
S(S(
i + 1,j – 1)
S(ii++1,j
1,j )
i + 1,j – 1)
S(i,ji,j––11))
S(
S(i + 1,j – 1) +1 [if i
+ 1..j.
S(i
+
1,j
–
1)
+1
[if
i,j
base
pair]
+ 1,j – 1) +
paired, added
a structure
for pair]
k1,j+S(i
S(i on
+ 1,jto– 1)
+1 [if i,j base
S(S(
i,ki,k
) ) S(iS(+
kS(
+1,j)
)1,j )
is unpaired,
added
to a structure forS(i,j) = max
S(i on
+ 1,j)
S(i + 1,j)
S(i,j) = max
S(i
+
1,j)
S(i,j
–
1)
S(i,j)
=
max
,j)
.j –=1.max S(i,j – 1)S(i,j – 1)
S(i,j
– 1)+ S(k
maxi<k<j
S(i,k)
paired, ibut
+i1+ 1 not
j –j –
11 to each other;
maxnot
S(i,k)
+ S(kother;
+ 1,j)
i<k<j+to
max
S(i,k
jucture
are paired,
but
each
S(i,k)
S(k
+
1,j)
max
i<k<j
i
j
–
1
j
i
i
+
1
j
i
k
k
+
1
j
i
j
i
j
–
1
j
i
i
+
1
i
k
k
+
1
j
i
j
i<k<j
for i..j adds together subhe
adds
together
subpair i..j 2.
2.
unpaired
4.4.
Bifurcation
i,ji,jfor
pair
i i unpaired
unpaired
Bifurcation
res structure
for two1.1.sub-sequences,
i..k and3.3. jj unpaired
To run this
recursion efficient
Nussinov algorithm
18
quence structure on i,j conditional on i and j being
the optimal score S(i,j) is just the maximum
at this paired but not to each other.
of
the
four
possibilities.
We’ve
thus
defined
Since these are the only four possible cases,
efined
theofoptimal
score S(i,j)
recursively
as amaximum
functhe
optimal
score
S(i,j)
is
just
the
res
4. i,j are paired, but not to each other; the structure for i...j adds together
tion
of optimal
ofWe’ve
smaller
sub-­‐structures
twoscores
sub-­‐sequences,
i ...thus
k andsubk+1 ... j (a bifurcaLon)
of the
fourforpossibilities.
defined
top
of only
Weoptimal
deal
two
independent
sub-­‐structures
so with
we puOng
onlyS(i,j)
need
to remember
score
recursively
as a func-together, the
e sequences;
ways -­‐  the
opLmal score S(i,k) is independent of anything going on in k+1 ... j, and
these
scores,
not of
theoptimal
combinatorial
only
scores ofexplosion
smaller subi..j can tion
vice versa
R IofM possible
E R sequences;
soall we
only
remember
structures.
Mathematically
-­‐  Must consider
possible
k’s bneed
etweentoi and
jthis
these
scores,
recursion
looks
like:not the combinatorial explosion
ucture
of possible
structures.
Mathematically
Recursive definition
of the best
score for a sub-sequence
i,j looksthis
at four possibilities:
recursionS(i
looks
like:
+ S(
1,j
– 1) +1 [if i,j base pair]
ure for S(
i + 1,j )
i + 1,j – 1)
S(i,j – 1)
S(i
+
1,j)
S(i,j) = max
S(i + 1,j – 1) +1 [if i,j baseS(pair]
i,k )
S(k + 1,j )
S(i,j – 1)
ure for
S(iS(i,k)
+ 1,j) + S(k + 1,j)
maxi<k<j
S(i,j) = max
Nussinov algorithm
i +1
j –1
S(i,j – 1)
+j –S(k
i
1 j + 1,j) i
i + 1max
j i<k<j S(i,k)
other; i
i
k
k +1
j
r subTo run
this recursion
1. i,j pair
2. i unpaired efficiently,
3. j unpairedwe just
4. Bifurcation
19
j
Since these are the only four possible cases,
re (call it S(i,j)) can be defined
in terms of optimal scores of the optimal score S(i,j) is just the maximum
sequences. As shown in the top of of the four possibilities. We’ve thus defined
ere are only four possible ways the optimal score S(i,j) recursively as a func  Sincebase
these
only four
cases, the
opLmal
scoresubtion possible
only of optimal
scores
of smaller
ure of•nested
pairsare
onthe
i..j can
S(i, j) is just the maximumsequences;
of the four
sopossibiliLes
we only need to remember
ed:
•  We’ve thus
the opLmal
scorenot
S(i,j)
as explosion
a
these scores,
the recursively
combinatorial
ase pair,
added on
to defined
a structure
funcLon only of opLmal scores
of smaller
sub-­‐sequences,
so this
of possible
structures.
Mathematically
– 1.
we
only need to remember
these looks
scores,
not the
RRred,
IIM
EER
recursion
like:
Madded
R on
to a structure for
Nussinov algorithm
PPRRI IMME
ERR
combinatorial
explosion of possible structures
S(i + 1,j – 1) +1 [if i,j base pair]
red, addeddefinition
on to aof structure
for for a sub-sequence i,j
S(i
+looks
1,j)atatfour
Recursive
the
best
score
Recursive definition of the best score for aS(i,j)
sub-sequence
i,jlooks
fourpossibilities:
possibilities:
=
max
aa Recursive
definition
of
the
best
score
for
a
sub-sequence
i,j
looks
at
four
possibilities:
Recursive definition of the best score for a sub-sequenceS(i,j
i,j looks
at
four
possibilities:
– 1)
S(
i + 1,j ) )
maxi<k<j S(i,k) + S(k + 1,j)
i +i +1,j1,j––1to
)1) each other;
S(
aired, butS(
not
S(
S(ii++1,j
1,j )
S(S(
i +i 1,j
– 1–)1)
+ 1,j
S(i + 1,j )
ture for i..j adds together subfor two sub-sequences, i..k and
bifurcation).
S(S(
i,ji,j– 1–)1)
S(
i,ji,j––
1)1)
S(
S(
i,k ) )
S(
S(
i,ki,ki,k
))
S(
S(
k +k1,j
)
+)1,j
S(
S(kS(
k++1,j
1,j
) )
To run this recursion efficiently, we just
need to make sure that whenever we try to
i+
–j –1j –
i +1i1+i 1+If
1j 1–add
1 jwe
1
the first case.
on a i,j compute an S(i,j), we already have calculated
j j–j–j1–11jj j
i i i ii+i i+
1+i11+ 1 j j j
jjj j
k111+ 1 j jj j
ii i
ii i i kkkk kkk+++
i i
the iscores
for smaller
sub-sequences.
This
nto i + 1..j –i i 1,
what
is the
score
1.1.1.
i,ji,j
pair
2.2.
3.3.
jup
Bifurcation
1.i,j
i,jpair
pair the
i unpaired sets
3.
unpaired
Bifurcation
20
2.i2.
i unpaired
unpaired
junpaired
4.4.
a dynamic programming
ially,
we know
(from
definipair
iunpaired
3.
j junpaired
unpaired
4.Bifurcation
Bifurcation algorithm.
IMER
Nussinov algorithm
Pdefinition
R II M
M EE R
R the best score for a sub-sequence i,j looks at four possibilities:
Recursive
of
P
R
PRIMER
aa
S(i + 1,j )
S(i + 1,j – 1)
S(for
i,j – 1)sub-sequence i,j looks at four possibilities:
Recursive definition
definition of
of the
thebest
bestscore
score
Recursive
for aasub-sequence
i,j looks at four possibilities:
S(
)
k + 1,j
)
a Recursive definition of the best
score for a sub-sequence i,ji,klooks
atS(
four
possibilities:
S(i i++1,j1,j))
S(ii++1,j
1,j––11))
S(
S(
S(i,ji,j––1)1)
S(
S(i + 1,j )
S(i + 1,j – 1)
S(i,j – 1)
S(i,ki,k) )
S(
+ 1,j
S(
S(
k k+ 1,j
))
i +1
j –1
S(i,k )
S(k + 1,j )
j
i
1. i,j pair
ii++11
2.
ii i + 1
i
i +1
i
j
j –1 j
i
j j––11
i unpaired
3.j j unpaired
i
i i +1
j j –1
i
j
i
j
i +1
j
i i +1
i
i
j
j –11 j j
j–
j –1 j
k
k +1
j
4. Bifurcation
i
k
k +1
i
k
i
k
k +1
k +1
j j
j
ww.nature.com/naturebiotechnology
ww.nature.com/naturebiotechnology
ww.nature.com/naturebiotechnology
1. i,j
i,j pair
pair
2. i iunpaired
unpaired
unpaired
Bifurcation
1.
2.
3.3.j junpaired
4.4.Bifurcation
1. i,j pair for 2.
3. j unpaired
4. Bifurcation
Dynamic programming algorithm
alli unpaired
sub-sequences
i,j, from smallest
to largest:
j
j b Dynamic programming algorithm
j smallest to largest:
forall
allsub-sequences
sub-sequencesi,j,
i,j,from
from
ithm for
smallest to largest:
G G G A
G 0
G 0 0
G
0 0
A
0 0i
A
0
A
U
C
C
21
bA
ADynamic
U C C programming
G G algorithm
G A A A
f U C C
j
G 0 0 0 0 0 0j 1 2 3
G G G Aj A A U C C
G G G Aj A A U
1 2 3
G 0
G
G 0 G G G A A AG U 0C 0C 0 G 00 00
0 G0 A0 A0 A1
G
G
0
0
0
0
0 0 G 00 00 00 01 002 00201
G 0 0
i G
G
G G 00 00
A
i 0 G 0 000 000 001 001 00101
i
0A
A
G
A
0 0A
A
0A 0
U
U 0
C
C
C
C
G
0 0 0 0 0
Ai 0 0 0
0 0
1 01 01 1
A
0 0 0 0
00 00
A
0 0
0 0
A
0 0
0 0
0 0U
0 0
0C 0
0
A
A
U
0 0
0 0
C
C0 0
0 0
Initialization;Initialization;
Initialization;
0 0
C
A
A
U
C
C
0
0 01 001 00101
0 0 1
0 0 0 0 00
0 0
0 0 00
0
0 0
G
G 0
C C
U2 C3 G
C 0
1 i2 G
3
2 3
1
2
1
1
1
1
1
1
1
0
0
0
0
0
recursive
fill;
recursive
fill;fill;
recursive
2 A
3
2
2 2
1A
1 1
1
1 A
1
1
1 U
1
0
0 C
0
0
0 C
0
0
0 0
G G A A A U C C
j
0 0 0j0 0 1 2 3
GG G G G G A A A A A A U U C C C C
0
0 00 00 0 1 0 21 32 3
G
0 00
G 00
0 0 0 0 1 2 3
0 0 000 000 0 1
0 00
G0G 00
0 0 121 222 3 3
i i GG 0 0 00 0 000 000 0 10 0 111 212 2 2
AA
AA
AA
UU
CC
CC
0 0 111 111 1 1
0 0 000 000 0 1
0 0 111 111 1 1
00 000 0 1
00 00 1 1 1 1 1 1
0 0 0 0
00 00 0 00 0
0 00 00 0
0 0 0
0 000 0 0
traceback;
traceback;
traceback;
AA AA
A A
AA U U
UC
GC
GA
G
CC
G
GG C
GGG C
result.
result.
result.
where inside
add a score f
get S(i,j), be
whereinside
insideth
where
available
to
add
a
score
fo
add
a score
for
where
inside
th
already
pair
getS(i,j),
beca
add
a S(i,j),
scorebeca
for
get
optimal
sub
available
get
S(i,j), becau
available
totob
hasn’t
kept
alreadypaired
paire
available
to ba
already
how
the
recw
optimal
subalready
paired
optimal
sub-st
hasn’t
kepttra
tr
optimal
sub-str
to
remembe
hasn’t
kept
hasn’t
kept
trac
howthe
the
recu
how
the
recurs
of
com
how
the recursS
toremember
remember
to
structures
o
to
remember
S
of
the
comb
of
the combin
recursion
is
of
the
combin
structures
structures
onon
Thereison
are
structures
recursion
isinv
it
recursion
recursion
isare
invR
with
pseudo
There
There
are
RN
There
are
RN
withpseudokn
pseudo
serious
limi
with
with
pseudokn
serious
limita
serious
limitat
cient
algorit
serious limitati
cientalgorithm
algorith
cient
ing)
that
ca
cient
algorithm
ing)that
thatcan
cang
ing)
longthat
as can
onegu
ing)
longasasone
oneuse
u
long
long
as onesyste
use
scoring
scoring
system
scoring
system
scoring
system,
dependent
dependent
tht
dependent
the
dependent
ther
plex
dynamp
plexdynamic
dynami
plex
dynamic
plex
guaranteeopti
op
guarantee
o
guarantee
opt
guarantee
Nussinov algorithm
•  Storing the S(i, j) matrix requires memory proporLonal to
N2, similar to what sequence alignment algorithms need
•  However, the innermost loop of having to find opLmal
potenLal bifurcaLon points k means that the folding
algorithm requires Lme proporLonal to N3, a factor of N more Lme-­‐intensive than sequence alignment
–  RNA folding calculaLons o=en require a large amount of computer
power
22
e for just that thing going on in k + 1..j, and vice versa, so
imum num- S(i,k) + S(k + 1,j) is the score of the optimal
ub-sequence structure on i,j conditional on i and j being
ize that this paired but not to each other.
We want to fold the following RNA sequence:
Since these are the only four possible cases,
n be defined
AAGUUCG
S(i,j) is just the maximum
al scores of the optimal score
in the top of of the four possibilities. We’ve thus defined
(1)  Write the sequence along the top and le= side of the matrix
ossible ways the optimal score S(i,j) recursively as a funcirs on i..j can tion only of optimal scores of smaller subsequences;
so weofonly
need toand
remember
(2) IniLalize
the diagonal
the matrix
one-­‐below to zero
o a structure these scores, not the combinatorial explosion
of possible structures. Mathematically this
recursion
looks
like:
(3) Fill
jth entries
according
to
structure
forin i, Nussinov Algorithm Example
structure for
each other;
ogether subnces,
i..k and
23
S(i,j) = max
S(i + 1,j – 1) +1 [if i,j base pair]
S(i + 1,j)
S(i,j – 1)
maxi<k<j S(i,k) + S(k + 1,j)
To run this recursion efficiently, we just
Nussinov Algorithm -­‐ iniLalizaLon
24
.
r
t
e
s
d
f
f
s
n
e
r
r
;
- 25
same thing, but conditional on j being unpaired. In case 4, where we deal with putting
two independent sub-structures together, the
optimal score S(i,k) is independent of anything going on in k + 1..j, and vice versa, so
S(i,k) + S(k + 1,j) is the score of the optimal
structure on i,j conditional on i and j being
paired but not to each other.
Since these are the only four possible cases,
the optimal score S(i,j) is just the maximum
of the four possibilities. We’ve thus defined
the optimal score S(i,j) recursively as a function only of optimal scores of smaller subsequences; so we only need to remember
these scores, not the combinatorial explosion
of possible structures. Mathematically this
recursion looks like:
Nussinov Algorithm
.
r
t
e
s
d
f
f
s
n
e
r
r
;
- 26
same thing, but conditional on j being unpaired. In case 4, where we deal with putting
two independent sub-structures together, the
optimal score S(i,k) is independent of anything going on in k + 1..j, and vice versa, so
S(i,k) + S(k + 1,j) is the score of the optimal
structure on i,j conditional on i and j being
paired but not to each other.
Since these are the only four possible cases,
the optimal score S(i,j) is just the maximum
of the four possibilities. We’ve thus defined
the optimal score S(i,j) recursively as a function only of optimal scores of smaller subsequences; so we only need to remember
these scores, not the combinatorial explosion
of possible structures. Mathematically this
recursion looks like:
Nussinov Algorithm
same thing, but conditional on j being unpaired. In case 4, where we deal with putting
two independent sub-structures together, the
optimal score S(i,k) is independent of anything going on in k + 1..j, and vice versa, so
S(i,k) + S(k + 1,j) is the score of the optimal
structure on i,j conditional on i and j being
paired but not to each other.
Since these are the only four possible cases,
the optimal score S(i,j) is just the maximum
of the four possibilities. We’ve thus defined
the optimal score S(i,j) recursively as a function only of optimal scores of smaller subsequences; so we only need to remember
these scores, not the combinatorial explosion
of possible structures. Mathematically this
recursion looks like:
.
r
t
e
s
d
f
f
s
n
Nussinov Algorithm
e
r
r
;
-
27
.
r
t
e
s
d
f
f
s
n
e
r
r
;
- 28
same thing, but conditional on j being unpaired. In case 4, where we deal with putting
two independent sub-structures together, the
optimal score S(i,k) is independent of anything going on in k + 1..j, and vice versa, so
S(i,k) + S(k + 1,j) is the score of the optimal
structure on i,j conditional on i and j being
paired but not to each other.
Since these are the only four possible cases,
the optimal score S(i,j) is just the maximum
of the four possibilities. We’ve thus defined
the optimal score S(i,j) recursively as a function only of optimal scores of smaller subsequences; so we only need to remember
these scores, not the combinatorial explosion
of possible structures. Mathematically this
recursion looks like:
Nussinov Algorithm .
r
t
e
s
d
f
f
s
n
e
r
r
;
- 29
same thing, but conditional on j being unpaired. In case 4, where we deal with putting
two independent sub-structures together, the
optimal score S(i,k) is independent of anything going on in k + 1..j, and vice versa, so
S(i,k) + S(k + 1,j) is the score of the optimal
structure on i,j conditional on i and j being
paired but not to each other.
Since these are the only four possible cases,
the optimal score S(i,j) is just the maximum
of the four possibilities. We’ve thus defined
the optimal score S(i,j) recursively as a function only of optimal scores of smaller subsequences; so we only need to remember
these scores, not the combinatorial explosion
of possible structures. Mathematically this
recursion looks like:
Nussinov Algorithm same thing, but conditional on j being unpaired. In case 4, where we deal with putting
two independent sub-structures together, the
optimal score S(i,k) is independent of anything going on in k + 1..j, and vice versa, so
S(i,k) + S(k + 1,j) is the score of the optimal
structure on i,j conditional on i and j being
paired but not to each other.
Since these are the only four possible cases,
the optimal score S(i,j) is just the maximum
of the four possibilities. We’ve thus defined
the optimal score S(i,j) recursively as a function only of optimal scores of smaller subsequences; so we only need to remember
these scores, not the combinatorial explosion
of possible structures. Mathematically this
recursion looks like:
.
r
t
e
s
d
f
f
s
n
Nussinov Algorithm e
r
r
;
-
30
.
r
t
e
s
d
f
f
s
n
e
r
r
;
- 31
same thing, but conditional on j being unpaired. In case 4, where we deal with putting
two independent sub-structures together, the
optimal score S(i,k) is independent of anything going on in k + 1..j, and vice versa, so
S(i,k) + S(k + 1,j) is the score of the optimal
structure on i,j conditional on i and j being
paired but not to each other.
Since these are the only four possible cases,
the optimal score S(i,j) is just the maximum
of the four possibilities. We’ve thus defined
the optimal score S(i,j) recursively as a function only of optimal scores of smaller subsequences; so we only need to remember
these scores, not the combinatorial explosion
of possible structures. Mathematically this
recursion looks like:
Nussinov Algorithm .
r
t
e
s
d
f
f
s
n
e
r
r
;
-
same thing, but conditional on j being unpaired. In case 4, where we deal with putting
two independent sub-structures together, the
j
of anyoptimal score S(i,k) is independent
versa, so
thing going on in k + 1..j,
and
vice
A A G S(i,k) + S(k + 1,j) is the score of the optimal
A 0 on 0 structure on i,j conditional
i and j 0 being
paired but not to each other.
A the only
0 four possible
0 0 Since these
are
cases,
i
the optimal
score
G S(i,j) is just the
0 maximum
0 of the four possibilities. We’ve thus defined
the optimal score
U S(i,j) recursively as a0 function only of optimal scores of smaller subU only need to remember
sequences; so we
these scores, not the combinatorial explosion
C Fill of in possible structures. Mathematically this
highlighted recursion looks like:
G Nussinov Algorithm square: (i = 1,j = 7) S(i,j) = max
S(i + 1,j – 1) +1 [if i,j base pair]
S(i + 1,j)
S(i,j – 1)
maxi<k<j S(i,k) + S(k + 1,j)
U U C G 1 2 2 3 1 1 1 1 0 0 1 1 0 0 0 1 0 0 0 1 0 0 1 0 0 A-­‐G don’t base pair = 1 = 2 k = 2: S(1,2) + S(3,7) = 0+1 = 1 k = 3: S(1,3) + S(4,7) = 0+1 = 1 k = 4: S(1,4) + S(5,7) = 1+1 = 2 k = 5: S(1,5) + S(6,7) = 2+1 = 3 k = 6: S(1,6) + S(7,7) = 2+0 = 2 Nussinov Algorithm -­‐traceback j
i
A A A 0 A 0 G G U U C G 0 0 1 2 2 3 0 0 1 1 1 1 0 0 0 0 1 1 0 0 0 0 1 0 0 0 1 0 0 1 0 0 U U C G start here Can you draw this folded RNA? k = 5: S(1,5) + S(6,7) = 2+1 = 3 Op'mal sub-­‐structure from 1-­‐5 (with 2 matches) Op'mal sub-­‐structure from 6-­‐7 (with 1 match) Nussinov Algorithm -­‐traceback j
i
A A A 0 A 0 G G U U C G 0 0 1 2 2 3 0 0 1 1 1 1 0 0 0 0 1 1 0 0 0 0 1 0 0 0 1 0 0 1 0 0 U U C G Can you draw this folded RNA? G A A U
U
C G
start here -­‐ note that in reality, stems can’t form if the loop is less than 3bp due to restric'ons on backbone angles Improvements on Nussinov algorithm •  Nussinov is the “core” of most RNA folding programs, but they all have bells & whistles 3’…CCAUUCAUAG…5’
RNA Energetics I
–  Take into account that loop must be 3 or more nucleo'des ||||||
–  Not all base pairs are equal in reality (we treated them all at +1 in Nussinov) 5’…CGUGAGU…3’
–  Base stacking interac'ons • base pairing:
G
>
C
A
U
>
G
U
• base stacking:
GpA
|
|
CpU
35
Base stacking contributes more
source unknown. All rights reserved. This content is excluded from our Creaative
to free energy than base pairing ©
Commons license. For more information, see http://ocw.mit.edu/help/faq-fair-use/.
Improvements on Nussinov algorithm •  Nussinov is the “core” of most RNA folding programs, but they all have bells & whistles Take into account that loop must be 3 or more nucleo'des Not all base pairs are equal in reality (we treated them all at +1 in Nussinov) Base stacking interac'ons Penalizes interior bulges Extra terms at terminal ends of RNA exposed to solvent -­‐Nussinov algorithm cannot detect pseudoknots, since these do not sa'sfy the recursive assump'on that each structure can be split into smaller self-­‐contained sub-­‐structures -­‐ more advanced algorithms –  With all these addi'ons, mfold gets ~70% of bases correctly folded; preYy good on average but would likely want to do in vivo structure profiling of your RNA if you really want to know its structure – 
– 
– 
– 
– 
– 
36
Happy Spring Break! 37
MIT OpenCourseWare
http://ocw.mit.edu
7.91J / 20.490J / 20.390J / 7.36J / 6.802J / 6.874J / HST.506J Foundations of Computational and Systems Biology
Spring 2014
For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.
Download