Uploaded by ashok.bioit

5 RNA structure prediction Barquist

advertisement
A brief introduction to computational prediction of RNA
secondary structure
Lars Barquist
13.11.2018
Lecture Goals
At the end of this lecture you should be able to:
• Describe the most common algorithm and energy model
underlying minimum free energy (MFE) RNA secondary
structure prediction
• Name at least one major weaknesses of standard MFE
predictions
• Describe 3 alternative approaches that can complement MFE
prediction
RNA Structure
Central Dogma of Genomics: sequence determines structure
determines function (Petsko, 2000)
A
Primary Structure
5
10
15
20
25
30
35
40
45
50
55
60
65
70
75
Ψ
Ψ
5’ GCGGAUUUAGCUCAGDDGGGAGAGCGCCAGACUGAAYA.CUGGAGGUCCUGUGT.CGAUCCACAGAAUUCGCACCA
3’
B Secondary Structure
3’A
C Tertiary Structure
75
C
5’ G C
Acceptor C G
G C 70
Stem
GU
5A U
D Loop
15
5’
T ΨC
Loop
C
A
T ΨC
Loop
UA
60
DGA
U U A 65
C UA
A
D
GACAC
C U C G 10
G
G
CUGUG
G
G G A A G C 25
C 50
C
T Ψ
U
.
G
20
C GAGG
55
C G 45
AU
40
AG
30 C
Variable
C
Ψ.
A
Anticodon U
Loop
.
G
A AY
Loop
D Loop
3’
Acceptor
Stem
Anticodon
Loop
35
How do we build a folding algorithm?
Need three things:
• An encoding of secondary structure we can compute on
• Definition (and restriction) of the problem to be solved
• A method to produce optimal solutions to this problem
Representing RNA secondary structure
Matrix representation of RNA structure
Constraints on secondary structure
(assumptions)
Also generally assume only Watson-Crick (G:C, A:U) and wobble pairs (U:G)!
Optimization: Dynamic Programming
• Not dynamic, not programming
• Foundational optimization technique
used in many genomics applications
(e.g. BLAST)
• Break down optimization problem into series of subproblems
• Basic idea: fill out score matrix using recursion, traceback
along maximal scoring path to deduce optimal sequence
of events (Viterbi, Needleman-Wunsch)
• First goal: maximize base-pairs (Nussinov)
The Nussinov folding algorithm
Recursion to maximize number of base-pairs:
Nussinov, SIAM 1978; Eddy, Nat. Biotech. 2004
The Nussinov folding algorithm
Nussinov et al, SIAM 1978; Eddy, Nat. Biotech. 2004; Durbin et al, Biological Sequence Analysis
Problems with the Nussinov algorithm
• Solutions frequently not unique
• 77 bp tRNA-His has thousands of “optimal” structures
with 26 base-pairs!
• Maximizing base-pairs frequently doesn’t reflect biological
structure
• Real tRNA structure has ~21 bases
• Lacks physical realism, folding energy
Need an energy model!
Parameters of the Turner energy model
Base stacking
Optical melting
Doktycz et al., JBC 1995
Nearest Neighbor
Model
W X\Y Z
CG
GC
AU
UA
GU
UG
CG
-3.26
-2.36
-2.11
-2.08
-1.41
-2.11
GC
-3.42
-3.26
-2.35
-2.24
-1.53
-2.51
AU
-2.24
-2.08
-0.93
-1.10
-0.55
-1.36
UA
-2.35
-2.11
-1.33
-0.93
-1.00
-1.27
GU
-2.51
-2.11
-1.27
-1.36
+0.47
+1.29
UG
-1.53
-1.41
-1.00
-0.55
+0.30
+0.47
Mathews et al., JMB 1999
Laurberg et al., Nature 2008
Zuker folding
A
U
C
G L
A U
Free Energy = L + S 1 +S 2 + S 3
= 5.00 − 2.11 − 2.35 − 2.24
= −1.70 kcal/mol
S1
C G
U A
G C
S2
S3
Zuker & Stiegler, NAR 1981
Dynamic programming over thousands of measured and
inferred parameters for structural components
Exercise:
What is this sequence?
UCCUCUGUAGUUCAGUCGGUAGAACGGCGGACUGUUAA
UCCGUAUGUCACUGGUUCGAGUCCAGUCAGAGGAGCCA
(https://sites.google.com/site/rnainformatics/sequence)
Take 5 min, use Vienna RNAfold (http://rna.tbi.univie.ac.at)
and NUPACK (http://www.nupack.org) to fold it!
Play around with default settings if you’ve got time.
Exercise results
Should we believe these results?
• Benchmarks:
• By method developers:
Walters et al. 1994, mean sensitivity 63.6%
Mathews et al. 1999, mean sensitivity 72.9%
• Independent:
Doshi et al. 2004, mean sensitivity 41%
Dowell & Eddy 2004, sens 56%, PPV 48%
Gardner & Giegerich 2004, sens 56%, PPV 46%
• Datasets: tRNA, SSU & LSU rRNA, SRP, Rnase P,
tmRNA
• Energy model only accurate to within 5-10%!
Less obvious results
What is this stuff?
What is optimization?
The energy model defines a space of potential folds
– optimization identifies the best scoring fold within this space
O’Reilly Media
Centroid: balance point for this space
The problem with optimization on an
inaccurate energy model
“There is an embarrassing abundance of structures having a free energy
near that of the
optimum” (McCaskill 1990)
−20
−20.2
−20.4
Δ G (kcal/mol)
−20.6
−20.8
−21
Biological
MFE
−21.2
A
GC
CG
GC
GU
AU
UA
UA
UG
A
U CG A
C
C
UA
A
G UG C
GC
G UA
G
A
GC
AU
GU
CG
GU
C GU
C
A CC
G
AU
CG
UGA
G
G
A
G
A
U
G
A UU
−21.4
−21.6
−21.8
−22
−5
0
5
A
GC
CG
GC
GU
AU
UA
UA
A
G U GA A C
C
C
CU
CU
UG
U
U
C
U G GG AG
UC
A
G
G A G AA GA U
G UU CU
U
C
A
UG
G A G GC G
A
C
A
GCG
Suboptimal
10
d
15
(S ,S
BP
Back to our results:
A
GC
CG
GC
GU
AU
UA
CUA
UGA
UU
A
C
U
G
C U C GA
GACA
G
G
GAGC
C U G U U UC
GG A
C
U
G
G
C GAG
CG
AU
GU
AU
C GA
U
A
G A
i
)
mfe
20
25
30
35
Wuchty et al., Biopolymers 1990
Other alternatives to “plain” MFE
optimization
• Determine how confident the model is in the global structure
• Sampling!
• Augmenting the model with additional information
• Comparative prediction of secondary structure
• Experimental constraints on secondary structure
Sampling from the energy model
2
6
3
92
7
5
17
21
20
4
9
40
10
14
18
A
GC
CG
GC
GU
AU
UA
UA
UG
A
U CG A
C
C
UA
A
G UG C
GC
G UA
G
A
GC
AU
GU
CG
GU
C GU
C
A CC
G
AU
CG
UGA
G
G
A
G
A
U
G
A UU
A
GC
CG
GC
GU
AU
UA
UA
U GA
UGA
A
U
C
CUCG
A
G
GAGC
C
GG A
G
C C G CU
A
A A
C UG
C
G U
A U
A G
G AU
G
U U
U
C
UG C
G U
A GG
A
GC
CG
GC
GU
AU
A UU A
CUA
UA
G
C
G
GACA
CU
U G U GU U C
C
CC
A
GA
U
G U U C U GA
A
G
G
G
A
G G C G
A
G
C A
A
U
A
GCG
UU GG
Cluster tree of the 20 most frequent
structures in a sample of 1000
16
1
2
103
11
A
GC
CG
GC
GU
AU
UA
UA
A
G U GA A C
C
C
CU
CU
UG
U
CU
U G GG AG
UC
A
G
G A G AA GA U
G UU CU
U
C
A
UG
G A G GC G
C A
A
GCG
12
15
19
22
8
13
A
GC
CG
GC
GU
AU
UA
CUA
U UA ACAC
G
UGA
G
A
U
G
CUCG
C U G U U UC
G
C
GAGC
GG A
U
G
G
C GAG
CG
AU
GU
AU
C GA
U
A
G A
Ding & Lawrence, NAR 2003
Can provide a more intuitive picture of uncertainty in structure
Additional information
Woese et al., Micro. Rev. 1983
Additional information: covariation
}
Credit: Eric Nawrocki @NCBI
Additional information:
covariation
Credit: Eric Nawrocki @NCBI
Additional information: covariation
Credit: Eric Nawrocki @NCBI
Additional information: covariation
• Incorporating evolutionary information into structure
prediction is a deep topic!
• Co-estimation of alignment and structure prohibitive
• Resources:
• Barquist et al., Curr Protoc Bioinformatics 2016
• Pitfalls of naïve approaches: Rivas et al., Nat Methods
2017
• Rfam (rfam.xfam.org)
• Producing these sequence:structure models more art than
science, BUT still one of the best sources of evidence for
conserved functional structure!
Additional information: experimental
More from Redmond next!
Westhof & Romby, Nat. Methods 2010
Exercise:
What is this sequence?
UCCUCUGUAGUUCAGUCGGUAGAACGGCGGACUGUUAA
UCCGUAUGUCACUGGUUCGAGUCCAGUCAGAGGAGCCA
(https://sites.google.com/site/rnainformatics/sequence)
Use Sfold to get an idea of what the structural ensemble looks
like: http://sfold.wadsworth.org/cgi-bin/srna.pl
Now say you’ve gotten information that one base is definitely
unpaired – use RNAfold with the following constraint:
.........................x..................................................
Exercise results: Sfold sampling
63%
37%
Note: results are stochastic – different runs will give slightly different answers!
Exercise results: Sfold sampling
Exercise results: experimental constraints
Can be done high-throughput but need to deal with noise – “soft constraints”
see: Eddy, Annu Rev Biophys 2014
Summary
• Don’t trust point (MFE) estimates!
• They can be useful but treat with skepticism
• Parameters have uncertainty
• Some possibilities completely excluded (base
modifications, pseudoknots, tertiary contacts)
• The ensemble can be more informative!
• Base-pairing probabilities can inform on stable substructures
• Dot plots can be informative about possible local
alternatives
• Sampling can give an idea as to global alternatives
• If you need an accurate structure, bring in more information:
covariation and/or experimental!
Download