RNANotes - McMaster Physics and Astronomy

advertisement
Roles of RNA
• mRNA (messenger)
• rRNA (ribosomal)
• tRNA (transfer)
• other ribonucleoproteins (e.g.
spliceosome, signal recognition
particle, ribonuclease P)
• viral genomes
• artificial ribozymes
Typical transfer RNA structure
Thermodynamics parameters are
measured on real molecules.
G
Helix formation = hydrogen
bonds + stacking
U
U
C
} DG = -2.1 kcal/mol
A
} DG = -1.2 kcal/mol
A
U loop DG = + 4.5 kcal/mol
C
Entropic penalty for loop
formation.
C
U
Multi-branched
loop
Sum up contributions of
helices and loops over
the whole structure.
Bulges
Internal loops
Hairpin loop
Pairs i-j and k-l are compatible if (a) i < j < k < l , or (b) i < k < l < j .
(c) is called a pseudoknot: i < k < j < l . Usually not counted as secondary structure.
(b)
(a)
(c)
k
k
i
k
l
l
l
j
i
i
j
j
Bracket notation is used to represent structure:
a:
((((....))))..((((....))))
b:
((.((((....)))).))
Basic problem: Want an algorithm that considers every allowed secondary structure
for a given sequence and finds the lowest energy state.
Simplest case: find structure which maximizes number of base pairs.
Let  ij = -1 if bases can pair and + if not.
Ignore loop contributions.
E(i,j) = energy of min energy structure for chain segment from i to j.
We want E(1,N).
or
=
i
j
i
j-1
j
i
k
j
E (i, j  1)


E (i, j )  min 
min E (i, k  1)  E (k  1, j  1)   kj

i

 k  j 4
Algorithms that work by recursion relations like this are called dynamic programming.
The algorithm is O(N3) although the number of structures increases exponentially with N.
Also need to do backtracking to work out the minimum energy structure:
Set B(i,j) = k if j is paired with k, or 0 if unpaired.
Partition Function Algorithm (for simplest energy rules)
or
=
i
j
i
j-1
j
i
k
j 4
Z ij  Z i , j 1   Z i ,k 1Z k 1, j 1akj
k i
where
akj  exp(  kj / kT )
Real Energy Rules : Need to consider many special cases.
What type of loop are you closing?
Algorithm is more complex but still is O(N3).
j
Equilibrium probability
that base i is paired with j
Equilibrium probability
that base i is unpaired
pij 
aij Z i 1, j 1 Z ijends
Z1, N
1
i
j
N
pi ,0  1   pij
j 1
Example of pairing
probabilities taken
from Vienna
package web-site
N
Is folding kinetics important?
RNA folding kinetics involves reorganisation of secondary structure
i
A
C
B
D
iii
I
ii
B
B
C
D
E
F
H
D
G
Native structures may not be global minimum free energy states.
Morgan & Higgs (1996) J. Chem. Phys.
Energy Landscapes in RNA Folding
Morgan & Higgs (1998)
Quantity
Fitting Function
Groundstate energy
Parameters
E  C1  N
C1 = 2.9 (0.2)
 = -0.368 (0.001)
Total number of states
ln   C2  N
C2 = -5.6 (0.4)
 = 0.533 (0.001)
Number of groundstates
ln   C3  N
C3 = 1.75 (0.2)
 = 0.068 (0.001)
Groundstates are degenerate in this model because energies are integers.
Generate many random groundstates.
How far apart are these groundstates?
How high are the barriers between groundstates?
We found Frozen pairs (present in every groundstate)
This figure shows the frozen pairs only.
The molecule is divided into independent unfrozen loops.
Define Neff as the length of the longest loop.
Two groundstates for the same sequence
Minimum Free Energy Prediction
Deterministic. Always gets MFE structure for a given set of energy rules.
If MFE structure is not the same as biological structure, this could be because
(i) energy rules are inaccurate or insufficient
(ii) kinetics is important and molecule is trapped in metastable state.
Monte Carlo simulations of folding kinetics.
Store a current structure.
Estimate rates of removal of existing helices and rates of addition of other
compatible helices.
Choose one helix to be added or removed with probability proportional to its rate.
Repeat this many times. Can simulate structure formation from an unfolded state.
Q is a bacteriophage
RNA virus with approx 4000
nucleotides
Viral RNA has complex
secondary structure.
The replicase gene codes for
the replicase protein.
This is an RNA-dependent
RNA polymerase.
Synthesizes complementary
strand. Viral replication
needs two steps: plus to
minus to plus.
In vitro RNA evolution in the Q system
c
Begin with
Replicase +
nucleotides +
viral RNA
c
Replicase +
nucleotides
only
Transfer small
quantity to each
successive tube
c
c
sequence RNA after
many transfers
Barrier heights between alternative groundstates
Observation:
Mean barrier height between
groundstates scales as
<h> ~ Neff0.5
Neff ~ 0.3 N
Therefore barriers become
significant for large enough
sequences.
An example where kinetics is important to
control biological function:
the 5’ region of the MS2 phage.
3500
130
Maturation protein
Time to formation of the 5’ structure influences expression of the maturation
protein more than the stability of this structure.
Average prob. SD free
Simulations compare with experiments on mutant sequences.
CC3435AA
0.2
0.1
WT & U32C
SA
0.0
0.0
2.0
4.0
Time (s)
6.0
8.0
RNA in comparison to Proteins
Both have well defined 3d structures
RNA folding problem is easier because secondary structure separates from
tertiary structure more easily - But it is still a complex problem.
RNA model has real parameters therefore you can say something about
real molecules. RNA folding algorithm is simple enough to be able to do
statistical physics. (cf. 27-mer lattice protein models).
Part of sequence alignment of Mitochondrial Small Sub-Unit rRNA
Full gene is length ~950
11 Primate species with mouse as outgroup
Mouse
Lemur
Tarsier
SakiMonkey
Marmoset
Baboon
Gibbon
Orangutan
Gorilla
PygmyChimp
Chimp
Human
:
:
:
:
:
:
:
:
:
:
:
:
*
20
*
40
*
60
*
CUCACCAUCUCUUGCUAAUUCAGCCUAUAUACCGCCAUCUUCAGCAAACCCUAAAAAGG-UAUUAAAGUAAGCAAAAGA
CUCACCACUUCUUGCUAAUUCAACUUAUAUACCGCCAUCCCCAGCAAACCCUAUUAAGGCCC-CAAAGUAAGCAAAAAC
CUUACCACCUCUUGCUAAUUCAGUCUAUAUACCGCCAUCUUCAGCAAACCCUAAUAAAGGUUUUAAAGUAAGCACAAGU
CUUACCACCUCUUGCC-AU-CAGCCUGUAUACCGCCAUCUUCAGCAAACUCUA-UAAUGACAGUAAAGUAAGCACAAGU
CUCACCACGUCUAGCC-AU-CAGCCUGUAUACCGCCAUCUUCAGCAAACUCCU-UAAUGAUUGUAAAGUAAGCAGAAGU
CCCACCCUCUCUUGCU----UAGUCUAUAUACCGCCAUCUUCAGCAAACCCUGAUGAAGGCUACGAAGUGAGCGCAAAU
CUCACCAUCUCUUGCU----CAGCCUAUAUACCGCCAUCUUCAGCAAACCCUGACAAAGGCUAUAAAGUAAGCACAAAC
CUCACCACCCCUUGCU----CAGCCUAUAUACCGCCAUCUUCAGCAAACCCUGAUGAAGGCCACGAAGUAAGCGCAAAC
CUCACCACCUCUUGCU----CAGCCUAUAUACCGCCAUCUUCAGCAAACCCUGACGAAGGCCACAAAGUAAGCACAAGU
CUCACCGCCUCUUGCU----CAGCCUAUAUACCGCCAUCUUCAGCAAACCCUGAUGAAGGUUACAAAGUAAGCGCAAGU
CUCACCGCCUCUUGCU----CAGCCUAUAUACCGCCAUCUUCAGCAAACCCUGAUGAAGGUUACAAAGUAAGCGCAAGU
CUCACCACCUCUUGCU----CAGCCUAUAUACCGCCAUCUUCAGCAAACCCUGAUGAAGGCUACAAAGUAAGCGCAAGU
CucACC cuCUuGCu
cAgccUaUAUACCGCCAUCuuCAGCAAACcCu
A G
aAAGUaAGC AA
:
:
:
:
:
:
:
:
:
:
:
:
78
78
79
76
76
75
75
75
75
75
75
75
100 100
96 100
100
100
99
63
72
67
91
99
100
100
100
Murphy et al.
Nature (2001)
100
100
66
77
100
100
100
96
93
100
100 100
100
100
< 50
< 50
63
92
100
90
100
100
74
99
99
52
70
100
100
96
97
100
100
97
98
uses 15 nuclear plus
3 mitochondrial
proteins
100
100
100
100 100
90 100
70
100
100
100
90
100
53
98
88
< 50
< 50
< 50
55
65
64
79
100
100 100
72 100
87
97
87
100
97
84
71
83
99
52
95
99
95
100
100
100
85
62
< 50
93
64
85
< 50
< 50
72
60
< 50
< 50
100 100
100 100
100 100
100
67 100
100
< 50
100
< 50
98
80
99
100
100
100
100
100
100
100
77
99
100
100
100
100
100
100
94
< 50
99
97
100 63
100 < 50
79 100
< 50
< 50
50
< 50
< 50
100
100
Megaptera *
Cetacea
Tursiops
Hippopotamus *
Tragelaphus
Okapia
Sus
Lama *
Ceratotherium *
Tapirus
Equus *
Felis *
Leopardus
Panthera
Canis *
Ursus
Manis *
Artibeus *
Microchiroptera
Nycteris
Pteropus *
Megachiroptera
Rousettus
Erinaceus *
Sorex *
Asioscalops
Condylura *
Cetartiodactyla
Perissodactyla
IV
Carnivora
Pholidota
Chiroptera
'Eulipotyphla'
Cavia *
Hydrochaerus
Agouti
CavioHystricoErethizon
morpha
gnathi
Myocastor
Dinomys
Hystrix
Heterocephalus *
Rodentia
Mus *
Rattus
Cricetus
Pedetes
Castor
Dipodomys
Tamias *
Muscardinus
Sylvilagus *
Lagomorpha
Ochotona *
Hylobates
Homo *
Macaca *
Anthropoidea
Primates
Ateles *
Callimico
Cynocephalus */ **
Dermoptera **
Lemur *
Lemuriformes
Primates
Tarsius *
Tarsiiformes
Tupaia *
Scandentia
Choloepus did. *
Choloepus hof.
Tamandua
Xenarthra
Myrmecophaga
Euphractus *
Chaetophractus
Trichechus *
Sirenia
Tethytheria
Loxodonta *
Proboscidea
Procavia *
Hyracoidea
Echinops *
Tenrecidae
Orycteropus *
Tubulidentata
Macroscelides
Macroscelidea
Elephantulus *
Didelphis *
Marsupialia
Macropus *
G
l
i
r
e
s
III
II
P
a
e
n
u
n
g
u
l
a
t
a
A
f
r
o
t
h
e
r
i
a
I
Afrotheria / Laurasiatheria
Striking examples of convergent evolution
Cao et al. (2000) Gene
uses 12 mitochondrial
proteins
RNA pairs model (GR7)
53 complete Mammalian mitochondrial genomes
Complete set of rRNAs + tRNAs from = 973 pairs.
100
100
86
100
97
100
100
Jow et al. (2002)
MCMC searches the rugged landscape in
tree space using the Metropolis algorithm.
Obtains a set of possible trees weighted
according to their likelihood.
1. Rate parameter changes = continuous
2. Branch length changes = continuous
E
A
3. Topology changes = discrete
D
2
E
C
A
A
E
D
1
4
B
Nearest-neighbour interchange
D
C
Long-range
move
E
B
C
C
B
D
3
B
A
Models of Sequence Evolution
rij is the rate of substitution from state i to state j
States label bases A,C,G & T
Pij(t) = probability of being in state j at time t
given that ancestor was in state i at time 0.
dPij
dt
  Pik rkj
k
i
t
j
The HKY model describes rate of evolution of single sites
to
A
from
A
G
C
T
*
 G
 C
 C
*
 T
 T
 T
 C
*
G  A
C
T
 A
 A
*
 G
 G
The frequencies of the four bases are  A , G , C , T .
 is the transition-transversion rate parameter
* means minus the sum of elements on the row
Compensatory Substitutions
Two sides of the acceptor stem from a tRNA are shown.
Due to structure conservation alignment is possible in widely
different species.
Bacillus subtilis
Escherichia coli
Saccharomyces cerevisiae
Drosophila melanogaster
Homo sapiens
1234567
(((((((
7654321
)))))))
GGCUCGG
GCCCGGA
GCGGAUU
GCCGAAA
GCCGAAA
CCGAGCC
UCCGGGC
AAUUCGC
UUUCGGC
UUUCGGC
Model 7A is a General Reversible 7-state Model
7 frequencies i + 21 rate parameters ij
- 2 constraints = 26 free parameters
1
2
3
4
5
6
7
AU
GU
GC
UA
UG
CG
MM
 5 15
 5 25
 5 35
 5 45
 6 16
 6 26
 6 36
 6 46
 6 56
 7 17
 7 27
 7 37
 7 47
 7 57
 7 67
1
AU
*
 2 12
2
GU
*
3
GC
4
UA
5
UG
6
CG
7
MM
 1 12
 1 13
 1 14
 1 15
 1 16
 1 17
 2 23
 2 24
 2 25
 2 26
 2 27
 3 13  4 14
 3 23  4 24
*
 4 34
 3 34
*
 3 35  4 45
 3 36  4 46
 3 37  4 47
*
 5 56
*
 5 57  6 67
*
Probability of remaining in same state Pii
SSU rRNA sequences from Eubacteria
Probability Pij of changes from CG to other pairs
SSU rRNA from Eubacteria
What is going on?
AU
GU
UA
fast
fast
UG
slow
GC
CG
Selection against GU and UG is weaker than against mismatches.
Double transitions are faster than double transversions.
Double transitions are faster than single transitions to GU and UG states.
This is explained by the theory of compensatory substitutions.
Analysis of RNA sequence databases
tRNA
mitoch.
tRNA
general
tRNA
archaea
Rnase P
SSU
rRNA
G+C average
G+C helical regions
0.339
0.448
0.532
0.681
0.636
0.829
0.594
0.730
0.545
0.674
Frequencies
0.266
0.121
0.257
0.233
0.046
0.030
0.046
0.372
0.260
0.128
0.142
0.043
0.025
0.030
0.473
0.320
0.057
0.077
0.031
0.020
0.022
0.385
0.296
0.117
0.104
0.050
0.022
0.026
0.352
0.298
0.122
0.173
0.020
0.021
0.014
Number of sequences
884
754
64
84
455
Number of pairs
21
21
21
80
296
GC
CG
AU
UA
GU
UG
MM
Selection for thermodynamically stable structures
Higgs (2000) Quart. Rev. Biophysics
Analysis of RNA Substitution Rates
tRNA
mitoch.
tRNA
general
tRNA
archaea
Rnase P
SSU
rRNA
0.67
0.84
0.86
0.77
2.44
3.32
2.32
0.49
0.83
1.46
1.24
1.96
5.01
0.99
0.45
0.89
4.01
1.78
1.85
3.00
0.86
0.65
0.60
1.46
1.09
1.72
2.84
5.24
0.55
0.66
1.40
0.93
3.92
4.36
7.84
Double transitions /
Double transversions
4.7
1.7
2.3
3.1
2.1
Double transitions /
Transitions to GU or UG
1.6
2.0
8.9
3.6
2.8
Mutabilities
GC
CG
AU
UA
GU
UG
MM
Thermodynamic properties influence Evolutionary properties
Download