Design of RNA-based transcription switches to control living cells James Skinner

advertisement
Design of RNA-based transcription switches to
control living cells
James Skinner∗,† and Professor Alfonso Jaramillo,‡
The University of Warwick
E-mail: j.r.skinner@warwick.ac.uk
Abstract
In this work we investigate methods of regulation without use of transcription factors. We first produce a design for a synthetic RNA oscillator circuit, then work towards
computationally producing antiterminators. As a first step we computationally generate a large number of Rho-independent transcription terminator sequences, which we
achieve by first constructing a model of termination efficiency for simple terminators —
outperforming previous models—then optimise over this model to generate terminator
sequences. Our promising results demonstrate the feasibility of designing transcription
terminators computationally.
1
Introduction
Synthetic biology and the engineering of living cells holds great potential in medicine and
engineering. In creating a cell with a desired behaviour, DNA sequences with particular
functionality are designed and constructed. The philosophy of synthetic biology is to take
an engineering approach to this task, where DNA circuitry is built from pre-existing, well
characterised parts. Complex circuitry requires genes to interact and regulate each other,
which is normally achieved in nature through proteins called transcription factors. Since
transcription factors are difficult to design de novo, there is interest in re-purposing RNA to
∗
To whom correspondence should be addressed
Centre for Complexity Science
‡
School of Life Sciences
†
1
Loop
RNA
RNAp
Hairpin
Stem
5’
A-tract
3’
antisense
5’
sense
A A A AA A A A A
U U UUU U U U U 3’
U-tract
5’
3’
DNA
T TTT T T T T T
Figure 1: Cartoon mechanics of simple termination. The transcribed mRNA forms a hairpin
inside the transcribing RNAp, causing dissociation. The U:A RNA:DNA binding at the U-tract
stalls transcription, giving the hairpin time to form 3 . Important parts of the translated terminator
are the A-tract, U-tract, stem and loop.
achieve gene regulation 1 . Here we investigate regulation using RNA, and develop software
for computationally generating large number of transcription terminator sequences, which
may be used in a method of RNA regulation—antitermination 2 .
We define a gene to be a sequence of nucleotides which, through the action of various
machinery in the cell, produces a particular protein; a molecule which may bind strongly
and specifically to another molecule, and may catalyse specific reactions. Proteins perform
diverse tasks in the cell, and are produced through the process of transcription followed
by translation. In transcription, RNAp attaches to a region near the start of a gene and
runs along the DNA, ‘unzipping’ it, reading it and producing an RNA copy of one of the
DNA strands. This molecule—mRNA—is used in translation to produce proteins, but this
process is less important to this work. Importantly, transcription stops only when the RNAp
dissociates from the DNA, and it is (normally) nucleotide sequences known as transcription
terminators which cause this. Thus, terminators mark the end of the region which codes for
a protein. Terminator efficiency is a measure of the fraction of RNAp that the terminator
causes to dissociate.
Here we focus only on the class of terminators known as Rho-independent, which are
terminators which function without a protein called the Rho factor, and we look primarily at
terminators in bacteria. We will be referring to rho-independent terminators unless otherwise
specified. Such terminators function by having a sequence that causes the mRNA being
transcribed to form a hairpin structure (Figure 1) inside of the RNAp, causing the RNAp
to dissociate. A terminator typically results in a transcribed section of mRNA composed
of an A-tract, hairpin (made from a stem and loop) and U-tract. The U-tract is a Uracildense region downstream of the hairpin; since U:A is the lowest energy RNA:DNA base
pair 4 , transcription pauses at the U-tract, giving time for the hairpin to form 3 . Whilst the
RNAp is paused at the U-tract, the hairpin forms and is responsible for causing the RNAp
to dissociate 3 , but the exact details of this are not yet known. The A-tract typically has a
high Adenine content, with a role that could possibly be to extend the hairpin stem and help
ratchet the U-tract from the DNA 5 . By surrounding them with other genetic components,
2
terminators may be converted into antiterminators—terminators which may be enabled and
disabled using an RNA—which may be used as a method of RNA-only regulation 2 .
In the current state of synthetic biology, we are able to design synthetic circuitry by
combining known components which we may obtain from the literature or from a catalogue
such as The Registry1 . This circuitry may then be incorporated into an organism such as E.
coli via construction of a plasmid—a ring of DNA. Plasmid sequences are subject to certain
constraints, such as containing a suitable proportion of GC pairs and not containing long
repeated regions, which arise from the methods used in synthesis. In this work we describe
a method of computationally designing terminators; an important component normally required at the end of each gene. We show the feasibility of computationally designing large
numbers of terminators subject to constraints, and provide a software package for doing
this. We make use of an existing technology—CRISPR—to regulate circuitry via RNA, and
combine this technology with antitermination to give the design for an improved oscillator 6 .
In this paper we will be discussing nucleotide positions in DNA; each strand of DNA
is directional, with a 5’ (pronounced “five prime”) and a 3’ end, and transcription can only
occur 5’ → 3’. Paired strands run in opposite directions, so when talking about a particular
gene, the direction of transcription in that gene will be referred to as ‘upstream’, and the
opposite direction ‘downstream’. In translation, the strand which is copied into mRNA
(with Ts substituted to Us) is known as the ‘sense’ strand, and the complementary strand
the ‘antisense’ strand (Figure 1). The abbreviations bp and nt will be used for base pairs
and nucleotides respectively.
2
Background
To design computationally a nucleotide sequence for an efficient terminator requires a good
model of terminator efficiency from the sequence. Ermolaeva et al. 7 describes the TransTerm
algorithm for terminator discovery in bacterial genomes which uses hairpins followed by Utracts as indicators of terminator sequences. TransTerm assigns energy to hairpins through
a linear combination of the number of G-C, A-U and G-U base pairs in the stem, as well
as mismatches and gaps in the stem and the number of nucleotides in the loop, where the
coefficients are scores relating to each of these features. U-tracts are rated using the scoring
function proposed by d’Aubenton Carafa et al. 8 :
(
P15
xn−1 × 0.9 nth nucleotide is a U
score = − i=1 xn xn =
x0 = 1
xn−1 × 0.6 otherwise
The TransTermHP algorithm 9 improves upon TransTerm. Terminators are discovered by
looking first for length-six windows containing at-least three Thymines (which are transcribed
into Uracils), and the 15 nucleotides downstream of the start of this window are scored with
the same heuristic from d’Aubenton Carafa et al. 8 . The sequence up to 59 nt upstream
of this U-tract is given a hairpin score using an efficient algorithm. These scores are then
1
http://parts.igem.org/Main_Page
3
used to predict the quality of the sequence found as a terminator, which is done using the
likelihood of a structure with greater or equal hairpin and U-tract scores arising by chance,
taking into account the GC content of the genome being searched.
Cambray et al. 10 develop a method for the measurement of terminator efficiencies, and
characterise 61 E. coli terminators. A model is constructed using a linear combination of
sequence features mined from literature, and fit this model to their own data. Prediction was
found to be improved when terminators having low efficiency and terminators with variable
folding dynamics were excluded from the training set. In a similar work, Chen et al. 5
measure strengths of over 300 terminators (most are unique, but some are measured in the
forwards and reverse direction). Chen et al. 5 then build a biophysical model of termination
efficiency based off free energies for loop closure, hairpin extension, hairpin base binding and
RNA:DNA binding with the U-tract.
3
RNA oscillator
To gain familiarity with the process of designing synthetic circuitry, a circuit was constructed
to be implemented in a plasmid and inserted into E. coli. An oscillating circuit using only
RNA was chosen as the circuit to construct; the significance of using RNA and no transcription factors is that RNA folding is far easier to deal with computationally than protein
folding, making computational circuit design more feasible, meaning RNA circuitry is likely
to be important in the future. An oscillator was chosen as oscillations require nonlinearity
in the dynamics, so a functional circuit would demonstrate that this nonlinearity is possible
using only RNA.
3.1
A method of regulation using only RNA
Gene regulation using only RNA was facilitated by the CRISPR technology 11,12 . Two different designs inspired by Bikard et al. 11 and Qi et al. 12 were composed in attempt to minimize
repeated sequences, as these introduce difficulties in synthesis. CRISPR (clustered regularly
interspaced short palindromic repeat) loci encode a prokaryotic immune system against foreign DNA, often introduced by phages, and is made up of short "spacer" sequences which
match some target DNA sequence, separated by short repetitive sequences. This entire
repeat-spacer array is transcribed into a precursor CRISPR RNA, which is cleaved into
CRISPR RNA (crRNA), each targeting a particular nucleotide sequence. These crRNAs
then work with Cas (CRISPR-associated) nucleases to bind and cleave the target sequence.
The only restriction on target sequences is that the target (known as the protospacer) must
be have the adjacent 3’ sequence motif NGG—where N means any nucleotide—known as
the Protospacer Adjacent Motif (PAM). The length of the sgRNA base pairing region should
ideally be 20nt, with lengths differing from this negatively impacting repression 12 , meaning
coincidental off-target matches are extremely rare.
The Streptococcus pyogenes CRISPR system is the simplest, requiring the Cas9 nuclease,
crRNA and another RNA (tracrRNA) which is involved in the interaction between crRNA
and Cas9. RNaseIII is also required in the binding and processing of Cas9 and sgRNA, but
4
A
B
C
Figure 2: A repressilator; gene A represses B, B represses C and C represses A using transcription
factors, which results in oscillations. The circuit described here works in the same way using only
RNA.
this is already present in E. coli. Bikard et al. 11 take the Cas9 nuclease from S. pyogenes and
introduce mutations to prevent cleavage, which they call dCas9. Using a crRNA to target
dCas9 to a promoter region represses transcription, likely through blocking the RNAp, so
enables a form of regulation. Qi et al. 12 simplify the system by introducing a gene coding for
a chimera of crRNA and tracrRNA which they name Small Guide RNA (sgRNA), reducing
the entire system to two genes; the deactivated Cas9 and an sgRNA. This reduced system
is named CRISPR interface (CRISPRi).
3.2
Design of circuit
We decided on an oscillator design analogous to the repressilator; a system of 3 genes A,
B and C, where A represses B, B represses C and C represses A (Figure 2). We used the
CRISPRi architecture 12 , meaning that we required the three genes A, B and C, as well as a
gene coding for dCas9. The sequences for these are given in Appendix A. Genes A, B and
C use different promoter regions, two of which are constitutive (always ‘on’) and the other
arabinose inducible (only ‘on’ when arabinose is present), which is necessary for experimental
validation; we only want to observe oscillations when arabinose is present. Following each
promoter is the sequence coding for the sgRNA; 20 nt complementary to the region to which
the sgRNA will bind, followed by 82 nt which code for the machinery used in the interaction
between the sgRNA and dCas9, as well as the terminator for the gene 12 .
Repression is highly efficient when the sgRNA binds to the -35 or -10 box of the promoter
(regions which are important in the binding of the RNAp) 11,12 , so these were the regions
targeted. The architecture of the binding of the sgRNA to the promoters is given graphically
in Figure 3, where it can be seen that every sgRNA obscures either a -35 or -10 box. Note
that the gene A sgRNA binds to the antisense promoter strand of gene B; this was necessary
due to no appropriate PAMs existing on the sense strand, and this has no effect on the level
of repression 12 .
5
Gene A promoter (plLac01)
#-35##
PAM
#-10##
5’-ATAAATGTGAGCGGATAACATTGACATTGTGAGCGGATAACAAGATACTGAGCAC-3’ sense
|||||||||||||
||||||||||||||||||||||
3’-TATTTACACTCGCCTATTGTAACTGTAACACTCGCCTATTGTTCTATGACTCGTG-5’ antisense
||||||||||||||||||||
5’-GAUAACAUUGACAUUGUGAG<dCas9 binding + term>-3’ Gene C sgRNA
Gene B promoter (plTet012)
#-35## PAM
#-10##
3’-TGAGATAGTAACTATCTCAAACTGTAGGGATAGTCACTATCTCTATGACTCGT-5’ antisense
|||||||||||||||||||||||||||||
||||
5’-ACTCTATCATTGATAGAGTTTGACATCCCTATCAGTGATAGAGATACTGAGCA-3’ sense
||||||||||||||||||||
3’-<dCas9 binding + term>AUAGUCACUAUCUCUAUGAC-5’ Gene A sgRNA
Gene C promoter (J23119)
#-35##
PAM#-10##
5’-TTGACAGCTAGCTCAGTCCTAGGTATAATGCTAGC-3’ sense
|||||||||||||||
3’-AACTGTCGATCGAGTCAGGATCCATATTACGATCG-5’ antisense
||||||||||||||||||||
5’-UUGACAGCUAGCUCAGUCCU<dCas9 binding + term>-3’ Gene B sgRNA
Figure 3: Blueprint of sgRNAs binding—thus repressing—their target promoters. It can be seen
that all sgRNAs bind adjacent and 3’ to a PAM, and obscure either a -10 or -35 region of the
promoter, thus maximally repressing the target gene 12 . The sgRNA transcript of gene A binds to
the sense strand of gene B instead of the antisense strand as there was no continently located PAM
on the sense strand. This does not significantly affect the level of repression 11 .
4
Design of software
To facilitate the computational design of terminators, we first constructed a model of terminator efficiency with respect to sequence. This model assigns a score—which we call TS —to a
sequence, with higher scores indicating higher predicted terminator strengths. We used optimisation procedures to come up with a sequence maximizing TS , thus producing terminators
which are predicted to be efficient.
4.1
Models of termination efficiency
Four models of terminator efficiency were constructed of increasing complexity. For these
models we used the same technique as Cambray et al. 10 of using a linear combination of
nonlinear basis functions fi on the sequence supplied (Equation 1), parameterised by coefficients βi . This allowed for sufficient nonlinearity in the model while retaining linearity
in the parameters, thus allowing the optimum parameters (in the least-squares sense) to be
calculated efficiently (in closed form) when fitting the model to data 13 .
6
TS (sequence) =
X
βi fi (sequence) + P (sequence)
(1)
i=0
In every model, f0 = 1 is used so that β0 f0 supplies a constant offset. The P function
penalises sequences containing stop codons; P looks for the reading frame containing the
fewest stop codons, and applies a constant penalty multiplied by the number of stop codons
in that frame. The terms used in each model and their corresponding coefficients are given in
Appendix B. Terms largely correspond to free energies, which are calculated using RNAeval
and RNAfold from the ViennaRNA package 14 . The set of terms and coefficients used in each
model is given in Appendix B.
4.1.1
Model 1
The simplest model was defined using 5 terms (including the constant offset term f0 = 1):
f1 = ∆Ga f2 = ∆Gb f3 = ∆Gl f4 = ∆Gu
where ∆GA , ∆GB , ∆GL , ∆GU are free energies which have been reimplemented from Chen
et al. 5 . These could not be reimplemented exactly, but still represent sensible measures from
which to base a model of termination efficiency.
∆GA is the free energy of the extended hairpin, and is calculated via ∆GA = ∆GHA −
∆GH . ∆GHA is the free energy of the folded RNA beginning 8 nucleotides upstream of
the start of the hairpin and ending 8 nucleotides downstream the end of the hairpin, as
predicted by RNAeval. ∆GH is the free energy of hairpin folding; the free energy reported
by RNAeval for the hairpin sequence using the structure given by RNAfold for the entire
terminator concatenated to just the hairpin region.
∆GB is the free energy of the binding of the hairpin base, which is defined as the 3
dinucleotide pairs in the hairpin furthest from the loop. This is calculated by identifying these
nucleotide pairs and treating each strand as an individual RNA fragment, using RNAeval to
calculate their free energy.
The free energy of the hairpin loop is included in the model through ∆GL . ∆GL is
calculated using RNAeval to calculate the free energy of the terminator sequence with the
structure where only the nucleotides at the top of the stem (just below the loop) are bound.
The binding of this base pair takes up most of the time of the binding of the stem, as once
these nucleotides have paired the rest of the stem rapidly “zips” up 5 .
∆GU is the free energy of binding between the U-tract and the DNA, and is calculated
using:
8
X
0
∆GU = ∆GRNA:DNA +
GRNA:DNA (ni , ni + 1)
i=1
where a U-tract length of 8 is assumed, G0RNA:DNA is the initiation energy of RNA:DNA
hybridization and GRNA:DNA (ni , ni + 1) is the free energy of the binding of the RNA:DNA
nucleotide pair at positions i and i + 1 4 .
7
(a) ∆GHA
(b) ∆GH
(c) ∆GB
(d) ∆GL
Figure 4: Sequence subsets and structures used in free energy calculations. ∆GHA is calculated by
looking at the sequence from 8bp upstream of the hairpin to bp downstream, and calculating the free
energy of the entire structure with remaining bases unpaired. ∆GH is calculated similarly looking
only at hairpin sequence. ∆GB looks at the three nucleotide pairs at the base of the hairpin and
calculates the free energy as if this was a standalone structure. ∆GL is calculated by folding the
entire sequence into a structure where only the top nucleotide pair in the stem is bound. Free energy
calculations were performed with ViennaRNA 14 .
4.1.2
Model 2
The second model extended the first by including interactions between the free energy terms.
Squared terms were omitted to keep the model complexity low.
4.1.3
Models 3 and 4
Model 3 expanded upon the second by introducing two new sequence functions LL and LS
which correspond to the lengths of the loop and the hairpin stem respectively. Again, model
3 included all pairwise interactions between sequence functions as well as each sequence
function alone plus f0 = 1, giving a total of 22 terms. Model 4 extended this again by
adding a new free energy ∆GS ; the free energy of the binding of the stem as calculated by
RNAeval, giving a total of 29 terms.
4.2
Identification of terminator regions
All four models rely on the identification of the A-tract, U-tract, hairpin and loop, which are
identified using a reimplementation of the heuristic described by Chen et al. 5 in supplementary note 8. Given a sequence, RNAfold is used to predict the folded mRNA structure. We
then search the entire structure except the first and last 8 nucleotides for hairpins. Structure
is represented as a dot-bracket notation, where a dot indicates an unpaired nucleotide and a
matched pair of brackets indicate a base pair. As an example, ...(((((.....)))))...
shows a structure with a hairpin. Using this representation, we define hairpins as all top
level pairs of matched brackets; i.e., matched brackets which are not inside another pair of
matched brackets. If more than one hairpin is found, we use the one with the greatest folded
free energy, as calculated by RNAeval. Given the location of this hairpin, we search for the
U-tract by starting at the 6th nucleotide in the right arm of the stem and looking at each 8
bp sequence beginning with a U until 8 nt downstream from the hairpin. The U-tract is the
8
8 bp region with the highest ∆GU . If candidate U-tracts with identical ∆GU are found, the
most 5’ one is chosen. If there are no 8bp sequences beginning with a U, the 8bp sequence
immediately 3’ of the stem is chosen. The A-tract is chosen as the 8bp directly 5’ of the
hairpin. The loop is the largest loop structure in the structure of the hairpin, where we
choose the most 5’ loop if multiple are present.
Identification of these regions relies on being able to find a hairpin in the structure. This
will not be the case for all sequences, and in such instances the model will be unable to
assign a score to the sequence. When this occurs, we assign a constant score which is below
that of all sequences with correctly identified components, but high enough such that an
optimisation routine may still explore regions of failed component identification.
4.3
Algorithms used
We used three algorithms to look for a sequence maximising the score assigned to it by the
model described above.
4.3.1
Random Mutation Hill-climber
Our first optimisation technique used was Random Mutation Hill-climbing (RMHC). We keep
a population of fixed length sequences, and repeatedly performed single point mutations, reevaluated the score of the sequence, and kept the mutated sequence if the score had improved.
This technique can get trapped in local optima, unable to progress.
4.3.2
Simulated Annealing
Simulated Annealing (SA) was implemented to deal with ruggedness in the sequence score
landscape, as it is less prone to get stuck at local optima. Sequence scores were treated as
energies, and RMHC was performed but instead of accepting only beneficial mutations, the
acceptance probabilities were changed to:
(
1
ES1 ≤ ES2
P (accept mutation) =
−(ES −ES )
1
2
T
ES1 > ES2
e
where ES1 and ES2 are the energies (scores) of the sequence before and after mutation
respectively, and T is temperature in units of Boltzmann’s constant. T begins at some
specified value and reduces linearly towards 0 over the course of the optimisation. Unless
specified, we used T=2 for the initial value in optimisation.
4.3.3
Genetic Algorithm
A Genetic Algorithm (GA) was used to discover if there was any structure in the optimisation
problem of the kind that may be taken advantage of by a GA. We used a GA with nonoverlapping generations; that is, a ‘generation’ constitutes generating N offspring from a
population of size N , and replacing the entire population with its offspring. For selection of
9
parents we used tournament selection; select two individuals from a population at random,
and choose the individual with the higher fitness as a parent, then repeat to get a second
parent. Once two parents have been selected, we used two-point crossover to generate two
children; one complementary to the other with regard to which loci were inherited by which
parent. Once offspring were produced, we applied a mutation operator of mutating every
locus to a random nucleotide with probability 1/L, where L is the length of the individual.
5
Results
5.1
5.1.1
Model
Fitting to data
In fitting the model to data, we combined the datasets of Cambray et al. 10 and Chen
et al. 5 giving a total of 636 terminator sequences with experimentally measured termination
efficiencies. From Chen et al. 5 we only used the natural and synthetic terminator datasets,
and omitted the set of removed terminators, as these had unusual expression patterns so
likely had unusual mechanisms of termination, which we are not interested in producing.
Since there are a number of mechanisms of termination more complex than simple hairpin
formation, we restricted the dataset to only those 473 terminators with a length less than 50.
We hypothesize that such small terminators do not have ‘room’ for complex mechanisms,
and we only need to be able to model a single mechanism for successful terminator design.
To fit the model to the data, we needed to choose the set of coefficients minimizing the
difference between the predicted and measured efficiencies. Since our model is linear, we
were able to compute the coefficients minimising the squared error efficiently.
We tested model prediction on unseen data through bootstrapping. That is, we performed
100 iterations of randomly selecting two thirds of the combined dataset, fitting the model to
this training set and testing the model against the remaining data. Measures of predictive
performance are given in Table 1.
Table 1
Correlation Coefficient
σ̂
Mean Squared Error
Model
Terms µ̂
1
2
3
4
5
11
22
29
0.492 0.071
0.531 0.106
0.530 0.094
0.532 0.073
1288 378
1171 296
1264 298
1343 348
Chen et al. 5
Cambray et al. 10
–
5
0.450
0.620
1000 –
498 –
–
–
µ̂
σ̂
As shown in Table 1, we achieved better prediction on average on the combined restricted
10
Predicted score
150
Correlation coefficient: 0.668
100
102
101
50
100
0
10-1
50 100 150 200 250
Measured score
100
101
102
Measured score
Figure 5: Predicted vs actual (experimentally characterised) terminator efficiency. Note logarithmic axes on right hand plot means negative predicted scores have been removed.
datasets than Chen et al. 5 achieved on their natural and synthetic datasets. Testing our
model on only the natural and synthetic datasets from Chen et al. 5 , we achieve a correlation
coefficient of 0.502 which is still an improvement. Refitting the model to all terminators
of length below 50 and correlating the actual and predicted termination scores, again only
against terminators below length 50, whilst restricting only to those terminators which our
model is able to assign a score, we achieve a correlation coefficient of 0.668 (Figure 5).
Including terminators we are unable to predict—those where a hairpin cannot be identified—
and assigning them a score of 0 gives a correlation coefficient of 0.583. Looking at the
standard deviation of the correlation coefficients in Table 1, we see it is low compared to
the mean value, indicating out model accuracy is not sensitive to the training data.
Depending on our goal, the correlation coefficient or the mean squared error may be more
relevant. If we are interested in building a good predictive model of termination efficiency,
then the mean squared error is of greater importance. If, however, we are only intending to
optimise over this model, then we are more interested in the correlation coefficient since a
constant offset between predicted and actual efficiencies does not change the location of the
maxima which the optimisation is searching for.
5.2
5.2.1
Optimisation
Algorithm comparison
Here we look at the solutions each algorithm converges to. Since the GA and SA have a nonzero probability of reaching any solution, including the global optimum, we say an algorithm
has converged once its rate of improvement has become sufficiently slow, even though these
algorithms would eventually find the global maxima if run for enough time (though this
time may be infeasibly large). Running each algorithm 7 times for 10000 generations with a
11
Score
350
300
250
200
150
100
50
00
Genetic
Algorithm
5000
Generations
350
300
250
200
150
100
50
1000000
Simulated
Annealing
5000
Generations
Random Mutation
350 Hill-climbing
300
250
200
150
100
50
1000000
5000
10000
Generations
Figure 6: Highest score in the population for 7 separate runs of each algorithm optimising over
model 4. It can be seen that SA outperforms the other two algorithms. The poor performance of
the GA is likely due to the rapid loss of diversity at the beginning of each run. Each run lasted
10000 generations using a population of 30 and individual size of 50. The SA was started with a
temperature of 2.
population size of 30 and sequence length of 50, we see different runs converging to fitness
values (Figure 6). We also see different algorithms converging to different scores on average;
t-tests for the final values of pairwise different algorithms having the same mean fitness
after 10000 generations give p-values of 4 × 10−4 for GA/hill-climbing, 10−5 for SA/GA
and 2 × 10−3 for hill-climbing/SA, thus all algorithms converge to statistically significantly
different distributions of fitnesses. This tells us that the fitness landscape is rugged and
contains many local maxima in which the optimisation may get stuck.
Simulated annealing converges to the best distribution of fitnesses of the three algorithms.
It is interesting that the basic hill-climbing outperforms the genetic algorithm; investigating
this shows a rapid loss of diversity near the beginning of the optimisation (Figure 7), meaning the GA is less able to explore the fitness landscape. This could be fixed by reducing the
selective pressure in the algorithm by switching to a new method of selection 15 , or by implementing diversity maintenance methods such as deterministic crowding 16 . Adding further
terms to the model or constraints to the optimisation, such as minimising hydrophobicity or
producing antiterminators, is likely to increase the nonlinearity in the fitness landscape and
widen the gap between simple and complex optimisation routines.
5.2.2
Testing
Testing was performed by introducing a fitness function which is simply the count of the
number of T’s in the string. The convergence of each algorithm to the string of all T’s is
illustrated in Figure 8. The GA can be seen to converge fastest, which is due to the initially
diverse population having high fitness sub-sequences crossed over. Simulated Annealing is
12
Mean distance
between individuals
40
35
30
25
20
15
10
50
GA
SA
RMHC
200
400
600
Generations
800
1000
Figure 7: Diversity of the population as each algorithm progresses. To measure diversity we use
the mean distance between individuals in a population, where we define the distance between two
individuals as the number of loci at which the individuals differ. The initial rapid loss of diversity
in the GA can be seen, which is likely the reason behind the poor (sub-RMHC) performance.
the slowest algorithm to converge, and reduces to hill-climbing when the temperature is 0.
Since the fitness landscape is smooth and contains no local optima, all algorithms converge
to the global optimum solution.
One measure of the model not tested in bootstrapping is checking that poor scores are
assigned to non-terminators. We tested this by generating 10000 random sequences of length
50 and investigating the scores assigned, using the rationale that the volume of sequence
space taken up by valid terminators is so tiny that a random sequence is extremely unlikely
to coincide with it. No random sequence generated was able to be assigned a hairpin, so the
model was unable to assign any scores. This is a desirable behaviour; a failure to identify a
hairpin may be treated as a low score.
5.3
Sequences produced
The sequences produced by the optimisation show some strange behaviour. Looking at the
results of the optimisation described above (10000 generations with a population size of 30
and sequence length of 50) (Table 2), we see many sequences with clearly formed U-tracts,
but the highest scoring sequences lack a U-tract and instead have a number of trailing Gs,
and have a sequence of Cs instead of an A-tract. This becomes apparent when looking at
only the highest scoring sequences produced over the 7 runs, as shown in Table 3. This is
possibly a consequence of the model not extrapolating well into non-terminator regions of
sequence space.
Inspecting the sequences produced by each algorithm, we see the GA has significantly
less diversity in the results than the other two algorithms, which is likely the reason for
its poor performance; preventing the premature population convergence would likely solve
this. Looking at the best sequences produces, we can see all algorithms converge on a
13
40
Best fitness
35
30
25
SA:T=2
SA:T=0
RMHC
GA
20
150
200
400
600
Generation
800
1000
Figure 8: Fitness of the best individual in the population as the optimisation progresses using
the simple testing fitness function of counting the number of ‘T’s. Parameters used: population
size=30, individual size=100. Data shown: Genetic Algorithm (GA), Random Mutation Hill-climber
(RMHC), Simulated Annealing (SA). SA was started with temperatures of 2 and 0, and it can be
seen that SA with temperature 0 is equivalent to RMHC
similar solution, but this solution is not a good terminator. These similar solutions differ
at the hairpin, which is unsurprising; optimising the hairpin is hard since a single mutation
can make or break a base pair, and has the possibility of changing the hairpin structure,
introducing significant nonlinearity into the fitness landscape.
14
Table 2: Representative sequences produced after running each algorithm with population 30 for
10000 generations.
Random Mutation Hill-climbing
Sequence
Score
CCCCACCCCGCGCUGCCGUACGGCGGGCAGCGCGGGGUGGGGGGCUAGUA
GCAGGCCGCGGUCGUGGGGUUCCACCCACGACCGCGGCCUGCCGCAUCCG
CUGUAAAAAUCUGUAUGGGAUACACGCUCCGGCUCUAUUUGUUUUUUUUU
AACAACUAGACUCUCUAAAGGGGUCACCCGCUAAUUGGUGUUUUUUUUUU
CGGCCCCCCCCAUACAAAGACCACAAACGCUCCGGCCAGGGGGGGGCCGG
CAGCGAGACUUACCCGUAGGUGGCUCCGGUCCGGGAGUUUUUUUUUUGAC
CCCCCCCCCCGCACGACCUCGGAAUGUAAACGGGGGGGGGGGUCCCACGU
CAGGAUUCAAAUAUUAAUGGCCUCGGUAAAUUAUAACAUUUUUUUUUUUG
UCUUGCAGGUAGACUCCGGUGUGUUUUUUUUUUUUGGCAGGGAUAACACG
UUACCUAUUUACCCGUAUGGAGCUCCGGCCUGUUUUUUUUUUUCGCAAUC
168.671
135.787
185.949
197.391
228.335
183.54
226.792
191.404
176.264
175.348
Simulated Annealing
CCCCCCCCCCCAGUCAAACUCCUCGGCCAUAGAAAAGGGGGGGGGGGCCU
CUCAACACAACAGGUGCUCCGGUGGUUAUAGUUUUUUUUUGAGGUGACGA
GGACUUAAAGUUGGGGCUCACGGAGACUAUCGUUUUAUCUUUUUUUUUAA
UCACGUACGUUAGGGUGGACACGCUGAACUAAUUUUUUUUUCUCGAGUGC
UGUCAAAAGUUUAAUCUAUUGCUCCGGCCAUGCUUUUUUUUUAUUAAGAU
GGUGUUGUGUCUACCACUUAUCAGCUAGUCCUCGGGGUGUUUUUUUUUUC
GUCAAGACUAACAUUCAUAUAUAAUUUCUCCGGAUAGGUGUUUUUUUUUU
AGAUCUAACUUAUAUUAGGAUACUGGAUGGUUUUUUUUUAGGUCACAUAC
CCCCCCCCCCCAACAACUAGGAAACUUGAAGGAGCGACGGGGGGGGGGGG
UAGUCAAGGAAUAUUAGUUAGCUCCGGGUAGUAAGAAUUUUUUUUUUUAU
262.51
179.986
176.856
174.469
181.603
179.526
186.932
175.051
265.701
184.134
Genetic Algorithm
AAACUCUUCGCAUAUAAUCUCCGGAUGAUAGUUUUUUUUUCCUACUUUCC
AAACUCUUAGCAUAUAAUCUCCGGAUGAUGGUUUUUUUUUCCUACUCUCC
AGGCACGCUACAUAAAAUCUACGGAUGAUAGUUUUUUUUCCAUACUUCCU
CGCCACUUCGCAUAUAAUCUCCGGAUUUUGGUUUUUUUUUCCUACUUCCU
CGCCACGCUACAUAUAAUCUCCGGAUUUUGGUUUUUUUUUCCUACUGCCU
AAACUCUUCGCAUAAAAUCUCCGGAUUUUAGUUUCUUUUUCCUACUUACU
AUGCACCUUACAUAGAAUCUCCGGAUUUUAGUUUUUUUUUAUAACUUCCU
AAACACUUCGCAUAUAAUCUCCGGAUGAUGGCUUUUCUUUCCUACUUUCC
AAACUCUUAGCAUAUAAUCUCCGGAUGAUGGUUUUUUUUUCCUACUUUCC
AAACUCUUCGCAUAUAAUCUCCGGAUUUUAGUUUUUUUUUACAACUUUCC
15
178.327
177.732
99.4276
0
99.3639
149.729
175.195
33.6897
177.732
178.219
Table 3: Top 10 most highly scored sequences from each algorithm from 8 runs each of length 10000
with population 30
Random Mutation Hill-climbing
Sequence
Score
CCCCCCCCCCGGUAUACAGCAGGGAACUUAAUACGAAGCGGGGGGGGGGG
CCCCCCCCCCGAUAUACAGAAGGGAACUGAAUAAGAAGCGGGGGGGGGGG
CCCCCCCCCCGAUGGAAAGAAGGGAACUGAAUAAGAAGCGGGGGGGGGGG
CCCCCCCCCCGAUAUACAGAAGGGAACUUAAUACGAAGCGGGGGGGGGGG
CCCCCCCCCCGAUAUACAGAAGGGAACUUAAUACGAAGCGGGGGGGGGGG
CCCCCCCCCCGAUAUACAGAAGGGAACUUAAUACGAAGCGGGGGGGGGGG
CCCCCCCCCCGAUAAACAGAGGGGGACUAAAUACGGAGCGGGGGGGGGGG
CCUCCCCCCCGGUAUACAGCAGGGAACUAAAUAAGAAGCGGGGGGGGGGG
CCCCCCCCCCGCUAUACAGCAGGGAACUGAAUAAGGAGCGGGGGGGGGGG
CGCCCCCCCCGGUAGACAGAAGGGAACUGAAAACGAAGCGGGGGGGGGGG
264.615
262.879
262.879
261.142
261.142
261.142
257.1
248.74
245.947
238.304
Simulated Annealing
CCCCCCCCCCACAAAGCGGCACUCCGGAGUCUACAGCAGGGGGGGGGGGG
CCCCCCCCCUCACAUAGGUUAAACCGCCUCCGGGCAUUGAGGGGGGGGGG
CCCCCCCCCCGGGGACUAUUAUUGGAUACGAGCGCACAGGGGGGGGGGGG
CCCCCCCCCCAGGGGGUAGAAUGACGGGAACGAUGUGCAGGGGGGGGGGG
CCCCCCCCCGAGCGCACAGGGGGACCAAUAAAAAGCGGAAGGGGGGGGGG
CCCCCCCCCCCACUAAAGCUCCGGCACAUACGACAGGGGGGGGGGGGGAU
GGCCCCCCGUGAUUCAUAUUAGGCGCCUCGGGGCGAGCACGGGGGGCCCC
CGGCCCCCGUAACGGUAAUUAGCGUCGGUCUCCCGCAACGGGGGCCGGAA
CCCCCCCCGGAUAACAACCUCCGGGAAUGCGCACCGGGGGGGGGUCAUGU
CCCCCCCCGAGCUAUUAACCGGGGAACUGCCAGCGCGGGGGGGGGCAUAU
274.048
267.084
267.003
267.003
266.352
258.099
233.565
232.309
231.687
230.723
Genetic Algorithm
CCCCCCCCUGUGAGCGUCAACUCCGGUAACCCGAACAAGCAGGGGGGGGG
CCCCCCCCCGAGGAAACCUAGCUCCGGCCGCCCACCGCAAGGGGGGGGGG
CCCCCCCCCGAGCUCUUGUCUCCGGAAUUUAACCGCCACCGGGGGGGGGG
CCCCCCCCCGAGACGACGCCCUCCGGGAGACACAGCCGACGGGGGGGGGG
CCCCCCCCCGGAAAGCUAGUGUACACCUCGGUCUAACCCCGGGGGGGGGG
CCCCCCCCCCGCUACUUGACUCUCCGGACAACACGCCAAGGGGGGGGGGG
CCCCCCCCCCGUCACUGUUACUCCGGAUGCCAGACCCAAGGGGGGGGGGG
CCCCCCCCCCAUAAUGUAUACCUCCGGGCCAGAGAACAGGGGGGGGGGGG
CCCCCCCCCCAACCACUUAAGCUCCGGCACAGACGAAAGGGGGGGGGGGG
CCCCCCCCCCCAUCAGGUACCUCGGUCACAACACAAGCGGGGGGGGGGGC
16
310.603
301.754
296.534
291.795
279.071
275.904
275.672
274.512
274.28
273.609
6
Discussion
In extending this work, one of the major first goals would be to improve the model of
termination efficiency over which the optimisation is run. If a general machine learning
technique is used to learn a model from data, then training will also require negative training
examples; i.e., sequence strings with low experimentally measured efficiencies. It is important
that extrapolation is considered; since this model is to be optimised over all of sequence space,
we must be confident that the efficiency score assigned does not take large optima in regions
of sequence space far away from those that are known to produce good terminators. For this
reason it may be preferable to hand build a model which takes advantage of the physics of
the problem and use data to parametrise this model, as we have done here.
One approach in improving the model may be to identify more candidate features through
the literature and sensible physical assumptions, and to use feature selection techniques to
identify an optimum feature subset to be used in the linear model. The features under
selection here may include interactions between the ‘basic’ features (a single free energy
calculation, the length of the hairpin loop, etc), and a simple way to do this would be
to automate a hypothesis test for each feature checking that its coefficient is statistically
significantly different from 0, and remove the feature if it is not. We may then refit the
model to the data and repeat. Along with producing a more accurate model, this process
may also provide insight as to what kinds of interactions are important in the mechanics of
termination. An alternative approach may be to perform feature extraction on the actual
sequences in the dataset, but this may require more data since this data should not be used
for subsequent training.
It may be useful to minimize the hydrophobicity of the mRNA produced by the terminator. Hydrophobic molecules in the cell have a tendency to form insoluble aggregates
which are toxic to the cell, and it would be desirable to minimize this. We may also produce
bidirectional terminators by changing the objective function to be 0.5 times the score of the
forwards terminator plus 0.5 times the score of the reverse compliment sequence.
Since we do not want to use a single terminator more than once in a synthetic circuit
due to difficulties in synthesis, it may be desirable to optimise a family of terminators which
are maximally different from each other. This comes naturally to a GA, where addition of
diversity maintenance, such as deterministic crowding 16 , would produce a number of clusters
of terminators. Clustering algorithms could be used to identify these clusters, and the best
terminator could be taken from each.
There are constraints on DNA sequences that can be synthesized. Personal contact with
Integrated DNA Technologies (IDT)2 reveals that sequences with any of the properties listed
are unable to be synthesized:
1. Terminal repeat elements greater than 5bp
2. Hairpins greater than 15bp within 100bp of each other
3. Hairpins greater than 19bp
2
https://eu.idtdna.com/
17
4. Poly A or T stretches over 11bp
5. Poly G or C stretches over 7bp
6. Terminal GC content below 30% or above 70%
7. GC content below 28% in a 100bp window
8. Total GC content below 25%
9. Total GC content over 70%
10. GC content over 77% in a 100bp window
11. GC content over 68% in a 600bp window
12. Individual repeats over 7bp which constitute 35% of a sequence
13. Total repeats over 7bp which constitute 70% of a sequence
14. Windowed repeats over 7bp which constitute 90% of a 70bp window
Such constraints would be simple to incorporate into the optimisation, either by assining
a score penalty to sequences which break these rules, or by adding a hard constraint that
sequences breaking these rules cannot be generated. The former may be preferable, as it
would less constrain how the fitness landscape can be explored.
A particularly interesting extension would be to move from design of terminators to antiterminators 2 . To design RNA-controlled antiterminators computationally would allow for
the construction of regulatory circuits entirely in RNA, without being restricted by the difficulties associated with protein folding. As a demonstration of the power of antitermination
regulation, an improved oscillator with greater stability which only requires two genes 6 may
be implemented by combining antitermination and CRISPRi, which is given in Figure 9.
18
c11
TT
STOP
CSY4
RAJ11
c11
TT
STOP
CSY4
sgRNA
CSY4
GFP
Figure 9: Schematic implementing the oscillator described by Stricker et al. 6 without use of transcription factors. The blue blocks are antiterminators, where ‘TT’ is a terminator which may be
produced from the work here, and RAJ11 codes for the RNA able to disable termination (thus activating translation of genes A and B). ‘sgRNA’ is a small guide RNA (CRISPRi — Section 3.1)
which represses the promoters of both A and B, disabling the genes. Once translated, mRNAs are cut
at the CSY4 sites, freeing up the RAJ11 RNA and the RNA transcribed from the site coloured in red,
which will code for Green Fluorescent Protein—used for experimental verification of oscillations.
7
Conclusion
In this work we have investigated RNA based regulation, and tackled the computational
design of transcription terminators with the goal of moving towards computational design
of antiterminators. Computational design of synthetic components is important as, with
the increasing circuit complexity, we want to work at higher levels of abstraction and not
worry about circuits at the nucleotide level. We may also have a number of constraints and
preferences to work with which are too large to consider in designing components by hand.
To enable computational terminator design, we first constructed a model parametrised
on the datasets of Chen et al. 5 and Cambray et al. 10 , which takes a sequence and assigns
it a score predicting its efficiency as a terminator. We then used optimisation techniques
to optimise over this model, arriving at a set of sequences with high predicted terminator
efficiency. The results we obtain are promising, though not convincing enough to move onto
experimental classification of the sequences produced. Our work demonstrates that, with an
improved model, it is feasible to computationally produce transcription terminators subject
to a potentially large number of constraints. Constraints of a particular interest to the circuit
designer will likely include production of a large family of maximally diverse terminators,
producing bidirectional terminators, minimising terminator length, respecting constraints
for synthesis, minimising toxic hydrophobic gene products and others. By extending to
production of antiterminator sequences, we help pave the pay towards computational design
of complex circuits of many interacting genes.
19
References
(1) Rodrigo, G.; Landrain, T. E.; Jaramillo, A. Proceedings of the National Academy of
Sciences 2012, 109, 15271–15276.
(2) Liu, C. C.; Qi, L.; Lucks, J. B.; Segall-Shapiro, T. H.; Wang, D.; Mutalik, V. K.;
Arkin, A. P. Nat Meth 2012, 9, 1088–1094.
(3) Gusarov, I.; Nudler, E. Molecular cell 1999, 3, 495–504.
(4) Sugimoto, N.; Nakano, S.-I.; Katoh, M.; Matsumura, A.; Nakamuta, H.; Ohmichi, T.;
Yoneyama, M.; Sasaki, M. Biochemistry 1995, 34, 11211–11216.
(5) Chen, Y.-J. J.; Liu, P.; Nielsen, A. A.; Brophy, J. A.; Clancy, K.; Peterson, T.;
Voigt, C. A. Nature methods 2013, 10, 659–664.
(6) Stricker, J.; Cookson, S.; Bennett, M. R.; Mather, W. H.; Tsimring, L. S.; Hasty, J.
Nature 2008, 456, 516–519.
(7) Ermolaeva, M. D.; Khalak, H. G.; White, O.; Smith, H. O.; Salzberg, S. L. Journal of
Molecular Biology 2000, 301, 27 – 33.
(8) d’Aubenton Carafa, Y.; Brody, E.; Thermes, C. Journal of Molecular Biology 1990,
216, 835 – 858.
(9) Kingsford, C.; Ayanbule, K.; Salzberg, S. Genome Biology 2007, 8, R22.
(10) Cambray, G.; Guimaraes, J. C.; Mutalik, V. K.; Lam, C.; Mai, Q.-A.; Thimmaiah, T.;
Carothers, J. M.; Arkin, A. P.; Endy, D. Nucleic Acids Res 2013, 41, 5139–48.
(11) Bikard, D.; Jiang, W.; Samai, P.; Hochschild, A.; Zhang, F.; Marraffini, L. A. Nucleic
Acids Research 2013, 41, 7429–7437.
(12) Qi, L.; Larson, M.; Gilbert, L.; Doudna, J.; Weissman, J.; Arkin, A.; Lim, W. Cell
2013, 152, 1173 – 1183.
(13) Barber, D. Bayesian Reasoning and Machine Learning; Cambridge University Press,
2012.
(14) Lorenz, R.; Bernhart, S. H. F.; zu Siederdissen, C. H.; Tafer, H.; Flamm, C.;
Stadler, P. F.; Hofacker, I. L. Algorithms for Molecular Biology 2011, 6, 26.
(15) Jong, K. A. D. Evolutionary computation - a unified approach; MIT Press, 2006; pp
I–IX, 1–256.
(16) Mahfoud, S. W. Crowding and Preselection Revisited. Parallel Problem Solving From
Nature. 1992; pp 27–36.
20
Appendix A
Sequences
The sequences for the genes used for the RNA oscillator are given below:
Gene A
1 ATAAATGTGA GCGGATAACA TTGACATTGT GAGCGGATAA CAAGATACTG
51 AGCACCAGTA TCTCTATCAC TGATAGTTTT AGAGCTAGAA ATAGCAAGTT
101 AAAATAAGGC TAGTCCGTTA TCAACTTGAA AAAGTGGCAC CGAGTCGGTG
151 CTTTTTT
Promoter (plLac01)
sgRNA base pairing region - pairs with plTet012
sgRNA dCas9 handle + terminator
Gene B
1 ACTCTATCAT TGATAGAGTT TGACATCCCT ATCAGTGATA GAGATACTGA
51 GCATTGACAG CTAGCTCAGT CCTGTTTTAG AGCTAGAAAT AGCAAGTTAA
101 AATAAGGCTA GTCCGTTATC AACTTGAAAA AGTGGCACCG AGTCGGTGCT
151 TTTTT
Promoter (plTet012)
sgRNA base pairing region - pairs with J23119
sgRNA dCas9 handle + terminator
Gene C
1 TTGACAGCTA GCTCAGTCCT AGGTATAATG CTAGCGATAA
51 GTGAGGTTTT AGAGCTAGAA ATAGCAAGTT AAAATAAGGC
101 TCAACTTGAA AAAGTGGCAC CGAGTCGGTG CTTTTTT
Promoter (J23119)
sgRNA base pairing region - pairs with plLac01
sgRNA dCas9 handle + terminator
dCas9 gene:
1 ACATTGATTA TTTGCACGGC GTCACACTTT GCTATGCCAT
51 TCCATAAGAT TAGCGGATCC TACCTGACGC TTTTTATCGC
101 TGTTTCTCCA TACCGTTTTT TTGGGCTAGC TCTAGAGAAA
151 ACTAGATGAT GGATAAGAAA TACTCAATAG GCTTAGCTAT
201 AGCGTCGGAT GGGCGGTGAT CACTGATGAA TATAAGGTTC
251 GTTCAAGGTT CTGGGAAATA CAGACCGCCA CAGTATCAAA
301 TAGGGGCTCT TTTATTTGAC AGTGGAGAGA CAGCGGAAGC
351 AAACGGACAG CTCGTAGAAG GTATACACGT CGGAAGAATC
401 TCTACAGGAG ATTTTTTCAA ATGAGATGGC GAAAGTAGAT
451 TTCATCGACT TGAAGAGTCT TTTTTGGTGG AAGAAGACAA
501 CGTCATCCTA TTTTTGGAAA TATAGTAGAT GAAGTTGCTT
551 ATATCCAACT ATCTATCATC TGCGAAAAAA ATTGGTAGAT
21
CATTGACATT
TAGTCCGTTA
AGCTTTTTTA
AACTCTCTAC
GAGGGGACAA
CGGCACAAAT
CGTCTAAAAA
AAAAATCTTA
GACTCGTCTC
GTATTTGTTA
GATAGTTTCT
GAAGCATGAA
ATCATGAGAA
TCTACTGATA
601
651
701
751
801
851
901
951
1001
1051
1101
1151
1201
1251
1301
1351
1401
1451
1501
1551
1601
1651
1701
1751
1801
1851
1901
1951
2001
2051
2101
2151
2201
2251
2301
2351
2401
2451
2501
2551
2601
2651
AAGCGGATTT
CGTGGTCATT
GGACAAACTA
AAAACCCTAT
CGATTGAGTA
TGAGAAGAAA
TGACCCCTAA
CAGCTTTCAA
AATTGGAGAT
ATGCTATTTT
GCTCCCCTAT
CTTGACTCTT
AAGAAATCTT
GGGGGAGCTA
AAAAATGGAT
TGCTGCGCAA
CACTTGGGTG
ATTTTTAAAA
TTCCTTATTA
ATGACTCGGA
TGTCGATAAA
TTGATAAAAA
TATGAGTATT
TGAAGGAATG
TTGTTGATTT
AAAGAAGATT
AGGAGTTGAA
TAAAAATTAT
ATCTTAGAGG
GATTGAGGAA
TGAAACAGCT
AAATTGATTA
TTTTTTGAAA
ATGATGATAG
GGACAAGGCG
TGCTATTAAA
TCAAAGTAAT
CGTGAAAATC
GAAACGAATC
AGCATCCTGT
TATCTCCAAA
TCGTTTAAGT
GCGCTTAATC
TTTTGATTGA
TTTATCCAGT
TAACGCAAGT
AATCAAGACG
AATGGCTTAT
TTTTAAATCA
AAGATACTTA
CAATATGCTG
ACTTTCAGAT
CAGCTTCAAT
TTAAAAGCTT
TTTTGATCAA
GCCAAGAAGA
GGTACTGAGG
GCAACGGACC
AGCTGCATGC
GACAATCGTG
TGTTGGTCCA
AGTCTGAAGA
GGTGCTTCAG
TCTTCCAAAT
TTACGGTTTA
CGAAAACCAG
ACTCTTCAAA
ATTTCAAAAA
GATAGATTTA
TAAAGATAAA
ATATTGTTTT
AGACTTAAAA
TAAACGTCGC
ATGGTATTAG
TCAGATGGTT
TTTGACATTT
ATAGTTTACA
AAAGGTATTT
GGGGCGGCAT
AGACAACTCA
GAAGAAGGTA
TGAAAATACT
ATGGAAGAGA
GATTATGATG
TATTTGGCCT
GGGAGATTTA
TGGTACAAAC
GGAGTAGATG
ATTAGAAAAT
TTGGGAATCT
AATTTTGATT
CGATGATGAT
ATTTGTTTTT
ATCCTAAGAG
GATTAAACGC
TAGTTCGACA
TCAAAAAACG
ATTTTATAAA
AATTATTGGT
TTTGACAACG
TATTTTGAGA
AGAAGATTGA
TTGGCGCGTG
AACAATTACC
CTCAATCATT
GAAAAAGTAC
TAACGAATTG
CATTTCTTTC
ACAAATCGAA
AATAGAATGT
ATGCTTCATT
GATTTTTTGG
AACATTGACC
CATATGCTCA
CGTTATACTG
GGATAAGCAA
TTGCCAATCG
AAAGAAGACA
TGAACATATT
TACAGACTGT
AAGCCAGAAA
AAAGGGCCAG
TCAAAGAATT
CAATTGCAAA
CATGTATGTG
TCGATGCCAT
22
TAGCGCATAT
AATCCTGATA
CTACAATCAA
CTAAAGCGAT
CTCATTGCTC
CATTGCTTTG
TGGCAGAAGA
TTAGATAATT
GGCAGCTAAG
TAAATACTGA
TACGATGAAC
ACAACTTCCA
GATATGCAGG
TTTATCAAAC
GAAACTAAAT
GCTCTATTCC
AGACAAGAAG
AAAAATCTTG
GCAATAGTCG
CCATGGAATT
TATTGAACGC
TACCAAAACA
ACAAAGGTCA
AGGTGAACAG
AAGTAACCGT
TTTGATAGTG
AGGTACCTAC
ATAATGAAGA
TTATTTGAAG
CCTCTTTGAT
GTTGGGGACG
TCTGGCAAAA
CAATTTTATG
TTCAAAAAGC
GCAAATTTAG
AAAAGTTGTT
ATATCGTTAT
AAAAATTCGC
AGGAAGTCAG
ATGAAAAGCT
GACCAAGAAT
TGTTCCACAA
GATTAAGTTT
ATAGTGATGT
TTATTTGAAG
TCTTTCTGCA
AGCTCCCCGG
TCATTGGGTT
TGCTAAATTA
TATTGGCGCA
AATTTATCAG
AATAACTAAG
ATCATCAAGA
GAAAAGTATA
TTATATTGAT
CAATTTTAGA
CGTGAAGATT
CCATCAAATT
ACTTTTATCC
ACTTTTCGAA
TTTTGCATGG
TTGAAGAAGT
ATGACAAACT
TAGTTTGCTT
AATATGTTAC
AAGAAAGCCA
TAAGCAATTA
TTGAAATTTC
CATGATTTGC
AAATGAAGAT
ATAGGGAGAT
GATAAGGTGA
TTTGTCTCGA
CAATATTAGA
CAGCTGATCC
ACAAGTGTCT
CTGGTAGCCC
GATGAATTGG
TGAAATGGCA
GAGAGCGTAT
ATTCTTAAAG
CTATCTCTAT
TAGATATTAA
AGTTTCCTTA
2701 AAGACGATTC AATAGACAAT AAGGTCTTAA
2751 GGTAAATCGG ATAACGTTCC AAGTGAAGAA
2801 CTATTGGAGA CAACTTCTAA ACGCCAAGTT
2851 ATAATTTAAC GAAAGCTGAA CGTGGAGGTT
2901 GGTTTTATCA AACGCCAATT GGTTGAAACT
2951 GGCACAAATT TTGGATAGTC GCATGAATAC
3001 AACTTATTCG AGAGGTTAAA GTGATTACCT
3051 GACTTCCGAA AAGATTTCCA ATTCTATAAA
3101 CCATCATGCC CATGATGCGT ATCTAAATGC
3151 TTAAGAAATA TCCAAAACTT GAATCGGAGT
3201 GTTTATGATG TTCGTAAAAT GATTGCTAAG
3251 AGCAACCGCA AAATATTTCT TTTACTCTAA
3301 CAGAAATTAC ACTTGCAAAT GGAGAGATTC
3351 ACTAATGGGG AAACTGGAGA AATTGTCTGG
3401 CACAGTGCGC AAAGTATTGT CCATGCCCCA
3451 CAGAAGTACA GACAGGCGGA TTCTCCAAGG
3501 AATTCGGACA AGCTTATTGC TCGTAAAAAA
3551 TGGTGGTTTT GATAGTCCAA CGGTAGCTTA
3601 AGGTGGAAAA AGGGAAATCG AAGAAGTTAA
3651 GGGATCACAA TTATGGAAAG AAGTTCCTTT
3701 TTTAGAAGCT AAAGGATATA AGGAAGTTAA
3751 TACCTAAATA TAGTCTTTTT GAGTTAGAAA
3801 GCTAGTGCCG GAGAATTACA AAAAGGAAAT
3851 ATATGTGAAT TTTTTATATT TAGCTAGTCA
3901 GTCCAGAAGA TAACGAACAA AAACAATTGT
3951 TATTTAGATG AGATTATTGA GCAAATCAGT
4001 TTTAGCAGAT GCCAATTTAG ATAAAGTTCT
4051 GAGACAAACC AATACGTGAA CAAGCAGAAA
4101 TTGACGAATC TTGGAGCTCC CGCTGCTTTT
4151 TGATCGTAAA CGATATACGT CTACAAAAGA
4201 TCCATCAATC CATCACTGGT CTTTATGAAA
4251 CTAGGAGGTG ACTGAGTCGA CCCAGGCATC
4301 TCGAAAGACT GGGCCTTTCG TTTTATCTGT
4351 CTACTAGAGT CACACTGGCT CACCTTCGGG
Promoter (K206001 - pBad weak)
RBS (J61100)
Coding sequence (K1026001)
Terminator (B0015)
23
CGCGTTCTGA
GTAGTCAAAA
AATCACTCAA
TGAGTGAACT
CGCCAAATCA
TAAATACGAT
TAAAATCTAA
GTACGTGAGA
CGTCGTTGGA
TTGTCTATGG
TCTGAGCAAG
TATCATGAAC
GCAAACGCCC
GATAAAGGGC
AGTCAATATT
AGTCAATTTT
GACTGGGATC
TTCAGTCCTA
AATCCGTTAA
GAAAAAAATC
AAAAGACTTA
ACGGTCGTAA
GAGCTGGCTC
TTATGAAAAG
TTGTGGAGCA
GAATTTTCTA
TAGTGCATAT
ATATTATTCA
AAATATTTTG
AGTTTTAGAT
CACGCATTGA
AAATAAAACG
TGTTTGTCGG
TGGGCCTTTC
TAAAAATCGT
AGATGAAAAA
CGTAAGTTTG
TGATAAAGCT
CTAAGCATGT
GAAAATGATA
ATTAGTTTCT
TTAACAATTA
ACTGCTTTGA
TGATTATAAA
AAATAGGCAA
TTCTTCAAAA
TCTAATCGAA
GAGATTTTGC
GTCAAGAAAA
ACCAAAAAGA
CAAAAAAATA
GTGGTTGCTA
AGAGTTACTA
CGATTGACTT
ATCATTAAAC
ACGGATGCTG
TGCCAAGCAA
TTGAAGGGTA
GCATAAGCAT
AGCGTGTTAT
AACAAACATA
TTTATTTACG
ATACAACAAT
GCCACTCTTA
TTTGAGTCAG
AAAGGCTCAG
TGAACGCTCT
TGCGTTTATA
Appendix B
Model coefficients
Table 4: Features and coefficients used for model 1
Term
Coefficient
Const offset
-4.28716658e+02
∆GA
-2.44926259e-01
∆GB
3.85406757e+01
∆GL
-5.03126344e+00
∆GU
3.89388333e+00
Table 5: Features and coefficients used for model 2
Term
Coefficient
Const offset
-1.17415969e+03
∆GA
2.17107148e+01
∆GB
9.82247253e+01
∆GL
8.59259503e+01
∆GU
-1.00567505e+02
∆GA ∆GB
-1.66646028e+00
∆GA ∆GL
2.72483217e-01
∆GA ∆GU
3.01939045e-01
∆GB ∆GL
-7.23665163e+00
∆GB ∆GU
8.62127506e+00
∆GL ∆GU
-6.80847451e-01
Table 6: Features and coefficients used for model 3
Term
Coefficient
Const offset
-1.54449902e+03
∆GA
1.94092517e+01
GB
1.27651933e+02
∆GL
7.06966080e+01
∆GU
-1.05605531e+02
LL
4.49623637e+01
LS
1.05131338e+01
∆GA ∆GB
-1.24756363e+00
∆GA ∆GL
5.96358313e-01
∆GA ∆GU
4.47671856e-01
∆GA LL
1.69585495e-01
∆GA LS
-3.21389945e-01
∆GB ∆GL
-5.05766396e+00
∆GB ∆GU
8.91592880e+00
∆GB LL
-4.11082632e+00
∆GB LS
-9.24211257e-01
∆GL ∆GU
-4.42576156e-01
∆GL LS
-5.19086074e-01
∆GL LS
-4.15503552e-01
∆GU ∆LL
-1.47249066e-01
∆GU LS
5.93690124e-02
LL LS
4.69729903e-01
24
Table 7: Features and coefficients used for model 4
Term
Coefficient
Const offset
-1.51184410e+03
∆GA
1.66213509e+01
∆GB
1.21625099e+02
∆GL
5.88613541e+01
∆GU
-1.27621717e+02
LL
5.61511202e+01
LS
-1.96140980e+01
∆GS
-4.00973435e+01
∆GA ∆GB
-1.20492198e+00
∆GA ∆GL
3.52904807e-01
∆GA ∆GU
3.85190822e-01
∆GA LL
9.77057825e-03
∆GA LS
-1.34732917e-02
∆GA ∆GS
1.94543285e-01
∆GB ∆GL
-4.29635388e+00
∆GB ∆GU
1.06674915e+01
∆GB LL
-4.64374486e+00
∆GB LS
1.67025667e+00
∆GB ∆GS
3.09989733e+00
∆GL ∆GU
-1.28934700e-01
∆GL LL
3.38330963e-01
∆GL LS
-6.30864883e-01
∆GL ∆GS
-3.75788116e-01
∆GU LL
2.05456068e-02
∆GU LS
-3.65345779e-01
∆GU ∆GS
-4.59169225e-01
LL LS
-1.47700544e-01
LL ∆GS
-4.57114319e-01
LS ∆GS
9.21028673e-02
Appendix C
C.1
Software usage and maintenance
How to Use
The software has been produced with a simple command-line interface, and may be extended with a graphical interface. To use the software, first call make in the directory containing the software, which will produce two executables; score.exe and termopt.exe.
termopt.exe is the main executable and will be described below, and score.exe is an
executable supplied for convenience which takes a sequence and produces a terminator score.
Since the domain of the terminator model used does not cover the full space of sequences,
score.exe will exit with an exit status of 1 if a sequence is provided which the model is
unable to predict. To run TermOpt, call ./termopt.exe <params>, where <params>
is an unordered list of parameters to be supplied which are given in Table 8. All parameters
are optional and will default to their values given in the file named params. Modifying the
contents of the params file is an alternative way of providing parameter values.
As an example, if we want to use simulated annealing starting from a temperature 1.5
(units of Boltzmann’s constant) on a population of 30 length 40 sequences, we would call:
./termopt -temperature 15 -population-size 30 -individual-size 40 We
may alternatively edit the params file appropriately and simply call ./termopt.exe. The
format of the params file is an unordered sequence of <param>=<val> separated by newlines. Whitespace is ignored, as are all characters including and following a #, allowing
for commenting. The names of parameters are the same as the long names of those that
may be specified on the command-line, except there is no double-dash prefix and all dashes
separating words have been replaced by underscores.
25
Table 8: Optional parameters that may be passed to termopt. The long names may also be used
in the params file to set defaults, where the - prefix has been removed and all other -s replaced
with _s.
Short name
Long name
Type
Description
-a
–algorithm
string
-c
–stop-codon-penalty
float
-g
–generations
int
-h
–history
bool
-p
–population-size
int
-i
-u
–individual-size
–unfolded-energy
int
float
-v
–vienna-location
string
-V
-t
–verbose
–top
bool
int
-T
–temperature
float
Select an algorithm to use. The argument
should be one of evolve, anneal, hillclimb
which selects a GA, simulated annealing
or random mutation hill-climbing respectively.
Set the score penalty for containing stop
codons. The Total stop codon penalty applied is the parameter given multiplied by
the number of stop codons in the reading
frame containing the fewest stop codons.
Since this penalty is added to the score, it
will normally be negative.
Set the number of generations to run the
algorithm for.
If true, after outputting sequences and their
fitnesses, outputs a newline followed by the
history of the best fitness at each generation.
Size of the population of sequences being
optimized.
Length of the sequences being optimised.
Set the fitness assigned to an individual
which the model cannot predict. This can
happen when the full sequence is unable to
form a hairpin, or if the hairpin is unable
to fold without the context of the rest of
the sequence.
The location of the folder containing ViennaRNA executables RNAfold and RNAeval. The current version MUST contain a
trailing slash.
If true, prints more information to stdout.
Only print the top n individuals to stdout.
If passed 0, will print all individuals.
Sets the starting temperature to use in simulated annealing, which decreases linearly
towards 0 during simulation.
26
C.2
C.2.1
How to Modify
Replacing the model
The best way to replace the model of termination efficiency which is being optimised is to
replace the term_score function in the file linear_model.cpp. If the new model does
not use the free energies, then the free_energies.cpp,h files may be removed, as can
the get_components() function in sequence.cpp.
C.2.2
Adding new parameters
To add new parameters, first add a new variable to the Options struct in the argparse.h
file with a default value. It may be preferable to copy the current paradigm of having the
default value set using a function which looks in params for a value. To do this, specify
a function which takes a std::string which will be the string of the value passed in,
checks it for correctness and returns it as the correct type. Calling get_val() with a
string argument will look in params for values corresponding to this argument, which can
be used to pull values from params. To take a value from the commandline, modify the
parse_args function by adding a new term to the else if stack which assigns the validated
parameter value to the new Options variable. The existing code should provide a large
number of examples to follow.
27
Download