Design of RNA-based transcription switches to control living cells James Skinner∗,† and Professor Alfonso Jaramillo,‡ The University of Warwick E-mail: j.r.skinner@warwick.ac.uk Abstract In this work we investigate methods of regulation without use of transcription factors. We first produce a design for a synthetic RNA oscillator circuit, then work towards computationally producing antiterminators. As a first step we computationally generate a large number of Rho-independent transcription terminator sequences, which we achieve by first constructing a model of termination efficiency for simple terminators — outperforming previous models—then optimise over this model to generate terminator sequences. Our promising results demonstrate the feasibility of designing transcription terminators computationally. 1 Introduction Synthetic biology and the engineering of living cells holds great potential in medicine and engineering. In creating a cell with a desired behaviour, DNA sequences with particular functionality are designed and constructed. The philosophy of synthetic biology is to take an engineering approach to this task, where DNA circuitry is built from pre-existing, well characterised parts. Complex circuitry requires genes to interact and regulate each other, which is normally achieved in nature through proteins called transcription factors. Since transcription factors are difficult to design de novo, there is interest in re-purposing RNA to ∗ To whom correspondence should be addressed Centre for Complexity Science ‡ School of Life Sciences † 1 Loop RNA RNAp Hairpin Stem 5’ A-tract 3’ antisense 5’ sense A A A AA A A A A U U UUU U U U U 3’ U-tract 5’ 3’ DNA T TTT T T T T T Figure 1: Cartoon mechanics of simple termination. The transcribed mRNA forms a hairpin inside the transcribing RNAp, causing dissociation. The U:A RNA:DNA binding at the U-tract stalls transcription, giving the hairpin time to form 3 . Important parts of the translated terminator are the A-tract, U-tract, stem and loop. achieve gene regulation 1 . Here we investigate regulation using RNA, and develop software for computationally generating large number of transcription terminator sequences, which may be used in a method of RNA regulation—antitermination 2 . We define a gene to be a sequence of nucleotides which, through the action of various machinery in the cell, produces a particular protein; a molecule which may bind strongly and specifically to another molecule, and may catalyse specific reactions. Proteins perform diverse tasks in the cell, and are produced through the process of transcription followed by translation. In transcription, RNAp attaches to a region near the start of a gene and runs along the DNA, ‘unzipping’ it, reading it and producing an RNA copy of one of the DNA strands. This molecule—mRNA—is used in translation to produce proteins, but this process is less important to this work. Importantly, transcription stops only when the RNAp dissociates from the DNA, and it is (normally) nucleotide sequences known as transcription terminators which cause this. Thus, terminators mark the end of the region which codes for a protein. Terminator efficiency is a measure of the fraction of RNAp that the terminator causes to dissociate. Here we focus only on the class of terminators known as Rho-independent, which are terminators which function without a protein called the Rho factor, and we look primarily at terminators in bacteria. We will be referring to rho-independent terminators unless otherwise specified. Such terminators function by having a sequence that causes the mRNA being transcribed to form a hairpin structure (Figure 1) inside of the RNAp, causing the RNAp to dissociate. A terminator typically results in a transcribed section of mRNA composed of an A-tract, hairpin (made from a stem and loop) and U-tract. The U-tract is a Uracildense region downstream of the hairpin; since U:A is the lowest energy RNA:DNA base pair 4 , transcription pauses at the U-tract, giving time for the hairpin to form 3 . Whilst the RNAp is paused at the U-tract, the hairpin forms and is responsible for causing the RNAp to dissociate 3 , but the exact details of this are not yet known. The A-tract typically has a high Adenine content, with a role that could possibly be to extend the hairpin stem and help ratchet the U-tract from the DNA 5 . By surrounding them with other genetic components, 2 terminators may be converted into antiterminators—terminators which may be enabled and disabled using an RNA—which may be used as a method of RNA-only regulation 2 . In the current state of synthetic biology, we are able to design synthetic circuitry by combining known components which we may obtain from the literature or from a catalogue such as The Registry1 . This circuitry may then be incorporated into an organism such as E. coli via construction of a plasmid—a ring of DNA. Plasmid sequences are subject to certain constraints, such as containing a suitable proportion of GC pairs and not containing long repeated regions, which arise from the methods used in synthesis. In this work we describe a method of computationally designing terminators; an important component normally required at the end of each gene. We show the feasibility of computationally designing large numbers of terminators subject to constraints, and provide a software package for doing this. We make use of an existing technology—CRISPR—to regulate circuitry via RNA, and combine this technology with antitermination to give the design for an improved oscillator 6 . In this paper we will be discussing nucleotide positions in DNA; each strand of DNA is directional, with a 5’ (pronounced “five prime”) and a 3’ end, and transcription can only occur 5’ → 3’. Paired strands run in opposite directions, so when talking about a particular gene, the direction of transcription in that gene will be referred to as ‘upstream’, and the opposite direction ‘downstream’. In translation, the strand which is copied into mRNA (with Ts substituted to Us) is known as the ‘sense’ strand, and the complementary strand the ‘antisense’ strand (Figure 1). The abbreviations bp and nt will be used for base pairs and nucleotides respectively. 2 Background To design computationally a nucleotide sequence for an efficient terminator requires a good model of terminator efficiency from the sequence. Ermolaeva et al. 7 describes the TransTerm algorithm for terminator discovery in bacterial genomes which uses hairpins followed by Utracts as indicators of terminator sequences. TransTerm assigns energy to hairpins through a linear combination of the number of G-C, A-U and G-U base pairs in the stem, as well as mismatches and gaps in the stem and the number of nucleotides in the loop, where the coefficients are scores relating to each of these features. U-tracts are rated using the scoring function proposed by d’Aubenton Carafa et al. 8 : ( P15 xn−1 × 0.9 nth nucleotide is a U score = − i=1 xn xn = x0 = 1 xn−1 × 0.6 otherwise The TransTermHP algorithm 9 improves upon TransTerm. Terminators are discovered by looking first for length-six windows containing at-least three Thymines (which are transcribed into Uracils), and the 15 nucleotides downstream of the start of this window are scored with the same heuristic from d’Aubenton Carafa et al. 8 . The sequence up to 59 nt upstream of this U-tract is given a hairpin score using an efficient algorithm. These scores are then 1 http://parts.igem.org/Main_Page 3 used to predict the quality of the sequence found as a terminator, which is done using the likelihood of a structure with greater or equal hairpin and U-tract scores arising by chance, taking into account the GC content of the genome being searched. Cambray et al. 10 develop a method for the measurement of terminator efficiencies, and characterise 61 E. coli terminators. A model is constructed using a linear combination of sequence features mined from literature, and fit this model to their own data. Prediction was found to be improved when terminators having low efficiency and terminators with variable folding dynamics were excluded from the training set. In a similar work, Chen et al. 5 measure strengths of over 300 terminators (most are unique, but some are measured in the forwards and reverse direction). Chen et al. 5 then build a biophysical model of termination efficiency based off free energies for loop closure, hairpin extension, hairpin base binding and RNA:DNA binding with the U-tract. 3 RNA oscillator To gain familiarity with the process of designing synthetic circuitry, a circuit was constructed to be implemented in a plasmid and inserted into E. coli. An oscillating circuit using only RNA was chosen as the circuit to construct; the significance of using RNA and no transcription factors is that RNA folding is far easier to deal with computationally than protein folding, making computational circuit design more feasible, meaning RNA circuitry is likely to be important in the future. An oscillator was chosen as oscillations require nonlinearity in the dynamics, so a functional circuit would demonstrate that this nonlinearity is possible using only RNA. 3.1 A method of regulation using only RNA Gene regulation using only RNA was facilitated by the CRISPR technology 11,12 . Two different designs inspired by Bikard et al. 11 and Qi et al. 12 were composed in attempt to minimize repeated sequences, as these introduce difficulties in synthesis. CRISPR (clustered regularly interspaced short palindromic repeat) loci encode a prokaryotic immune system against foreign DNA, often introduced by phages, and is made up of short "spacer" sequences which match some target DNA sequence, separated by short repetitive sequences. This entire repeat-spacer array is transcribed into a precursor CRISPR RNA, which is cleaved into CRISPR RNA (crRNA), each targeting a particular nucleotide sequence. These crRNAs then work with Cas (CRISPR-associated) nucleases to bind and cleave the target sequence. The only restriction on target sequences is that the target (known as the protospacer) must be have the adjacent 3’ sequence motif NGG—where N means any nucleotide—known as the Protospacer Adjacent Motif (PAM). The length of the sgRNA base pairing region should ideally be 20nt, with lengths differing from this negatively impacting repression 12 , meaning coincidental off-target matches are extremely rare. The Streptococcus pyogenes CRISPR system is the simplest, requiring the Cas9 nuclease, crRNA and another RNA (tracrRNA) which is involved in the interaction between crRNA and Cas9. RNaseIII is also required in the binding and processing of Cas9 and sgRNA, but 4 A B C Figure 2: A repressilator; gene A represses B, B represses C and C represses A using transcription factors, which results in oscillations. The circuit described here works in the same way using only RNA. this is already present in E. coli. Bikard et al. 11 take the Cas9 nuclease from S. pyogenes and introduce mutations to prevent cleavage, which they call dCas9. Using a crRNA to target dCas9 to a promoter region represses transcription, likely through blocking the RNAp, so enables a form of regulation. Qi et al. 12 simplify the system by introducing a gene coding for a chimera of crRNA and tracrRNA which they name Small Guide RNA (sgRNA), reducing the entire system to two genes; the deactivated Cas9 and an sgRNA. This reduced system is named CRISPR interface (CRISPRi). 3.2 Design of circuit We decided on an oscillator design analogous to the repressilator; a system of 3 genes A, B and C, where A represses B, B represses C and C represses A (Figure 2). We used the CRISPRi architecture 12 , meaning that we required the three genes A, B and C, as well as a gene coding for dCas9. The sequences for these are given in Appendix A. Genes A, B and C use different promoter regions, two of which are constitutive (always ‘on’) and the other arabinose inducible (only ‘on’ when arabinose is present), which is necessary for experimental validation; we only want to observe oscillations when arabinose is present. Following each promoter is the sequence coding for the sgRNA; 20 nt complementary to the region to which the sgRNA will bind, followed by 82 nt which code for the machinery used in the interaction between the sgRNA and dCas9, as well as the terminator for the gene 12 . Repression is highly efficient when the sgRNA binds to the -35 or -10 box of the promoter (regions which are important in the binding of the RNAp) 11,12 , so these were the regions targeted. The architecture of the binding of the sgRNA to the promoters is given graphically in Figure 3, where it can be seen that every sgRNA obscures either a -35 or -10 box. Note that the gene A sgRNA binds to the antisense promoter strand of gene B; this was necessary due to no appropriate PAMs existing on the sense strand, and this has no effect on the level of repression 12 . 5 Gene A promoter (plLac01) #-35## PAM #-10## 5’-ATAAATGTGAGCGGATAACATTGACATTGTGAGCGGATAACAAGATACTGAGCAC-3’ sense ||||||||||||| |||||||||||||||||||||| 3’-TATTTACACTCGCCTATTGTAACTGTAACACTCGCCTATTGTTCTATGACTCGTG-5’ antisense |||||||||||||||||||| 5’-GAUAACAUUGACAUUGUGAG<dCas9 binding + term>-3’ Gene C sgRNA Gene B promoter (plTet012) #-35## PAM #-10## 3’-TGAGATAGTAACTATCTCAAACTGTAGGGATAGTCACTATCTCTATGACTCGT-5’ antisense ||||||||||||||||||||||||||||| |||| 5’-ACTCTATCATTGATAGAGTTTGACATCCCTATCAGTGATAGAGATACTGAGCA-3’ sense |||||||||||||||||||| 3’-<dCas9 binding + term>AUAGUCACUAUCUCUAUGAC-5’ Gene A sgRNA Gene C promoter (J23119) #-35## PAM#-10## 5’-TTGACAGCTAGCTCAGTCCTAGGTATAATGCTAGC-3’ sense ||||||||||||||| 3’-AACTGTCGATCGAGTCAGGATCCATATTACGATCG-5’ antisense |||||||||||||||||||| 5’-UUGACAGCUAGCUCAGUCCU<dCas9 binding + term>-3’ Gene B sgRNA Figure 3: Blueprint of sgRNAs binding—thus repressing—their target promoters. It can be seen that all sgRNAs bind adjacent and 3’ to a PAM, and obscure either a -10 or -35 region of the promoter, thus maximally repressing the target gene 12 . The sgRNA transcript of gene A binds to the sense strand of gene B instead of the antisense strand as there was no continently located PAM on the sense strand. This does not significantly affect the level of repression 11 . 4 Design of software To facilitate the computational design of terminators, we first constructed a model of terminator efficiency with respect to sequence. This model assigns a score—which we call TS —to a sequence, with higher scores indicating higher predicted terminator strengths. We used optimisation procedures to come up with a sequence maximizing TS , thus producing terminators which are predicted to be efficient. 4.1 Models of termination efficiency Four models of terminator efficiency were constructed of increasing complexity. For these models we used the same technique as Cambray et al. 10 of using a linear combination of nonlinear basis functions fi on the sequence supplied (Equation 1), parameterised by coefficients βi . This allowed for sufficient nonlinearity in the model while retaining linearity in the parameters, thus allowing the optimum parameters (in the least-squares sense) to be calculated efficiently (in closed form) when fitting the model to data 13 . 6 TS (sequence) = X βi fi (sequence) + P (sequence) (1) i=0 In every model, f0 = 1 is used so that β0 f0 supplies a constant offset. The P function penalises sequences containing stop codons; P looks for the reading frame containing the fewest stop codons, and applies a constant penalty multiplied by the number of stop codons in that frame. The terms used in each model and their corresponding coefficients are given in Appendix B. Terms largely correspond to free energies, which are calculated using RNAeval and RNAfold from the ViennaRNA package 14 . The set of terms and coefficients used in each model is given in Appendix B. 4.1.1 Model 1 The simplest model was defined using 5 terms (including the constant offset term f0 = 1): f1 = ∆Ga f2 = ∆Gb f3 = ∆Gl f4 = ∆Gu where ∆GA , ∆GB , ∆GL , ∆GU are free energies which have been reimplemented from Chen et al. 5 . These could not be reimplemented exactly, but still represent sensible measures from which to base a model of termination efficiency. ∆GA is the free energy of the extended hairpin, and is calculated via ∆GA = ∆GHA − ∆GH . ∆GHA is the free energy of the folded RNA beginning 8 nucleotides upstream of the start of the hairpin and ending 8 nucleotides downstream the end of the hairpin, as predicted by RNAeval. ∆GH is the free energy of hairpin folding; the free energy reported by RNAeval for the hairpin sequence using the structure given by RNAfold for the entire terminator concatenated to just the hairpin region. ∆GB is the free energy of the binding of the hairpin base, which is defined as the 3 dinucleotide pairs in the hairpin furthest from the loop. This is calculated by identifying these nucleotide pairs and treating each strand as an individual RNA fragment, using RNAeval to calculate their free energy. The free energy of the hairpin loop is included in the model through ∆GL . ∆GL is calculated using RNAeval to calculate the free energy of the terminator sequence with the structure where only the nucleotides at the top of the stem (just below the loop) are bound. The binding of this base pair takes up most of the time of the binding of the stem, as once these nucleotides have paired the rest of the stem rapidly “zips” up 5 . ∆GU is the free energy of binding between the U-tract and the DNA, and is calculated using: 8 X 0 ∆GU = ∆GRNA:DNA + GRNA:DNA (ni , ni + 1) i=1 where a U-tract length of 8 is assumed, G0RNA:DNA is the initiation energy of RNA:DNA hybridization and GRNA:DNA (ni , ni + 1) is the free energy of the binding of the RNA:DNA nucleotide pair at positions i and i + 1 4 . 7 (a) ∆GHA (b) ∆GH (c) ∆GB (d) ∆GL Figure 4: Sequence subsets and structures used in free energy calculations. ∆GHA is calculated by looking at the sequence from 8bp upstream of the hairpin to bp downstream, and calculating the free energy of the entire structure with remaining bases unpaired. ∆GH is calculated similarly looking only at hairpin sequence. ∆GB looks at the three nucleotide pairs at the base of the hairpin and calculates the free energy as if this was a standalone structure. ∆GL is calculated by folding the entire sequence into a structure where only the top nucleotide pair in the stem is bound. Free energy calculations were performed with ViennaRNA 14 . 4.1.2 Model 2 The second model extended the first by including interactions between the free energy terms. Squared terms were omitted to keep the model complexity low. 4.1.3 Models 3 and 4 Model 3 expanded upon the second by introducing two new sequence functions LL and LS which correspond to the lengths of the loop and the hairpin stem respectively. Again, model 3 included all pairwise interactions between sequence functions as well as each sequence function alone plus f0 = 1, giving a total of 22 terms. Model 4 extended this again by adding a new free energy ∆GS ; the free energy of the binding of the stem as calculated by RNAeval, giving a total of 29 terms. 4.2 Identification of terminator regions All four models rely on the identification of the A-tract, U-tract, hairpin and loop, which are identified using a reimplementation of the heuristic described by Chen et al. 5 in supplementary note 8. Given a sequence, RNAfold is used to predict the folded mRNA structure. We then search the entire structure except the first and last 8 nucleotides for hairpins. Structure is represented as a dot-bracket notation, where a dot indicates an unpaired nucleotide and a matched pair of brackets indicate a base pair. As an example, ...(((((.....)))))... shows a structure with a hairpin. Using this representation, we define hairpins as all top level pairs of matched brackets; i.e., matched brackets which are not inside another pair of matched brackets. If more than one hairpin is found, we use the one with the greatest folded free energy, as calculated by RNAeval. Given the location of this hairpin, we search for the U-tract by starting at the 6th nucleotide in the right arm of the stem and looking at each 8 bp sequence beginning with a U until 8 nt downstream from the hairpin. The U-tract is the 8 8 bp region with the highest ∆GU . If candidate U-tracts with identical ∆GU are found, the most 5’ one is chosen. If there are no 8bp sequences beginning with a U, the 8bp sequence immediately 3’ of the stem is chosen. The A-tract is chosen as the 8bp directly 5’ of the hairpin. The loop is the largest loop structure in the structure of the hairpin, where we choose the most 5’ loop if multiple are present. Identification of these regions relies on being able to find a hairpin in the structure. This will not be the case for all sequences, and in such instances the model will be unable to assign a score to the sequence. When this occurs, we assign a constant score which is below that of all sequences with correctly identified components, but high enough such that an optimisation routine may still explore regions of failed component identification. 4.3 Algorithms used We used three algorithms to look for a sequence maximising the score assigned to it by the model described above. 4.3.1 Random Mutation Hill-climber Our first optimisation technique used was Random Mutation Hill-climbing (RMHC). We keep a population of fixed length sequences, and repeatedly performed single point mutations, reevaluated the score of the sequence, and kept the mutated sequence if the score had improved. This technique can get trapped in local optima, unable to progress. 4.3.2 Simulated Annealing Simulated Annealing (SA) was implemented to deal with ruggedness in the sequence score landscape, as it is less prone to get stuck at local optima. Sequence scores were treated as energies, and RMHC was performed but instead of accepting only beneficial mutations, the acceptance probabilities were changed to: ( 1 ES1 ≤ ES2 P (accept mutation) = −(ES −ES ) 1 2 T ES1 > ES2 e where ES1 and ES2 are the energies (scores) of the sequence before and after mutation respectively, and T is temperature in units of Boltzmann’s constant. T begins at some specified value and reduces linearly towards 0 over the course of the optimisation. Unless specified, we used T=2 for the initial value in optimisation. 4.3.3 Genetic Algorithm A Genetic Algorithm (GA) was used to discover if there was any structure in the optimisation problem of the kind that may be taken advantage of by a GA. We used a GA with nonoverlapping generations; that is, a ‘generation’ constitutes generating N offspring from a population of size N , and replacing the entire population with its offspring. For selection of 9 parents we used tournament selection; select two individuals from a population at random, and choose the individual with the higher fitness as a parent, then repeat to get a second parent. Once two parents have been selected, we used two-point crossover to generate two children; one complementary to the other with regard to which loci were inherited by which parent. Once offspring were produced, we applied a mutation operator of mutating every locus to a random nucleotide with probability 1/L, where L is the length of the individual. 5 Results 5.1 5.1.1 Model Fitting to data In fitting the model to data, we combined the datasets of Cambray et al. 10 and Chen et al. 5 giving a total of 636 terminator sequences with experimentally measured termination efficiencies. From Chen et al. 5 we only used the natural and synthetic terminator datasets, and omitted the set of removed terminators, as these had unusual expression patterns so likely had unusual mechanisms of termination, which we are not interested in producing. Since there are a number of mechanisms of termination more complex than simple hairpin formation, we restricted the dataset to only those 473 terminators with a length less than 50. We hypothesize that such small terminators do not have ‘room’ for complex mechanisms, and we only need to be able to model a single mechanism for successful terminator design. To fit the model to the data, we needed to choose the set of coefficients minimizing the difference between the predicted and measured efficiencies. Since our model is linear, we were able to compute the coefficients minimising the squared error efficiently. We tested model prediction on unseen data through bootstrapping. That is, we performed 100 iterations of randomly selecting two thirds of the combined dataset, fitting the model to this training set and testing the model against the remaining data. Measures of predictive performance are given in Table 1. Table 1 Correlation Coefficient σ̂ Mean Squared Error Model Terms µ̂ 1 2 3 4 5 11 22 29 0.492 0.071 0.531 0.106 0.530 0.094 0.532 0.073 1288 378 1171 296 1264 298 1343 348 Chen et al. 5 Cambray et al. 10 – 5 0.450 0.620 1000 – 498 – – – µ̂ σ̂ As shown in Table 1, we achieved better prediction on average on the combined restricted 10 Predicted score 150 Correlation coefficient: 0.668 100 102 101 50 100 0 10-1 50 100 150 200 250 Measured score 100 101 102 Measured score Figure 5: Predicted vs actual (experimentally characterised) terminator efficiency. Note logarithmic axes on right hand plot means negative predicted scores have been removed. datasets than Chen et al. 5 achieved on their natural and synthetic datasets. Testing our model on only the natural and synthetic datasets from Chen et al. 5 , we achieve a correlation coefficient of 0.502 which is still an improvement. Refitting the model to all terminators of length below 50 and correlating the actual and predicted termination scores, again only against terminators below length 50, whilst restricting only to those terminators which our model is able to assign a score, we achieve a correlation coefficient of 0.668 (Figure 5). Including terminators we are unable to predict—those where a hairpin cannot be identified— and assigning them a score of 0 gives a correlation coefficient of 0.583. Looking at the standard deviation of the correlation coefficients in Table 1, we see it is low compared to the mean value, indicating out model accuracy is not sensitive to the training data. Depending on our goal, the correlation coefficient or the mean squared error may be more relevant. If we are interested in building a good predictive model of termination efficiency, then the mean squared error is of greater importance. If, however, we are only intending to optimise over this model, then we are more interested in the correlation coefficient since a constant offset between predicted and actual efficiencies does not change the location of the maxima which the optimisation is searching for. 5.2 5.2.1 Optimisation Algorithm comparison Here we look at the solutions each algorithm converges to. Since the GA and SA have a nonzero probability of reaching any solution, including the global optimum, we say an algorithm has converged once its rate of improvement has become sufficiently slow, even though these algorithms would eventually find the global maxima if run for enough time (though this time may be infeasibly large). Running each algorithm 7 times for 10000 generations with a 11 Score 350 300 250 200 150 100 50 00 Genetic Algorithm 5000 Generations 350 300 250 200 150 100 50 1000000 Simulated Annealing 5000 Generations Random Mutation 350 Hill-climbing 300 250 200 150 100 50 1000000 5000 10000 Generations Figure 6: Highest score in the population for 7 separate runs of each algorithm optimising over model 4. It can be seen that SA outperforms the other two algorithms. The poor performance of the GA is likely due to the rapid loss of diversity at the beginning of each run. Each run lasted 10000 generations using a population of 30 and individual size of 50. The SA was started with a temperature of 2. population size of 30 and sequence length of 50, we see different runs converging to fitness values (Figure 6). We also see different algorithms converging to different scores on average; t-tests for the final values of pairwise different algorithms having the same mean fitness after 10000 generations give p-values of 4 × 10−4 for GA/hill-climbing, 10−5 for SA/GA and 2 × 10−3 for hill-climbing/SA, thus all algorithms converge to statistically significantly different distributions of fitnesses. This tells us that the fitness landscape is rugged and contains many local maxima in which the optimisation may get stuck. Simulated annealing converges to the best distribution of fitnesses of the three algorithms. It is interesting that the basic hill-climbing outperforms the genetic algorithm; investigating this shows a rapid loss of diversity near the beginning of the optimisation (Figure 7), meaning the GA is less able to explore the fitness landscape. This could be fixed by reducing the selective pressure in the algorithm by switching to a new method of selection 15 , or by implementing diversity maintenance methods such as deterministic crowding 16 . Adding further terms to the model or constraints to the optimisation, such as minimising hydrophobicity or producing antiterminators, is likely to increase the nonlinearity in the fitness landscape and widen the gap between simple and complex optimisation routines. 5.2.2 Testing Testing was performed by introducing a fitness function which is simply the count of the number of T’s in the string. The convergence of each algorithm to the string of all T’s is illustrated in Figure 8. The GA can be seen to converge fastest, which is due to the initially diverse population having high fitness sub-sequences crossed over. Simulated Annealing is 12 Mean distance between individuals 40 35 30 25 20 15 10 50 GA SA RMHC 200 400 600 Generations 800 1000 Figure 7: Diversity of the population as each algorithm progresses. To measure diversity we use the mean distance between individuals in a population, where we define the distance between two individuals as the number of loci at which the individuals differ. The initial rapid loss of diversity in the GA can be seen, which is likely the reason behind the poor (sub-RMHC) performance. the slowest algorithm to converge, and reduces to hill-climbing when the temperature is 0. Since the fitness landscape is smooth and contains no local optima, all algorithms converge to the global optimum solution. One measure of the model not tested in bootstrapping is checking that poor scores are assigned to non-terminators. We tested this by generating 10000 random sequences of length 50 and investigating the scores assigned, using the rationale that the volume of sequence space taken up by valid terminators is so tiny that a random sequence is extremely unlikely to coincide with it. No random sequence generated was able to be assigned a hairpin, so the model was unable to assign any scores. This is a desirable behaviour; a failure to identify a hairpin may be treated as a low score. 5.3 Sequences produced The sequences produced by the optimisation show some strange behaviour. Looking at the results of the optimisation described above (10000 generations with a population size of 30 and sequence length of 50) (Table 2), we see many sequences with clearly formed U-tracts, but the highest scoring sequences lack a U-tract and instead have a number of trailing Gs, and have a sequence of Cs instead of an A-tract. This becomes apparent when looking at only the highest scoring sequences produced over the 7 runs, as shown in Table 3. This is possibly a consequence of the model not extrapolating well into non-terminator regions of sequence space. Inspecting the sequences produced by each algorithm, we see the GA has significantly less diversity in the results than the other two algorithms, which is likely the reason for its poor performance; preventing the premature population convergence would likely solve this. Looking at the best sequences produces, we can see all algorithms converge on a 13 40 Best fitness 35 30 25 SA:T=2 SA:T=0 RMHC GA 20 150 200 400 600 Generation 800 1000 Figure 8: Fitness of the best individual in the population as the optimisation progresses using the simple testing fitness function of counting the number of ‘T’s. Parameters used: population size=30, individual size=100. Data shown: Genetic Algorithm (GA), Random Mutation Hill-climber (RMHC), Simulated Annealing (SA). SA was started with temperatures of 2 and 0, and it can be seen that SA with temperature 0 is equivalent to RMHC similar solution, but this solution is not a good terminator. These similar solutions differ at the hairpin, which is unsurprising; optimising the hairpin is hard since a single mutation can make or break a base pair, and has the possibility of changing the hairpin structure, introducing significant nonlinearity into the fitness landscape. 14 Table 2: Representative sequences produced after running each algorithm with population 30 for 10000 generations. Random Mutation Hill-climbing Sequence Score CCCCACCCCGCGCUGCCGUACGGCGGGCAGCGCGGGGUGGGGGGCUAGUA GCAGGCCGCGGUCGUGGGGUUCCACCCACGACCGCGGCCUGCCGCAUCCG CUGUAAAAAUCUGUAUGGGAUACACGCUCCGGCUCUAUUUGUUUUUUUUU AACAACUAGACUCUCUAAAGGGGUCACCCGCUAAUUGGUGUUUUUUUUUU CGGCCCCCCCCAUACAAAGACCACAAACGCUCCGGCCAGGGGGGGGCCGG CAGCGAGACUUACCCGUAGGUGGCUCCGGUCCGGGAGUUUUUUUUUUGAC CCCCCCCCCCGCACGACCUCGGAAUGUAAACGGGGGGGGGGGUCCCACGU CAGGAUUCAAAUAUUAAUGGCCUCGGUAAAUUAUAACAUUUUUUUUUUUG UCUUGCAGGUAGACUCCGGUGUGUUUUUUUUUUUUGGCAGGGAUAACACG UUACCUAUUUACCCGUAUGGAGCUCCGGCCUGUUUUUUUUUUUCGCAAUC 168.671 135.787 185.949 197.391 228.335 183.54 226.792 191.404 176.264 175.348 Simulated Annealing CCCCCCCCCCCAGUCAAACUCCUCGGCCAUAGAAAAGGGGGGGGGGGCCU CUCAACACAACAGGUGCUCCGGUGGUUAUAGUUUUUUUUUGAGGUGACGA GGACUUAAAGUUGGGGCUCACGGAGACUAUCGUUUUAUCUUUUUUUUUAA UCACGUACGUUAGGGUGGACACGCUGAACUAAUUUUUUUUUCUCGAGUGC UGUCAAAAGUUUAAUCUAUUGCUCCGGCCAUGCUUUUUUUUUAUUAAGAU GGUGUUGUGUCUACCACUUAUCAGCUAGUCCUCGGGGUGUUUUUUUUUUC GUCAAGACUAACAUUCAUAUAUAAUUUCUCCGGAUAGGUGUUUUUUUUUU AGAUCUAACUUAUAUUAGGAUACUGGAUGGUUUUUUUUUAGGUCACAUAC CCCCCCCCCCCAACAACUAGGAAACUUGAAGGAGCGACGGGGGGGGGGGG UAGUCAAGGAAUAUUAGUUAGCUCCGGGUAGUAAGAAUUUUUUUUUUUAU 262.51 179.986 176.856 174.469 181.603 179.526 186.932 175.051 265.701 184.134 Genetic Algorithm AAACUCUUCGCAUAUAAUCUCCGGAUGAUAGUUUUUUUUUCCUACUUUCC AAACUCUUAGCAUAUAAUCUCCGGAUGAUGGUUUUUUUUUCCUACUCUCC AGGCACGCUACAUAAAAUCUACGGAUGAUAGUUUUUUUUCCAUACUUCCU CGCCACUUCGCAUAUAAUCUCCGGAUUUUGGUUUUUUUUUCCUACUUCCU CGCCACGCUACAUAUAAUCUCCGGAUUUUGGUUUUUUUUUCCUACUGCCU AAACUCUUCGCAUAAAAUCUCCGGAUUUUAGUUUCUUUUUCCUACUUACU AUGCACCUUACAUAGAAUCUCCGGAUUUUAGUUUUUUUUUAUAACUUCCU AAACACUUCGCAUAUAAUCUCCGGAUGAUGGCUUUUCUUUCCUACUUUCC AAACUCUUAGCAUAUAAUCUCCGGAUGAUGGUUUUUUUUUCCUACUUUCC AAACUCUUCGCAUAUAAUCUCCGGAUUUUAGUUUUUUUUUACAACUUUCC 15 178.327 177.732 99.4276 0 99.3639 149.729 175.195 33.6897 177.732 178.219 Table 3: Top 10 most highly scored sequences from each algorithm from 8 runs each of length 10000 with population 30 Random Mutation Hill-climbing Sequence Score CCCCCCCCCCGGUAUACAGCAGGGAACUUAAUACGAAGCGGGGGGGGGGG CCCCCCCCCCGAUAUACAGAAGGGAACUGAAUAAGAAGCGGGGGGGGGGG CCCCCCCCCCGAUGGAAAGAAGGGAACUGAAUAAGAAGCGGGGGGGGGGG CCCCCCCCCCGAUAUACAGAAGGGAACUUAAUACGAAGCGGGGGGGGGGG CCCCCCCCCCGAUAUACAGAAGGGAACUUAAUACGAAGCGGGGGGGGGGG CCCCCCCCCCGAUAUACAGAAGGGAACUUAAUACGAAGCGGGGGGGGGGG CCCCCCCCCCGAUAAACAGAGGGGGACUAAAUACGGAGCGGGGGGGGGGG CCUCCCCCCCGGUAUACAGCAGGGAACUAAAUAAGAAGCGGGGGGGGGGG CCCCCCCCCCGCUAUACAGCAGGGAACUGAAUAAGGAGCGGGGGGGGGGG CGCCCCCCCCGGUAGACAGAAGGGAACUGAAAACGAAGCGGGGGGGGGGG 264.615 262.879 262.879 261.142 261.142 261.142 257.1 248.74 245.947 238.304 Simulated Annealing CCCCCCCCCCACAAAGCGGCACUCCGGAGUCUACAGCAGGGGGGGGGGGG CCCCCCCCCUCACAUAGGUUAAACCGCCUCCGGGCAUUGAGGGGGGGGGG CCCCCCCCCCGGGGACUAUUAUUGGAUACGAGCGCACAGGGGGGGGGGGG CCCCCCCCCCAGGGGGUAGAAUGACGGGAACGAUGUGCAGGGGGGGGGGG CCCCCCCCCGAGCGCACAGGGGGACCAAUAAAAAGCGGAAGGGGGGGGGG CCCCCCCCCCCACUAAAGCUCCGGCACAUACGACAGGGGGGGGGGGGGAU GGCCCCCCGUGAUUCAUAUUAGGCGCCUCGGGGCGAGCACGGGGGGCCCC CGGCCCCCGUAACGGUAAUUAGCGUCGGUCUCCCGCAACGGGGGCCGGAA CCCCCCCCGGAUAACAACCUCCGGGAAUGCGCACCGGGGGGGGGUCAUGU CCCCCCCCGAGCUAUUAACCGGGGAACUGCCAGCGCGGGGGGGGGCAUAU 274.048 267.084 267.003 267.003 266.352 258.099 233.565 232.309 231.687 230.723 Genetic Algorithm CCCCCCCCUGUGAGCGUCAACUCCGGUAACCCGAACAAGCAGGGGGGGGG CCCCCCCCCGAGGAAACCUAGCUCCGGCCGCCCACCGCAAGGGGGGGGGG CCCCCCCCCGAGCUCUUGUCUCCGGAAUUUAACCGCCACCGGGGGGGGGG CCCCCCCCCGAGACGACGCCCUCCGGGAGACACAGCCGACGGGGGGGGGG CCCCCCCCCGGAAAGCUAGUGUACACCUCGGUCUAACCCCGGGGGGGGGG CCCCCCCCCCGCUACUUGACUCUCCGGACAACACGCCAAGGGGGGGGGGG CCCCCCCCCCGUCACUGUUACUCCGGAUGCCAGACCCAAGGGGGGGGGGG CCCCCCCCCCAUAAUGUAUACCUCCGGGCCAGAGAACAGGGGGGGGGGGG CCCCCCCCCCAACCACUUAAGCUCCGGCACAGACGAAAGGGGGGGGGGGG CCCCCCCCCCCAUCAGGUACCUCGGUCACAACACAAGCGGGGGGGGGGGC 16 310.603 301.754 296.534 291.795 279.071 275.904 275.672 274.512 274.28 273.609 6 Discussion In extending this work, one of the major first goals would be to improve the model of termination efficiency over which the optimisation is run. If a general machine learning technique is used to learn a model from data, then training will also require negative training examples; i.e., sequence strings with low experimentally measured efficiencies. It is important that extrapolation is considered; since this model is to be optimised over all of sequence space, we must be confident that the efficiency score assigned does not take large optima in regions of sequence space far away from those that are known to produce good terminators. For this reason it may be preferable to hand build a model which takes advantage of the physics of the problem and use data to parametrise this model, as we have done here. One approach in improving the model may be to identify more candidate features through the literature and sensible physical assumptions, and to use feature selection techniques to identify an optimum feature subset to be used in the linear model. The features under selection here may include interactions between the ‘basic’ features (a single free energy calculation, the length of the hairpin loop, etc), and a simple way to do this would be to automate a hypothesis test for each feature checking that its coefficient is statistically significantly different from 0, and remove the feature if it is not. We may then refit the model to the data and repeat. Along with producing a more accurate model, this process may also provide insight as to what kinds of interactions are important in the mechanics of termination. An alternative approach may be to perform feature extraction on the actual sequences in the dataset, but this may require more data since this data should not be used for subsequent training. It may be useful to minimize the hydrophobicity of the mRNA produced by the terminator. Hydrophobic molecules in the cell have a tendency to form insoluble aggregates which are toxic to the cell, and it would be desirable to minimize this. We may also produce bidirectional terminators by changing the objective function to be 0.5 times the score of the forwards terminator plus 0.5 times the score of the reverse compliment sequence. Since we do not want to use a single terminator more than once in a synthetic circuit due to difficulties in synthesis, it may be desirable to optimise a family of terminators which are maximally different from each other. This comes naturally to a GA, where addition of diversity maintenance, such as deterministic crowding 16 , would produce a number of clusters of terminators. Clustering algorithms could be used to identify these clusters, and the best terminator could be taken from each. There are constraints on DNA sequences that can be synthesized. Personal contact with Integrated DNA Technologies (IDT)2 reveals that sequences with any of the properties listed are unable to be synthesized: 1. Terminal repeat elements greater than 5bp 2. Hairpins greater than 15bp within 100bp of each other 3. Hairpins greater than 19bp 2 https://eu.idtdna.com/ 17 4. Poly A or T stretches over 11bp 5. Poly G or C stretches over 7bp 6. Terminal GC content below 30% or above 70% 7. GC content below 28% in a 100bp window 8. Total GC content below 25% 9. Total GC content over 70% 10. GC content over 77% in a 100bp window 11. GC content over 68% in a 600bp window 12. Individual repeats over 7bp which constitute 35% of a sequence 13. Total repeats over 7bp which constitute 70% of a sequence 14. Windowed repeats over 7bp which constitute 90% of a 70bp window Such constraints would be simple to incorporate into the optimisation, either by assining a score penalty to sequences which break these rules, or by adding a hard constraint that sequences breaking these rules cannot be generated. The former may be preferable, as it would less constrain how the fitness landscape can be explored. A particularly interesting extension would be to move from design of terminators to antiterminators 2 . To design RNA-controlled antiterminators computationally would allow for the construction of regulatory circuits entirely in RNA, without being restricted by the difficulties associated with protein folding. As a demonstration of the power of antitermination regulation, an improved oscillator with greater stability which only requires two genes 6 may be implemented by combining antitermination and CRISPRi, which is given in Figure 9. 18 c11 TT STOP CSY4 RAJ11 c11 TT STOP CSY4 sgRNA CSY4 GFP Figure 9: Schematic implementing the oscillator described by Stricker et al. 6 without use of transcription factors. The blue blocks are antiterminators, where ‘TT’ is a terminator which may be produced from the work here, and RAJ11 codes for the RNA able to disable termination (thus activating translation of genes A and B). ‘sgRNA’ is a small guide RNA (CRISPRi — Section 3.1) which represses the promoters of both A and B, disabling the genes. Once translated, mRNAs are cut at the CSY4 sites, freeing up the RAJ11 RNA and the RNA transcribed from the site coloured in red, which will code for Green Fluorescent Protein—used for experimental verification of oscillations. 7 Conclusion In this work we have investigated RNA based regulation, and tackled the computational design of transcription terminators with the goal of moving towards computational design of antiterminators. Computational design of synthetic components is important as, with the increasing circuit complexity, we want to work at higher levels of abstraction and not worry about circuits at the nucleotide level. We may also have a number of constraints and preferences to work with which are too large to consider in designing components by hand. To enable computational terminator design, we first constructed a model parametrised on the datasets of Chen et al. 5 and Cambray et al. 10 , which takes a sequence and assigns it a score predicting its efficiency as a terminator. We then used optimisation techniques to optimise over this model, arriving at a set of sequences with high predicted terminator efficiency. The results we obtain are promising, though not convincing enough to move onto experimental classification of the sequences produced. Our work demonstrates that, with an improved model, it is feasible to computationally produce transcription terminators subject to a potentially large number of constraints. Constraints of a particular interest to the circuit designer will likely include production of a large family of maximally diverse terminators, producing bidirectional terminators, minimising terminator length, respecting constraints for synthesis, minimising toxic hydrophobic gene products and others. By extending to production of antiterminator sequences, we help pave the pay towards computational design of complex circuits of many interacting genes. 19 References (1) Rodrigo, G.; Landrain, T. E.; Jaramillo, A. Proceedings of the National Academy of Sciences 2012, 109, 15271–15276. (2) Liu, C. C.; Qi, L.; Lucks, J. B.; Segall-Shapiro, T. H.; Wang, D.; Mutalik, V. K.; Arkin, A. P. Nat Meth 2012, 9, 1088–1094. (3) Gusarov, I.; Nudler, E. Molecular cell 1999, 3, 495–504. (4) Sugimoto, N.; Nakano, S.-I.; Katoh, M.; Matsumura, A.; Nakamuta, H.; Ohmichi, T.; Yoneyama, M.; Sasaki, M. Biochemistry 1995, 34, 11211–11216. (5) Chen, Y.-J. J.; Liu, P.; Nielsen, A. A.; Brophy, J. A.; Clancy, K.; Peterson, T.; Voigt, C. A. Nature methods 2013, 10, 659–664. (6) Stricker, J.; Cookson, S.; Bennett, M. R.; Mather, W. H.; Tsimring, L. S.; Hasty, J. Nature 2008, 456, 516–519. (7) Ermolaeva, M. D.; Khalak, H. G.; White, O.; Smith, H. O.; Salzberg, S. L. Journal of Molecular Biology 2000, 301, 27 – 33. (8) d’Aubenton Carafa, Y.; Brody, E.; Thermes, C. Journal of Molecular Biology 1990, 216, 835 – 858. (9) Kingsford, C.; Ayanbule, K.; Salzberg, S. Genome Biology 2007, 8, R22. (10) Cambray, G.; Guimaraes, J. C.; Mutalik, V. K.; Lam, C.; Mai, Q.-A.; Thimmaiah, T.; Carothers, J. M.; Arkin, A. P.; Endy, D. Nucleic Acids Res 2013, 41, 5139–48. (11) Bikard, D.; Jiang, W.; Samai, P.; Hochschild, A.; Zhang, F.; Marraffini, L. A. Nucleic Acids Research 2013, 41, 7429–7437. (12) Qi, L.; Larson, M.; Gilbert, L.; Doudna, J.; Weissman, J.; Arkin, A.; Lim, W. Cell 2013, 152, 1173 – 1183. (13) Barber, D. Bayesian Reasoning and Machine Learning; Cambridge University Press, 2012. (14) Lorenz, R.; Bernhart, S. H. F.; zu Siederdissen, C. H.; Tafer, H.; Flamm, C.; Stadler, P. F.; Hofacker, I. L. Algorithms for Molecular Biology 2011, 6, 26. (15) Jong, K. A. D. Evolutionary computation - a unified approach; MIT Press, 2006; pp I–IX, 1–256. (16) Mahfoud, S. W. Crowding and Preselection Revisited. Parallel Problem Solving From Nature. 1992; pp 27–36. 20 Appendix A Sequences The sequences for the genes used for the RNA oscillator are given below: Gene A 1 ATAAATGTGA GCGGATAACA TTGACATTGT GAGCGGATAA CAAGATACTG 51 AGCACCAGTA TCTCTATCAC TGATAGTTTT AGAGCTAGAA ATAGCAAGTT 101 AAAATAAGGC TAGTCCGTTA TCAACTTGAA AAAGTGGCAC CGAGTCGGTG 151 CTTTTTT Promoter (plLac01) sgRNA base pairing region - pairs with plTet012 sgRNA dCas9 handle + terminator Gene B 1 ACTCTATCAT TGATAGAGTT TGACATCCCT ATCAGTGATA GAGATACTGA 51 GCATTGACAG CTAGCTCAGT CCTGTTTTAG AGCTAGAAAT AGCAAGTTAA 101 AATAAGGCTA GTCCGTTATC AACTTGAAAA AGTGGCACCG AGTCGGTGCT 151 TTTTT Promoter (plTet012) sgRNA base pairing region - pairs with J23119 sgRNA dCas9 handle + terminator Gene C 1 TTGACAGCTA GCTCAGTCCT AGGTATAATG CTAGCGATAA 51 GTGAGGTTTT AGAGCTAGAA ATAGCAAGTT AAAATAAGGC 101 TCAACTTGAA AAAGTGGCAC CGAGTCGGTG CTTTTTT Promoter (J23119) sgRNA base pairing region - pairs with plLac01 sgRNA dCas9 handle + terminator dCas9 gene: 1 ACATTGATTA TTTGCACGGC GTCACACTTT GCTATGCCAT 51 TCCATAAGAT TAGCGGATCC TACCTGACGC TTTTTATCGC 101 TGTTTCTCCA TACCGTTTTT TTGGGCTAGC TCTAGAGAAA 151 ACTAGATGAT GGATAAGAAA TACTCAATAG GCTTAGCTAT 201 AGCGTCGGAT GGGCGGTGAT CACTGATGAA TATAAGGTTC 251 GTTCAAGGTT CTGGGAAATA CAGACCGCCA CAGTATCAAA 301 TAGGGGCTCT TTTATTTGAC AGTGGAGAGA CAGCGGAAGC 351 AAACGGACAG CTCGTAGAAG GTATACACGT CGGAAGAATC 401 TCTACAGGAG ATTTTTTCAA ATGAGATGGC GAAAGTAGAT 451 TTCATCGACT TGAAGAGTCT TTTTTGGTGG AAGAAGACAA 501 CGTCATCCTA TTTTTGGAAA TATAGTAGAT GAAGTTGCTT 551 ATATCCAACT ATCTATCATC TGCGAAAAAA ATTGGTAGAT 21 CATTGACATT TAGTCCGTTA AGCTTTTTTA AACTCTCTAC GAGGGGACAA CGGCACAAAT CGTCTAAAAA AAAAATCTTA GACTCGTCTC GTATTTGTTA GATAGTTTCT GAAGCATGAA ATCATGAGAA TCTACTGATA 601 651 701 751 801 851 901 951 1001 1051 1101 1151 1201 1251 1301 1351 1401 1451 1501 1551 1601 1651 1701 1751 1801 1851 1901 1951 2001 2051 2101 2151 2201 2251 2301 2351 2401 2451 2501 2551 2601 2651 AAGCGGATTT CGTGGTCATT GGACAAACTA AAAACCCTAT CGATTGAGTA TGAGAAGAAA TGACCCCTAA CAGCTTTCAA AATTGGAGAT ATGCTATTTT GCTCCCCTAT CTTGACTCTT AAGAAATCTT GGGGGAGCTA AAAAATGGAT TGCTGCGCAA CACTTGGGTG ATTTTTAAAA TTCCTTATTA ATGACTCGGA TGTCGATAAA TTGATAAAAA TATGAGTATT TGAAGGAATG TTGTTGATTT AAAGAAGATT AGGAGTTGAA TAAAAATTAT ATCTTAGAGG GATTGAGGAA TGAAACAGCT AAATTGATTA TTTTTTGAAA ATGATGATAG GGACAAGGCG TGCTATTAAA TCAAAGTAAT CGTGAAAATC GAAACGAATC AGCATCCTGT TATCTCCAAA TCGTTTAAGT GCGCTTAATC TTTTGATTGA TTTATCCAGT TAACGCAAGT AATCAAGACG AATGGCTTAT TTTTAAATCA AAGATACTTA CAATATGCTG ACTTTCAGAT CAGCTTCAAT TTAAAAGCTT TTTTGATCAA GCCAAGAAGA GGTACTGAGG GCAACGGACC AGCTGCATGC GACAATCGTG TGTTGGTCCA AGTCTGAAGA GGTGCTTCAG TCTTCCAAAT TTACGGTTTA CGAAAACCAG ACTCTTCAAA ATTTCAAAAA GATAGATTTA TAAAGATAAA ATATTGTTTT AGACTTAAAA TAAACGTCGC ATGGTATTAG TCAGATGGTT TTTGACATTT ATAGTTTACA AAAGGTATTT GGGGCGGCAT AGACAACTCA GAAGAAGGTA TGAAAATACT ATGGAAGAGA GATTATGATG TATTTGGCCT GGGAGATTTA TGGTACAAAC GGAGTAGATG ATTAGAAAAT TTGGGAATCT AATTTTGATT CGATGATGAT ATTTGTTTTT ATCCTAAGAG GATTAAACGC TAGTTCGACA TCAAAAAACG ATTTTATAAA AATTATTGGT TTTGACAACG TATTTTGAGA AGAAGATTGA TTGGCGCGTG AACAATTACC CTCAATCATT GAAAAAGTAC TAACGAATTG CATTTCTTTC ACAAATCGAA AATAGAATGT ATGCTTCATT GATTTTTTGG AACATTGACC CATATGCTCA CGTTATACTG GGATAAGCAA TTGCCAATCG AAAGAAGACA TGAACATATT TACAGACTGT AAGCCAGAAA AAAGGGCCAG TCAAAGAATT CAATTGCAAA CATGTATGTG TCGATGCCAT 22 TAGCGCATAT AATCCTGATA CTACAATCAA CTAAAGCGAT CTCATTGCTC CATTGCTTTG TGGCAGAAGA TTAGATAATT GGCAGCTAAG TAAATACTGA TACGATGAAC ACAACTTCCA GATATGCAGG TTTATCAAAC GAAACTAAAT GCTCTATTCC AGACAAGAAG AAAAATCTTG GCAATAGTCG CCATGGAATT TATTGAACGC TACCAAAACA ACAAAGGTCA AGGTGAACAG AAGTAACCGT TTTGATAGTG AGGTACCTAC ATAATGAAGA TTATTTGAAG CCTCTTTGAT GTTGGGGACG TCTGGCAAAA CAATTTTATG TTCAAAAAGC GCAAATTTAG AAAAGTTGTT ATATCGTTAT AAAAATTCGC AGGAAGTCAG ATGAAAAGCT GACCAAGAAT TGTTCCACAA GATTAAGTTT ATAGTGATGT TTATTTGAAG TCTTTCTGCA AGCTCCCCGG TCATTGGGTT TGCTAAATTA TATTGGCGCA AATTTATCAG AATAACTAAG ATCATCAAGA GAAAAGTATA TTATATTGAT CAATTTTAGA CGTGAAGATT CCATCAAATT ACTTTTATCC ACTTTTCGAA TTTTGCATGG TTGAAGAAGT ATGACAAACT TAGTTTGCTT AATATGTTAC AAGAAAGCCA TAAGCAATTA TTGAAATTTC CATGATTTGC AAATGAAGAT ATAGGGAGAT GATAAGGTGA TTTGTCTCGA CAATATTAGA CAGCTGATCC ACAAGTGTCT CTGGTAGCCC GATGAATTGG TGAAATGGCA GAGAGCGTAT ATTCTTAAAG CTATCTCTAT TAGATATTAA AGTTTCCTTA 2701 AAGACGATTC AATAGACAAT AAGGTCTTAA 2751 GGTAAATCGG ATAACGTTCC AAGTGAAGAA 2801 CTATTGGAGA CAACTTCTAA ACGCCAAGTT 2851 ATAATTTAAC GAAAGCTGAA CGTGGAGGTT 2901 GGTTTTATCA AACGCCAATT GGTTGAAACT 2951 GGCACAAATT TTGGATAGTC GCATGAATAC 3001 AACTTATTCG AGAGGTTAAA GTGATTACCT 3051 GACTTCCGAA AAGATTTCCA ATTCTATAAA 3101 CCATCATGCC CATGATGCGT ATCTAAATGC 3151 TTAAGAAATA TCCAAAACTT GAATCGGAGT 3201 GTTTATGATG TTCGTAAAAT GATTGCTAAG 3251 AGCAACCGCA AAATATTTCT TTTACTCTAA 3301 CAGAAATTAC ACTTGCAAAT GGAGAGATTC 3351 ACTAATGGGG AAACTGGAGA AATTGTCTGG 3401 CACAGTGCGC AAAGTATTGT CCATGCCCCA 3451 CAGAAGTACA GACAGGCGGA TTCTCCAAGG 3501 AATTCGGACA AGCTTATTGC TCGTAAAAAA 3551 TGGTGGTTTT GATAGTCCAA CGGTAGCTTA 3601 AGGTGGAAAA AGGGAAATCG AAGAAGTTAA 3651 GGGATCACAA TTATGGAAAG AAGTTCCTTT 3701 TTTAGAAGCT AAAGGATATA AGGAAGTTAA 3751 TACCTAAATA TAGTCTTTTT GAGTTAGAAA 3801 GCTAGTGCCG GAGAATTACA AAAAGGAAAT 3851 ATATGTGAAT TTTTTATATT TAGCTAGTCA 3901 GTCCAGAAGA TAACGAACAA AAACAATTGT 3951 TATTTAGATG AGATTATTGA GCAAATCAGT 4001 TTTAGCAGAT GCCAATTTAG ATAAAGTTCT 4051 GAGACAAACC AATACGTGAA CAAGCAGAAA 4101 TTGACGAATC TTGGAGCTCC CGCTGCTTTT 4151 TGATCGTAAA CGATATACGT CTACAAAAGA 4201 TCCATCAATC CATCACTGGT CTTTATGAAA 4251 CTAGGAGGTG ACTGAGTCGA CCCAGGCATC 4301 TCGAAAGACT GGGCCTTTCG TTTTATCTGT 4351 CTACTAGAGT CACACTGGCT CACCTTCGGG Promoter (K206001 - pBad weak) RBS (J61100) Coding sequence (K1026001) Terminator (B0015) 23 CGCGTTCTGA GTAGTCAAAA AATCACTCAA TGAGTGAACT CGCCAAATCA TAAATACGAT TAAAATCTAA GTACGTGAGA CGTCGTTGGA TTGTCTATGG TCTGAGCAAG TATCATGAAC GCAAACGCCC GATAAAGGGC AGTCAATATT AGTCAATTTT GACTGGGATC TTCAGTCCTA AATCCGTTAA GAAAAAAATC AAAAGACTTA ACGGTCGTAA GAGCTGGCTC TTATGAAAAG TTGTGGAGCA GAATTTTCTA TAGTGCATAT ATATTATTCA AAATATTTTG AGTTTTAGAT CACGCATTGA AAATAAAACG TGTTTGTCGG TGGGCCTTTC TAAAAATCGT AGATGAAAAA CGTAAGTTTG TGATAAAGCT CTAAGCATGT GAAAATGATA ATTAGTTTCT TTAACAATTA ACTGCTTTGA TGATTATAAA AAATAGGCAA TTCTTCAAAA TCTAATCGAA GAGATTTTGC GTCAAGAAAA ACCAAAAAGA CAAAAAAATA GTGGTTGCTA AGAGTTACTA CGATTGACTT ATCATTAAAC ACGGATGCTG TGCCAAGCAA TTGAAGGGTA GCATAAGCAT AGCGTGTTAT AACAAACATA TTTATTTACG ATACAACAAT GCCACTCTTA TTTGAGTCAG AAAGGCTCAG TGAACGCTCT TGCGTTTATA Appendix B Model coefficients Table 4: Features and coefficients used for model 1 Term Coefficient Const offset -4.28716658e+02 ∆GA -2.44926259e-01 ∆GB 3.85406757e+01 ∆GL -5.03126344e+00 ∆GU 3.89388333e+00 Table 5: Features and coefficients used for model 2 Term Coefficient Const offset -1.17415969e+03 ∆GA 2.17107148e+01 ∆GB 9.82247253e+01 ∆GL 8.59259503e+01 ∆GU -1.00567505e+02 ∆GA ∆GB -1.66646028e+00 ∆GA ∆GL 2.72483217e-01 ∆GA ∆GU 3.01939045e-01 ∆GB ∆GL -7.23665163e+00 ∆GB ∆GU 8.62127506e+00 ∆GL ∆GU -6.80847451e-01 Table 6: Features and coefficients used for model 3 Term Coefficient Const offset -1.54449902e+03 ∆GA 1.94092517e+01 GB 1.27651933e+02 ∆GL 7.06966080e+01 ∆GU -1.05605531e+02 LL 4.49623637e+01 LS 1.05131338e+01 ∆GA ∆GB -1.24756363e+00 ∆GA ∆GL 5.96358313e-01 ∆GA ∆GU 4.47671856e-01 ∆GA LL 1.69585495e-01 ∆GA LS -3.21389945e-01 ∆GB ∆GL -5.05766396e+00 ∆GB ∆GU 8.91592880e+00 ∆GB LL -4.11082632e+00 ∆GB LS -9.24211257e-01 ∆GL ∆GU -4.42576156e-01 ∆GL LS -5.19086074e-01 ∆GL LS -4.15503552e-01 ∆GU ∆LL -1.47249066e-01 ∆GU LS 5.93690124e-02 LL LS 4.69729903e-01 24 Table 7: Features and coefficients used for model 4 Term Coefficient Const offset -1.51184410e+03 ∆GA 1.66213509e+01 ∆GB 1.21625099e+02 ∆GL 5.88613541e+01 ∆GU -1.27621717e+02 LL 5.61511202e+01 LS -1.96140980e+01 ∆GS -4.00973435e+01 ∆GA ∆GB -1.20492198e+00 ∆GA ∆GL 3.52904807e-01 ∆GA ∆GU 3.85190822e-01 ∆GA LL 9.77057825e-03 ∆GA LS -1.34732917e-02 ∆GA ∆GS 1.94543285e-01 ∆GB ∆GL -4.29635388e+00 ∆GB ∆GU 1.06674915e+01 ∆GB LL -4.64374486e+00 ∆GB LS 1.67025667e+00 ∆GB ∆GS 3.09989733e+00 ∆GL ∆GU -1.28934700e-01 ∆GL LL 3.38330963e-01 ∆GL LS -6.30864883e-01 ∆GL ∆GS -3.75788116e-01 ∆GU LL 2.05456068e-02 ∆GU LS -3.65345779e-01 ∆GU ∆GS -4.59169225e-01 LL LS -1.47700544e-01 LL ∆GS -4.57114319e-01 LS ∆GS 9.21028673e-02 Appendix C C.1 Software usage and maintenance How to Use The software has been produced with a simple command-line interface, and may be extended with a graphical interface. To use the software, first call make in the directory containing the software, which will produce two executables; score.exe and termopt.exe. termopt.exe is the main executable and will be described below, and score.exe is an executable supplied for convenience which takes a sequence and produces a terminator score. Since the domain of the terminator model used does not cover the full space of sequences, score.exe will exit with an exit status of 1 if a sequence is provided which the model is unable to predict. To run TermOpt, call ./termopt.exe <params>, where <params> is an unordered list of parameters to be supplied which are given in Table 8. All parameters are optional and will default to their values given in the file named params. Modifying the contents of the params file is an alternative way of providing parameter values. As an example, if we want to use simulated annealing starting from a temperature 1.5 (units of Boltzmann’s constant) on a population of 30 length 40 sequences, we would call: ./termopt -temperature 15 -population-size 30 -individual-size 40 We may alternatively edit the params file appropriately and simply call ./termopt.exe. The format of the params file is an unordered sequence of <param>=<val> separated by newlines. Whitespace is ignored, as are all characters including and following a #, allowing for commenting. The names of parameters are the same as the long names of those that may be specified on the command-line, except there is no double-dash prefix and all dashes separating words have been replaced by underscores. 25 Table 8: Optional parameters that may be passed to termopt. The long names may also be used in the params file to set defaults, where the - prefix has been removed and all other -s replaced with _s. Short name Long name Type Description -a –algorithm string -c –stop-codon-penalty float -g –generations int -h –history bool -p –population-size int -i -u –individual-size –unfolded-energy int float -v –vienna-location string -V -t –verbose –top bool int -T –temperature float Select an algorithm to use. The argument should be one of evolve, anneal, hillclimb which selects a GA, simulated annealing or random mutation hill-climbing respectively. Set the score penalty for containing stop codons. The Total stop codon penalty applied is the parameter given multiplied by the number of stop codons in the reading frame containing the fewest stop codons. Since this penalty is added to the score, it will normally be negative. Set the number of generations to run the algorithm for. If true, after outputting sequences and their fitnesses, outputs a newline followed by the history of the best fitness at each generation. Size of the population of sequences being optimized. Length of the sequences being optimised. Set the fitness assigned to an individual which the model cannot predict. This can happen when the full sequence is unable to form a hairpin, or if the hairpin is unable to fold without the context of the rest of the sequence. The location of the folder containing ViennaRNA executables RNAfold and RNAeval. The current version MUST contain a trailing slash. If true, prints more information to stdout. Only print the top n individuals to stdout. If passed 0, will print all individuals. Sets the starting temperature to use in simulated annealing, which decreases linearly towards 0 during simulation. 26 C.2 C.2.1 How to Modify Replacing the model The best way to replace the model of termination efficiency which is being optimised is to replace the term_score function in the file linear_model.cpp. If the new model does not use the free energies, then the free_energies.cpp,h files may be removed, as can the get_components() function in sequence.cpp. C.2.2 Adding new parameters To add new parameters, first add a new variable to the Options struct in the argparse.h file with a default value. It may be preferable to copy the current paradigm of having the default value set using a function which looks in params for a value. To do this, specify a function which takes a std::string which will be the string of the value passed in, checks it for correctness and returns it as the correct type. Calling get_val() with a string argument will look in params for values corresponding to this argument, which can be used to pull values from params. To take a value from the commandline, modify the parse_args function by adding a new term to the else if stack which assigns the validated parameter value to the new Options variable. The existing code should provide a large number of examples to follow. 27