pubs.acs.org/accounts Article Machine Learning for Designing Next-Generation mRNA Therapeutics Published as part of the Accounts of Chemical Research special issue “mRNA Therapeutics”. Sebastian M. Castillo-Hair and Georg Seelig* Downloaded via 80.214.100.41 on October 24, 2023 at 21:37:14 (UTC). See https://pubs.acs.org/sharingguidelines for options on how to legitimately share published articles. Cite This: Acc. Chem. Res. 2022, 55, 24−34 ACCESS Read Online Metrics & More Article Recommendations CONSPECTUS: Over just the last 2 years, mRNA therapeutics and vaccines have undergone a rapid transition from an intriguing concept to real-world impact. However, whereas some aspects of mRNA therapeutics, such as the use of chemical modifications to increase stability and reduce immunogenicity, have been extensively optimized for over two decades, other aspects, particularly the selection and design of the noncoding leader and trailer sequences which control translation efficiency and stability, have received comparably less attention. In practice, such 5′ and 3′ untranslated regions (UTRs) are often borrowed from highly expressed human genes with few or no modifications, as in the case for the Pfizer/BioNTech Covid vaccine. Focusing on the 5′UTR, we here argue that model-driven design is a promising alternative that provides unprecedented control over 5′UTR function. We review recent work that combines synthetic biology with machine learning to build quantitative models that relate ribosome loading, and thus translation efficiency, to the 5′UTR sequence. We first introduce an experimental approach that uses polysome profiling and high-throughput sequencing to quantify ribosome loading for hundreds of thousands of 5′UTRs in parallel. We apply this approach to measure ribosome loading in synthetic RNA libraries with a random sequence inserted into the 5′UTR. We then review Optimus 5-Prime, a convolutional neural network model trained on the experimental data. We highlight that very accurate models of biological regulation can be learned from synthetic data sets with degenerate 5′UTRs. We validate model predictions not only on held-out data sets from our random library but also on a large library of over 30 000 human 5′UTR fragments and using translation reporter data collected independently by other groups. Both the experiment and model are compatible with commonly used chemically modified nucleosides, in particular, pseudouridine (Ψ) and 1methyl-pseudouridine (m1Ψ). We find that, in general, 5′UTRs have very similar impacts when combined with different proteincoding sequences and even in the context of different chemical modifications. We demonstrate that Optimus 5-Prime can be combined with design algorithms to generate de novo sequences with precisely defined translation efficiencies. We emphasize recent developments in design algorithms that rely on activation maximization and generative modeling to improve both the fitness and diversity of designed sequences. Compared with prior approaches such as genetic algorithms, we show that these approaches are not only faster but also less likely to get stuck in local sequence optima. Finally, we discuss how the approach reviewed here can be generalized to other gene regions and applications. ■ KEY REFERENCES • Linder, J.; Seelig, G. Fast Activation Maximization for Molecular Sequence Design. BMC Bioinformatics 2021, 22, 510.2 Fast SeqProp is a computational design algorithm based on activation maximization that can be combined • Sample, P. J.; Wang, B.; Reid, D. W.; Presnyak, V.; McFadyen, I. J.; Morris, D. R.; Seelig, G. Human 5′ UTR Design and Variant Effect Prediction from a Massively Parallel Translation Assay. Nature Biotechnology 2019, 37, 803−809.1 A neural network model, Optimus 5-Prime, trained on data f rom a massively parallel translation assay accurately predicts how the 5′UTR sequence controls ribosome loading and, together with a genetic algorithm, enables the design of highperforming 5′UTR sequences for mRNA therapeutics. © 2021 The Authors. Published by American Chemical Society Received: October 7, 2021 Published: December 14, 2021 24 https://doi.org/10.1021/acs.accounts.1c00621 Acc. Chem. Res. 2022, 55, 24−34 Accounts of Chemical Research pubs.acs.org/accounts Article Figure 1. Workflow combining high-throughput assays and machine learning for characterizing mRNA regulation and engineering UTR sequences with high performance. ■ involved in the regulation of translation and mRNA degradation; however, how their sequence affects these processes is not completely understood, and poor UTR design can negatively impact the expression of the therapeutic protein. Thus, most mRNA therapies to date take their UTRs from highly expressed human genes such as α- and β-globin.30 Recent studies, however, have shown that alternative UTRs can result in higher expression,31,32 suggesting that there is significant room for improvement. Moreover, when targeting different cell types or tissues, UTRs may need further tuning to account for the differential expression of regulators such as RNA-binding proteins (RBPs)33 and microRNAs (miRNAs).34 To further complicate matters, there is a complex interplay between the different sequence-dependent regulatory mechanisms that ultimately control protein expression. Recent studies have shown that 5′UTR elements that repress translation can also diminish mRNA stability,35 but 5′UTRs with very high translation efficiencies can also have a destabilizing effect and ultimately reduce expression.36 Clearly, quantitative models that take into account these effects and predict protein expression from sequence are crucial to unlocking the full potential of mRNA therapeutics. In this Account, we describe a framework that combines high-throughput assays and deep-learning techniques to develop predictive models, obtain biological insights, and engineer de novo sequences that optimize protein expression for mRNA therapeutics applications (Figure 1). The first component of this framework consists of massively parallel reporter assays (MPRAs) based on large synthetic gene or transcript libraries, whereby sequence variation is targeted to the region under study. Such an approach makes it possible to experimentally interrogate and quantify how a particular part of the mRNA (5′UTR, coding sequence (CDS), 3′UTR) contributes to protein production. In particular, here we with sequence-function models such as Optimus 5-Prime to rapidly design f unctional and high-fitness sequences. • Linder, J.; Bogard, N.; Rosenberg, A. B.; Seelig, G. A Generative Neural Network for Maximizing Fitness and Diversity of Synthetic DNA and Protein Sequences. Cell Systems 2020, 11, 49−62.e16.3 Deep Exploration Networks (DENs) are a class of generative sequence design models that can be used to design sequence libraries while simultaneously maximizing the performance of all sequences and minimizing similarity between them. INTRODUCTION We are witnessing the beginning of the mRNA therapeutics revolution. Indeed, this technology is now in the public spotlight thanks to its role in fighting the COVID-19 pandemic: It resulted in two of the most effective vaccines available,4,5 with more currently in clinical trials,6 and has continued to enable the rapid development of potential booster shots against variants of concern.7 But its scope is not limited to vaccines for infectious diseases, as mRNA therapeutics are being evaluated for clinical applications such as cancer immunotherapy,8−10 regenerative medicine,11−13 and protein replacement therapy,14,15 among others.16,17 This wave of mRNA therapeutics is the result of decades of work on many fronts, including the development of novel 5′-cap analogs18−21 and capping methods,22−25 chemically modified bases,26−28 and lipid nanoparticles,29 all of which improve mRNA stability, decrease immunogenicity, improve delivery efficiency, and, in general, result in the successful in vivo expression of the therapeutic protein. Nevertheless, there is still untapped potential in optimizing the primary sequence, either to further improve protein expression or to encode more complex pharmacokinetics. For example, 5′ and 3′ untranslated regions (UTRs) are heavily 25 https://doi.org/10.1021/acs.accounts.1c00621 Acc. Chem. Res. 2022, 55, 24−34 Accounts of Chemical Research pubs.acs.org/accounts Article Figure 2. Polysome profiling data from a random 5′UTR library captures known regulatory effects. (A) Schematic of polysome profiling experiment. (B) Influence of upstream start codon position along the 5′UTR. Plots show the average MRL of sequences containing a canonical (AUG) or noncanonical (CUG, GUG) start codon at the indicated position with respect to the primary AUG. (C) Influence of context around upstream start codon. Sequences containing an upstream AUG, GUG, or CUG between positions −21 and −8 were grouped by their surrounding context (strong, moderate, weak) and whether they occur in-frame or out-of-frame with the primary AUG. p values were calculated using two-sided t tests. (D) Relationship between the 5′UTR secondary structure and the measured ribosome load. 20 000 5′UTR sequences were grouped by their predicted minimum free energy, and the MRL distribution of each group was plotted. includes a purine at −3 and a G at +4, promotes recognition.38,39 Finally, initiation factors dissociate, the 60S subunit is recruited, the full 80S ribosome is assembled, and peptide elongation begins.37 Translation of the primary open reading frame (ORF) can be heavily influenced by sequence elements within the 5′UTR. For example, upstream start codons (uAUGs) and upstream ORFs (uORFs) can have a repressive effect by capturing ribosomes that would otherwise initiate at the primary start codon.40 Secondary structure and RNA-binding proteins may block or interfere with ribosome scanning.41 Other elements, such as the 5′ terminal oligopyrimidine tract (5′TOP), regulate changes in translation in response to stress.42 Finally, internal ribosome entry sites (IRESs) allow translation initiation in a cap-independent manner.43 All of these cis-regulatory elements may interact with one another and influence translation in ways that remain challenging to predict. Machine-learning approaches can, in principle, be used to build models that predict translation efficiency from 5′UTR sequence, but such models require very large-scale and highquality training data. A potential solution could be found in high-throughput translation data sets obtained from the human transcriptome. Most of these used Ribo-seq,44 a method wherein ribosome-bound transcripts are digested and the ribosome-protected mRNA fragments are sequenced. Ribo-seq provides mRNA translation efficiencies and even identifies the reading frame being translated, but it has difficulty distinguishing between transcript isoforms of the same gene due to a short ∼30 nt fragment length. TrIP-seq,45 wherein mRNAs are fractionated based on the number of elongating ribosomes before sequencing, can distinguish transcript isoforms and has focus on recent work characterizing the influence of the 5′UTR sequence on translation efficiency. The second component consists of deep-learning methods capable of identifying complex regulatory relationships from the resulting data sets. Specifically, we show that a convolutional neural network (CNN) model trained on MPRA data accurately predicts ribosome loading from 5′UTR sequences. Finally, we discuss methods for engineering novel sequences that achieve specified performance levels or exceed the performance of endogenous sequences. We first report results from designing 5′UTRs using a genetic algorithm, an iterative discrete search method that “evolves” sequences in silico until the model prediction matches a prespecified target. We then describe more recent methods that are capable of rapidly designing large libraries of diverse sequences while avoiding many of the inefficiencies and pitfalls of genetic algorithms. We conclude with a short discussion on how this work can be extended to 3′UTR and CDS sequences and to other molecular phenomena such as mRNA degradation. ■ A MASSIVELY PARALLEL TRANSLATION ASSAY FOR CHARACTERIZING 5′UTRs Translation of most mRNAs in eukaryotes starts with the assembly of the eIF4F complex at the 5′cap followed by recruitment of the 43S preinitiation complex containing the 40S ribosomal subunit, several eukaryotic initiation factors (eIFs), and the Met-tRNAiMet anticodon. 37 Next, the assembled 43S complex scans the 5′UTR in the 5′ to 3′ direction until a start codon is recognized. Successful recognition depends on the start codon identity (the canonical AUG is more likely to be recognized than CUG or GUG) and the context around it. The “Kozak consensus sequence”, which 26 https://doi.org/10.1021/acs.accounts.1c00621 Acc. Chem. Res. 2022, 55, 24−34 Accounts of Chemical Research pubs.acs.org/accounts Article Figure 3. Polysome profiling data from the random 5′UTR eGFP library generalizes to different CDSs and mRNA chemistries. (A) MRL from a library of 3110 5′UTRs with an eGFP versus mCherry CDS. (B) Schematic of uridine compared with the modified nucleosides pseudouridine (Ψ) and 1-methyl-pseudouridine (m1Ψ). (C,D) MRL comparison for modified versus unmodified chemistries. The eGFP library was resynthesized using Ψ (C) or m1Ψ (D), and polysome profiling was performed as with the unmodified library. r2 values were calculated from 20 000 sequences with the highest read coverage. Plots show 3000 sequences randomly chosen from this subset. splicing52 and even translation49 often used short (6−10 nt) random regions such that every possible sequence combination was covered. However, building on our own previous work on alternative splicing53 and translation regulation in yeast,51 we used a longer random sequence to allow for more diverse motif combinations. Whereas we cannot cover every possible 50-mer (there are >1030), short regulatory sequences such as start and stop codons should appear frequently and in many different positions and combinations. mRNA was synthesized, capped, and polyadenylated using an in vitro transcription (IVT) system, which, compared with transfecting plasmid DNA, allowed us to remove confounding transcriptional and RNA processing effects. Synthesized mRNA was transfected into HEK293T cells and incubated for 12 h. Cells were then lysed in the presence of cycloheximide, an antibiotic that halts elongating ribosomes. The lysate was run through a sucrose gradient, and fractions containing mRNAs bound to distinct numbers of ribosomes (polysomes) were collected, barcoded, and sequenced. The resulting data set contains read counts for 280 000 5′UTR sequences. From here, we obtained the mean ribosome load (MRL) for each sequence by multiplying, for each fraction, the proportion of reads corresponding to a specific sequence times the number of associated ribosomes and summing these products. The MRL is thus a quantitative measure of translation efficiency. Our polysome profiling data set recapitulated previously known regulatory effects. 5′UTRs with upstream AUGs had, on average, lower MRL values when the AUG was out-offrame with respect to the primary start codon (Figure 2B). This effect was also present with noncanonical upstream start codons (CUG, GUG) but to a significantly lesser extent (Figure 2B). Notably, the context around upstream start codons strongly influenced their repressive effect: Out-of-frame uAUGs were more repressive when surrounded by a purine (A,G) at position −3 and a guanine at +4, matching the Kozak consensus sequence, whereas uCUGs and uGUGs had statistically significant effects only within a similarly strong context (Figure 2C). Similar effects were observed for uORFs. These observations are consistent with stronger uAUGs and uORFs redirecting ribosomes that would otherwise initiate at the primary start codon. Additionally, sequences with lower predicted free energies had, on average, lower MRLs, consistent with stable secondary structures interfering with ribosome scanning (Figure 2D). been successfully applied to studying the impact of alternative 5′ and 3′ UTRs.45,46 Still, endogenous transcript data may not be optimal for training predictive models for multiple reasons. First, endogenous transcripts contain highly variable UTR and CDS sequences, making it difficult to reliably isolate how a specific part of the mRNA, such as the 5′UTR, influences translation. Second, the size of an endogenous data set is fundamentally limited by the size of the human transcriptome. Because deep learning can take advantage of extremely large data sets to achieve exceptional performance,47 obtaining more examples than what the genome can provide is desirable. Finally, sequences with deleterious effects are likely to be underrepresented in endogenous data, potentially resulting in major model blind spots. An alternative approach is to use MPRAs, where large libraries of synthetic reporter sequences are assayed. Here variation is restricted to a particular sequence element, in the form of either fully degenerate or endogenous sequence fragments.48 In addition, the MPRA library size can be orders of magnitude larger than the number of genomic examples. Previous MPRAs for characterizing translation in human cells have used stable single-copy integration of DNA libraries followed by fluorescence-activated sorting and sequencing.49,50 Similarly, we have previously used DNA libraries combined with a growth selection assay to study translation in yeast.51 Measurements from DNA libraries are affected by 5′UTRs influencing transcription or RNA processing as well as translation, which, in some studies, has been compensated for by placing a fluorescent reporter downstream of an IRES in the same transcript.49,50 To characterize the influence of the 5′UTR on translation, we developed an MPRA in which we measure the ribosome loading of a random library of hundreds of thousands of mRNAs (Figure 2A). This assay uses polysome profiling followed by sequencing, as in TrIP-seq;45 however, we use synthetic mRNA libraries where sequence variation is targeted to the 5′UTR region. Specifically, our reporter design contains a constant enhanced green fluorescent protein (eGFP) CDS, a 3′UTR derived from bovine growth hormone (BGH), and a 5′UTR with an initial 25 nt-long fixed segment followed by a 50 nt fully degenerate region. Two out-of-frame stop codons present at the beginning of the eGFP CDS ensured that initiation at a randomly generated out-of-frame start codon would not result in extended translation. Prior MPRAs for 27 https://doi.org/10.1021/acs.accounts.1c00621 Acc. Chem. Res. 2022, 55, 24−34 Accounts of Chemical Research pubs.acs.org/accounts Article Figure 4. Optimus 5-Prime can predict ribosome loading and protein expression from a given 5′UTR sequence. (A) Optimus 5-Prime architecture. An input sequence represented as a 50 × 4 one-hot encoded vector (bottom) is fed into two convolutional layers (middle) followed by a fully dense layer to generate an MRL prediction (top). (B) Measured versus predicted MRLs on a held-out test set of 20 000 sequences. Red: 5′UTRs with no uAUGs. Blue: 5′UTRs with uAUGs. (C) Predicted MRL versus eGFP fluorescence for 10 mRNAs selected to have a wide range of MRL values. mRNAs were independently transfected into HEK293 cells and imaged using an IncuCyte S3 live-cell analysis system. The maximum fluorescence over a 20.5 h time window is shown. Figure 5. Optimus 5-Prime predictions generalize across mRNA chemistries, cell lines, and endogenous 5′UTRs. (A) Coefficients of determination (r2) for model predictions when training and test data sets are taken from one of two replicates of the original eGFP data set without modification (U), with pseudouridine (Ψ), or with 1-methyl-pseudouridine (m1Ψ). (B,C) Optimus 5-Prime predictions compared with translation efficiency measurements for 77 5′UTRs designed and characterized in six different cell lines by Ferreira et al.56 mRNA reporters contained a GFP ORF preceded by a designed 5′ UTR and a red fluorescent protein (RFP) ORF preceded by an IRES to be used as a normalization control. Cell lines included human embryonic kidney cells (293T), mouse pre-B lymphocytes (PD31), human chronic myelogenous leukemia cells (K562), human colon cancer cells (HCT116), Chinese hamster ovary cells (CHO-K1), and mouse plasmacytoma (MPC11). 5′UTRs were used in this analysis only if their GFP ORFs started with ATGG. 5′UTRs shorter than 50 bp were zero-padded before being used with Optimus 5-Prime. (B) Direct comparison with measurements in PD31 cells. (C) Coefficients of determination (r2) of measurements versus MRL predictions in all cell lines. (D) Predicted versus observed MRL for wild-type and SNV-containing human 5′UTR sequences. ■ DEVELOPING PREDICTIVE MODELS OF RIBOSOME LOADING Deep learning has been highly successful at various tasks in molecular biology54 due to at least two factors. First, the tiered nature of the molecular interactions involved in a particular processsequence motifs recruit effector proteins, which form complexes with other proteins, which, in turn, interact with other complexes and sequence motifsare efficiently captured by the layered architecture of a deep-learning network. Second, because deep learning is capable of capturing complex, nonlinear interactions, these models are uniquely suited to take advantage of extremely large data sets to obtain improved performance.47,55 To predict ribosome loading from 5′UTR sequence, we developed a convolutional neural network (CNN) model named Optimus 5-Prime. The model contains two convolutional layers with filters that identify short motifs from the input and one fully connected layer that ultimately computes the MRL prediction (Figure 4A). We first trained this model using 260 000 random 5′UTRs and associated MRLs from the polysome profiling data set. Optimus 5-Prime was able to Associations between 5′UTR sequences and MRL measurements generalized to other coding sequences and nucleotide chemistries, both relevant to mRNA therapeutic applications. We tested the same ∼3000 5′UTRs together with either the eGFP or mCherry CDS, two fluorescent proteins of different origins and with widely differing sequences. We observed excellent correlation between MRLs measured in both contexts (r2 = 0.732, Figure 3A), suggesting that identical UTRs result in similar MRLs, even if the CDS context changes. Similarly, we resynthesized the original 280 000-member 5′UTR library but replaced uracil in the IVT reaction with the chemically modified nucleosides pseudouridine (Ψ) and 1methyl-pseudouridine (m1Ψ) (Figure 3B). mRNAs incorporating these modifications avoid activation of intracellular pattern recognition receptors, which would result in an immune response that suppresses translation and promotes mRNA degradation. Consequently, these modifications are commonly used in mRNA therapeutics.6,11 MRLs were found to be highly correlated across chemistries (Figure 3C,D). 28 https://doi.org/10.1021/acs.accounts.1c00621 Acc. Chem. Res. 2022, 55, 24−34 Accounts of Chemical Research pubs.acs.org/accounts predict ribosome loading with remarkable precision when tested against 20 000 samples held out from training (r2 = 0.93, Figure 4B, compared with r2 = 0.64 for the best k-mer linear model with k ≤ 6). To further validate whether MRL predictions were indicative of output protein expression, we selected 10 sequences and performed individual eGFP fluorescence measurements, which were highly correlated with MRL predictions (Figure 4C). To be maximally useful for the design of mRNA therapeutics, Optimus 5′ needs to generalize to different coding sequences, chemical modifications, cell types, sequence “types” (e.g., human rather than random), or lengths. As detailed above, we found UTR sequences to be highly transferable between different CDS contexts and even chemical modifications (Figure 3). Accordingly, despite being trained on eGFP library data, Optimus 5-Prime predictions could explain 78 and 77% of the observed MRL variation in data from two replicates with ∼200 000 random 5′UTRs preceding an mCherry CDS (Figure 5A). Similarly, despite being trained only on unmodified mRNA data, Optimus 5-Prime could explain 69−73% of the observed MRL variation in the Ψ library and 68−76% in the m1Ψ library. Still, the model accuracy could be increased to 84−85, 77−82, and 72−81% by retraining directly on the mCherry, Ψ, and m1Ψ data sets, respectively (Figure 5B). Therefore, whereas a model trained on unmodified RNA data is reasonably accurate, training directly on modified RNA data will be ideal for predicting the impact of such modifications in mRNA therapeutics contexts. To test whether Optimus 5-Prime would perform well on sequences designed by others, we turned to translation measurements conducted with six different cell lines and 77 5′UTRs designed by Ferreira et al.56 and found that Optimus 5-Prime could explain 73−85% of the reported variation (Figure 5C). We also note that measurements reported for different cell types are very highly correlated, suggesting that the basic regulatory rules (e.g., strengths of Kozak, role of uORFs, etc.) remain similar between cell types. These observations also suggest that a model trained on data collected in a single cell type can generalize to other, more clinically relevant cell types. We also showed that Optimus 5-Prime generates accurate MRL predictions on human 5′UTRs despite being trained on random sequences only. We synthesized 35 212 5′UTRs extracted from the 50 nt long region immediately preceding the start codon in human transcripts and 3577 single nucleotide variant (SNV) sequences from ClinVar57 and assayed them as described above. MRL predictions were highly correlated with experimental observations (r2 = 0.82, Figure 5D). A limitation of the initial version of Optimus 5-Prime is its fixed 50 nt long input, as UTRs used for mRNA therapeutics can be longer, whereas human 5′UTRs range from tens to thousands of bases with a median length of 218.41 Thus we constructed and characterized a new library with a degenerate 5′UTR region ranging from 25 to 100 bases. We then retrained Optimus 5-Prime using a longer 100-base input layer. Input sequences shorter than 100 bases were accommodated by leftpadding their one-hot-encoded vector with zeros. On a test data set of held-out random and human 5′UTR sequences, we found MRL predictions to be highly correlated with measurements (r2 from 0.84 to 0.75). Recently, Gagneur and coworkers used our MPRA data to train a model based on convolutions staggered every three bases to further extend MRL predictions to arbitrary-length 5′UTR sequences.58 Finally, we use Optimus 5-Prime to score the translation efficiencies of 5′UTRs previously used in mRNA therapeutics. First, the commonly used α- and β-globin 5′UTRs6 have predicted MRLs of 6.1 and 6.6, respectively. Therefore, compared with the 25−100 nt library data set, these sequences can be placed in the 65th and 86th percentiles. The BioNTech/Pfizer BNT-162b2 COVID-19 vaccine4 uses a modified α-globin with a consensus Kozak and, according to our model, results in an MRL of 6.3 (76th percentile). Finally, the Moderna mRNA-1273 vaccine5 uses a synthetic 5′UTR, which our model predicts to have an MRL of 5.7 (52nd percentile). Whereas most of these UTRs result in higher-thanaverage ribosome loading and generally high expression, our results suggest that further optimization could be beneficial for strongly expressing proteins in therapeutic applications. In fact, in very recent work, Exposito and coworkers compared six 5′UTRs selected from our eGFP library because of their high measured MRLs to the β-globin 5′UTR and found that at least one of the six synthetic sequences (“UTR4”) resulted in higher GFP expression across three different cell types. Most notably, an 80% increase in fluorescence compared with the β-globin control was reported in primary human-monocyte-derived dendritic cells.59 Article ■ DESIGNING SEQUENCES FOR ENHANCED AND SPECIFIC mRNA TRANSLATION Methods to rationally design regulatory sequences with custom performance, such as 5′UTRs that achieve target translation efficiencies, have been a major focus of synthetic biology. Early genetic engineering relied on using regulatory sequences from endogenous sources, with the expectation that they would perform as well in their new synthetic context,60,61 an approach still largely used with UTRs for mRNA therapeutics.6,30 However, further work, in particular, related to promoters, demonstrated that engineered sequences could allow for the finer tuning of performance62,63 and even outperform native sequences64 while being more robust to context changes.65 Methods to design these sequences included building chimeras from native sequences,64 rationally inserting66 or deleting67 regulatory motifs, and screening libraries containing random mutations62,63 or permutations of sequence elements.68,69 An alternative approach is model-based design, where a predictive sequence-to-function model is used alongside a search algorithm to generate fully synthetic sequences with target performance. An early demonstration of this approach was the ribosome binding site (RBS) calculator, a software package that designs bacterial 5′UTRs for specified translation efficiencies.70 The RBS calculator was successful in part because bacterial translation initiation relies on binding of the 16S ribosomal RNA to a sequence element in the mRNA 5′UTR; therefore, a model based entirely on RNA hybridization thermodynamics was sufficiently accurate. However, detailed biophysical models may not be available for other processes. Machine-learning models, such as Optimus 5-Prime, that can be trained on large-scale example data sets even in the absence of a quantitative biophysical model provide a powerful alternative as “oracles” for sequence design. We first demonstrated the machine-learning-guided design of functional sequence elements in the context of yeast 5′UTR regulation. Specifically, we used a neural network model trained on 500 000 random UTRs together with random 29 https://doi.org/10.1021/acs.accounts.1c00621 Acc. Chem. Res. 2022, 55, 24−34 Accounts of Chemical Research pubs.acs.org/accounts Article Figure 6. (A) Sequence design using Optimus 5-Prime and a genetic algorithm. (B) Predicted and observed MRLs of 12 000 sequences designed for different target MRLs. (C,D) Model performance before (C) and after (D) retraining with a subset of the designed sequences evaluated on a held-out designed sequence test set. Figure 7. Fast SeqProp, a gradient-based sequence design method with PWM sampling and per-base logit normalization, rapidly finds highperforming sequences. (A) Methods based on search heuristics introduce random changes to a candidate sequence. Performance needs to be evaluated for several candidates before finding an improvement. (B) Gradient-based methods move in the direction of increased performance, as measured by the gradient of the cost function. (C) Fast SeqProp optimizes logits via gradient descent. Logits are normalized across positions, and one-hot encoded sequences are sampled from PWMs before being presented to the pretrained predictive model. (D) Cost function over number of iterations with and without logit normalization and PWM sampling when optimizing for high MRLs using Optimus 5-Prime. mutagenesis to computationally evolve high fitness sequences.51 This work gave a proof of principle for machine-learningguided sequence design, but regulatory sequences optimized for gene expression in yeast are unlikely to be optimal for mRNA therapeutics applications. Given the high accuracy achieved by Optimus 5-Prime, we similarly evaluated its application in designing 5′UTRs with specified translation efficiencies.1 Our design approach was based on genetic algorithms,71 a discrete search heuristic previously used for designing bacterial 5′UTRs70 and RNAs with pseudoknotted structures.72 Starting with a set of random 50 bp 5′UTRs, each iteration consisted of random mutations and crossovers in silico, followed by scoring using Optimus 5-Prime and selection of the best sequences for the next round (Figure 6A). To test this approach, we designed 12 000 sequences to either achieve one of seven discrete MRL values between 3 and 9 or to maximize the MRL. These sequences were then synthesized and tested via polysome profiling. We found excellent agreement between the target and experimental MRLs when the target was eight or lower. However, for larger target MRLs, the experimental measurements were lower than predicted (Figure 6B). A closer inspection of sequences in this stage revealed the appearance of long poly-U stretches not present in the training library. Therefore, the genetic algorithm was likely exploiting a blind spot in Optimus 5-Primea region in sequence space where predictions would have low qualityto further increase MRLs. Notably, retraining the model using a subset of the designed sequences and their measured MRLs improved the prediction accuracy (Figure 6C,D). Recent work by Lu and coworkers also used a genetic algorithm to design 5′UTRs for DNA gene therapy applications, but their approach jointly optimized transcription and translation.73 ■ IMPROVED ALGORITHMS FOR SEQUENCE DESIGN Sequence design based on search heuristics such as genetic algorithms has several issues that we have tried to address in recent work. For example, design can be slow and inefficient: The mutation and crossover operations change a few nucleotides at a time and are not guaranteed to increase the performance at every step, resulting in small improvements despite multiple model evaluations (Figure 7A). A more efficient approach is activation maximization through gradient descent, where the gradient of the performance metric with respect to the model input is used to iteratively refine a candidate sequence.74 Progress is always made in the direction of increased performance, and fewer model evaluations are required (Figure 7B). However, gradients can only be taken with respect to continuous real-valued inputs, and thus some modifications are needed to design sequences consisting of discrete letters. Attempts at addressing this limitation include representing sequences via unstructured real-valued matrices75 30 https://doi.org/10.1021/acs.accounts.1c00621 Acc. Chem. Res. 2022, 55, 24−34 Accounts of Chemical Research pubs.acs.org/accounts and introducing a “softmax” layer that transforms unbounded real-valued inputs (“logits”) into position-weight matrices (PWMs) before feeding them to the model.74 However, these approaches may result in poor performance because models are trained on one-hot encoded data (i.e., unambiguous sequences), not real-valued inputs. We previously demonstrated a hybrid continuous/discrete solution to this problem: At every iteration, model evaluations are made using one-hot encoded sequences sampled from the PWM, but gradients are evaluated with respect to the continuous input logits. We successfully used this approach to design sequences with custom alternative polyadenylation isoform ratios.55 In more recent work,2 we evaluated the effect of normalizing logits across all positions and introducing perbase scaling and bias factors (Figure 7C). These additions helped us avoid issues with vanishing gradients, an error mode where gradients become too small to drive meaningful updates. As a result, our Fast SeqProp method converges rapidly in a variety of design tasks, including maximizing transcription factor binding, transcriptional activity, alternative polyadenylation, and translation using Optimus 5-Prime (Figure 7D). Still, despite the speed improvement of Fast SeqProp, activation maximization methods share a few limitations. First, the algorithm needs to be run from scratch for every new generated sequence. Furthermore, optimization might get stuck in local minima or converge to a region in the sequence space far from the training data set, where the model is not accurate. A related issue is the lack of an explicit mechanism to force generated sequences to be distinct, thus limiting the diversity of sequences available for experimental testing and reducing the likelihood of finding one with high performance. A different class of design methods is based on deep generative models, neural networks trained to learn the distribution of a training data set to generate completely new examples with similar properties.74,76,77 A major advantage over gradient methods is speed: After training, generating new examples requires a single evaluation of the generative model without any iterations; however, the basic versions of these methods do not optimize sequence performance or explicitly maximize diversity. We recently developed Deep Exploration Networks (DENs),3 an activation-maximizing deep generative model that addresses these limitations (Figure 8). DENs are trained via gradient descent by minimizing a cost function composed of two terms: one related to the performance of a generated sequence as given by an independent, pretrained predictive model (e.g., Optimus 5-Prime) and the other computed from a similarity metric between two generated sequences (Figure 8A). By simultaneously maximizing performance and minimizing similarity, DENs learn to generate highly diverse sequences with high performance. Furthermore, DENs can be restricted from generating sequences that deviate too much from the sequence space defined by the training data set of the predictor by using a variational autoencoder (VAE)76 to penalize deviation during training (Figure 8A). We showed that DENs can be used to design sequences with specified alternative polyadenylation isoform ratios and custom cleavage positions, splicing regulatory sequences with maximal differential splicing in two different cell lines and highly diverse GFP sequences with high fluorescence. Still, a potential drawback of generative models is the up-front requirement for model training. Thus we recommend using Fast SeqProp for simple design tasks and Article Figure 8. Deep Exploration Networks. (A) A DEN is a neural network that transforms random real-valued vectors (Z) into PWMs, from which sequences can be sampled. During training, generated sequences are scored based on their predicted performance and their similarity to each other. Optionally, another generative network such as a VAE, trained on the same data as the predictive model, can be used to make sure DEN-generated sequences do not dramatically deviate from the training data set. (B) During generation, a single evaluation of the trained DEN results in a different generated sequence. DENs when maximum performance and large numbers of diverse sequences are desirable. ■ SUMMARY AND OUTLOOK Improving our ability to map mRNA sequence to function and vice versa is key to developing a new generation of mRNA therapeutics. Here we reviewed how an approach combining high-throughput MPRA data, deep learning, and sequence design algorithms can be used to characterize mRNA regulation, extract biological insights, and design novel sequences with high performance. We expect this work to be extended in several directions in the near future. First, MPRAs targeting regions other than the 5′UTR and processes other than translation will be developed. As a recent example, Qian and coworkers developed an MPRA with a short randomized uORF in the 5′UTR to study the interplay between translation and mRNA stability.35 Similarly, MPRAs targeting the 3′UTR region have been developed to study the effect of variants on mRNA abundance78 and subcellular transcript localization in neurons.79 Second, models capable of predicting multiple biomolecular processes from sequence should be further developed. For example, recent work found that sequences selected to have an exceptionally high MRL could result in low mRNA stability,36 suggesting that optimizing a predictor that models translation alone may not be an optimal strategy for maximizing protein expression. Similarly, in a test of six synthetic 5′UTRs with very high MRLs (7.8−10), a range of expression levels was observed, 31 https://doi.org/10.1021/acs.accounts.1c00621 Acc. Chem. Res. 2022, 55, 24−34 Accounts of Chemical Research pubs.acs.org/accounts possibly because of confounding effects of the 5′UTR sequence on stability or even cell toxicity.59 Finally, we expect understanding and engineering cell-type-specific expression to be a major goal going forward, as targeting expression to specific cell or tissue types will limit the side effects of future mRNA therapeutics. Whereas some cell-type specificity can currently be achieved by pasting miRNA binding elements into the 3′UTR,80,81 we expect model-based design to further increase the specificity and allow the targeting of cell types and tissues that are currently inaccessible. the BNT162b2MRNA Covid-19 Vaccine. N. Engl. J. Med. 2020, 383, 2603−2615. (5) Baden, L. R.; El Sahly, H. M.; Essink, B.; Kotloff, K.; Frey, S.; Novak, R.; Diemert, D.; Spector, S. A.; Rouphael, N.; Creech, C. B.; McGettigan, J.; Khetan, S.; Segall, N.; Solis, J.; Brosz, A.; Fierro, C.; Schwartz, H.; Neuzil, K.; Corey, L.; Gilbert, P.; Janes, H.; Follmann, D.; Marovich, M.; Mascola, J.; Polakowski, L.; Ledgerwood, J.; Graham, B. S.; Bennett, H.; Pajon, R.; Knightly, C.; Leav, B.; Deng, W.; Zhou, H.; Han, S.; Ivarsson, M.; Miller, J.; Zaks, T. Efficacy and Safety of the MRNA-1273 SARS-CoV-2 Vaccine. N. Engl. J. Med. 2021, 384, 403−416. (6) Chaudhary, N.; Weissman, D.; Whitehead, K. A. MRNA Vaccines for Infectious Diseases: Principles, Delivery and Clinical Translation. Nat. Rev. Drug Discovery 2021, 20, 817−838. (7) Wu, K.; Choi, A.; Koch, M.; Elbashir, S.; Ma, L.; Lee, D.; Woods, A.; Henry, C.; Palandjian, C.; Hill, A.; Jani, H.; Quinones, J.; Nunna, N.; O’Connell, S.; McDermott, A. B; Falcone, S.; Narayanan, E.; Colpitts, T.; Bennett, H.; Corbett, K. S; Seder, R.; Graham, B. S; Stewart-Jones, G. B.; Carfi, A.; Edwards, D. K Variant SARS-CoV-2 mRNA vaccines confer broad neutralization as primary or booster series in mice. bioRxiv 2021, DOI: 10.1101/2021.04.13.439482. (8) Sebastian, M.; Schröder, A.; Scheel, B.; Hong, H. S.; Muth, A.; von Boehmer, L.; Zippelius, A.; Mayer, F.; Reck, M.; Atanackovic, D.; Thomas, M.; Schneller, F.; Stöhlmacher, J.; Bernhard, H.; Gröschel, A.; Lander, T.; Probst, J.; Strack, T.; Wiegand, V.; Gnad-Vogt, U.; Kallen, K.-J.; Hoerr, I.; von der Muelbe, F.; Fotin-Mleczek, M.; Knuth, A.; Koch, S. D. A Phase I/IIa Study of the MRNA-Based Cancer Immunotherapy CV9201 in Patients with Stage IIIB/IV Non-Small Cell Lung Cancer. Cancer Immunol. Immunother. 2019, 68, 799−812. (9) Papachristofilou, A.; Hipp, M. M.; Klinkhardt, U.; Früh, M.; Sebastian, M.; Weiss, C.; Pless, M.; Cathomas, R.; Hilbe, W.; Pall, G.; Wehler, T.; Alt, J.; Bischoff, H.; Geißler, M.; Griesinger, F.; Kallen, K.J.; Fotin-Mleczek, M.; Schröder, A.; Scheel, B.; Muth, A.; Seibel, T.; Stosnach, C.; Doener, F.; Hong, H. S.; Koch, S. D.; Gnad-Vogt, U.; Zippelius, A. Phase Ib Evaluation of a Self-Adjuvanted Protamine Formulated MRNA-Based Active Cancer Immunotherapy, BI1361849 (CV9202), Combined with Local Radiation Treatment in Patients with Stage IV Non-Small Cell Lung Cancer. j. immunotherapy cancer 2019, 7, 38. (10) Beck, J. D.; Reidenbach, D.; Salomon, N.; Sahin, U.; Türeci, Ö .; Vormehr, M.; Kranz, L. M. MRNA Therapeutics in Cancer Immunotherapy. Mol. Cancer 2021, 20, 69. (11) Kwon, H.; Kim, M.; Seo, Y.; Moon, Y. S.; Lee, H. J.; Lee, K.; Lee, H. Emergence of Synthetic MRNA: In Vitro Synthesis of MRNA and Its Applications in Regenerative Medicine. Biomaterials 2018, 156, 172−193. (12) Warren, L.; Lin, C. MRNA-Based Genetic Reprogramming. Mol. Ther. 2019, 27, 729−734. (13) Chanda, P. K.; Sukhovershin, R.; Cooke, J. P. MRNAEnhanced Cell Therapy and Cardiovascular Regeneration. Cells 2021, 10, 187. (14) Magadum, A.; Kaur, K.; Zangi, L. MRNA-Based Protein Replacement Therapy for the Heart. Mol. Ther. 2019, 27, 785−793. (15) Trepotec, Z.; Lichtenegger, E.; Plank, C.; Aneja, M. K.; Rudolph, C. Delivery of MRNA Therapeutics for the Treatment of Hepatic Diseases. Mol. Ther. 2019, 27, 794−802. (16) Sahin, U.; Karikó, K.; Türeci, Ö . MRNA-Based Therapeutics Developing a New Class of Drugs. Nat. Rev. Drug Discovery 2014, 13, 759−780. (17) Pardi, N.; Hogan, M. J.; Porter, F. W.; Weissman, D. MRNA Vaccines a New Era in Vaccinology. Nat. Rev. Drug Discovery 2018, 17, 261−279. (18) Stepinski, J.; Waddell, C.; Stolarski, R.; Darzynkiewicz, E.; Rhoads, R. E. Synthesis and Properties of MRNAs Containing the Novel “Anti-Reverse” Cap Analogs 7-Methyl(3′-O-Methyl)GpppG and 7-Methyl(3′-Deoxy)GpppG. RNA 2001, 7, 1486−1495. (19) Jemielity, J.; Fowler, T.; Zuberek, J.; Stepinski, J.; Lewdorowicz, M.; Niedzwiecka, A.; Stolarski, R.; Darzynkiewicz, E.; Rhoads, R. E. ■ AUTHOR INFORMATION Corresponding Author Georg Seelig − Department of Electrical & Computer Engineering and Paul G. Allen School of Computer Science & Engineering, University of Washington, Seattle, Washington 98195, United States; orcid.org/0000-0002-3163-8782; Email: gseelig@uw.edu Author Sebastian M. Castillo-Hair − Department of Electrical & Computer Engineering and eScience Institute, University of Washington, Seattle, Washington 98195, United States; orcid.org/0000-0002-2384-3129 Complete contact information is available at: https://pubs.acs.org/10.1021/acs.accounts.1c00621 Notes The authors declare no competing financial interest. Biographies Sebastian M. Castillo-Hair is a Data Science Postdoctoral Fellow at the eScience Institute and the Department of Electrical and Computer Engineering at University of Washington. He studies how highthroughput assays and machine learning can be used to engineer novel synthetic biological systems. Previously, he obtained his Ph.D. in Bioengineering at Rice University. Georg Seelig is a Professor at the University of Washington. His research interests are in synthetic biology and genomics. ■ ACKNOWLEDGMENTS We thank Johannes Linder for feedback on this manuscript. This work was supported by NIH Awards R01GM120379 and R01HG009892 to G.S., and by the University of Washington eScience Institute with support from the Washington Research Foundation to S.M.C. ■ REFERENCES (1) Sample, P. J.; Wang, B.; Reid, D. W.; Presnyak, V.; McFadyen, I. J.; Morris, D. R.; Seelig, G. Human 5′ UTR Design and Variant Effect Prediction from a Massively Parallel Translation Assay. Nat. Biotechnol. 2019, 37, 803−809. (2) Linder, J.; Seelig, G. Fast Activation Maximization for Molecular Sequence Design. BMC Bioinf. 2021, 22, 510. (3) Linder, J.; Bogard, N.; Rosenberg, A. B.; Seelig, G. A Generative Neural Network for Maximizing Fitness and Diversity of Synthetic DNA and Protein Sequences. Cell Systems 2020, 11, 49−62. (4) Polack, F. P.; Thomas, S. J.; Kitchin, N.; Absalon, J.; Gurtman, A.; Lockhart, S.; Perez, J. L.; Pérez Marc, G.; Moreira, E. D.; Zerbini, C.; Bailey, R.; Swanson, K. A.; Roychoudhury, S.; Koury, K.; Li, P.; Kalina, W. V.; Cooper, D.; Frenck, R. W.; Hammitt, L. L.; Türeci, Ö .; Nell, H.; Schaefer, A.; Ü nal, S.; Tresnan, D. B.; Mather, S.; Dormitzer, P. R.; Ş ahin, U.; Jansen, K. U.; Gruber, W. C. Safety and Efficacy of 32 Article https://doi.org/10.1021/acs.accounts.1c00621 Acc. Chem. Res. 2022, 55, 24−34 Accounts of Chemical Research pubs.acs.org/accounts Novel “Anti-Reverse” Cap Analogs with Superior Translational Properties. RNA 2003, 9, 1108−1122. (20) Kuhn, A. N.; Diken, M.; Kreiter, S.; Selmi, A.; Kowalska, J.; Jemielity, J.; Darzynkiewicz, E.; Huber, C.; Türeci, Ö .; Sahin, U. Phosphorothioate Cap Analogs Increase Stability and Translational Efficiency of RNA Vaccines in Immature Dendritic Cells and Induce Superior Immune Responses in Vivo. Gene Ther. 2010, 17, 961−971. (21) Kocmik, I.; Piecyk, K.; Rudzinska, M.; Niedzwiecka, A.; Darzynkiewicz, E.; Grzela, R.; Jankowska-Anyszka, M. Modified ARCA Analogs Providing Enhanced Translational Properties of Capped MRNAs. Cell Cycle 2018, 17, 1624−1636. (22) Ensinger, M. J.; Martin, S. A.; Paoletti, E.; Moss, B. Modification of the 5′-Terminus of MRNA by Soluble Guanylyl and Methyl Transferases from Vaccinia Virus. Proc. Natl. Acad. Sci. U. S. A. 1975, 72, 2525−2529. (23) Yisraeli, J. K.; Melton, D. A. [4] Synthesis of Long, Capped Transcripts in Vitro by SP6 and T7 RNA Polymerases. In Methods in Enzymology; RNA Processing Part A: General Methods; Academic Press, 1989; Vol. 180, pp 42−50. (24) Yunus, M. A.; Chung, L. M. W.; Chaudhry, Y.; Bailey, D.; Goodfellow, I. Development of an Optimized RNA-Based Murine Norovirus Reverse Genetics System. J. Virol. Methods 2010, 169, 112−118. (25) Henderson, J. M.; Ujita, A.; Hill, E.; Yousif-Rosales, S.; Smith, C.; Ko, N.; McReynolds, T.; Cabral, C. R.; Escamilla-Powers, J. R.; Houston, M. E. Cap 1 Messenger RNA Synthesis with CoTranscriptional CleanCap® Analog by In Vitro Transcription. Current Protocols 2021, 1, e39. (26) Karikó, K.; Muramatsu, H.; Welsh, F. A.; Ludwig, J.; Kato, H.; Akira, S.; Weissman, D. Incorporation of Pseudouridine Into MRNA Yields Superior Nonimmunogenic Vector With Increased Translational Capacity and Biological Stability. Mol. Ther. 2008, 16, 1833− 1840. (27) Andries, O.; Mc Cafferty, S.; De Smedt, S. C.; Weiss, R.; Sanders, N. N.; Kitada, T. N1-Methylpseudouridine-Incorporated MRNA Outperforms Pseudouridine-Incorporated MRNA by Providing Enhanced Protein Expression and Reduced Immunogenicity in Mammalian Cell Lines and Mice. J. Controlled Release 2015, 217, 337−344. (28) Li, B.; Luo, X.; Dong, Y. Effects of Chemically Modified Messenger RNA on Protein Expression. Bioconjugate Chem. 2016, 27, 849−853. (29) Hou, X.; Zaks, T.; Langer, R.; Dong, Y. Lipid Nanoparticles for MRNA Delivery. Nat. Rev. Mater. 2021, 1−17. (30) Weng, Y.; Li, C.; Yang, T.; Hu, B.; Zhang, M.; Guo, S.; Xiao, H.; Liang, X.-J.; Huang, Y. The Challenge and Prospect of MRNA Therapeutics Landscape. Biotechnol. Adv. 2020, 40, 107534. (31) Orlandini von Niessen, A. G.; Poleganov, M. A.; Rechner, C.; Plaschke, A.; Kranz, L. M.; Fesser, S.; Diken, M.; Löwer, M.; Vallazza, B.; Beissert, T.; Bukur, V.; Kuhn, A. N.; Türeci, Ö .; Sahin, U. Improving MRNA-Based Therapeutic Gene Delivery by ExpressionAugmenting 3′ UTRs Identified by Cellular Library Screening. Mol. Ther. 2019, 27, 824−836. (32) Roth, N.; Schön, J.; Hoffmann, D.; Thran, M.; Thess, A.; Mueller, S. O.; Petsch, B.; Rauch, S. CV2CoV, an Enhanced MRNABased SARS-CoV-2 Vaccine Candidate, Supports Higher Protein Expression and Improved Immunogenicity in Rats. bioRxiv 2021, DOI: 10.1101/2021.05.13.443734. (33) Gerstberger, S.; Hafner, M.; Tuschl, T. A Census of Human RNA-Binding Proteins. Nat. Rev. Genet. 2014, 15, 829−845. (34) Sood, P.; Krek, A.; Zavolan, M.; Macino, G.; Rajewsky, N. CellType-Specific Signatures of MicroRNAs on Target MRNA Expression. Proc. Natl. Acad. Sci. U. S. A. 2006, 103, 2746−2751. (35) Jia, L.; Mao, Y.; Ji, Q.; Dersh, D.; Yewdell, J. W.; Qian, S.-B. Decoding MRNA Translatability and Stability from the 5′ UTR. Nat. Struct. Mol. Biol. 2020, 27, 814−821. (36) Leppek, K.; Byeon, G. W.; Kladwang, W.; Wayment-Steele, H. K.; Kerr, C. H.; Xu, A. F.; Kim, D. S.; Topkar, V. V.; Choe, C.; Rothschild, D.; Tiu, G. C.; Wellington-Oguri, R.; Fujii, K.; Sharma, E.; Watkins, A. M.; Nicol, J. J.; Romano, J.; Tunguz, B.; Participants, E.; Barna, M.; Das, R. Combinatorial Optimization of MRNA Structure, Stability, and Translation for RNA-Based Therapeutics. bioRxiv 2021, DOI: 10.1101/2021.03.29.437587. (37) Jackson, R. J.; Hellen, C. U. T.; Pestova, T. V. The Mechanism of Eukaryotic Translation Initiation and Principles of Its Regulation. Nat. Rev. Mol. Cell Biol. 2010, 11, 113−127. (38) Kozak, M. Point Mutations Define a Sequence Flanking the AUG Initiator Codon That Modulates Translation by Eukaryotic Ribosomes. Cell 1986, 44, 283−292. (39) Kozak, M. Structural Features in Eukaryotic MRNAs That Modulate the Initiation of Translation. J. Biol. Chem. 1991, 266, 19867−19870. (40) Hinnebusch, A. G.; Ivanov, I. P.; Sonenberg, N. Translational Control by 5′-Untranslated Regions of Eukaryotic MRNAs. Science 2016, 352, 1413−1416. (41) Leppek, K.; Das, R.; Barna, M. Functional 5′ UTR MRNA Structures in Eukaryotic Translation Regulation and How to Find Them. Nat. Rev. Mol. Cell Biol. 2018, 19, 158−174. (42) Avni, D.; Biberman, Y.; Meyuhas, O. The 5′ Terminal Oligopyrimidine Tract Confers Translational Control on Top Mrnas in a Cell Type-and Sequence Context-Dependent Manner. Nucleic Acids Res. 1997, 25, 995−1001. (43) Hellen, C. U. T.; Sarnow, P. Internal Ribosome Entry Sites in Eukaryotic MRNA Molecules. Genes Dev. 2001, 15, 1593−1612. (44) Ingolia, N. T.; Ghaemmaghami, S.; Newman, J. R. S.; Weissman, J. S. Genome-Wide Analysis in Vivo of Translation with Nucleotide Resolution Using Ribosome Profiling. Science 2009, 324, 218−223. (45) Floor, S. N.; Doudna, J. A. Tunable Protein Synthesis by Transcript Isoforms in Human Cells. eLife 2016, 5, No. e10921. (46) Blair, J. D.; Hockemeyer, D.; Doudna, J. A.; Bateup, H. S.; Floor, S. N. Widespread Translational Remodeling during Human Neuronal Differentiation. Cell Rep. 2017, 21, 2005−2016. (47) Sun, C.; Shrivastava, A.; Singh, S.; Gupta, A. Revisiting Unreasonable Effectiveness of Data in Deep Learning Era. arXiv [cs.CV], August 4, 2017, 1707.02968. https://arxiv.org/abs/1707. 02968 (accessed 2021-09-06). (48) Kinney, J. B.; McCandlish, D. M. Massively Parallel Assays and Quantitative Sequence−Function Relationships. Annu. Rev. Genomics Hum. Genet. 2019, 20, 99−127. (49) Noderer, W. L.; Flockhart, R. J.; Bhaduri, A.; Diaz de Arce, A. J.; Zhang, J.; Khavari, P. A.; Wang, C. L. Quantitative Analysis of Mammalian Translation Initiation Sites by FACS-Seq. Mol. Syst. Biol. 2014, 10, 748. (50) Diaz de Arce, A. J.; Noderer, W. L.; Wang, C. L. Complete Motif Analysis of Sequence Requirements for Translation Initiation at Non-AUG Start Codons. Nucleic Acids Res. 2018, 46, 985−994. (51) Cuperus, J. T.; Groves, B.; Kuchina, A.; Rosenberg, A. B.; Jojic, N.; Fields, S.; Seelig, G. Deep Learning of the Regulatory Grammar of Yeast 5′ Untranslated Regions from 500,000 Random Sequences. Genome Res. 2017, 27, 2015−2024. (52) Ke, S.; Shang, S.; Kalachikov, S. M.; Morozova, I.; Yu, L.; Russo, J. J.; Ju, J.; Chasin, L. A. Quantitative Evaluation of All Hexamers as Exonic Splicing Elements. Genome Res. 2011, 21, 1360− 1374. (53) Rosenberg, A. B.; Patwardhan, R. P.; Shendure, J.; Seelig, G. Learning the Sequence Determinants of Alternative Splicing from Millions of Random Sequences. Cell 2015, 163, 698−711. (54) Eraslan, G.; Avsec, Ž .; Gagneur, J.; Theis, F. J. Deep Learning: New Computational Modelling Techniques for Genomics. Nat. Rev. Genet. 2019, 20, 389−403. (55) Bogard, N.; Linder, J.; Rosenberg, A. B.; Seelig, G. A Deep Neural Network for Predicting and Engineering Alternative Polyadenylation. Cell 2019, 178, 91−106. (56) Ferreira, J. P.; Overton, K. W.; Wang, C. L. Tuning Gene Expression with Synthetic Upstream Open Reading Frames. Proc. Natl. Acad. Sci. U. S. A. 2013, 110, 11284−11289. 33 Article https://doi.org/10.1021/acs.accounts.1c00621 Acc. Chem. Res. 2022, 55, 24−34 Accounts of Chemical Research pubs.acs.org/accounts (57) Landrum, M. J.; Lee, J. M.; Benson, M.; Brown, G.; Chao, C.; Chitipiralla, S.; Gu, B.; Hart, J.; Hoffman, D.; Hoover, J.; Jang, W.; Katz, K.; Ovetsky, M.; Riley, G.; Sethi, A.; Tully, R.; VillamarinSalomon, R.; Rubinstein, W.; Maglott, D. R. ClinVar: Public Archive of Interpretations of Clinically Relevant Variants. Nucleic Acids Res. 2016, 44, D862−D868. (58) Karollus, A.; Avsec, Ž .; Gagneur, J. Predicting Mean Ribosome Load for 5′UTR of Any Length Using Deep Learning. PLoS Comput. Biol. 2021, 17, No. e1008982. (59) Linares-Fernández, S.; Moreno, J.; Lambert, E.; Mercier-Gouy, P.; Vachez, L.; Verrier, B.; Exposito, J.-Y. Combining an Optimized MRNA Template with a Double Purification Process Allows Strong Expression of in Vitro Transcribed MRNA. Mol. Ther.–Nucleic Acids 2021, 26, 945−956. (60) Breathnach, R.; Harris, B. A. Plasmids for the Cloning and Expresion of Full-Length Double-Stranded CDNAs under Control of the SV40 Early or Late Gene Promoter. Nucleic Acids Res. 1983, 11, 7119−7136. (61) Studier, F. W.; Moffatt, B. A. Use of Bacteriophage T7 RNA Polymerase to Direct Selective High-Level Expression of Cloned Genes. J. Mol. Biol. 1986, 189, 113−130. (62) Alper, H.; Fischer, C.; Nevoigt, E.; Stephanopoulos, G. Tuning Genetic Control through Promoter Engineering. Proc. Natl. Acad. Sci. U. S. A. 2005, 102, 12678−12683. (63) Nevoigt, E.; Kohnke, J.; Fischer, C. R.; Alper, H.; Stahl, U.; Stephanopoulos, G. Engineering of Promoter Replacement Cassettes for Fine-Tuning of Gene Expression in Saccharomyces Cerevisiae. Appl. Environ. Microbiol. 2006, 72, 5266−5273. (64) de Boer, H. A.; Comstock, L. J.; Vasser, M. The Tac Promoter: A Functional Hybrid Derived from the Trp and Lac Promoters. Proc. Natl. Acad. Sci. U. S. A. 1983, 80, 21−25. (65) Mutalik, V. K.; Guimaraes, J. C.; Cambray, G.; Lam, C.; Christoffersen, M. J.; Mai, Q.-A.; Tran, A. B.; Paull, M.; Keasling, J. D.; Arkin, A. P.; Endy, D. Precise and Reliable Gene Expression via Standard Transcription and Translation Initiation Elements. Nat. Methods 2013, 10, 354−360. (66) Lutz, R.; Bujard, H. Independent and Tight Regulation of Transcriptional Units in Escherichia Coli Via the LacR/O, the TetR/ O and AraC/I1-I2 Regulatory Elements. Nucleic Acids Res. 1997, 25, 1203−1210. (67) Chao, S.-H.; Harada, J. N.; Hyndman, F.; Gao, X.; Nelson, C. G.; Chanda, S. K.; Caldwell, J. S. PDX1, a Cellular Homeoprotein, Binds to and Regulates the Activity of Human Cytomegalovirus Immediate Early Promoter *. J. Biol. Chem. 2004, 279, 16111−16120. (68) Magnusson, T.; Haase, R.; Schleef, M.; Wagner, E.; Ogris, M. Sustained, High Transgene Expression in Liver with Plasmid Vectors Using Optimized Promoter-Enhancer Combinations. Journal of Gene Medicine 2011, 13, 382−391. (69) Blazeck, J.; Garg, R.; Reed, B.; Alper, H. S. Controlling Promoter Strength and Regulation in Saccharomyces Cerevisiae Using Synthetic Hybrid Promoters. Biotechnol. Bioeng. 2012, 109, 2884−2895. (70) Salis, H. M.; Mirsky, E. A.; Voigt, C. A. Automated Design of Synthetic Ribosome Binding Sites to Control Protein Expression. Nat. Biotechnol. 2009, 27, 946−950. (71) Eiben, A. E.; Smith, J. From Evolutionary Computation to the Evolution of Things. Nature 2015, 521, 476−482. (72) Taneda, A. Multi-Objective Genetic Algorithm for Pseudoknotted RNA Sequence Design. Front. Genet. 2012, 3, 36. (73) Cao, J.; Novoa, E. M.; Zhang, Z.; Chen, W. C. W.; Liu, D.; Choi, G. C. G.; Wong, A. S. L.; Wehrspaun, C.; Kellis, M.; Lu, T. K. High-Throughput 5′ UTR Engineering for Enhanced Protein Production in Non-Viral Gene Therapies. Nat. Commun. 2021, 12, 4138. (74) Killoran, N.; Lee, L. J.; Delong, A.; Duvenaud, D.; Frey, B. J. Generating and Designing DNA with Deep Generative Models. arXiv [cs.LG], December 17, 2017, 1712.06148. https://arxiv.org/abs/1712. 06148 (accessed 2021-09-28). (75) Lanchantin, J.; Singh, R.; Lin, Z.; Qi, Y. Deep Motif: Visualizing Genomic Sequence Classifications. arXiv [cs.LG], June 2, 2016, 1605.01133. https://arxiv.org/abs/1605.01133 (accessed 2021-0928). (76) Kingma, D. P.; Welling, M. Auto-Encoding Variational Bayes. arXiv [stat.ML], May 1, 2014, 1312.6114. https://arxiv.org/abs/1312. 6114 (accessed 2021-08-28). (77) Goodfellow, I. J.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; WardeFarley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative Adversarial Networks. arXiv [stat.ML], June 10, 2014, 1406.2661. https://arxiv. org/abs/1406.2661 (accessed 2021-08-28). (78) Griesemer, D.; Xue, J. R.; Reilly, S. K.; Ulirsch, J. C.; Kukreja, K.; Davis, J. R.; Kanai, M.; Yang, D. K.; Butts, J. C.; Guney, M. H.; Luban, J.; Montgomery, S. B.; Finucane, H. K.; Novina, C. D.; Tewhey, R.; Sabeti, P. C. Genome-Wide Functional Screen of 3′UTR Variants Uncovers Causal Variants for Human Disease and Evolution. Cell 2021, 184, 5247−5260 .e19. (79) Mikl, M.; Eletto, D.; Lee, M.; Lafzi, A.; Mhamedi, F.; Sain, S. B.; Handler, K.; Moor, A. E. A Massively Parallel Reporter Assay Reveals Focused and Broadly Encoded RNA Localization Signals in Neurons. bioRxiv 2021, DOI: 10.1101/2021.04.27.441590. (80) Xie, Z.; Wroblewska, L.; Prochazka, L.; Weiss, R.; Benenson, Y. Multi-Input RNAi-Based Logic Circuit for Identification of Specific Cancer Cells. Science 2011, 333, 1307−1311. (81) Jain, R.; Frederick, J. P.; Huang, E. Y.; Burke, K. E.; Mauger, D. M.; Andrianova, E. A.; Farlow, S. J.; Siddiqui, S.; Pimentel, J.; Cheung-Ong, K.; McKinney, K. M.; Köhrer, C.; Moore, M. J.; Chakraborty, T. MicroRNAs Enable MRNA Therapeutics to Selectively Program Cancer Cells to Self-Destruct. Nucleic Acid Ther. 2018, 28, 285−296. 34 Article https://doi.org/10.1021/acs.accounts.1c00621 Acc. Chem. Res. 2022, 55, 24−34