Uploaded by manhdan.pham

castillo-hair-seelig-2021-machine-learning-for-designing-next-generation-mrna-therapeutics

advertisement
pubs.acs.org/accounts
Article
Machine Learning for Designing Next-Generation mRNA
Therapeutics
Published as part of the Accounts of Chemical Research special issue “mRNA Therapeutics”.
Sebastian M. Castillo-Hair and Georg Seelig*
Downloaded via 80.214.100.41 on October 24, 2023 at 21:37:14 (UTC).
See https://pubs.acs.org/sharingguidelines for options on how to legitimately share published articles.
Cite This: Acc. Chem. Res. 2022, 55, 24−34
ACCESS
Read Online
Metrics & More
Article Recommendations
CONSPECTUS: Over just the last 2 years, mRNA therapeutics and
vaccines have undergone a rapid transition from an intriguing concept to
real-world impact. However, whereas some aspects of mRNA
therapeutics, such as the use of chemical modifications to increase
stability and reduce immunogenicity, have been extensively optimized for
over two decades, other aspects, particularly the selection and design of
the noncoding leader and trailer sequences which control translation
efficiency and stability, have received comparably less attention. In
practice, such 5′ and 3′ untranslated regions (UTRs) are often borrowed
from highly expressed human genes with few or no modifications, as in
the case for the Pfizer/BioNTech Covid vaccine. Focusing on the
5′UTR, we here argue that model-driven design is a promising alternative
that provides unprecedented control over 5′UTR function. We review
recent work that combines synthetic biology with machine learning to
build quantitative models that relate ribosome loading, and thus translation efficiency, to the 5′UTR sequence. We first introduce an
experimental approach that uses polysome profiling and high-throughput sequencing to quantify ribosome loading for hundreds of
thousands of 5′UTRs in parallel. We apply this approach to measure ribosome loading in synthetic RNA libraries with a random
sequence inserted into the 5′UTR. We then review Optimus 5-Prime, a convolutional neural network model trained on the
experimental data. We highlight that very accurate models of biological regulation can be learned from synthetic data sets with
degenerate 5′UTRs. We validate model predictions not only on held-out data sets from our random library but also on a large library
of over 30 000 human 5′UTR fragments and using translation reporter data collected independently by other groups. Both the
experiment and model are compatible with commonly used chemically modified nucleosides, in particular, pseudouridine (Ψ) and 1methyl-pseudouridine (m1Ψ). We find that, in general, 5′UTRs have very similar impacts when combined with different proteincoding sequences and even in the context of different chemical modifications. We demonstrate that Optimus 5-Prime can be
combined with design algorithms to generate de novo sequences with precisely defined translation efficiencies. We emphasize recent
developments in design algorithms that rely on activation maximization and generative modeling to improve both the fitness and
diversity of designed sequences. Compared with prior approaches such as genetic algorithms, we show that these approaches are not
only faster but also less likely to get stuck in local sequence optima. Finally, we discuss how the approach reviewed here can be
generalized to other gene regions and applications.
■
KEY REFERENCES
• Linder, J.; Seelig, G. Fast Activation Maximization for
Molecular Sequence Design. BMC Bioinformatics 2021,
22, 510.2 Fast SeqProp is a computational design algorithm
based on activation maximization that can be combined
• Sample, P. J.; Wang, B.; Reid, D. W.; Presnyak, V.;
McFadyen, I. J.; Morris, D. R.; Seelig, G. Human 5′
UTR Design and Variant Effect Prediction from a
Massively Parallel Translation Assay. Nature Biotechnology 2019, 37, 803−809.1 A neural network model,
Optimus 5-Prime, trained on data f rom a massively
parallel translation assay accurately predicts how the
5′UTR sequence controls ribosome loading and, together
with a genetic algorithm, enables the design of highperforming 5′UTR sequences for mRNA therapeutics.
© 2021 The Authors. Published by
American Chemical Society
Received: October 7, 2021
Published: December 14, 2021
24
https://doi.org/10.1021/acs.accounts.1c00621
Acc. Chem. Res. 2022, 55, 24−34
Accounts of Chemical Research
pubs.acs.org/accounts
Article
Figure 1. Workflow combining high-throughput assays and machine learning for characterizing mRNA regulation and engineering UTR sequences
with high performance.
■
involved in the regulation of translation and mRNA
degradation; however, how their sequence affects these
processes is not completely understood, and poor UTR design
can negatively impact the expression of the therapeutic protein.
Thus, most mRNA therapies to date take their UTRs from
highly expressed human genes such as α- and β-globin.30
Recent studies, however, have shown that alternative UTRs
can result in higher expression,31,32 suggesting that there is
significant room for improvement. Moreover, when targeting
different cell types or tissues, UTRs may need further tuning to
account for the differential expression of regulators such as
RNA-binding proteins (RBPs)33 and microRNAs (miRNAs).34
To further complicate matters, there is a complex interplay
between the different sequence-dependent regulatory mechanisms that ultimately control protein expression. Recent
studies have shown that 5′UTR elements that repress
translation can also diminish mRNA stability,35 but 5′UTRs
with very high translation efficiencies can also have a
destabilizing effect and ultimately reduce expression.36 Clearly,
quantitative models that take into account these effects and
predict protein expression from sequence are crucial to
unlocking the full potential of mRNA therapeutics.
In this Account, we describe a framework that combines
high-throughput assays and deep-learning techniques to
develop predictive models, obtain biological insights, and
engineer de novo sequences that optimize protein expression
for mRNA therapeutics applications (Figure 1). The first
component of this framework consists of massively parallel
reporter assays (MPRAs) based on large synthetic gene or
transcript libraries, whereby sequence variation is targeted to
the region under study. Such an approach makes it possible to
experimentally interrogate and quantify how a particular part of
the mRNA (5′UTR, coding sequence (CDS), 3′UTR)
contributes to protein production. In particular, here we
with sequence-function models such as Optimus 5-Prime to
rapidly design f unctional and high-fitness sequences.
• Linder, J.; Bogard, N.; Rosenberg, A. B.; Seelig, G. A
Generative Neural Network for Maximizing Fitness and
Diversity of Synthetic DNA and Protein Sequences. Cell
Systems 2020, 11, 49−62.e16.3 Deep Exploration Networks (DENs) are a class of generative sequence design
models that can be used to design sequence libraries while
simultaneously maximizing the performance of all sequences
and minimizing similarity between them.
INTRODUCTION
We are witnessing the beginning of the mRNA therapeutics
revolution. Indeed, this technology is now in the public
spotlight thanks to its role in fighting the COVID-19
pandemic: It resulted in two of the most effective vaccines
available,4,5 with more currently in clinical trials,6 and has
continued to enable the rapid development of potential
booster shots against variants of concern.7 But its scope is
not limited to vaccines for infectious diseases, as mRNA
therapeutics are being evaluated for clinical applications such
as cancer immunotherapy,8−10 regenerative medicine,11−13 and
protein replacement therapy,14,15 among others.16,17 This wave
of mRNA therapeutics is the result of decades of work on
many fronts, including the development of novel 5′-cap
analogs18−21 and capping methods,22−25 chemically modified
bases,26−28 and lipid nanoparticles,29 all of which improve
mRNA stability, decrease immunogenicity, improve delivery
efficiency, and, in general, result in the successful in vivo
expression of the therapeutic protein.
Nevertheless, there is still untapped potential in optimizing
the primary sequence, either to further improve protein
expression or to encode more complex pharmacokinetics. For
example, 5′ and 3′ untranslated regions (UTRs) are heavily
25
https://doi.org/10.1021/acs.accounts.1c00621
Acc. Chem. Res. 2022, 55, 24−34
Accounts of Chemical Research
pubs.acs.org/accounts
Article
Figure 2. Polysome profiling data from a random 5′UTR library captures known regulatory effects. (A) Schematic of polysome profiling
experiment. (B) Influence of upstream start codon position along the 5′UTR. Plots show the average MRL of sequences containing a canonical
(AUG) or noncanonical (CUG, GUG) start codon at the indicated position with respect to the primary AUG. (C) Influence of context around
upstream start codon. Sequences containing an upstream AUG, GUG, or CUG between positions −21 and −8 were grouped by their surrounding
context (strong, moderate, weak) and whether they occur in-frame or out-of-frame with the primary AUG. p values were calculated using two-sided
t tests. (D) Relationship between the 5′UTR secondary structure and the measured ribosome load. 20 000 5′UTR sequences were grouped by their
predicted minimum free energy, and the MRL distribution of each group was plotted.
includes a purine at −3 and a G at +4, promotes
recognition.38,39 Finally, initiation factors dissociate, the 60S
subunit is recruited, the full 80S ribosome is assembled, and
peptide elongation begins.37 Translation of the primary open
reading frame (ORF) can be heavily influenced by sequence
elements within the 5′UTR. For example, upstream start
codons (uAUGs) and upstream ORFs (uORFs) can have a
repressive effect by capturing ribosomes that would otherwise
initiate at the primary start codon.40 Secondary structure and
RNA-binding proteins may block or interfere with ribosome
scanning.41 Other elements, such as the 5′ terminal
oligopyrimidine tract (5′TOP), regulate changes in translation
in response to stress.42 Finally, internal ribosome entry sites
(IRESs) allow translation initiation in a cap-independent
manner.43 All of these cis-regulatory elements may interact
with one another and influence translation in ways that remain
challenging to predict.
Machine-learning approaches can, in principle, be used to
build models that predict translation efficiency from 5′UTR
sequence, but such models require very large-scale and highquality training data. A potential solution could be found in
high-throughput translation data sets obtained from the human
transcriptome. Most of these used Ribo-seq,44 a method
wherein ribosome-bound transcripts are digested and the
ribosome-protected mRNA fragments are sequenced. Ribo-seq
provides mRNA translation efficiencies and even identifies the
reading frame being translated, but it has difficulty distinguishing between transcript isoforms of the same gene due to a
short ∼30 nt fragment length. TrIP-seq,45 wherein mRNAs are
fractionated based on the number of elongating ribosomes
before sequencing, can distinguish transcript isoforms and has
focus on recent work characterizing the influence of the 5′UTR
sequence on translation efficiency.
The second component consists of deep-learning methods
capable of identifying complex regulatory relationships from
the resulting data sets. Specifically, we show that a convolutional neural network (CNN) model trained on MPRA data
accurately predicts ribosome loading from 5′UTR sequences.
Finally, we discuss methods for engineering novel sequences
that achieve specified performance levels or exceed the
performance of endogenous sequences. We first report results
from designing 5′UTRs using a genetic algorithm, an iterative
discrete search method that “evolves” sequences in silico until
the model prediction matches a prespecified target. We then
describe more recent methods that are capable of rapidly
designing large libraries of diverse sequences while avoiding
many of the inefficiencies and pitfalls of genetic algorithms. We
conclude with a short discussion on how this work can be
extended to 3′UTR and CDS sequences and to other
molecular phenomena such as mRNA degradation.
■
A MASSIVELY PARALLEL TRANSLATION ASSAY
FOR CHARACTERIZING 5′UTRs
Translation of most mRNAs in eukaryotes starts with the
assembly of the eIF4F complex at the 5′cap followed by
recruitment of the 43S preinitiation complex containing the
40S ribosomal subunit, several eukaryotic initiation factors
(eIFs), and the Met-tRNAiMet anticodon. 37 Next, the
assembled 43S complex scans the 5′UTR in the 5′ to 3′
direction until a start codon is recognized. Successful
recognition depends on the start codon identity (the canonical
AUG is more likely to be recognized than CUG or GUG) and
the context around it. The “Kozak consensus sequence”, which
26
https://doi.org/10.1021/acs.accounts.1c00621
Acc. Chem. Res. 2022, 55, 24−34
Accounts of Chemical Research
pubs.acs.org/accounts
Article
Figure 3. Polysome profiling data from the random 5′UTR eGFP library generalizes to different CDSs and mRNA chemistries. (A) MRL from a
library of 3110 5′UTRs with an eGFP versus mCherry CDS. (B) Schematic of uridine compared with the modified nucleosides pseudouridine (Ψ)
and 1-methyl-pseudouridine (m1Ψ). (C,D) MRL comparison for modified versus unmodified chemistries. The eGFP library was resynthesized
using Ψ (C) or m1Ψ (D), and polysome profiling was performed as with the unmodified library. r2 values were calculated from 20 000 sequences
with the highest read coverage. Plots show 3000 sequences randomly chosen from this subset.
splicing52 and even translation49 often used short (6−10 nt)
random regions such that every possible sequence combination
was covered. However, building on our own previous work on
alternative splicing53 and translation regulation in yeast,51 we
used a longer random sequence to allow for more diverse motif
combinations. Whereas we cannot cover every possible 50-mer
(there are >1030), short regulatory sequences such as start and
stop codons should appear frequently and in many different
positions and combinations.
mRNA was synthesized, capped, and polyadenylated using
an in vitro transcription (IVT) system, which, compared with
transfecting plasmid DNA, allowed us to remove confounding
transcriptional and RNA processing effects. Synthesized
mRNA was transfected into HEK293T cells and incubated
for 12 h. Cells were then lysed in the presence of
cycloheximide, an antibiotic that halts elongating ribosomes.
The lysate was run through a sucrose gradient, and fractions
containing mRNAs bound to distinct numbers of ribosomes
(polysomes) were collected, barcoded, and sequenced. The
resulting data set contains read counts for 280 000 5′UTR
sequences. From here, we obtained the mean ribosome load
(MRL) for each sequence by multiplying, for each fraction, the
proportion of reads corresponding to a specific sequence times
the number of associated ribosomes and summing these
products. The MRL is thus a quantitative measure of
translation efficiency.
Our polysome profiling data set recapitulated previously
known regulatory effects. 5′UTRs with upstream AUGs had,
on average, lower MRL values when the AUG was out-offrame with respect to the primary start codon (Figure 2B).
This effect was also present with noncanonical upstream start
codons (CUG, GUG) but to a significantly lesser extent
(Figure 2B). Notably, the context around upstream start
codons strongly influenced their repressive effect: Out-of-frame
uAUGs were more repressive when surrounded by a purine
(A,G) at position −3 and a guanine at +4, matching the Kozak
consensus sequence, whereas uCUGs and uGUGs had
statistically significant effects only within a similarly strong
context (Figure 2C). Similar effects were observed for uORFs.
These observations are consistent with stronger uAUGs and
uORFs redirecting ribosomes that would otherwise initiate at
the primary start codon. Additionally, sequences with lower
predicted free energies had, on average, lower MRLs,
consistent with stable secondary structures interfering with
ribosome scanning (Figure 2D).
been successfully applied to studying the impact of alternative
5′ and 3′ UTRs.45,46
Still, endogenous transcript data may not be optimal for
training predictive models for multiple reasons. First,
endogenous transcripts contain highly variable UTR and
CDS sequences, making it difficult to reliably isolate how a
specific part of the mRNA, such as the 5′UTR, influences
translation. Second, the size of an endogenous data set is
fundamentally limited by the size of the human transcriptome.
Because deep learning can take advantage of extremely large
data sets to achieve exceptional performance,47 obtaining more
examples than what the genome can provide is desirable.
Finally, sequences with deleterious effects are likely to be
underrepresented in endogenous data, potentially resulting in
major model blind spots.
An alternative approach is to use MPRAs, where large
libraries of synthetic reporter sequences are assayed. Here
variation is restricted to a particular sequence element, in the
form of either fully degenerate or endogenous sequence
fragments.48 In addition, the MPRA library size can be orders
of magnitude larger than the number of genomic examples.
Previous MPRAs for characterizing translation in human cells
have used stable single-copy integration of DNA libraries
followed by fluorescence-activated sorting and sequencing.49,50
Similarly, we have previously used DNA libraries combined
with a growth selection assay to study translation in yeast.51
Measurements from DNA libraries are affected by 5′UTRs
influencing transcription or RNA processing as well as
translation, which, in some studies, has been compensated
for by placing a fluorescent reporter downstream of an IRES in
the same transcript.49,50
To characterize the influence of the 5′UTR on translation,
we developed an MPRA in which we measure the ribosome
loading of a random library of hundreds of thousands of
mRNAs (Figure 2A). This assay uses polysome profiling
followed by sequencing, as in TrIP-seq;45 however, we use
synthetic mRNA libraries where sequence variation is targeted
to the 5′UTR region. Specifically, our reporter design contains
a constant enhanced green fluorescent protein (eGFP) CDS, a
3′UTR derived from bovine growth hormone (BGH), and a
5′UTR with an initial 25 nt-long fixed segment followed by a
50 nt fully degenerate region. Two out-of-frame stop codons
present at the beginning of the eGFP CDS ensured that
initiation at a randomly generated out-of-frame start codon
would not result in extended translation. Prior MPRAs for
27
https://doi.org/10.1021/acs.accounts.1c00621
Acc. Chem. Res. 2022, 55, 24−34
Accounts of Chemical Research
pubs.acs.org/accounts
Article
Figure 4. Optimus 5-Prime can predict ribosome loading and protein expression from a given 5′UTR sequence. (A) Optimus 5-Prime architecture.
An input sequence represented as a 50 × 4 one-hot encoded vector (bottom) is fed into two convolutional layers (middle) followed by a fully dense
layer to generate an MRL prediction (top). (B) Measured versus predicted MRLs on a held-out test set of 20 000 sequences. Red: 5′UTRs with no
uAUGs. Blue: 5′UTRs with uAUGs. (C) Predicted MRL versus eGFP fluorescence for 10 mRNAs selected to have a wide range of MRL values.
mRNAs were independently transfected into HEK293 cells and imaged using an IncuCyte S3 live-cell analysis system. The maximum fluorescence
over a 20.5 h time window is shown.
Figure 5. Optimus 5-Prime predictions generalize across mRNA chemistries, cell lines, and endogenous 5′UTRs. (A) Coefficients of determination
(r2) for model predictions when training and test data sets are taken from one of two replicates of the original eGFP data set without modification
(U), with pseudouridine (Ψ), or with 1-methyl-pseudouridine (m1Ψ). (B,C) Optimus 5-Prime predictions compared with translation efficiency
measurements for 77 5′UTRs designed and characterized in six different cell lines by Ferreira et al.56 mRNA reporters contained a GFP ORF
preceded by a designed 5′ UTR and a red fluorescent protein (RFP) ORF preceded by an IRES to be used as a normalization control. Cell lines
included human embryonic kidney cells (293T), mouse pre-B lymphocytes (PD31), human chronic myelogenous leukemia cells (K562), human
colon cancer cells (HCT116), Chinese hamster ovary cells (CHO-K1), and mouse plasmacytoma (MPC11). 5′UTRs were used in this analysis
only if their GFP ORFs started with ATGG. 5′UTRs shorter than 50 bp were zero-padded before being used with Optimus 5-Prime. (B) Direct
comparison with measurements in PD31 cells. (C) Coefficients of determination (r2) of measurements versus MRL predictions in all cell lines. (D)
Predicted versus observed MRL for wild-type and SNV-containing human 5′UTR sequences.
■
DEVELOPING PREDICTIVE MODELS OF
RIBOSOME LOADING
Deep learning has been highly successful at various tasks in
molecular biology54 due to at least two factors. First, the tiered
nature of the molecular interactions involved in a particular
processsequence motifs recruit effector proteins, which form
complexes with other proteins, which, in turn, interact with
other complexes and sequence motifsare efficiently captured
by the layered architecture of a deep-learning network. Second,
because deep learning is capable of capturing complex,
nonlinear interactions, these models are uniquely suited to
take advantage of extremely large data sets to obtain improved
performance.47,55
To predict ribosome loading from 5′UTR sequence, we
developed a convolutional neural network (CNN) model
named Optimus 5-Prime. The model contains two convolutional layers with filters that identify short motifs from the
input and one fully connected layer that ultimately computes
the MRL prediction (Figure 4A). We first trained this model
using 260 000 random 5′UTRs and associated MRLs from the
polysome profiling data set. Optimus 5-Prime was able to
Associations between 5′UTR sequences and MRL measurements generalized to other coding sequences and nucleotide
chemistries, both relevant to mRNA therapeutic applications.
We tested the same ∼3000 5′UTRs together with either the
eGFP or mCherry CDS, two fluorescent proteins of different
origins and with widely differing sequences. We observed
excellent correlation between MRLs measured in both contexts
(r2 = 0.732, Figure 3A), suggesting that identical UTRs result
in similar MRLs, even if the CDS context changes.
Similarly, we resynthesized the original 280 000-member
5′UTR library but replaced uracil in the IVT reaction with the
chemically modified nucleosides pseudouridine (Ψ) and 1methyl-pseudouridine (m1Ψ) (Figure 3B). mRNAs incorporating these modifications avoid activation of intracellular
pattern recognition receptors, which would result in an
immune response that suppresses translation and promotes
mRNA degradation. Consequently, these modifications are
commonly used in mRNA therapeutics.6,11 MRLs were found
to be highly correlated across chemistries (Figure 3C,D).
28
https://doi.org/10.1021/acs.accounts.1c00621
Acc. Chem. Res. 2022, 55, 24−34
Accounts of Chemical Research
pubs.acs.org/accounts
predict ribosome loading with remarkable precision when
tested against 20 000 samples held out from training (r2 = 0.93,
Figure 4B, compared with r2 = 0.64 for the best k-mer linear
model with k ≤ 6). To further validate whether MRL
predictions were indicative of output protein expression, we
selected 10 sequences and performed individual eGFP
fluorescence measurements, which were highly correlated
with MRL predictions (Figure 4C).
To be maximally useful for the design of mRNA
therapeutics, Optimus 5′ needs to generalize to different
coding sequences, chemical modifications, cell types, sequence
“types” (e.g., human rather than random), or lengths. As
detailed above, we found UTR sequences to be highly
transferable between different CDS contexts and even chemical
modifications (Figure 3). Accordingly, despite being trained on
eGFP library data, Optimus 5-Prime predictions could explain
78 and 77% of the observed MRL variation in data from two
replicates with ∼200 000 random 5′UTRs preceding an
mCherry CDS (Figure 5A). Similarly, despite being trained
only on unmodified mRNA data, Optimus 5-Prime could
explain 69−73% of the observed MRL variation in the Ψ
library and 68−76% in the m1Ψ library. Still, the model
accuracy could be increased to 84−85, 77−82, and 72−81% by
retraining directly on the mCherry, Ψ, and m1Ψ data sets,
respectively (Figure 5B). Therefore, whereas a model trained
on unmodified RNA data is reasonably accurate, training
directly on modified RNA data will be ideal for predicting the
impact of such modifications in mRNA therapeutics contexts.
To test whether Optimus 5-Prime would perform well on
sequences designed by others, we turned to translation
measurements conducted with six different cell lines and 77
5′UTRs designed by Ferreira et al.56 and found that Optimus
5-Prime could explain 73−85% of the reported variation
(Figure 5C). We also note that measurements reported for
different cell types are very highly correlated, suggesting that
the basic regulatory rules (e.g., strengths of Kozak, role of
uORFs, etc.) remain similar between cell types. These
observations also suggest that a model trained on data
collected in a single cell type can generalize to other, more
clinically relevant cell types.
We also showed that Optimus 5-Prime generates accurate
MRL predictions on human 5′UTRs despite being trained on
random sequences only. We synthesized 35 212 5′UTRs
extracted from the 50 nt long region immediately preceding
the start codon in human transcripts and 3577 single
nucleotide variant (SNV) sequences from ClinVar57 and
assayed them as described above. MRL predictions were highly
correlated with experimental observations (r2 = 0.82, Figure
5D).
A limitation of the initial version of Optimus 5-Prime is its
fixed 50 nt long input, as UTRs used for mRNA therapeutics
can be longer, whereas human 5′UTRs range from tens to
thousands of bases with a median length of 218.41 Thus we
constructed and characterized a new library with a degenerate
5′UTR region ranging from 25 to 100 bases. We then retrained
Optimus 5-Prime using a longer 100-base input layer. Input
sequences shorter than 100 bases were accommodated by leftpadding their one-hot-encoded vector with zeros. On a test
data set of held-out random and human 5′UTR sequences, we
found MRL predictions to be highly correlated with measurements (r2 from 0.84 to 0.75). Recently, Gagneur and
coworkers used our MPRA data to train a model based on
convolutions staggered every three bases to further extend
MRL predictions to arbitrary-length 5′UTR sequences.58
Finally, we use Optimus 5-Prime to score the translation
efficiencies of 5′UTRs previously used in mRNA therapeutics.
First, the commonly used α- and β-globin 5′UTRs6 have
predicted MRLs of 6.1 and 6.6, respectively. Therefore,
compared with the 25−100 nt library data set, these sequences
can be placed in the 65th and 86th percentiles. The
BioNTech/Pfizer BNT-162b2 COVID-19 vaccine4 uses a
modified α-globin with a consensus Kozak and, according to
our model, results in an MRL of 6.3 (76th percentile). Finally,
the Moderna mRNA-1273 vaccine5 uses a synthetic 5′UTR,
which our model predicts to have an MRL of 5.7 (52nd
percentile). Whereas most of these UTRs result in higher-thanaverage ribosome loading and generally high expression, our
results suggest that further optimization could be beneficial for
strongly expressing proteins in therapeutic applications. In fact,
in very recent work, Exposito and coworkers compared six
5′UTRs selected from our eGFP library because of their high
measured MRLs to the β-globin 5′UTR and found that at least
one of the six synthetic sequences (“UTR4”) resulted in higher
GFP expression across three different cell types. Most notably,
an 80% increase in fluorescence compared with the β-globin
control was reported in primary human-monocyte-derived
dendritic cells.59
Article
■
DESIGNING SEQUENCES FOR ENHANCED AND
SPECIFIC mRNA TRANSLATION
Methods to rationally design regulatory sequences with custom
performance, such as 5′UTRs that achieve target translation
efficiencies, have been a major focus of synthetic biology. Early
genetic engineering relied on using regulatory sequences from
endogenous sources, with the expectation that they would
perform as well in their new synthetic context,60,61 an approach
still largely used with UTRs for mRNA therapeutics.6,30
However, further work, in particular, related to promoters,
demonstrated that engineered sequences could allow for the
finer tuning of performance62,63 and even outperform native
sequences64 while being more robust to context changes.65
Methods to design these sequences included building chimeras
from native sequences,64 rationally inserting66 or deleting67
regulatory motifs, and screening libraries containing random
mutations62,63 or permutations of sequence elements.68,69
An alternative approach is model-based design, where a
predictive sequence-to-function model is used alongside a
search algorithm to generate fully synthetic sequences with
target performance. An early demonstration of this approach
was the ribosome binding site (RBS) calculator, a software
package that designs bacterial 5′UTRs for specified translation
efficiencies.70 The RBS calculator was successful in part
because bacterial translation initiation relies on binding of the
16S ribosomal RNA to a sequence element in the mRNA
5′UTR; therefore, a model based entirely on RNA hybridization thermodynamics was sufficiently accurate. However,
detailed biophysical models may not be available for other
processes. Machine-learning models, such as Optimus 5-Prime,
that can be trained on large-scale example data sets even in the
absence of a quantitative biophysical model provide a powerful
alternative as “oracles” for sequence design.
We first demonstrated the machine-learning-guided design
of functional sequence elements in the context of yeast 5′UTR
regulation. Specifically, we used a neural network model
trained on 500 000 random UTRs together with random
29
https://doi.org/10.1021/acs.accounts.1c00621
Acc. Chem. Res. 2022, 55, 24−34
Accounts of Chemical Research
pubs.acs.org/accounts
Article
Figure 6. (A) Sequence design using Optimus 5-Prime and a genetic algorithm. (B) Predicted and observed MRLs of 12 000 sequences designed
for different target MRLs. (C,D) Model performance before (C) and after (D) retraining with a subset of the designed sequences evaluated on a
held-out designed sequence test set.
Figure 7. Fast SeqProp, a gradient-based sequence design method with PWM sampling and per-base logit normalization, rapidly finds highperforming sequences. (A) Methods based on search heuristics introduce random changes to a candidate sequence. Performance needs to be
evaluated for several candidates before finding an improvement. (B) Gradient-based methods move in the direction of increased performance, as
measured by the gradient of the cost function. (C) Fast SeqProp optimizes logits via gradient descent. Logits are normalized across positions, and
one-hot encoded sequences are sampled from PWMs before being presented to the pretrained predictive model. (D) Cost function over number of
iterations with and without logit normalization and PWM sampling when optimizing for high MRLs using Optimus 5-Prime.
mutagenesis to computationally evolve high fitness sequences.51 This work gave a proof of principle for machine-learningguided sequence design, but regulatory sequences optimized
for gene expression in yeast are unlikely to be optimal for
mRNA therapeutics applications. Given the high accuracy
achieved by Optimus 5-Prime, we similarly evaluated its
application in designing 5′UTRs with specified translation
efficiencies.1 Our design approach was based on genetic
algorithms,71 a discrete search heuristic previously used for
designing bacterial 5′UTRs70 and RNAs with pseudoknotted
structures.72 Starting with a set of random 50 bp 5′UTRs, each
iteration consisted of random mutations and crossovers in
silico, followed by scoring using Optimus 5-Prime and selection
of the best sequences for the next round (Figure 6A). To test
this approach, we designed 12 000 sequences to either achieve
one of seven discrete MRL values between 3 and 9 or to
maximize the MRL. These sequences were then synthesized
and tested via polysome profiling. We found excellent
agreement between the target and experimental MRLs when
the target was eight or lower. However, for larger target MRLs,
the experimental measurements were lower than predicted
(Figure 6B). A closer inspection of sequences in this stage
revealed the appearance of long poly-U stretches not present in
the training library. Therefore, the genetic algorithm was likely
exploiting a blind spot in Optimus 5-Primea region in
sequence space where predictions would have low qualityto
further increase MRLs. Notably, retraining the model using a
subset of the designed sequences and their measured MRLs
improved the prediction accuracy (Figure 6C,D). Recent work
by Lu and coworkers also used a genetic algorithm to design
5′UTRs for DNA gene therapy applications, but their approach
jointly optimized transcription and translation.73
■
IMPROVED ALGORITHMS FOR SEQUENCE
DESIGN
Sequence design based on search heuristics such as genetic
algorithms has several issues that we have tried to address in
recent work. For example, design can be slow and inefficient:
The mutation and crossover operations change a few
nucleotides at a time and are not guaranteed to increase the
performance at every step, resulting in small improvements
despite multiple model evaluations (Figure 7A). A more
efficient approach is activation maximization through gradient
descent, where the gradient of the performance metric with
respect to the model input is used to iteratively refine a
candidate sequence.74 Progress is always made in the direction
of increased performance, and fewer model evaluations are
required (Figure 7B). However, gradients can only be taken
with respect to continuous real-valued inputs, and thus some
modifications are needed to design sequences consisting of
discrete letters. Attempts at addressing this limitation include
representing sequences via unstructured real-valued matrices75
30
https://doi.org/10.1021/acs.accounts.1c00621
Acc. Chem. Res. 2022, 55, 24−34
Accounts of Chemical Research
pubs.acs.org/accounts
and introducing a “softmax” layer that transforms unbounded
real-valued inputs (“logits”) into position-weight matrices
(PWMs) before feeding them to the model.74 However,
these approaches may result in poor performance because
models are trained on one-hot encoded data (i.e., unambiguous sequences), not real-valued inputs.
We previously demonstrated a hybrid continuous/discrete
solution to this problem: At every iteration, model evaluations
are made using one-hot encoded sequences sampled from the
PWM, but gradients are evaluated with respect to the
continuous input logits. We successfully used this approach
to design sequences with custom alternative polyadenylation
isoform ratios.55 In more recent work,2 we evaluated the effect
of normalizing logits across all positions and introducing perbase scaling and bias factors (Figure 7C). These additions
helped us avoid issues with vanishing gradients, an error mode
where gradients become too small to drive meaningful updates.
As a result, our Fast SeqProp method converges rapidly in a
variety of design tasks, including maximizing transcription
factor binding, transcriptional activity, alternative polyadenylation, and translation using Optimus 5-Prime (Figure 7D).
Still, despite the speed improvement of Fast SeqProp,
activation maximization methods share a few limitations. First,
the algorithm needs to be run from scratch for every new
generated sequence. Furthermore, optimization might get
stuck in local minima or converge to a region in the sequence
space far from the training data set, where the model is not
accurate. A related issue is the lack of an explicit mechanism to
force generated sequences to be distinct, thus limiting the
diversity of sequences available for experimental testing and
reducing the likelihood of finding one with high performance.
A different class of design methods is based on deep
generative models, neural networks trained to learn the
distribution of a training data set to generate completely new
examples with similar properties.74,76,77 A major advantage
over gradient methods is speed: After training, generating new
examples requires a single evaluation of the generative model
without any iterations; however, the basic versions of these
methods do not optimize sequence performance or explicitly
maximize diversity. We recently developed Deep Exploration
Networks (DENs),3 an activation-maximizing deep generative
model that addresses these limitations (Figure 8). DENs are
trained via gradient descent by minimizing a cost function
composed of two terms: one related to the performance of a
generated sequence as given by an independent, pretrained
predictive model (e.g., Optimus 5-Prime) and the other
computed from a similarity metric between two generated
sequences (Figure 8A). By simultaneously maximizing
performance and minimizing similarity, DENs learn to
generate highly diverse sequences with high performance.
Furthermore, DENs can be restricted from generating
sequences that deviate too much from the sequence space
defined by the training data set of the predictor by using a
variational autoencoder (VAE)76 to penalize deviation during
training (Figure 8A). We showed that DENs can be used to
design sequences with specified alternative polyadenylation
isoform ratios and custom cleavage positions, splicing
regulatory sequences with maximal differential splicing in two
different cell lines and highly diverse GFP sequences with high
fluorescence. Still, a potential drawback of generative models is
the up-front requirement for model training. Thus we
recommend using Fast SeqProp for simple design tasks and
Article
Figure 8. Deep Exploration Networks. (A) A DEN is a neural
network that transforms random real-valued vectors (Z) into PWMs,
from which sequences can be sampled. During training, generated
sequences are scored based on their predicted performance and their
similarity to each other. Optionally, another generative network such
as a VAE, trained on the same data as the predictive model, can be
used to make sure DEN-generated sequences do not dramatically
deviate from the training data set. (B) During generation, a single
evaluation of the trained DEN results in a different generated
sequence.
DENs when maximum performance and large numbers of
diverse sequences are desirable.
■
SUMMARY AND OUTLOOK
Improving our ability to map mRNA sequence to function and
vice versa is key to developing a new generation of mRNA
therapeutics. Here we reviewed how an approach combining
high-throughput MPRA data, deep learning, and sequence
design algorithms can be used to characterize mRNA
regulation, extract biological insights, and design novel
sequences with high performance.
We expect this work to be extended in several directions in
the near future. First, MPRAs targeting regions other than the
5′UTR and processes other than translation will be developed.
As a recent example, Qian and coworkers developed an MPRA
with a short randomized uORF in the 5′UTR to study the
interplay between translation and mRNA stability.35 Similarly,
MPRAs targeting the 3′UTR region have been developed to
study the effect of variants on mRNA abundance78 and
subcellular transcript localization in neurons.79 Second, models
capable of predicting multiple biomolecular processes from
sequence should be further developed. For example, recent
work found that sequences selected to have an exceptionally
high MRL could result in low mRNA stability,36 suggesting
that optimizing a predictor that models translation alone may
not be an optimal strategy for maximizing protein expression.
Similarly, in a test of six synthetic 5′UTRs with very high
MRLs (7.8−10), a range of expression levels was observed,
31
https://doi.org/10.1021/acs.accounts.1c00621
Acc. Chem. Res. 2022, 55, 24−34
Accounts of Chemical Research
pubs.acs.org/accounts
possibly because of confounding effects of the 5′UTR
sequence on stability or even cell toxicity.59 Finally, we expect
understanding and engineering cell-type-specific expression to
be a major goal going forward, as targeting expression to
specific cell or tissue types will limit the side effects of future
mRNA therapeutics. Whereas some cell-type specificity can
currently be achieved by pasting miRNA binding elements into
the 3′UTR,80,81 we expect model-based design to further
increase the specificity and allow the targeting of cell types and
tissues that are currently inaccessible.
the BNT162b2MRNA Covid-19 Vaccine. N. Engl. J. Med. 2020, 383,
2603−2615.
(5) Baden, L. R.; El Sahly, H. M.; Essink, B.; Kotloff, K.; Frey, S.;
Novak, R.; Diemert, D.; Spector, S. A.; Rouphael, N.; Creech, C. B.;
McGettigan, J.; Khetan, S.; Segall, N.; Solis, J.; Brosz, A.; Fierro, C.;
Schwartz, H.; Neuzil, K.; Corey, L.; Gilbert, P.; Janes, H.; Follmann,
D.; Marovich, M.; Mascola, J.; Polakowski, L.; Ledgerwood, J.;
Graham, B. S.; Bennett, H.; Pajon, R.; Knightly, C.; Leav, B.; Deng,
W.; Zhou, H.; Han, S.; Ivarsson, M.; Miller, J.; Zaks, T. Efficacy and
Safety of the MRNA-1273 SARS-CoV-2 Vaccine. N. Engl. J. Med.
2021, 384, 403−416.
(6) Chaudhary, N.; Weissman, D.; Whitehead, K. A. MRNA
Vaccines for Infectious Diseases: Principles, Delivery and Clinical
Translation. Nat. Rev. Drug Discovery 2021, 20, 817−838.
(7) Wu, K.; Choi, A.; Koch, M.; Elbashir, S.; Ma, L.; Lee, D.; Woods,
A.; Henry, C.; Palandjian, C.; Hill, A.; Jani, H.; Quinones, J.; Nunna,
N.; O’Connell, S.; McDermott, A. B; Falcone, S.; Narayanan, E.;
Colpitts, T.; Bennett, H.; Corbett, K. S; Seder, R.; Graham, B. S;
Stewart-Jones, G. B.; Carfi, A.; Edwards, D. K Variant SARS-CoV-2
mRNA vaccines confer broad neutralization as primary or booster
series in mice. bioRxiv 2021, DOI: 10.1101/2021.04.13.439482.
(8) Sebastian, M.; Schröder, A.; Scheel, B.; Hong, H. S.; Muth, A.;
von Boehmer, L.; Zippelius, A.; Mayer, F.; Reck, M.; Atanackovic, D.;
Thomas, M.; Schneller, F.; Stöhlmacher, J.; Bernhard, H.; Gröschel,
A.; Lander, T.; Probst, J.; Strack, T.; Wiegand, V.; Gnad-Vogt, U.;
Kallen, K.-J.; Hoerr, I.; von der Muelbe, F.; Fotin-Mleczek, M.; Knuth,
A.; Koch, S. D. A Phase I/IIa Study of the MRNA-Based Cancer
Immunotherapy CV9201 in Patients with Stage IIIB/IV Non-Small
Cell Lung Cancer. Cancer Immunol. Immunother. 2019, 68, 799−812.
(9) Papachristofilou, A.; Hipp, M. M.; Klinkhardt, U.; Früh, M.;
Sebastian, M.; Weiss, C.; Pless, M.; Cathomas, R.; Hilbe, W.; Pall, G.;
Wehler, T.; Alt, J.; Bischoff, H.; Geißler, M.; Griesinger, F.; Kallen, K.J.; Fotin-Mleczek, M.; Schröder, A.; Scheel, B.; Muth, A.; Seibel, T.;
Stosnach, C.; Doener, F.; Hong, H. S.; Koch, S. D.; Gnad-Vogt, U.;
Zippelius, A. Phase Ib Evaluation of a Self-Adjuvanted Protamine
Formulated MRNA-Based Active Cancer Immunotherapy, BI1361849
(CV9202), Combined with Local Radiation Treatment in Patients
with Stage IV Non-Small Cell Lung Cancer. j. immunotherapy cancer
2019, 7, 38.
(10) Beck, J. D.; Reidenbach, D.; Salomon, N.; Sahin, U.; Türeci, Ö .;
Vormehr, M.; Kranz, L. M. MRNA Therapeutics in Cancer
Immunotherapy. Mol. Cancer 2021, 20, 69.
(11) Kwon, H.; Kim, M.; Seo, Y.; Moon, Y. S.; Lee, H. J.; Lee, K.;
Lee, H. Emergence of Synthetic MRNA: In Vitro Synthesis of MRNA
and Its Applications in Regenerative Medicine. Biomaterials 2018,
156, 172−193.
(12) Warren, L.; Lin, C. MRNA-Based Genetic Reprogramming.
Mol. Ther. 2019, 27, 729−734.
(13) Chanda, P. K.; Sukhovershin, R.; Cooke, J. P. MRNAEnhanced Cell Therapy and Cardiovascular Regeneration. Cells 2021,
10, 187.
(14) Magadum, A.; Kaur, K.; Zangi, L. MRNA-Based Protein
Replacement Therapy for the Heart. Mol. Ther. 2019, 27, 785−793.
(15) Trepotec, Z.; Lichtenegger, E.; Plank, C.; Aneja, M. K.;
Rudolph, C. Delivery of MRNA Therapeutics for the Treatment of
Hepatic Diseases. Mol. Ther. 2019, 27, 794−802.
(16) Sahin, U.; Karikó, K.; Türeci, Ö . MRNA-Based Therapeutics 
Developing a New Class of Drugs. Nat. Rev. Drug Discovery 2014, 13,
759−780.
(17) Pardi, N.; Hogan, M. J.; Porter, F. W.; Weissman, D. MRNA
Vaccines  a New Era in Vaccinology. Nat. Rev. Drug Discovery 2018,
17, 261−279.
(18) Stepinski, J.; Waddell, C.; Stolarski, R.; Darzynkiewicz, E.;
Rhoads, R. E. Synthesis and Properties of MRNAs Containing the
Novel “Anti-Reverse” Cap Analogs 7-Methyl(3′-O-Methyl)GpppG
and 7-Methyl(3′-Deoxy)GpppG. RNA 2001, 7, 1486−1495.
(19) Jemielity, J.; Fowler, T.; Zuberek, J.; Stepinski, J.; Lewdorowicz,
M.; Niedzwiecka, A.; Stolarski, R.; Darzynkiewicz, E.; Rhoads, R. E.
■
AUTHOR INFORMATION
Corresponding Author
Georg Seelig − Department of Electrical & Computer
Engineering and Paul G. Allen School of Computer Science &
Engineering, University of Washington, Seattle, Washington
98195, United States; orcid.org/0000-0002-3163-8782;
Email: gseelig@uw.edu
Author
Sebastian M. Castillo-Hair − Department of Electrical &
Computer Engineering and eScience Institute, University of
Washington, Seattle, Washington 98195, United States;
orcid.org/0000-0002-2384-3129
Complete contact information is available at:
https://pubs.acs.org/10.1021/acs.accounts.1c00621
Notes
The authors declare no competing financial interest.
Biographies
Sebastian M. Castillo-Hair is a Data Science Postdoctoral Fellow at
the eScience Institute and the Department of Electrical and Computer
Engineering at University of Washington. He studies how highthroughput assays and machine learning can be used to engineer novel
synthetic biological systems. Previously, he obtained his Ph.D. in
Bioengineering at Rice University.
Georg Seelig is a Professor at the University of Washington. His
research interests are in synthetic biology and genomics.
■
ACKNOWLEDGMENTS
We thank Johannes Linder for feedback on this manuscript.
This work was supported by NIH Awards R01GM120379 and
R01HG009892 to G.S., and by the University of Washington
eScience Institute with support from the Washington Research
Foundation to S.M.C.
■
REFERENCES
(1) Sample, P. J.; Wang, B.; Reid, D. W.; Presnyak, V.; McFadyen, I.
J.; Morris, D. R.; Seelig, G. Human 5′ UTR Design and Variant Effect
Prediction from a Massively Parallel Translation Assay. Nat.
Biotechnol. 2019, 37, 803−809.
(2) Linder, J.; Seelig, G. Fast Activation Maximization for Molecular
Sequence Design. BMC Bioinf. 2021, 22, 510.
(3) Linder, J.; Bogard, N.; Rosenberg, A. B.; Seelig, G. A Generative
Neural Network for Maximizing Fitness and Diversity of Synthetic
DNA and Protein Sequences. Cell Systems 2020, 11, 49−62.
(4) Polack, F. P.; Thomas, S. J.; Kitchin, N.; Absalon, J.; Gurtman,
A.; Lockhart, S.; Perez, J. L.; Pérez Marc, G.; Moreira, E. D.; Zerbini,
C.; Bailey, R.; Swanson, K. A.; Roychoudhury, S.; Koury, K.; Li, P.;
Kalina, W. V.; Cooper, D.; Frenck, R. W.; Hammitt, L. L.; Türeci, Ö .;
Nell, H.; Schaefer, A.; Ü nal, S.; Tresnan, D. B.; Mather, S.; Dormitzer,
P. R.; Ş ahin, U.; Jansen, K. U.; Gruber, W. C. Safety and Efficacy of
32
Article
https://doi.org/10.1021/acs.accounts.1c00621
Acc. Chem. Res. 2022, 55, 24−34
Accounts of Chemical Research
pubs.acs.org/accounts
Novel “Anti-Reverse” Cap Analogs with Superior Translational
Properties. RNA 2003, 9, 1108−1122.
(20) Kuhn, A. N.; Diken, M.; Kreiter, S.; Selmi, A.; Kowalska, J.;
Jemielity, J.; Darzynkiewicz, E.; Huber, C.; Türeci, Ö .; Sahin, U.
Phosphorothioate Cap Analogs Increase Stability and Translational
Efficiency of RNA Vaccines in Immature Dendritic Cells and Induce
Superior Immune Responses in Vivo. Gene Ther. 2010, 17, 961−971.
(21) Kocmik, I.; Piecyk, K.; Rudzinska, M.; Niedzwiecka, A.;
Darzynkiewicz, E.; Grzela, R.; Jankowska-Anyszka, M. Modified
ARCA Analogs Providing Enhanced Translational Properties of
Capped MRNAs. Cell Cycle 2018, 17, 1624−1636.
(22) Ensinger, M. J.; Martin, S. A.; Paoletti, E.; Moss, B.
Modification of the 5′-Terminus of MRNA by Soluble Guanylyl
and Methyl Transferases from Vaccinia Virus. Proc. Natl. Acad. Sci. U.
S. A. 1975, 72, 2525−2529.
(23) Yisraeli, J. K.; Melton, D. A. [4] Synthesis of Long, Capped
Transcripts in Vitro by SP6 and T7 RNA Polymerases. In Methods in
Enzymology; RNA Processing Part A: General Methods; Academic
Press, 1989; Vol. 180, pp 42−50.
(24) Yunus, M. A.; Chung, L. M. W.; Chaudhry, Y.; Bailey, D.;
Goodfellow, I. Development of an Optimized RNA-Based Murine
Norovirus Reverse Genetics System. J. Virol. Methods 2010, 169,
112−118.
(25) Henderson, J. M.; Ujita, A.; Hill, E.; Yousif-Rosales, S.; Smith,
C.; Ko, N.; McReynolds, T.; Cabral, C. R.; Escamilla-Powers, J. R.;
Houston, M. E. Cap 1 Messenger RNA Synthesis with CoTranscriptional CleanCap® Analog by In Vitro Transcription. Current
Protocols 2021, 1, e39.
(26) Karikó, K.; Muramatsu, H.; Welsh, F. A.; Ludwig, J.; Kato, H.;
Akira, S.; Weissman, D. Incorporation of Pseudouridine Into MRNA
Yields Superior Nonimmunogenic Vector With Increased Translational Capacity and Biological Stability. Mol. Ther. 2008, 16, 1833−
1840.
(27) Andries, O.; Mc Cafferty, S.; De Smedt, S. C.; Weiss, R.;
Sanders, N. N.; Kitada, T. N1-Methylpseudouridine-Incorporated
MRNA Outperforms Pseudouridine-Incorporated MRNA by Providing Enhanced Protein Expression and Reduced Immunogenicity in
Mammalian Cell Lines and Mice. J. Controlled Release 2015, 217,
337−344.
(28) Li, B.; Luo, X.; Dong, Y. Effects of Chemically Modified
Messenger RNA on Protein Expression. Bioconjugate Chem. 2016, 27,
849−853.
(29) Hou, X.; Zaks, T.; Langer, R.; Dong, Y. Lipid Nanoparticles for
MRNA Delivery. Nat. Rev. Mater. 2021, 1−17.
(30) Weng, Y.; Li, C.; Yang, T.; Hu, B.; Zhang, M.; Guo, S.; Xiao,
H.; Liang, X.-J.; Huang, Y. The Challenge and Prospect of MRNA
Therapeutics Landscape. Biotechnol. Adv. 2020, 40, 107534.
(31) Orlandini von Niessen, A. G.; Poleganov, M. A.; Rechner, C.;
Plaschke, A.; Kranz, L. M.; Fesser, S.; Diken, M.; Löwer, M.; Vallazza,
B.; Beissert, T.; Bukur, V.; Kuhn, A. N.; Türeci, Ö .; Sahin, U.
Improving MRNA-Based Therapeutic Gene Delivery by ExpressionAugmenting 3′ UTRs Identified by Cellular Library Screening. Mol.
Ther. 2019, 27, 824−836.
(32) Roth, N.; Schön, J.; Hoffmann, D.; Thran, M.; Thess, A.;
Mueller, S. O.; Petsch, B.; Rauch, S. CV2CoV, an Enhanced MRNABased SARS-CoV-2 Vaccine Candidate, Supports Higher Protein
Expression and Improved Immunogenicity in Rats. bioRxiv 2021,
DOI: 10.1101/2021.05.13.443734.
(33) Gerstberger, S.; Hafner, M.; Tuschl, T. A Census of Human
RNA-Binding Proteins. Nat. Rev. Genet. 2014, 15, 829−845.
(34) Sood, P.; Krek, A.; Zavolan, M.; Macino, G.; Rajewsky, N. CellType-Specific Signatures of MicroRNAs on Target MRNA
Expression. Proc. Natl. Acad. Sci. U. S. A. 2006, 103, 2746−2751.
(35) Jia, L.; Mao, Y.; Ji, Q.; Dersh, D.; Yewdell, J. W.; Qian, S.-B.
Decoding MRNA Translatability and Stability from the 5′ UTR. Nat.
Struct. Mol. Biol. 2020, 27, 814−821.
(36) Leppek, K.; Byeon, G. W.; Kladwang, W.; Wayment-Steele, H.
K.; Kerr, C. H.; Xu, A. F.; Kim, D. S.; Topkar, V. V.; Choe, C.;
Rothschild, D.; Tiu, G. C.; Wellington-Oguri, R.; Fujii, K.; Sharma, E.;
Watkins, A. M.; Nicol, J. J.; Romano, J.; Tunguz, B.; Participants, E.;
Barna, M.; Das, R. Combinatorial Optimization of MRNA Structure,
Stability, and Translation for RNA-Based Therapeutics. bioRxiv 2021,
DOI: 10.1101/2021.03.29.437587.
(37) Jackson, R. J.; Hellen, C. U. T.; Pestova, T. V. The Mechanism
of Eukaryotic Translation Initiation and Principles of Its Regulation.
Nat. Rev. Mol. Cell Biol. 2010, 11, 113−127.
(38) Kozak, M. Point Mutations Define a Sequence Flanking the
AUG Initiator Codon That Modulates Translation by Eukaryotic
Ribosomes. Cell 1986, 44, 283−292.
(39) Kozak, M. Structural Features in Eukaryotic MRNAs That
Modulate the Initiation of Translation. J. Biol. Chem. 1991, 266,
19867−19870.
(40) Hinnebusch, A. G.; Ivanov, I. P.; Sonenberg, N. Translational
Control by 5′-Untranslated Regions of Eukaryotic MRNAs. Science
2016, 352, 1413−1416.
(41) Leppek, K.; Das, R.; Barna, M. Functional 5′ UTR MRNA
Structures in Eukaryotic Translation Regulation and How to Find
Them. Nat. Rev. Mol. Cell Biol. 2018, 19, 158−174.
(42) Avni, D.; Biberman, Y.; Meyuhas, O. The 5′ Terminal
Oligopyrimidine Tract Confers Translational Control on Top Mrnas
in a Cell Type-and Sequence Context-Dependent Manner. Nucleic
Acids Res. 1997, 25, 995−1001.
(43) Hellen, C. U. T.; Sarnow, P. Internal Ribosome Entry Sites in
Eukaryotic MRNA Molecules. Genes Dev. 2001, 15, 1593−1612.
(44) Ingolia, N. T.; Ghaemmaghami, S.; Newman, J. R. S.;
Weissman, J. S. Genome-Wide Analysis in Vivo of Translation with
Nucleotide Resolution Using Ribosome Profiling. Science 2009, 324,
218−223.
(45) Floor, S. N.; Doudna, J. A. Tunable Protein Synthesis by
Transcript Isoforms in Human Cells. eLife 2016, 5, No. e10921.
(46) Blair, J. D.; Hockemeyer, D.; Doudna, J. A.; Bateup, H. S.;
Floor, S. N. Widespread Translational Remodeling during Human
Neuronal Differentiation. Cell Rep. 2017, 21, 2005−2016.
(47) Sun, C.; Shrivastava, A.; Singh, S.; Gupta, A. Revisiting
Unreasonable Effectiveness of Data in Deep Learning Era. arXiv
[cs.CV], August 4, 2017, 1707.02968. https://arxiv.org/abs/1707.
02968 (accessed 2021-09-06).
(48) Kinney, J. B.; McCandlish, D. M. Massively Parallel Assays and
Quantitative Sequence−Function Relationships. Annu. Rev. Genomics
Hum. Genet. 2019, 20, 99−127.
(49) Noderer, W. L.; Flockhart, R. J.; Bhaduri, A.; Diaz de Arce, A.
J.; Zhang, J.; Khavari, P. A.; Wang, C. L. Quantitative Analysis of
Mammalian Translation Initiation Sites by FACS-Seq. Mol. Syst. Biol.
2014, 10, 748.
(50) Diaz de Arce, A. J.; Noderer, W. L.; Wang, C. L. Complete
Motif Analysis of Sequence Requirements for Translation Initiation at
Non-AUG Start Codons. Nucleic Acids Res. 2018, 46, 985−994.
(51) Cuperus, J. T.; Groves, B.; Kuchina, A.; Rosenberg, A. B.; Jojic,
N.; Fields, S.; Seelig, G. Deep Learning of the Regulatory Grammar of
Yeast 5′ Untranslated Regions from 500,000 Random Sequences.
Genome Res. 2017, 27, 2015−2024.
(52) Ke, S.; Shang, S.; Kalachikov, S. M.; Morozova, I.; Yu, L.;
Russo, J. J.; Ju, J.; Chasin, L. A. Quantitative Evaluation of All
Hexamers as Exonic Splicing Elements. Genome Res. 2011, 21, 1360−
1374.
(53) Rosenberg, A. B.; Patwardhan, R. P.; Shendure, J.; Seelig, G.
Learning the Sequence Determinants of Alternative Splicing from
Millions of Random Sequences. Cell 2015, 163, 698−711.
(54) Eraslan, G.; Avsec, Ž .; Gagneur, J.; Theis, F. J. Deep Learning:
New Computational Modelling Techniques for Genomics. Nat. Rev.
Genet. 2019, 20, 389−403.
(55) Bogard, N.; Linder, J.; Rosenberg, A. B.; Seelig, G. A Deep
Neural Network for Predicting and Engineering Alternative
Polyadenylation. Cell 2019, 178, 91−106.
(56) Ferreira, J. P.; Overton, K. W.; Wang, C. L. Tuning Gene
Expression with Synthetic Upstream Open Reading Frames. Proc.
Natl. Acad. Sci. U. S. A. 2013, 110, 11284−11289.
33
Article
https://doi.org/10.1021/acs.accounts.1c00621
Acc. Chem. Res. 2022, 55, 24−34
Accounts of Chemical Research
pubs.acs.org/accounts
(57) Landrum, M. J.; Lee, J. M.; Benson, M.; Brown, G.; Chao, C.;
Chitipiralla, S.; Gu, B.; Hart, J.; Hoffman, D.; Hoover, J.; Jang, W.;
Katz, K.; Ovetsky, M.; Riley, G.; Sethi, A.; Tully, R.; VillamarinSalomon, R.; Rubinstein, W.; Maglott, D. R. ClinVar: Public Archive
of Interpretations of Clinically Relevant Variants. Nucleic Acids Res.
2016, 44, D862−D868.
(58) Karollus, A.; Avsec, Ž .; Gagneur, J. Predicting Mean Ribosome
Load for 5′UTR of Any Length Using Deep Learning. PLoS Comput.
Biol. 2021, 17, No. e1008982.
(59) Linares-Fernández, S.; Moreno, J.; Lambert, E.; Mercier-Gouy,
P.; Vachez, L.; Verrier, B.; Exposito, J.-Y. Combining an Optimized
MRNA Template with a Double Purification Process Allows Strong
Expression of in Vitro Transcribed MRNA. Mol. Ther.–Nucleic Acids
2021, 26, 945−956.
(60) Breathnach, R.; Harris, B. A. Plasmids for the Cloning and
Expresion of Full-Length Double-Stranded CDNAs under Control of
the SV40 Early or Late Gene Promoter. Nucleic Acids Res. 1983, 11,
7119−7136.
(61) Studier, F. W.; Moffatt, B. A. Use of Bacteriophage T7 RNA
Polymerase to Direct Selective High-Level Expression of Cloned
Genes. J. Mol. Biol. 1986, 189, 113−130.
(62) Alper, H.; Fischer, C.; Nevoigt, E.; Stephanopoulos, G. Tuning
Genetic Control through Promoter Engineering. Proc. Natl. Acad. Sci.
U. S. A. 2005, 102, 12678−12683.
(63) Nevoigt, E.; Kohnke, J.; Fischer, C. R.; Alper, H.; Stahl, U.;
Stephanopoulos, G. Engineering of Promoter Replacement Cassettes
for Fine-Tuning of Gene Expression in Saccharomyces Cerevisiae.
Appl. Environ. Microbiol. 2006, 72, 5266−5273.
(64) de Boer, H. A.; Comstock, L. J.; Vasser, M. The Tac Promoter:
A Functional Hybrid Derived from the Trp and Lac Promoters. Proc.
Natl. Acad. Sci. U. S. A. 1983, 80, 21−25.
(65) Mutalik, V. K.; Guimaraes, J. C.; Cambray, G.; Lam, C.;
Christoffersen, M. J.; Mai, Q.-A.; Tran, A. B.; Paull, M.; Keasling, J.
D.; Arkin, A. P.; Endy, D. Precise and Reliable Gene Expression via
Standard Transcription and Translation Initiation Elements. Nat.
Methods 2013, 10, 354−360.
(66) Lutz, R.; Bujard, H. Independent and Tight Regulation of
Transcriptional Units in Escherichia Coli Via the LacR/O, the TetR/
O and AraC/I1-I2 Regulatory Elements. Nucleic Acids Res. 1997, 25,
1203−1210.
(67) Chao, S.-H.; Harada, J. N.; Hyndman, F.; Gao, X.; Nelson, C.
G.; Chanda, S. K.; Caldwell, J. S. PDX1, a Cellular Homeoprotein,
Binds to and Regulates the Activity of Human Cytomegalovirus
Immediate Early Promoter *. J. Biol. Chem. 2004, 279, 16111−16120.
(68) Magnusson, T.; Haase, R.; Schleef, M.; Wagner, E.; Ogris, M.
Sustained, High Transgene Expression in Liver with Plasmid Vectors
Using Optimized Promoter-Enhancer Combinations. Journal of Gene
Medicine 2011, 13, 382−391.
(69) Blazeck, J.; Garg, R.; Reed, B.; Alper, H. S. Controlling
Promoter Strength and Regulation in Saccharomyces Cerevisiae
Using Synthetic Hybrid Promoters. Biotechnol. Bioeng. 2012, 109,
2884−2895.
(70) Salis, H. M.; Mirsky, E. A.; Voigt, C. A. Automated Design of
Synthetic Ribosome Binding Sites to Control Protein Expression. Nat.
Biotechnol. 2009, 27, 946−950.
(71) Eiben, A. E.; Smith, J. From Evolutionary Computation to the
Evolution of Things. Nature 2015, 521, 476−482.
(72) Taneda, A. Multi-Objective Genetic Algorithm for Pseudoknotted RNA Sequence Design. Front. Genet. 2012, 3, 36.
(73) Cao, J.; Novoa, E. M.; Zhang, Z.; Chen, W. C. W.; Liu, D.;
Choi, G. C. G.; Wong, A. S. L.; Wehrspaun, C.; Kellis, M.; Lu, T. K.
High-Throughput 5′ UTR Engineering for Enhanced Protein
Production in Non-Viral Gene Therapies. Nat. Commun. 2021, 12,
4138.
(74) Killoran, N.; Lee, L. J.; Delong, A.; Duvenaud, D.; Frey, B. J.
Generating and Designing DNA with Deep Generative Models. arXiv
[cs.LG], December 17, 2017, 1712.06148. https://arxiv.org/abs/1712.
06148 (accessed 2021-09-28).
(75) Lanchantin, J.; Singh, R.; Lin, Z.; Qi, Y. Deep Motif: Visualizing
Genomic Sequence Classifications. arXiv [cs.LG], June 2, 2016,
1605.01133. https://arxiv.org/abs/1605.01133 (accessed 2021-0928).
(76) Kingma, D. P.; Welling, M. Auto-Encoding Variational Bayes.
arXiv [stat.ML], May 1, 2014, 1312.6114. https://arxiv.org/abs/1312.
6114 (accessed 2021-08-28).
(77) Goodfellow, I. J.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; WardeFarley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative Adversarial
Networks. arXiv [stat.ML], June 10, 2014, 1406.2661. https://arxiv.
org/abs/1406.2661 (accessed 2021-08-28).
(78) Griesemer, D.; Xue, J. R.; Reilly, S. K.; Ulirsch, J. C.; Kukreja,
K.; Davis, J. R.; Kanai, M.; Yang, D. K.; Butts, J. C.; Guney, M. H.;
Luban, J.; Montgomery, S. B.; Finucane, H. K.; Novina, C. D.;
Tewhey, R.; Sabeti, P. C. Genome-Wide Functional Screen of 3′UTR
Variants Uncovers Causal Variants for Human Disease and Evolution.
Cell 2021, 184, 5247−5260 .e19.
(79) Mikl, M.; Eletto, D.; Lee, M.; Lafzi, A.; Mhamedi, F.; Sain, S.
B.; Handler, K.; Moor, A. E. A Massively Parallel Reporter Assay
Reveals Focused and Broadly Encoded RNA Localization Signals in
Neurons. bioRxiv 2021, DOI: 10.1101/2021.04.27.441590.
(80) Xie, Z.; Wroblewska, L.; Prochazka, L.; Weiss, R.; Benenson, Y.
Multi-Input RNAi-Based Logic Circuit for Identification of Specific
Cancer Cells. Science 2011, 333, 1307−1311.
(81) Jain, R.; Frederick, J. P.; Huang, E. Y.; Burke, K. E.; Mauger, D.
M.; Andrianova, E. A.; Farlow, S. J.; Siddiqui, S.; Pimentel, J.;
Cheung-Ong, K.; McKinney, K. M.; Köhrer, C.; Moore, M. J.;
Chakraborty, T. MicroRNAs Enable MRNA Therapeutics to
Selectively Program Cancer Cells to Self-Destruct. Nucleic Acid
Ther. 2018, 28, 285−296.
34
Article
https://doi.org/10.1021/acs.accounts.1c00621
Acc. Chem. Res. 2022, 55, 24−34
Download