Methods for Gamification of Molecular Engineering by Directed

advertisement
Methods for Gamification of Molecular Engineering by Directed Evolution
I. Introduction
The tremendous promise of synthetic biology- and in particular of molecular
engineering by directed evolution- has yet to be realized. Relevant techniques still
require expensive reagents, finicky instrumentation, and great expertise in
molecular biology protocols. It would be a superlative boon to advancing this field if
more people with the requisite curiosity, scientific creativity, and raw aboveaverage intellect normally found in a full-time professional scientist could
independently play the game of research without the many years of formal training
and significant financial and infrastructural resources typically required.
Meanwhile, billions of personhours’ worth of human brainpower are expended
every day to solve the problems encountered within videogames. While this
phenomenon has been put to good use to solve scientific problems requiring
imagination, calculation, intuition, and pattern matching, it has not been deployed to
allow game players to independently experiment upon actual physical systems. We
describe methods that turn the physical process of molecular engineering by
directed evolution into a game that anyone can play with no more infrastructure
than access to the Internet. This will vastly increase the number of people who can
contribute to synthetic biology research and will thus vastly accelerate the rate of
progress in this life-saving, world-changing, mind-blowing field.
II. List of figures
Figure 1 Diagram of user interface for continuous directed evolution via synthesis
and sequencing
Figure 2 Flowchart showing how the game is played
Figure 3 Diagram of exemplary “genetic avatars”, “constructavars”, or “molavatars”
compared to sequences of the constructs for which they act as avatars, or visual
representations.
Figure 4 Process flow showing what happens physically/ remotely during
gameplay
Figure 5 Representative program written in OODLES, Objected Oriented Design
Language for Experimental Science
Figure 6 Representative screenshot of one embodiment of the invention
Figure 7 List of positions and possible qualities for describing one particular
“genetic avatar”, “constructavar”, or “molavatar” as a monkey based on data from
Seelig and Szostak (2007).
Figure 8 Reference, figure 3 from Seelig and Szostak (2007).
III. Description
The invention includes methods enabling a person with little scientific knowledge or
physical or financial resources to perform real-world experiments in order to
engineer molecules with desired properties. These methods include but are not
limited to 1) procedures for interfacing between player turns/actions and what
must happen physically at a remote semi-automated facility in correspondence to
those inputs, 2) software and user interface implementations of game flow that
enable the player to completely ignore any and all details of experimental protocol
underlying the game, 3) game design features that allow a player to provide rational
and intuitive guidance to directed evolution of specific biomolecular constructs
without any knowledge of their sequence or biochemical or structural details, 4)
game design features that allow a self-regulating economy of physical resources in
terms of virtual actions, and 5) software implementations that allow experts to hack,
refine, repeat, and scale all the actions taken by presumably naïve but successful
players.
Figure 1 shows a diagram of the minimal elements required of the invention in a
generically described main user interface. The game consists of any number of
missions that represent desired goals in molecular engineering, such as enzymes
that take the carbon atoms in C12H22O11 and add them to a growing lattice of sp3hybridized carbon atoms, i.e., turning table sugar into diamonds. The player can be
treated to such details in some embodiments, but it is noteworthy that game play
can proceed usefully toward engineering goals without knowledge of these details.
The main user interface has four minimal generic requirements: 1a) the gene pool, a
pool of genes that the player can use, 1b) the ability to define new genes or edit old
ones, 1c) elective evolutionary protocol submission- the ability to choose to subject
genes to evolutionary procedures, and 1d) display of results- information on how
specific genes fare when subjected to evolutionary protocols.
In some embodiments, the gene pool can be augmented using genes shown to have
various qualities by other players, and/or the gene pool can be augmented by
designing genes individually from scratch and/or editing existing genes. Of note is
that component 1b only requires the player to be able to perceive and design details
of easily visualized, concrete, or familiar objects, as described below in our
description of figure 3. Component 1c does not require any knowledge on the
player’s part of the details of the underlying evolutionary protocols- he only knows
that his genes will be subjected to evolutionary pressure of a specific kind, and that
afterwards he’ll see, per component 1d, whether and how much any aspects of the
genes changed during the process. Thus 1a-1d is a set of requirements for a useful
direct molecular engineering game in which a player can complete the loop of turn
cycles described in figure 2, which we shall now describe.
Figure 2 is a flowchart that tracks a player’s series of actions in the course of the
game, i.e., how the game can be played at a generic level. First, the player sees if he
or she has any genes to use. If no genes are available, genes must be constructed
virtually. Genes may be constructed using the technology described later in figure
3, so that no knowledge of genetics or biochemistry is needed.
Once genes are available, the player faces a virtual economic choice that
corresponds to the actual real-world economic costs of doing experiments. The
player’s virtual resources may be represented by a virtual currency, such as a
statistic associated with the player’s “character” (in a role-playing game-like
context) that corresponds to this virtual wealth (in one embodiment, the currency is
referred to as “nucleomana”, which evokes “nucleic acid” and also the concept of
“mana”, spiritual or magical power often used as a reservoir for the ability to cast
spells in fantasy-based role-playing games.) This aspect of the invention allows
straightforward harmonization between virtual and real world economic resources ,
which is foreseen to be necessary until the underlying technologies are themselves
free to use in their entirety. It is important to note that for clarity of play in some
embodiments the need to spend “nucleomana” may happen prior to gene
construction, as it is simpler to think of paying for the construction of a gene rather
than to imagine paying for the more complicated and multifarious and à la carte
processes involved in the next step, but in practice, this hardly makes a difference.
In reality, as will be noted later, the genes are not actually physically made until the
player has committed to experimenting with them- they are not synthesized in the
real world at the same time as they are synthesized virtually.
It is notable that during gene construction, players can set the precise degree to
which they would like the gene to be randomized during synthesis, with full
granular control over every element’s degree of synthetic randomization. Players
may even copy and paste genes in the gene pool in order to edit the underlying
randomization factors assumed during synthesis. This randomization allows for
tuning of the inputs to an “evolution” so that the optimal degree of randomization
can be determined in terms of optimal evolutionary outputs.
Further, total virtual resources will be depleted in a manner corresponding to actual
physical resources. As one example, a player may commit genes to an “evolution”,
but wish to have more physical resources committed to the synthesis and input of
that particular gene. Thus, players will be able to place the same gene multiple
times into one evolution, with each instance representing actual physical resource
use. However, the design of the economic cycle should take into account startup
costs, where the cost to synthesize a small amount of one precisely built gene is
greater than the per-unit cost of synthesizing more of it.
Once the player is assured of adequate virtual resources in the form of
“nucleomana” or “hit points”, or whatever form the particular embodiment uses, the
player then decides how to apportion the available genes between various
“evolutions”, an broadly-encompassing pseudo-neologism corresponding to
everything between input and output in the realm of molecular engineering via
directed evolution. The neologism may be employed to gloss over underlying
complexities of design and implementation with which the player need not be
concerned. To a PHOSITA (Person Having Ordinary Skill In The Art), “evolutions”
will be understood as referring to particular selection and screening procedures
that result in different physical outcomes for the constructs subjected to them,
where some outcomes are considered to indicate greater evolutionary fitness than
others. Once the player has apportioned various genes to various “evolutions”, he
can then inspect the results, which comprise a “results gene pool” from which he can
draw to continue the cycle. Importantly, the “results gene pool” may consist of
many new variants that were measured as having come to dominate a greater
portion of the population than they did in the starting materials. Thus it is useful
though not strictly necessary to have the player define the degree to which the
specific details of his gene may be randomized, so that unexpected variants can
emerge and greater swaths of sequence space can be explored.
It is important to note that the “results gene pool” will contain consensus sequences
representing the most selected for related inputs among the inputs given by the
player. Thus genes in the results pool do not necessarily represent actual individual
physical entities that were found to have been selected or screened for during the
“evolutions”. Rather, results pool genes are formed by algorithms that look for
optimal residues in various positions within related individual genes sequenced
post-evolution. A user might ideally select and/or edit these to use as starting
materials to be synthesized for the “next round” of the game. Importantly, players
will also be able to view individual sequencing results not built as consensus
estimates, and may select among these.
Also importantly, the results pool may also be viewable as a 3D topological fitness
landscape, wherein three variables are arrayed in X, Y, and Z in such a manner that
typically the most important variable in terms of the fitness ideal toward which a
specific mission is driven is chosen as the Z coordinate, with the optimal value
defined as the highest Z point, so that the player may simply observe how the genes
in the results pool (or other genes that have been copied and derived from the same
evolutions, or genes that have not been actually synthesized but whose placement
on the fitness landscape is algorithmically estimated, with caveats given) sit in a 3D
(or more) fitness landscape. This “fitness landscape view”, where genes may be
directly and intuitively observed, is an important aspect of the software and
visualization methods of the invention.
In one embodiment, the X, Y, and Z values correspond to chosen outputs of specific
evolutions. As an example, a mission might have the goal of finding enzymes that
fulfill three different criteria: cleavage of a target molecule, doing so at a certain pH
value mainly, and doing so mainly in the presence of a specified cofactor. A 3D
fitness landscape exploration of such data from three different experiments, with
predefined optima, should help guide the player to making better choices in pursuit
of the mission’s ends.
Figure 3 illustrates a method for how to create a game that can be played
intelligently (as a function of the person’s raw ability to pattern match) without
knowledge of genetics or biochemistry. Rather than having to directly manipulate
sequence data in the form of strings of nucleotides, players manipulate
visualizations of objects more suited to human intuition. These objects correspond
in fine detail to the underlying actual sequence of the molecules they represent.
Here we consider one particular unmodified scaffold protein- retinoid-X-receptoralpha, or RXR-alpha, as a monkey. Following Seelig and Szostak (2007), we may
wish in our game to use this very handy scaffold in order to evolve other novel
catalytic activities within two variable loop regions- regions consisting of a total of
21 amino acids. We consider a “base monkey” who has up to 58 different definable
features that can each take on between 2 and 26 different values, in order to visually
code for all observed variants in the loops and other contingencies that arise as the
scaffold itself mutates. We list the “monkey’s features” and all the different values
that those features might have- from all the different types of hats it might wear to
the color of its fur in different places to the length of its tail, etc., and we show the
“gene avatar” for ligase 1 that was “selected for” in Seelig and Szostak (2007), in
figure 2 of that paper. Note as well that this is a simple 2D visualization of the
monkey. There can be embodiments of the game involving 3D and even 4D and
higher “dimensional” inspection of the “gene avatar”, so that (for instance) the full
body all around can be considered in 3D, in 4D walking and other behaviors can be
considered a variable trait, in 5D considering sound as a fifth dimension, traits
related to noisemaking and speech could be variable traits, and in 6D considering
psychological artificial intelligence and interactive traits as a sixth dimension,
players could inspect these by, well… interacting.
One aspect of this method of the invention is the concept of an “overflow buffer”,
which is activated when mutations become too numerous or complex for concrete
visual representation. The overflow buffer comprises a set of generic image
manipulation procedures (filters, distortions, color manipulation, inversions, etc.)
that don’t need to be defined for a particular scaffold/avatar. For instance, RXRalpha’s avatar may be a monkey, but extreme directed evolution could result in
mutations to its scaffold and its variable region so extensive that the predesigned
image variants and changes simply can’t accommodate them. At this point, the
program would call up the overflow buffer procedures, which will allow
encodement of a large number (e.g. ten kilobytes’ worth) of sequence changes (an
outlandishly large number in the context of engineering even a highly complex
protein and treating it all as variable- almost every amino acid would have to change
in order to overflow the overflow buffer). It is critical to note that the example
presented is extremely graphically primitive, but a PHOSITA in programming and
graphic design skills will be able to create much less clunky-looking graphics based
on the principles of the method.
Figure 4 shows a process flow indicating what happens remotely and physically in
order to correctly correspond to the actions taken by the player within the game.
The process relies heavily on two key technologies, DNA sequencing and DNA
synthesis. A copy of Figure 2 is underlaid on Figure 4 in order to show
correspondence. First, a player must have genes to play with. However, as a matter
of real-world economics, no genes are actually synthesized until the player has
committed to subjecting them to a set of defined “evolutions”, or
selection/screening protocols. At this point, genes are physically synthesized, and
then processed into the particular molecular constructs that may be needed for the
“evolutions”. Different examples would include phage display- where the gene
needs to get put into a phage- and mRNA display- where instead the gene needs to
get processed as an mRNA that is covalently linked to its protein product. Then the
selection/screening protocols are performed, and deep sequencing is done on
various selected/screened portions. This sequencing serves to define the “results
pool” from which players can draw for subsequent turns of game play. The “results
pool” will typically show sequence consensuses rather than actual found sequences.
At this point the cycle is repeated.
Figure 5 shows code examples from OODLES, Object Oriented Design Language for
Experimental Science, an automation-directing programming language into which
player actions can be compiled. At its deepest level, OODLES directs the actions of
automation systems. At the highest level, OODLES provides easily human parseable
experimental strategy content and protocol descriptions. Because it is useful to
store player actions in stereotypable representative manners that are later
amenable to repeating, optimizing, hacking, changing, and reporting, and because
that process increases the odds in the long run that players will make meaningful
scientific contributions while playing the game, this method of the invention
increases the invention’s overall usefulness.
The figure shows code compiled directly from player actions within “Sequence
Space Explorer”, an embodiment of the invention closest to that described herein.
The figure also shows code compiled directly from actions of someone playing
“Temple of Genomic Freedom”, a game that acts as an example of another method of
the invention- a game that more elite and knowledgeable players can use in order to
perform R&D at the a more granular and detailed level than in “Sequence Space
Explorer”. Thus, players within “Temple of Genomic Freedom” will be able to design
missions and control automated facilities that implement evolutions. Playing
“Temple of Genomic Freedom” is analogous to being a “dungeonmaster” in the
traditional role-playing game Dungeons & Dragons, where the normal players are
playing “Sequence Space Explorer” and other games.
Figure 6 shows a representative screenshot from one embodiment of the invention,
a game entitled “Sequence Space Explorer”. The main user interface screen shown
here has all the necessary components outlined in Figure 1. A corral of genes are
shown in the upper left. The ability to construct new genes is offered by simply
right-clicking on any gene to open an editing box. The evolutions are listed by their
technical names (not a necessity) in a center column in white. Inputs are on the leftgenes or “gene avatars” that are queued up for commitment to the select
“evolutions” with which they line up in the middle. Results are shown on the rightincluding results specific to this player’s genes (she’s just started playing so there
are none yet), and also, under “All Results”, global results from everyone who’s
playing the game. This lets even unsophisticated players form an idea of consensus
values for particular aspects of genes undergoing evolutions. For players who wish
to look at real details of the underlying molecule biology, the box that opens upon
right-clicking allows any player to change the avatar of any gene to a crystal
structure of the closest homologous entity, with or without energy minimization to
match the changes between the homologous entity and the gene selected.
Finally, Figure 7 is a list by which the images in Figure 3 were generated.
In Figure 7, the first code set corresponds only to possible straightforward
modifications of the variable loop regions within RXR-alpha. Standard amino acid
one-letter codes are used below in alphabetical order, including “B” for either N or D
and “Z” for either Q or E., “J” for possible unnatural amino acid residues, “O” for
deletions that would be specified in a third code set, and “U” for inserts that in this
case don’t need to be described, but would be described in a fourth code set if
necessary. Further, “X” is used to indicate an undefined (presumably random)
position. All “random loop” residues begin as set to “X” in terms of the monkey’s
features. Thus, all 26 letters of the alphabet are used, so features can be read in
alphabetical order based on A-Z. Note that the “original monkey”, the starting
scaffold with no mutations other than the “X”’s, has “unmodified” in each place. In
all cases, there is an attempt to create descriptors that alphabetically match to the
starting letter of each amino acid as an aid to mnemonic interpretation by
developers and elite players. More ideally than in the example presented here, an
embodiment would use feature descriptors that mimetically or visually represent
the continuum of chemical similarities between amino acids or other forms of
biomolecular polymer residues, so that players could intuit similarities between
different feature values without knowing anything about the underlying
biochemistry.
First, Figure 7(a) lists the 21 random loop positions. It is also attempted to use
these positions in a clockwise manner with regard to the particular generic monkey.
Finally, for the sake of brevity, only features are presented that are relevant to
LIGASE 1 described in figure 3 and in figure 8.
Secondly, Figure 7(b) encodes changes to the non-variable scaffold in terms of
three different non-variable regions. Up to 2 different changes as far apart as 12
residues need to be accounted for in the first region, up to 8 different changes as far
apart as 7 residues need to be accounted for in the second region, and up to 7
different changes as far apart as 14 residues need to be accounted for in the third
region. In order to avoid having an unwieldy number of different variable features,
these changes are encoded in a compressed manner. First, for all scaffold areas the
total number of changes are feature-encoded- second, for each change, a feature
encodes the distance in amino acids from the previous feature (disregarding
intervening variable regions) with up to a number of possibilities equivalent to the
length of the scaffold or the observed greatest distance between mutations- third,
for each change, a feature encodes the change in standard 26-possible-letter
mnemonic format. Thus, the total number of features needing to be encoded to give
a full specification of mutations to scaffold regions- not counting the size of
deletions- is 1 total number specifier plus X distances between them plus Y change
specifications, where X=Y. Thus, 1+17+17 features are needed to describe all
observed scaffold mutations, for a total of 35 different features to fully specify
changes to the scaffold region. Including 21 features for variable regions, and 2
features to specify deletion sizes, there are a total of 58 features necessary to specify
all 7 ligase mutants in Seelig and Szostak’s paper. Note however that we here
present only ligase 1, which does not require this many features to be specified.
A (small) third code set would be required to represent the size of deletions. This
would encode the size of deletions as expressed by “O”-variable features in the
scaffold regions. Note that the smaller variable region’s deletions are fully
described inherently, so additional descriptors are not required. However, the
length of deletions needs to be described in order to provide a complete
specification of mutations to scaffold regions. In this example, we need only provide
as many features as there were deletional “O” values in the scaffold mutation
descriptions. As there were observed up to only one deletion in each of two scaffold
regions, we need to provide only two features that can take on a number of traits
equivalent to the longest possible or in this case longest observed deletion, which is
fifteen amino acids in scaffold region two and thirteen amino acids in scaffold region
three. However, we make these notes only theoretically, as we present only a gene
avatar for ligase 1 in this example, and deletion encodements are not necessary to
represent ligase 1 in respect to the original scaffold. However, if we were to
represent ligases 2,3,5,6, and 7 herein, there would be the need to represent the
deletions by following the feature-encodement methods presented above.
Note that no provision has been made for insertional mutations in this embodiment,
but that such provision could be made straightforwardly based on the encoding
principles described herein.
Note as well that this particular embodiment does not include provision for an
“overflow buffer” allowing for feature-based visualization of genes with greater
complexity (including greater scaffold variant distances, greater number of scaffold
variants, and greater number of deletions) than the ligase mutants represented
here. However, the value of the method is in its general applicability conceptually to
any gene systems that require feature-based visualization, and so all that is required
in order to represent greater complexity is the input of more features and possible
variable values for those features into a system for creating and representing those
variants visually. It is also important to note that such a system could be
implemented such that all variants are produced manually, but that to a PHOSITA of
computer programming and video graphics it will be a straightforward matter of
programming in order to automate such procedures as those that generate the
present embodiment through manual input.
Finally, Figure 8 is a reference with acknowledged copyright from the British
journal Nature, showing the key active mutants uncovered in that paper, which
correspond to Figure 3’s monkeys in all their multifariousness.
Figure 1 Minimal elements of the invention
Figure 2 Very basic flowchart of play
Figure 3 Gene avatars for gamification of molecular engineering by DE
Left: “basic monkey” “gene avatar”, representing original RXR-alpha gene with all
“X”’s in variable region and no changes to the scaffold.
Right: “complex monkey” “gene avatar”, representing ligase 1 from Seelig and
Szostak (2007).
basic monkey RXR-alpha
complex monkey Ligase 1
Figure 4 Real world management actions overlaid on gameplay flow
Figure 5 OODLES example code
Figure 6 Example embodiment of the invention as a game called “Sequence Space
Explorer”
Figure 7- lists by which images/ genetic avatars in Figure 3 were constructed.
7(a) random loop encodements
Random loop 1
Position 1- top of monkey’s head, for I
Iridescent hair bubble
Pos 2- monkey’s eyes, for L
Loving heart eyes
Pos 3- monkey’s left ear, for D
Devil-ette inside
Pos 4- monkey’s nose, for D
Doubling for total of four nostrils
Pos 5- monkey’s mouth, for A
Apple stuck inside it
Pos 6- monkey’s right hand holding something, for Y
Yin-Yang symbol
Pos 7- monkey’s left hand changes, for D
Distended laterally
Pos 8- monkey’s right foot changes, for Y
Yellowish glow
Pos 9- monkey’s right leg changes for K
Kinked leg
Pos 10- monkey’s left arm changes for Q
Quilled (feather appearance)
Pos 11- monkey’s left foot changes for T
Tripled
Pos 12- monkey’s left leg changes for D
Doubled
Random loop 2 (note for convenience sake in current embodiment these are performed first upon the basic monkey.)
In ligase 1: ESYHKCQDL
Pos 1- monkey’s distal tail change for E- Elongated
Pos 2- monkey’s proximal tail change for S- Sharpened (contrast increase)
Pos 3- monkey’s lower abdomen change for Y- Yellowed
Pos 4- monkey’s midsection change for H- Hot, shown on fire
Pos 5- monkey’s chest region change for K- “kooled”, made blue
Pos 6- monkey’s neck region change for C- corned, an ear of corn is put here
Pos 7- monkey’s mouth change for Q- Quintupled
Pos 8- monkey’s back, thing riding on it, for D- a donkey!
Pos 9- monkey’s left ear change, for L- Larger
7(b) changes to scaffold encodements
Pos 1: total number of changes in scaffold region
Feature: # of short green lines drawn on outsides of monkey’s arm
For ligase 1: 14
for each of 14 changes, the distance from the previous (or beginning) of the scaffold to the change must be encoded onto the
body parts listed below in order.
To each of these, apply a 100 pixel radius vortex distortion filter using Pixelmator for Mac to each of the areas in 1-14 with an
angle equal to 300*the distance.
Then, paste a small ~50x50 pixel “tattoo” to the center of each of these vortex areas, based on the starting letter of the amino
acid.
1. top of monkey’s head, 1600˚, Narwhal
2. monkey’s eyes, 4800˚, robot
3. monkey’s left ear, 900˚, robot
4. monkey’s nose, 300˚, queenbee
5. monkey’s mouth, 2100˚, lucky charm
6. monkey’s right hand palm area, 2100˚, sword
7. monkey’s left hand, 1200˚, yak
8. Right foot, 600˚, yak
9. Right leg, 300˚, robot
10. Left arm, 2400˚, robot
11. Left foot, 1200˚, king
12. Left leg, 600˚, tiger
13. Proximal tail, 3300˚, inkpot
14. Distal tail, 900˚, queenbee
Figure 8 Literature basis for gene avatar design
This is a screen capture of figure 3 from “Selection and evolution of enzymes from a
partially randomized non-catalytic scaffold”, Seelig B and Szostak J, Nature 448,
828-831 (16 August 2007)
Download