Completeness

advertisement
Systems and Synthetic
Biology: A programming
languages point of view
Saurabh Srivastava
Assistant Research Engineer
Computer Science + Bioengineering
University of California, Berkeley
1
In perspective
One view: biological
specifications
• Systems biology
– Modeling complete biological processes
One view: syntax of
components
• Genomics
– Reading DNA sequences got incredibly fast, and cheap
– Algorithms for sequence data analytics within organisms
One view: semantics of
components
• Synthetic biology
– DNA synthesis got incredibly cheap
– Functional characterization on the way
– Algorithms to predict tweaks to organisms
2
Different scales in biology
Organisms
Intercellular
processes
Intracellular
processes
Systems Biology
operates here
Protein
interactions
Protein
function
Synthetic Biology
operates here
3
Results and techniques used
Intercellular
processes
Intracellular
processes
Systems Biology
operates here
Protein
interactions
Protein
function
Synthetic Biology
operates here
4
Results and techniques used
Results:
Model of cell
communication allows
understanding cancer
Constructed a bacteria
that produces
Paracetamol/Depon
Techniques:
Automatically generating
concurrent programs using
input-output examples
Big data analysis and
abstraction
Intercellular
processes
Intracellular
processes
Systems Biology
operates here
Protein
interactions
Protein
function
Synthetic Biology
operates here
5
PART I
Synthesizing models for Systems
Biology
Synthesizing Systems Biology
“Programs”
Synthesizing concurrent programs from examples
Programs ≡ biological models
Examples ≡ biological experiments
We assist natural sciences with formal methods
• Given experiments, are there other models?
• If so, compute a new, disambiguating experiment
Part I: how stem cells coordinate their fates
7
Understanding Diseases
• “Cancer is fundamentally a disease of failure
of regulation of tissue growth. In order for a
normal cell to transform into a cancer cell, the
genes which regulate cell growth and
differentiation must be altered.” – from Wikipedia
• Research on cell differentiation helps
understanding diseases such as cancer.
8
C. elegans: A Model Organism
Earthworm used in developmental biology.
959 cells; its organs found in other animals.
Differentiation studied on vulval development.
9
Initial division of embryo
Identical precursor
cells collaborate to
decide their fate
Differentiation and
then development
into organ parts
10
Modeling Goal
What is the mechanism (program) within each cell
for deciding fates through communication?
11
Building Blocks of these Programs
Cells contain communicating proteins.
Protein interaction: a protein senses the
concentration of other proteins.
Interaction is either activation or inhibition.
A
B
A
B
12
How the Vulval Cells Differentiate
Anchor
Cell
If cells sense the same signal
strength, data races occur.
low
med
let-23
lst
let-60
lst
let-60
mpk-1
1º
sem-5
sem-5
lin-12
lin-12
sem-5
...
let-23
let-23
lin-12
high
lst
let-60
mpk-1
mpk-1
2º
3º
13
How Biologists Discover Interactions
• Measuring protein levels over time is infeasible
• If such “cell tracing” is infeasible, infer protein
interaction from end-to-end experiments
• That is, mutate cells  observe resulting fates
14
A Mutation Experiment
Anchor
Cell
low
med
let-23
let-23
lin-12
...
lin-12
let-23
lin-12
high
1º
2º
1º
3º
15
Putting Experiments Together
Fate of six neighboring cells
No protein is mutated.
Experiment
AC
lin-12
lin-15
let-23
lst
Fate decisions
1
ON
ON
ON
ON
ON
{332123}
2
ON
OFF
ON
ON
ON
{331113}
3
ON
ON
OFF
ON
ON
{112121,122121, 212121}
...
48
...
lin-12 is turned off.
Multiple outcomes observed
Experiments over 35 years by 11 groups
16
How to Build Accurate Models?
Inhibition discovered
by predictive modeling
[Fisher et al. 2007]
Anchor
Cell
low
med
let-23
lst
let-60
lst
let-60
mpk-1
1º
sem-5
sem-5
lin-12
lin-12
sem-5
...
let-23
let-23
lin-12
high
lst
let-60
mpk-1
mpk-1
2º
3º
17
Semantics of the Modeling Language
• Program has cells
• Non-deterministic outcomes
via schedule interleaving
Cell 1
Cell 2
let-23
sem-5
lin-12
• Cell has proteins
• All proteins advance
synchronously
lst
let-60
mpk-1
• Proteins have discrete state
and update functions.
18
Synthesizing Cellular Programs
19
Synthesis of Programs
Given as a partial program
biological
insight
completed
program
synthesizer
specification
Experiment
AC
lin-12
lin-15
let-23
lst
Fate decisions
1
ON
ON
ON
ON
ON
{332123}
2
ON
OFF
ON
ON
ON
{331113}
3
ON
ON
OFF
ON
ON
{112121,122121, 212121}
...
20
Partial Programs
Partial programs express biological insight:
• Which proteins are in the cell
• Which proteins may interact
Update functions can be unknown.
let-23
lin-12
sem-5
lst
let-60
mpk-1
?
?
21
Synthesis Algorithm
22
Classical CEGIS
initial input/output set
candidate
solution
SAT
synthesizer
UNSAT
verifier
SAT
add input-output
counterexample
UNSAT
23
Correctness Condition
Experiment
AC
lin-12
lin-15
let-23
lst
Fate decisions
1
ON
ON
ON
ON
ON
{332123}
2
ON
OFF
ON
ON
ON
{331113}
3
ON
ON
OFF
ON
ON
{112121, 122121,
212121}
...
Safety: all schedules must lead the program to produce
experiment outcomes observed in the wet lab.
∀mutation m. ∀schedule s. P(m, s)∈E(m)
Completeness: each observed experiment outcome must
be reproducible by the program for some schedule.
∀mutation m. ∀fate f∈E(m). ∃schedule s. P(m, s) = f
24
Counterexample-Guided Inductive Synthesis
initial set of
input/output examples
counterexample:
execution P(m, s)
with bad outcome
inductive
synthesizer
no candidate
completion
counterexample:
observation (m, f)
not reproducible
candidate
completion
safety
verifier
ok
completeness
verifier
25
Verifying for Safety
• Safety:
∀mutation m. ∀schedule s. P(m, s)∈E(m)
• Attempt to disprove by searching for a
demonic schedule:
∃mutation m. ∃schedule s. P(m, s) ∉ E(m)
Unroll over the set of
performed experiments
Search symbolically for
a demonic schedule
26
Verifying for Completeness
• Completeness:
∀mutation m. ∀fate f∈E(m). ∃schedule s. P(m, s) = f
• Attempt to disprove by showing lack of an
angelic schedule for some outcome:
∃mutation m. ∃fate f∈E(m). ∀schedule s. P(m, s) ≠ f
Unroll over pairs of
mutation and fate
Query symbolically for
an angelic schedule
27
Synthesized Models
• We synthesized two models of VPCs.
• Input: Partial model that specifies known,
simple protein behaviors.
• Output: Synthesized update functions for two
key proteins.
model 1
model 2
28
Does there exist
alternative model that
differs on new
experiment?
Additional Algorithms for
Going Beyond Synthesis
to Assist Scientists
What are the minimal
number of experiments
that constrain the space
to current model?
For two or more models
within the space, do there
exist disambiguating
experiments?
29
PART II
Predicting DNA insertions for
Synthetic Biology
Synthetic Biology inserts DNA
Applications enabled
• Microbial chemical factories
– Sustainable bacterial production of chemicals. E.g., drugs, polymers
• Bacteria as a sensor
– Agricultural apps: e.g., sense nitrogen depletion in soil, change color
• Tumor killing bacteria
– Sense multiple environmental factors (lower oxygen, high lactic acid)
– Invade cancerous cell
– Release drug inside cell
31
Does it actually work?
• Jay D. Keasling
– Artemisinin: from Artemisia annua to Yeast
– Amyris company: 219M incidence, 300M cure target
– 8 years of work; manual insights
• Computationally predicted
– Our tylenol E. coli strain
Sugar
Tyenol
How do cells manipulate chemicals
Incoming chemicals
Some proteins are enzymes
Output chemicals
Changing the cellular chemical
machinery
What happens when we add unnatural/external enzymes?
+
DNA synthesis
+ Plasmid
Transformation
Sugar
Bio repositories
Data dedup/correlation
+
Search
Acetaminophen
Tylenol
Current status
Opportunity
Sugar
Abstractions or rules from data
Rule application to predict
Biofuels
Polymers
Nylon etc.
35
Prediction for Tylenol
4ABH gene
from Mushroom
Chorismate
pathway
4-aminobenzoate
4-aminophenol
Tylenol
36
4ABH gene
inserted
4AP
Tylenol
Opportunity
Sugar
Abstractions or rules from data
Rule application to predict
Biofuels
Polymers
Nylon etc.
38
Biochemical rules/abstractions?
Biochemical rules/abstractions?
f
Graph
transformation
function
f’
f’’
* *
fclass
Operator taking (chemical)
graph and transforming it
* *
Open problem 1: Language for
chemicals and transforms
C1=CC=CC=C1
3D
2D
XYZ, CML, PDB
Good for
crystallographers
1D
SMILES, SYBYL
Good for
biochemists
Good for
computational
storage, retrieval
Alternative new representation
*
* *
• Transformation representation
– encmolecule
– Be able to efficiently compute fclass(encmolecule)
• Conflicting objectives:
– Trace-based encoding
• Difficulty at cycle cuts
– True graph encoding
• Subgraph isomorphism
Need midway
encoding
*
Applying reaction operators
We can now predict new edges: I.e., external
enzymes or even new constructed enzymes
Open problem 2: Deriving biochemical
programs
• fclass(encmolecule) is a single step
• Function composition to get “pathway programs”
– fclass0 (fclass1 (fclass2 (fclass3 (encmolecule))))
• Complications
– Path can be “acyclic” not just straight-line
– Mixed concrete and rule instantiation
Crux of the research problem
Intelligent rule instantiation: ------ edges
Given
target
chemical
Need better
search methods
Phrase it as a model checking reachability
problem: initial experiments point to a need for
more efficient representation of search space.
Eventual targets
Arbekacin
(semi-synthetic)
MRSA Drug
$20,900/gram
Amikacin
(natural)
Nothing interesting
$118/gram
Modular decomposition through
function summaries
Arbekacin
(semi-synthetic)
MRSA Drug
$20,900/gram
Amikacin
(natural)
Nothing interesting
$118/gram
Acknowledgements
48
http://pathway.berkeley.edu
+
Sugar
DNA synthesis
+ Plasmid
Transformation
Bio repositories
Data dedup/correlation
+
Search
Current status
Acetaminophen
Tylenol
Opportunity
Sugar
Chemical transformation functions from data
Language for biochemistry
Synthesis of biochemical program (i.e., DNA modifications)
Synthesis of internals of transformation function
Biofuels
Polymers
Nylon etc.
49
Backup slides
Nylon precursor (polymer)
51
Butanol (fuel)
52
Architecture
Pubchem:
53M entries
55,000
entries
Pubchem
names/Bren
da names:
55k
Organisms:
Uniprot:
23M entries
Download