Adleman and computing on a surface Course outline 1 Introduction 2 Theoretical background Biochemistry/molecular biology 3 Theoretical background computer science 4 History of the field 5 Splicing systems 6 P systems 7 Hairpins 8 Detection techniques 9 Micro technology introduction 10 Microchips and fluidics 11 Self assembly 12 Regulatory networks 13 Molecular motors 14 DNA nanowires 15 Protein computers 16 DNA computing - summery 17 Presentation of essay and discussion Who’s who? Tom Head Department of Mathematical Sciences Binghamton University Areas of interest Algebra Computing with biomolecules Formal representations of communication http://www.math.binghamton.edu/tom/ Leonard Adleman Department of Computer Science Areas of interest Method for Obtaining Digital Signatures and Public-Key Cryptosystems Turing Award 2002 Distinguishing Prime Numbers From Composite Numbers The First Case of Fermat's Last Theorem Primality Testing And Two Dimensional Abelian Varieties Over Finite Fields Molecular Computation of Combinatorial Problem http://www.usc.edu/dept/molecular-science/fm-adleman.htm Solutions To Richard Lipton Theoretical Computer Science College of Computing, Georgia Tech Areas of interest Algorithms and Complexity Theory Cryptography DNA Computing http://www.cc.gatech.edu/computing/Theory/theory.html Laura Landweber Dept. of Ecology and Evolutionary Biology Princeton University Areas of interest Origins the of Genes, Genomes Genetic Code Early Pathways of RNA Evolution Scrambled RNA Editing Gene DNA http://www.princeton.edu/~lfl/ Genes Scrambling Computing John Reif Computer Science Duke University Areas of interest DNA nanostructures Molecular Computation Efficient Algorithms Parallel Computation Robotic Motion Planning Optical Computing. http://www.cs.duke.edu/~reif/ Erik Winfree Computer Science Computation and Neural Systems Caltech, Areas of interest MacArthur Fellow 2000 DNA-based computers Computing by self-assembly Genetic Signal Regulatory Networks Transduction Cascades Ribosomal DNA Translation and RNA folding http://www.dna.caltech.edu/~winfree/ Nadrian Seeman Department of Chemistry New York University Areas of interest DNA Nanotechnology Macromolecular Biophysical Design and Topology Chemistry of Recombinational Intermediates DNA-Based Computation Crystallography http://www.nyu.edu/pages/chemistry/faculty/seeman.html Robert Corn Chemistry Department University of Wisconsin Areas of interest surface plasmon resonance (SPR) to monitor biopolymer adsorption, the chemical modification of surfaces, characterization electron of molecular monolayers transfer processes at liquid/liquid electrochemical interfaces. DNA computing algorithms at surfaces multilayer polyelectrolyte films for ion transport applications. http://corninfo.chem.wisc.edu/ Hagiya Masami Department of Computer Science, University of Tokyo Areas of interest Automated Deduction, Formal Verification and Programming Languages Bio-Computing Hybrid http://hagi.is.s.u-tokyo.ac.jp Systems... Akira Suyama Graduate School of Arts and Sciences, University of Tokyo Areas of interest SNPs Probe design DNA chips Quantitative Hybrid gene expression Systems... http://talent.c.u-tokyo.ac.jp/suyama/ John Rose Department of Computer Science, University of Tokyo Areas of interest the DNA chip, especially Tag-Antitag Systems Whiplash PCR, a simple autonomous DNA computer equilibrium chemistry/statistical thermodynamic model http://hagi.is.s.u-tokyo.ac.jp/~johnrose/ Gheorghe Păun Institute of Mathematics of the Romanian Academy Areas of interest Formal language theory (and applications) Combinatorics on words Semiotics operational DNA Computing Membrane http://stoilow.imar.ro/~gpaun/ research Computing Grzegorz Rozenberg Institute of Advanced Computer Science University of Leiden Areas of interest Molecular Computing Evolutionary Neural Algorithms Networks http://www.wi.leidenuniv.nl/~rozenber/ Giancarlo Mauri Dipartimento di Informatica, Sistemistica e Comunicazione (DISCo) Milano Areas of interest H systems P systems Neural Networks http://bioinformatics.bio.disco.unimib.it/ Ehud Shapiro Computer Science and Applied Mathematics the Weizmann Institute Areas of interest DNA as input fuel Biological Turing nanocomputer machine-like model http://www.weizmann.ac.il/mathusers/lbn/index.html Byoung-Tak Zhang School of Computer Science and Engineering Seoul National University Areas of interest Evolutionary Neural Intelligence Intelligence Molecular Intelligence Computational http://scai.snu.ac.kr/~btzhang/ Learning Theory Danny van Noort School of Computer Science and Engineering Seoul National University Areas of interest microstructure design and fabrication DNA-hybridisation instrumentation fluorescent affinity protein DNA biosensors chips computing cell http://bi.snu.ac.kr/~danny/ microscopy behaviour NP complete problems The theory of NP-completeness Tractable and intractable problems NP-complete problems Classifying problems Classify problems as tractable or intractable. Problem is tractable if there exists at least one polynomial bound algorithm that solves it. An algorithm is polynomial bound if its worst case growth rate can be bound by a polynomial p(n) in the size n of the problem p(n) an n ... a1n a0 where k is a constant k Intractable problems • Problem is intractable if it is not tractable. • All algorithms that solve the problem are not polynomial bound. • It has a worst case growth rate f(n) which cannot be bound by a polynomial p(n) in the size n of the problem. • For intractable problems the bounds are: f (n) c , or n n log n , etc. Hard practical problems There are many practical problems for which no one has yet found a polynomial bound algorithm. Examples: traveling salesperson, 0/1 knapsack, graph coloring, bin packing etc. Most design automation problems such as testing and routing. Many networks, database and graph problems. The theory of NP-completeness The theory of NP-completeness enables showing that these problems are at least as hard as NP-complete problems Practical implication of knowing problem is NP-complete is that it is probably intractable ( whether it is or not has not been proved yet) So any algorithm that solves it probably be very slow for large inputs will Decision problems decision problem answers yes or no for a given input A Examples: G Is there a path from s to t of length at most k? Given a graph Does graph G contain a Hamiltonian cycle? Given a graph G is it bipartite? Decision problem: Hamiltonian cycle A Hamiltonian cycle of a graph G is a cycle that includes each vertex of the graph exactly once. Problem: Given a graph G, does G have a Hamiltonian cycle? The class P P is the class of decision problems that are polynomial bounded Is the following problem in P? Given a weighted graph G, is there a spanning tree of weight at most B? The decision versions of problems such as shortest distance, tree belong to P and minimum spanning The class NP NP is which the class there of is decision a problems polynomial for bounded verification algorithm It can be shown that: all decision problems in P, and decision problems such as traveling salesman, knapsack, bin pack, are also in NP The relation between P and NP P NP If a time, problem a algorithm is solvable polynomial time can be easily in polynomial verification designed that ignores the certificate and answers “yes” for all inputs with the answer “yes”. The relation between P and NP It is not Problems Problems in NP can be verified “quickly”. It is easier to verify a solution than to in known P can whether be solved P = NP. “quickly” solve a problem. Some researchers believe that P and NP are not the same class. NP-complete problems A problem A is NP-complete if 1. It is in NP and 2. For every other problem A’ in NP, A’ A A problem A is NP-hard if For every other problem A’ in NP, A’ A Examples of NP-complete problems Cook’s theorem Satisfiability is NP-complete This was the first problem shown to be NP-complete Other problems the decision version of knapsack, the decision version of traveling salesman Satisfiability problem The satisfiability problem First, Conjunctive Normal Form (CNF) will be defined Then, the Satisfiability problem will be defined Conjunctive normal form (CNF) A logical (Boolean) variable is a variable that may be assigned the value true or false (x, y, w and z are Boolean variables) A literal is a logical variable or the negation of a logical variable (x and y are literals) A clause is a disjunction of literals ((wxy) and (xy) are clauses) Conjunctive normal form (CNF) A logical Conjunctive (Boolean) Normal expression Form if is it is in a conjunction of clauses. The following expression is conjunctive normal form: (wxy) (wyz) (xy) (wy) in The satisfiability problem Is there variables a of truth a assignment logical to the expression n in Conjunctive Normal Form which makes the value of the expression true? For the answer to be yes, all clauses must evaluate to true Otherwise the answer is no The satisfiability problem x=F, y=F, w=T and z=T is a truth assignment for: (wxy) (wyz) (xy) (wy) Note that if y=F then y=T Each clause evaluates to true Adleman’s experiment The 1994 experiment DNA computer The 1994 experiment The 1994 experiment Basic Idea Perform molecular biology experiment to find solution to math problem. Hamiltonian path (Proposed by William Hamilton) Given a connections network between of nodes them, is and directed there a path through the network that begins with the start node and concludes with the end node visiting each node only once (“Hamiltonian path")? Does a Hamiltonian path exist, or not?” Hamiltonian path does exist end city Detroit Chicago Boston start city Atlanta Hamiltonian path does not exist start city Detroit Chicago Boston end city Atlanta Solving the Hamiltonian problem Generation-&-Test Algorithm Step 1 Generate random paths on the network. Step 2 Keep only those paths that begin with start city and conclude with end city. Step 3 If there are N cities, keep only those paths of length N. Step 4 Keep only those that enter all cities at least Step 5 once. Any remaining paths are solutions (i.e., Hamiltonian paths). The paths [X] D -> B -> A [X] B -> C -> D -> B -> A -> B [X] A -> B -> C -> B [X] C -> D -> B -> A [x] A -> B -> A -> D [O] A -> B -> C -> D [X] A -> B -> A -> B -> C -> D Solving the Hamiltonian problem Combinatorial explosion The total number of paths grows exponentially as the network size increases: (e.g.) 106 paths for N=10 cities, 1012 paths (N=20), 10100 paths!! (N =100) The Generation-&-Test algorithm takes “forever”. Some sort of smart algorithm must be devised; none has been found so far (NP-hard). Finding a solution with DNA The key to solving the problem is using DNA to perform the five steps of the Generation-&Test algorithm in parallel search, instead of serial search. Intermezzo: DNA polymerase Protein that produces complementary DNA strand A -> T, T -> A, C -> G, G -> C Requires primer and starter Enables DNA to reproduce Intermezzo: DNA polymerase The bio-nanomachine hops onto DNA strand slides along reads each base writes its onto new strand complement Experimental set-up Ingredients and tools needed DNA strands that encode city names and connections between them Polymerases, ligase, water, salt, other ingredients Polymerase chain reaction (PCR) set Gel electrophoresis tool out non-solution strands) (that filters Gel electrophoresis Solving a Hamiltonian path problem end city Detroit Chicago Boston start city Atlanta City coding CITY DNA NAME ATLANTA ACTTGCAG BOSTON TCGGACTG CHICAGO GGCTATGT DETROIT CCGAGCAA CONNECTING PATH ATLANTA-BOSTON ATLANTA-DETROIT BOSTON-CHICAGO BOSTON-DETROIT BOSTON-ATLANTA CHICAGO-DETROIT COMPLEMENT TGAACGTC AGCCTGAC CCGATACA GGCTCGTT DNA PATH GCAGTCGG GCAGCCGA ACTGGGCT ACTGCCGA ACTGACTT ATGTCCGA City coding with DNA Boston Atlanta Atlanta -Boston GCAGTCGG TGAACGTC AGCCTGAC Atlanta Boston Possible paths end city Detroit Chicago Boston start city Atlanta Atlanta-Boston Atlanta* Boston-Chicago Boston* Chicago-Detroit Chicago* Detroit* Possible paths end city Detroit Chicago Boston start city Atlanta Boston-Atlanta Boston* Atlanta-Detroit Atlanta* Detroit* In pictures The DNA experiment 1. In a test tube, mix the prepared DNA pieces together (which will randomly link with each other, forming all different paths). 2. Perform PCR with two ‘start’ and ‘end’ DNA pieces as primers (which creates millions’ copies of DNA strands with the right start and end). 3. Perform gel electrophoresis to identify only those pieces of right length (e.g., N=4). The DNA experiment 4. Use DNA ‘probe’ molecules to check whether their paths pass through all intermediate cities. 5. All DNA pieces that are left in the tube should be precisely those representing Hamiltonian paths. If the tube contains any DNA at all, then conclude that a Hamiltonian path exists, and otherwise not. When it does, the DNA sequence represents the specific path of the solution. Summary and conclusion Why does it work? Enormous parallelism, with 1023 DNA pieces working in parallel to find solution simultaneously. Takes less than a week (vs. thousands years for supercomputer) Extraordinary energy efficient (10-10 of supercomputer energy use) Note this is a Universal Turing machine Experimental set-up Experimental set-up CAPTURE LAYER (-R or G) Experimental set-up CAPTURE LAYER (-R or G) - + Experimental set-up CAPTURE LAYER (-R or G) - + Experimental set-up CAPTURE LAYER (-R or G) - + Experimental set-up CAPTURE LAYER (-R or G) - HOT + Experimental set-up Experimental set-up Experimental set-up DNA computing on a surface DNA computing on surfaces DNA computing on surfaces Advantages over “solution phase” chemistry Facile purification steps Reduced interference between strands Easily automated Disadvantages: Loss of information density (2D) Lower surface hybridization efficiency Slower surface enzyme kinetics DNA surface model: input DNA strands representing the set {0,1}^n are synthesized and subsequently immobilized on a surface in a non-addressed fashion Encoding binary information Word Bit A strand is comprised of words. 1 2 3 4 1 2 3 4 1 2 3 4 . . . short Each word is a DNA strand (16mer) representing one or more bits. DNA word design problem Requirements of a “DNA code” Success in specific hybridization between a DNA code word and its Watson-crick complement Few false positive signals Virtually all designs enforce combinatorial constraints on the code words Applications: Information storage, retrieval for computing Molecular bar codes for chemical libraries DNA DNA word design problem Hamming: distance between two code words should be large Reverse complement: distance between a word and the reverse complement of another word should be large Also: frame shift, distinct sub-words, forbidden sub-words, … Work on DNA code design Seeman (1990): de novo design of sequences for nucleic acid structural engineering Brenner (1997): sorting polynucleotides using DNA tags Shoemaker et al. (1996): analysis of yeast deletion mutants using a parallel molecular bar-coding strategy Many other examples in DNA computing Word design example DNA surface model: process MARK strands in which bit j = 0 (or 1): hybridize with Watson-Crick complements of word containing polymerization DESTROY UNMARK bit j, followed by DNA surface model: process MARK strands in which bit j = 0 (or 1) DESTROY unmarked strands: exonuclease degradation UNMARK DNA surface model: process MARK strands in which bit j = 0 (or 1): hybridize with Watson-Crick complements of word containing bit j, followed by polymerization DNA surface model: process MARK strands in which bit j = 0 (or 1) DESTROY UNMARK unmarked strands strands: wash in distilled water DNA surface model: output Detect remaining strands (if any) by detaching amplifying strands from using PCR chain reaction). surface and (polymerase Computational power Theorem can Any CNFSAT be computed formula of size m using O(m) mark, unmark and destroy operations. Theorem Any circuit of size m can be computed using O(m) mark, destroy, and append operations. unmark, The satisfiability problem Input 16 strands Process MARK if bit z = 1 MARK if bit w = 1 MARK if bit y = 0 DESTROY UNMARK MARK if bit w = 0 MARK if bit y = 0 DESTROY UNMARK … Output and or or not z exactly those strands that satisfy the circuit remain on the surface. w or not y or not x 4-variable SAT demo (wxy) (wyz) (xy) (wy) {0000} {0010} {0100} {0110} {1000} {1010} {1100} {1110} {0001} {0011} {0101} {0111} {1001} {1011} {1101} {1111} 4-variable SAT demo 4-variable SAT demo 4-variable SAT demo The logic computation leading at types of of in the the each end DNA DNA cycle, to four molecules remaining on the surface. The identity of those molecules that correspond to the solutions was by PCR. Solution: S3 S7 S8 S9 determined 4-variable SAT, the answers S3: w=0, x=0, y=1, z=1 S7: w=0, x=1, y=1, z=1 S8: w=1, x=0, y=0, z=0 S9: w=1, x=0, y=0, z=1 y=1: (w V x V y) z=1: (w V y V z) x=0 or y=1: (x V y) w=0: (w V y) 4-variable SAT demo Synthesize; Attach Mark Destroy Unmark Readout Cycle 4-variable SAT demo Conclusions Solid-phase chemistry is a promising approach to DNA computing DNA computing will require greatly improved DNA surface attachment chemistries and control of chemical and enzymatic processes