Vyacheslav V. Rykov Outline • DNA Hybridization/Cross Hybridization • DNA Codes • Nearest Neighbor Thermodynamics – complete computations – bounds • Overview of Applications and Purposes • DNA Bitstring Library • Biomolecular Computing DNA Hybridization • DNA strands are modeled by directed 3’--> 5’ sequences of letters from the alphabet {A, C, G, T} • (A, T) and (C, G) are complementary pairs. • Two oppositely directed DNA sequences are capable of coalescing into a duplex. • Because an A (C) in one strand can (usually) only bind to a T (G) in the oppositely directed strand, the greatest energy of duplex formation is obtained when the two sequences are reverse-complements (complements) Orientation of single DNA strands is important for hybridization. A DNA Code Coding Strands for Ligation TACGCGACTTTC ATCAAACGATGC TGTGTGCTCGTC ATTTTTGCGTTA, CACTAAATACAA GAAAAAGAAGAA, 5’ 3’ Must Have Must Avoid 5’TACGCGACTTTC3’ 5’GAAAGTCGCGTA3’ TACGCGACTTTC GCATCGTTTGAT Probing Complement Strands for Reading GAAAGTCGCGTA GCATCGTTTGAT GACGAGCACACA TAACGCAAAAAT TTGTATTTAGTG TTCTTCTTTTTC 5’ 3’ ATCAAACGATGC GCATCGTTTGAT ATTTTTGCGTTA GAAAAAGAAGAA Watson Crick (WC) Duplexes Cross Hybridized (CH) Duplexes x=5’ ggCaCaTcatAct 3’ y=5’agatgGttAAccT 3’ 5’ ggCaCaTcatAct 3’ 3’ TccAAttGgtaga 5’ 5’ ggCaCaTcatAct 3’ 5’ AggTTaaCcatct 3’ 5’agatgGttAAccT 3’ =y DNA codes serve as universal components for biomolecular computing. DNA codes are closed under reverse-complementation. The strands in a DNA code have such binding specificity that a code strand will only hybridize with its reverse-complement and will not cross hybridize with any other code strand in the DNA code Such collections of strands are crucial to the success biomolecular computing and biomolecular nanotechnology. Basic idea is to have correct, parallel and autonomous addressing Characterization of synthetic DNA bar codes in Saccharomyces cervisiae gene-deletion strains” (Eason et al., PNAS). DNA codes for self-assembly of any components that can be attached to DNA. Their size presents the potential for increased complexity and location control in nanostructures produced by assembly that is driven by DNA duplex formation. Fundamental physical limits and increasing costs of fabrication facilities will force alternatives to conventional microelectronics manufacturing to be developed. In self-assembly, weak, local interactions among molecular components spontaneously organize those components into aggregates with properties that range from simple to complex DNA memory:The capacity and storage density of such memories is potentially very large. Information could be mined through massively parallel templatematching reactions. In addition, information could be processed based upon context, and information matched associatively based upon content. Interest into DNA computing was sparked in 1994 by Len Adleman. Adleman showed how we can use DNA molecules to solve a mathematical problem. (Hamiltonian path problem). DNA computing relies on the fact that DNA strands can be represented as sequences of bases (4-ary sequences) and the property of hybridization. In Hybridization, errors can occur. Thus, error-correcting codes are required for efficient synthesis of DNA strands to be used in computing. DNA Computing Strand Engineering No codeword-codewode CH (cc-CH) No codeword-probe CH (cp-CH) No probe-probe CH (pp-CH) A A A A A A A A C C=T1 G G T T T T T T T T =BEAD PROBE (T1) A T C T T T T C A A=F3 T T G A A A A G A T= BEAD PROBE (F3) T T T C C A A A A A =F1 T T T T T G G A A A = BEAD PROBE (F1) C A A T C C A T T A=T4 T A A T G G A T T G= BEAD PROBE (T4) T T T C T T A A C C=T2 G G T T A A G A A A= BEAD PROBE (T2) C C T T C T A A A T=F4 A T T T A G A A G G= BEAD PROBE (F4) A C T A A C A A A A=F2 T T T T G T T A G T= BEAD PROBE (F2) A C T C C T A A T A=T5 T A T T A G G A G T= BEAD PROBE (T5) C A T A A A A C A C=T3 G T G T T T T A T G= BEAD PROBE (T3) T C T C T C T A C T=F1 A G T A G A G A G A= BEAD PROBE (F5) Only Allowed Hybridizations C C A A C C A A A A A A = T1 A A A A A A A C C A C C=F1 T T T T T C C T T C C A =T2 T T A C C T C A A A C C =F2 T T T C A C A A C T C C=T3 T T C A A T C C A C A A =F3 T C A C T C T C T C A A =T4 T C T T T C T C C T C T=F4 C A T C T C A C C A T C =T5 No cp-CH T T T T T T G G T T G G=Probe(T1) G G T G G T T T T T T T=Probe(F1) T G G A A G G A A A A A=Probe(T2) G G T T T G A G G T A A =Probe(F2) G G A G T T G T G A A A=Probe(T3) T T G T G G A T T G A A=Probe(F3) T T G A G A G A G T G A=Probe(T4) A G A G G A G A A A G A=Probe(F4) G A T G G T G A G A T G=Probe(T5) G T G T G T A G T G T T=Probe(F5) A A C A C T A C A C A C =F5 No cc-CH No pp-CH DNA Computing Strand Engineering No codeword cp-CH T T C A A T C C A C A A =F3 T T G T G G A T T G A A=Probe(F3) T C A C T C T C T C A A =T4 T T G A G A G A G T G A=Probe(T4) T C T T T C T C C T C T=F4 A G A G G A G A A A G A=Probe(F4) C A T C T C A C C A T C =T5 G A T G G T G A G A T G=Probe(T5) A A C A C T A C A C A C =F5 G T G T G T A G T G T T=Probe(F5) C C A A C C A A A A A A = T1 T T T T T T G G T T G G=Probe(T1) A A A A A A A C C A C C=F1 G G T G G T T T T T T T=Probe(F1) T T T T T C C T T C C A =T2 T G G A A G G A A A A A=Probe(T2) T T A C C T C A A A C C =F2 G G T T T G A G G T A A =Probe(F2) T T T C A C A A C T C C=T3 G G A G T T G T G A A A=Probe(T3) PROBE(F2) GGTTTGAGGTAA C A A C C A A A A A A- T T A C C T C A A A C C- T T C A A T C C A C A A- T C A C T C T C T C A A - C A T C T C A C C A T C Yes WC bonding Yes, bitstring is F2 Good read T1-F2-F3-T4-T5 1 0 0 1 1 PROBE(T2) GGAGTTGTGAA C A A C C A A A A A A- T T A C C T C A A A C C- T T C A A T C C A C A A- T C A C T C T C T C A A - C A T C T C A C C A T C Darn! CH bonding No, bitstring is not T2 Bad read T1-F2-F3-T4-T5 DNA Computing Strand Engineering No codeword pp-CH, cc-CH GGTTTGAGGTAA pp-CH interferes with reading PROBE(F2) T T G A G A G A GT G PROBE(T4) PROBE(F2) GGTTTGAGGTAA bonding site competition C A A C C A A A A A A- T T A C C T C A A A C C- T T C A A T C C A C A A- T C A C T C T C T C A A - C A T C T C A C C A T C cc-CH interferes with separation and leads to unwanted library strand interaction T1-F2-F3-T4-T5 F1-F2-T3-T4-f5 C A A C C A A A A A A- T T A C C T C A A A C C- T T C A A T C C A C A A- T C A C T C T C T C A A - C A T C T C A C C A T C T T T C C A A A A-A T T A C C T C A A A C C- T T T C A C A A C T C C- T C A C T C T C T C A A - A A C A C T A C A C A C Watson-Crick Nearest Neighbor Computation AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT AA 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 AC 0 1.44 0 0 0 0.52 0 0 0 0 0 0 0 0 0 0 AG 0 0.13 1.28 0 0 0 0 0 0 0 0 0 0 0 0 0 AT 0 0 0 0.88 0 0 0 0 0 0 0 0 0 0 0 0 CA 0 0 0 0 1.45 0 0 0 0 0 0 0 0 0 0 0 CC 0 0 0 0 0 1.84 0 0 0 0 0 0 0 0 0 0 CG 0 0 0 0 0.47 0.11 2.17 0 0 0 0 0 0 0 0 0 CT 0 0 0 0 0.12 0.32 0 1.28 0 0 0 0 0 0 0 0 GA 0 0 0 0 0 0 0 0 1.3 0.25 0 0 0 0 0 0 GC 0 0.59 0 0 0 1.11 0 0 0 2.24 0 0.27 0 0.25 0 0 GG 0 0 0.32 0 0 0 0.11 0 0 1.11 1.84 0.52 0 0 0 0 GT 0 0 0 0 0 0 0 0.13 0 0.59 0 1.44 0 0 0 0 TA 0 0 0 0 0 0 0 0 0 0 0 0 0.58 0 0 0 TC 0 0 0 0 0 0 0 0 0 0 0 0 0 1.3 0 0 TG 0 0 0.12 0 0 0 0.47 0 0 0 0 0 0 0 1.45 0 TT 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 2.24 WC Duplex 5’ g g c a c a 3’ 3’ c c g t g t 5’ 5’ g g c a c a 3’ 5’ g g c a c a 3’ 1.44 5’ g g c a c a 3’ 5’ g g c a c a 3’ 1.84 1.45 1.45 NNFE=8.42 Cross Hybridized Nearest Neighbor Upper Bound Computation AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT AA 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 AC 0 1.44 0 0 0 0.52 0 0 0 0 0 0 0 0 0 0 AG 0 0.13 1.28 0 0 0 0 0 0 0 0 0 0 0 0 0 AT 0 0 0 0.88 0 0 0 0 0 0 0 0 0 0 0 0 CA 0 0 0 0 1.45 0 0 0 0 0 0 0 0 0 0 0 CC 0 0 0 0 0 1.84 0 0 0 0 0 0 0 0 0 0 CG 0 0 0 0 0.47 0.11 2.17 0 0 0 0 0 0 0 0 0 CT 0 0 0 0 0.12 0.32 0 1.28 0 0 0 0 0 0 0 0 GA 0 0 0 0 0 0 0 0 1.3 0.25 0 0 0 0 0 0 GC 0 0.59 0 0 0 1.11 0 0 0 2.24 0 0.27 0 0.25 0 0 GG 0 0 0.32 0 0 0 0.11 0 0 1.11 1.84 0.52 0 0 0 0 GT 0 0 0 0 0 0 0 0.13 0 0.59 0 1.44 0 0 0 0 TA 0 0 0 0 0 0 0 0 0 0 0 0 0.58 0 0 0 TC 0 0 0 0 0 0 0 0 0 0 0 0 0 1.3 0 0 TG 0 0 0.12 0 0 0 0.47 0 0 0 0 0 0 0 1.45 0 TT 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1.45 5’ g ggCaCaTcatAct gC aCaTcatAct 3’ 3’ 3’ Tc TccAAttGgtaga cA AttGgtaga5’5’ 5’ ggCaCaTcatAct 3’ 5’ AggTTaaCcatct 3’ 1.28 5’ g g C a C a T c a t A ct 3’ 5’ A g g T T a a C c a t ct 3’ 1.84 .27 0.88 NNFE~<5.45 NNFE~<5.72 Intermolecular Interactions Duplexes loop symmetric loop asymmetric 2.90 (5.3) 2.20 (5.3) Intramolecular Interactions CAAGACTTTTTGGTAGTAAA ***TTTCCC*********GGAA***GGGAAA***********TTCC*** NNFE~<5.66 NNFE~<5.66+ .59 + .32=6.57 5’ gg C a C a T c a t A ct 3’ 3’ T cc AA t t G g t a ga 5’ 5’ ggCaCaTcatAct 3’ 3’ TccAAttGgtaga 5’ 5’ ggCaCaTcatAct 3’ 5’ AggTTaaCcatc 3’ 5’ gg C a C a T c a t A ct 3’ 5’ A gg TT a a C c a t ct 3’ Virtual Stacked Pairs Virtual Duplex 5’ GGCACATCATACT 3’ 5’ AGTATGATGTGCC 3’ 5’ AGGTTAACCATCT3’ 5’ AGATGGTTAACCT3’ 5’ GGCACATCATACT 3’ Neareast Neighbor Appr. Free Energy of duplex formation (WC) 5’ AGTATGATGTGCC 3’ …= 18.8 1.84 2.24 1.45 5’ GGCACATCATACT 3’ 5’ AGGTTAACCATCT3’ 5’ ggCaCaTcatAct 3’ 3’ TccAAttGgtaga 5’ 5’ ggCaCaTcatAct 3’ 5’ AggTTaaCcatct3’ 5’ gg C a C a T c a t A ct 3’ 3’ T cc AA t t G g t a ga 5’ 1.28 1.45 5’ gg C a C a T c a t A ct 3’ 5’ A gg TT a a C c a t ct 3’ 1.84 0.88 NNFE CH =6.45 correlation=.737 Our FE bound Precise FE Length 16 435 random n Let Fq denote a set consisting of all vectors (codewords) of length n built over Aq {0,1,...q 1} i.e. x Fq x ( x1 , x2 ,..., xn ) xi Aq n Let d : Fq Fq {0,1,2,...} such that: n x, y, z Fq n n 1) 2) 3) Let Cq (n, M , d ) d (x, y ) 0 x y d (x, y ) d (y, x) d (x, y ) d (x, z ) d (z, y ) n Fq be such that: x, y Cq (n, M , d ) d ( x, y ) d Cq (n, M , d ) M Cq (n, M , d ) is referred to as a Code of length n, size M, and minimum distance d. n A sphere in Fq centered at x having radius d: Sphq (n, x, d ) {y : y Fqn , d (x, y) d} Volume of the sphere around x, of radius d: Vq (n, x, d ) Sph (n, x, i) y : y F d q n q , d (x, y ) d i 0 A space is HOMOGENEOUS when the volume of a sphere does not depend on where it is centered i.e. (x, y Fqn )(d 0, n)(Vq (n, x, d ) Vq (n, y, d )) A space is NON - HOMOGENEOUS when the volume of a sphere does depend on where it is centered. z ( z1 , z2 ,..., zk ) is a subsequence of x ( x1 , x2 ,..., xn ) Sequence if and only if there exists a strictly increasing sequence of indices: (i1 , i2 ,..., ik ) Such that: j, z j xi j LCS (x, y ) is defined to be the set of longest common subsequences of x and y LCS (x, y) is defined to be the length of the longest common subsequence of x and y Just what it says: x = <A C G T C G A G C> y = <G A C G C T G A G> LCS(x, y) = {<A C G C G A G>} |LCS(x, y)| = 7 Original Insertion-Deletion metric (Levenshtein 1966): x, y Fq n d (x, y ) (n L(x, y )) (n L(x, y )) 2(n L(x, y )) This metric results from the number of deletions and insertions that need to be made to obtain ‘ y ’ from ‘ x ’. For vectors that have the same length: the number of deletions that will be made is: n L ( x, y ) likewise, the number of insertions that will be made is: n L ( x, y ) Better Metric ? • LCS is simple and easy to compute. • LCS essentially is a count of the number of base pairings between two sequences, and thus does approximate bonding energy. • Clue: if two base pairs bond, but neither their neighbors to the right or left bond, it really doesn’t contribute much. • We might call such inconsequential bonds “lone” bonds. “Lone Bonds” B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B The red bonds are “lone bonds” that don’t contribute to the binding energy. Block LCS The longest common subsequence SUCH THAT: If xi is matched to yj, THEN EITHER xi-1 is matched to yj-1, OR xi+1 is matched to yj+1 A common subsequence z ( z1 , z2 ,..., zm ), 2 m n, is called a common stacked pair subsequence of length z m between x and y if two elements zi , zi 1 , i 1,2,..., m 1 , are consecutive in x and consecutive in y or if they are non -consecutive in x and or non-consecutive in y, then zi 1 , zi and zi 1 , zi 2 are consecutive in x and y. Let S (x, y ), 0 S (x, y ) n , denote the length z of the longest sequence occurring as a common stacked pair subsequence subsequence z between sequences x and y. The number S (x, y ) , is called a similarity of blocks between x and y. The metric is defined to be d (x, y ) n S (x, y ) We will be working in a NON-HOMOGENEOUS space making the obtainment of exact formulas for sphere volumes and code sizes VERY HARD. [6] L. M. G. M. Tolhuizen (1997): The Generalized Varshamov-Gilbert Bound is Implied by Turan’s Theorem, IEEE Transactions on Information Theory, 43:05. Varshamov-Gilbert Lower Bound on Code Size in metric: n q with any F q n Cqmax (n, M , d ) Vqmax n, d 1 n Let G be a simple graph on q vertices and e edges. G contains an M-clique if: 2n 1 q e 1 M 1 2 CLIQUES: The edge set of G is constructed as follows; an edge (x, y) exists in G if and only if d(x, y) > d. The first question is; how many edges does G have? This can be found by taking spheres of radius d − 1 around each vector and counting how many vectors are outside the particular sphere. Since edges will be double counted, we must divide by 2: 1 1 2n n e q Vq (n, x, d 1) q q n 2 xFq n 2 V (n, x, d 1) q xFq n q n qn n q Vqavg (n, d 1) 2 If: 1 q 2n q n n avg 1 q V (n, d 1) q 2 M 1 2 Then there exists a code of size M. qn M 1 avg Vq (n, d 1) M C qmax (n, d ) Let qn M avg Vq (n, d 1) Then: qn qn M avg 1 avg Vq (n, d 1) Vq (n, d 1) Hence there exists a code of size M and so: qn max C ( n, d ) avg q Vq (n, d 1) The upper bound for the average sphere volume in this metric will be: avg q V (n, d 1) q d 1 n d 1 min 2 ( d 1) 1, 2 j 1 n d j 1 j min( j 1,d 1) j 1 d d 1 k j k max( 0, j d ) k j k j The Varshamov-Gilbert bound becomes: q n d 1 n d 1 min 2 ( d 1) 1, 2 j 1 n d j 1 j min( j 1,d 1) j 1 d d 1 k j k max( 0, j d ) k j k Cqmax (n, d ) j d=6 d=7 d=8 d=9 d = 10 Length (n) Min. size Length (n) Min. size Length (n) Min. size Length (n) Min. size Length (n) Min. size 15 8 15 2 20 4 20 1 25 2 16 15 16 3 21 7 21 2 26 3 17 28 17 5 22 12 22 2 27 5 18 53 18 8 23 21 23 4 28 8 19 107 19 13 24 39 24 6 29 15 20 223 20 24 25 75 25 10 30 27 21 479 21 46 26 149 26 18 22 1055 22 90 27 304 27 33 23 2386 23 183 28 635 28 62 24 5524 24 381 29 1354 29 121 25 13068 25 815 30 2946 30 243 26 31545 26 1783 27 77600 27 3988 28 1943016 28 9102 29 494758 29 21174 30 1279652 30 50155 Thermodynamic weight of virtual stacked pairs. A C G T A 1.00 1.45 1.30 0.58 C 1.44 1.84 2.24 1.30 G 1.28 2.17 1.84 1.45 T 0.88 1.28 1.44 1.00 •Can use statistical estimation of sphere volume. Case 1: sequences end with the same symbol A C G C G T T A C T G A T A C A Get LCS of this match and add 1 for the A’s that have to Case 2: sequences end with different symbols A C G C G T T A C T G A T A C C Take the best LCS of these two Solve Problem Recursively • If x(i) and y(j) end with the same symbol, say A, then: LCS(x(i), y(j)) = LCS(x(i – 1), y(j – 1) + A • If xi and yj do NOT end with the same symbol, then: LCS(x(i), y(j)) = max[LCS(x(i – 1), y(j)), LCS(x(i), y(j – 1))] • • • • • Inefficient: we keep evaluating the same LCS(i, j) over and over. Instead, use dynamic programming. Fill in a table of LCS(i, j) values by i and j. You only have to figure each LCS(i, j) once. O(n2). A C T G A C T A 1 1 1 1 1 1 1 G 1 1 1 2 2 2 2 C 1 2 2 2 2 3 3 T 1 2 3 3 3 3 4 A 1 2 3 3 4 4 4 C 1 2 3 3 4 5 5 G 1 2 3 4 4 5 5 In terms of dynamic programming table: A C G T A A C G C Cell we are trying to figure out T G C A Information we use Stacked pair metric The longest common subsequence SUCH THAT there are no lone bonds. If xi is matched to yj, THEN EITHER xi-1 is matched to yj-1, OR xi+1 is matched to yj+1 Cannot “break” a block LCS Big regular LCS: A C T G C T G A C G C T Break to get two smaller regular LCS’s: A C T G C T G A C G C T Cannot “break” a block LCS Big block LCS: G G T A G G C C T A C C CANNOT break to get two smaller block LCS’s: G G T A G G C C T A C C Adding a single symbol to a string can have effects arbitrarily far back A C T C C C C T These three bonds make the LCSP. G G G G G A C T G A C T C C C C T G G G G G G A C T G Add just one symbol, G, and the red bond must be moved to make the new LCSP. Two algorithms “Efficient” examines cases of INPUT: the two sequences “Simple” examines cases of OUTPUT: the matching Tail equality of two sequences Tail equality 3: A G C T C A T C T C Tail equality 0: A G C T G A T C T A End count of a matching End count 2: A G C T C A T C T C End count 0: A G C T G A T C T A • The end count of a matching between x and y cannot exceed the tail equality of x and y. • Let LCSP(k)(i, j) be the length of the longest LCSP(i, j) achievable with a matching of end count k. LCSPi, j max LCSP 0 k e k 1 (k ) i, j • where e is the tail equality of x and y. Example: Figure LCSP(3): A C T G C T A T A C T G C T A T best of these two + 3 i, j max LCSPi k 1, j k , LCSPi LCSPi k 1, j k , LCSPi k , j k 1 k LCSP (k ) Substituting: LCSPi, j max max LCSPi k 1, j k , LCSPi k , j k 0 k e k 1 max LCSPi k 1, j k , LCSPi k , j k 1 k • O(n) worst case for one cell. • O(n3) for algorithm. • In practice, only 56% more time. • “Efficient” algorithm takes O(n) memory. • “Simple” algorithm takes O(n2) memory. Matches / Sequence Length vs. Sequence Length 100 pairs evaluated at each data point Program: PP Mean LCS-LCSP vs length 0.700 Matches / Sequence Length 0.600 0.500 0.400 LCS Block LCS 0.300 0.200 0.100 0.000 0 100 200 300 400 500 Sequence Length 600 700 800 900 1000 12000 LCS and Block LCS of 100,000 Random Pairs of Length 500 10000 8000 6000 LCS Block LCS 4000 2000 0 270 280 290 300 310 320 330 340 • Start with empty code. • Repeatedly generate random codewords and add them if they meet the distance requirement. When to stop? • • • • After “n” trials? When “n” trials in a row have failed? When fewer than “i” of the last “n” trials have succeeded? When the size of the code is near a maximum predicted by theory? Empirical Relation Between Codeword Length and Code Size n Cq (n, d ) 2 1.5 0.088 for distance 8 0.104 for distance 7 3500 3000 Code size, predicted and observed vs word length at distance 8 2500 2000 1500 1000 Observed Predicted 500 0 10 12 14 16 18 20 22 24 26 28 6000 5000 Code size, predicted and observed vs word length at distance 7 4000 3000 2000 Observed Predicted 1000 0 8 10 12 14 16 18 20 22 24 26 A DNA Computing Paradigm Let Q1, Q2 ,...Qk be fixed subsets of {1,2,...,n}. a. Find all subsets S {1,2,...,n} with S Qi for i with 1 i k. b. Find all subsets T {1,2,...,n} with Qi T for i with 1 i k The identification of maximal frequent sets in data fields are the computational bottleneck in association rule discovery. This is an important problem and the independent sets and maximal cliques problems fit this paradigm. DNA Code and DNA Bitstring Library A A A A A A A A C C=T1 G G T T T T T T T T =BEAD PROBE (T1) T T T C C A A A A A =F1 T T T T T G G A A A = BEAD PROBE (F1) T T T C T T A A C C=T2 G G T T A A G A A A= BEAD PROBE (T2) A C T A A C A A A A=F2 T T T T G T T A G T= BEAD PROBE (F2) C A T A A A A C A C=T3 G T G T T T T A T G= BEAD PROBE (T3) A T C T T T T C A A=F3 T T G A A A A G A T= BEAD PROBE (F3) C A A T C C A T T A=T4 T A A T G G A T T G= BEAD PROBE (T4) C C T T C T A A A T=F4 A T T T A G A A G G= BEAD PROBE (F4) A C T C C T A A T A=T5 T A T T A G G A G T= BEAD PROBE (T5) T C T C T C T A C T=F5 A G T A G A G A G A= BEAD PROBE (F5) DNA CODE 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31. 32. A A A A A A A A C C -T T T C T T A A C C-C A T A A A A C A C-T4-T5 A A A A A A A A C C -T T T C T T A A C C-C A T A A A A C A C-T4-F5 A A A A A A A A C C -T T T C T T A A C C-C A T A A A A C A C-F4-T5 A A A A A A A A C C -T T T C T T A A C C-C A T A A A A C A C-F4-F5 A A A A A A A A C C -T T T C T T A A C C -A T C T T T T C A A-T4-T5 A A A A A A A A C C -T T T C T T A A C C -A T C T T T T C A A-T4-F5 A A A A A A A A C C -T T T C T T A A C C -A T C T T T T C A A-F4-T5 A A A A A A A A C C -T T T C T T A A C C -A T C T T T T C A A-F4-F5 A A A A A A A A C C-A C T A A C A A A A-C A T A A A A C A C-T4-T5 A A A A A A A A C C-A C T A A C A A A A-C A T A A A A C A C-T4-F5 A A A A A A A A C C-A C T A A C A A A A-C A T A A A A C A C-F4-T5 A A A A A A A A C C-A C T A A C A A A A-C A T A A A A C A C-F4-F5 A A A A A A A A C C-A C T A A C A A A A -A T C T T T T C A A-T4-T5 A A A A A A A A C C-A C T A A C A A A A -A T C T T T T C A A-T4-F5 A A A A A A A A C C-A C T A A C A A A A -A T C T T T T C A A-F4-T5 A A A A A A A A C C-A C T A A C A A A A -A T C T T T T C A A-F4-F5 T T T C C A A A A A -T T T C T T A A C C-C A T A A A A C A C-T4-T5 T T T C C A A A A A -T T T C T T A A C C-C A T A A A A C A C-T4-F5 T T T C C A A A A A -T T T C T T A A C C-C A T A A A A C A C-F4-T5 T T T C C A A A A A -T T T C T T A A C C-C A T A A A A C A C-F4-F5 T T T C C A A A A A -T T T C T T A A C C -A T C T T T T C A A-T4-T5 T T T C C A A A A A -T T T C T T A A C C -A T C T T T T C A A-T4-F5 T T T C C A A A A A -T T T C T T A A C C -A T C T T T T C A A-F4-T5 T T T C C A A A A A -T T T C T T A A C C -A T C T T T T C A A-F4-F5 T T T C C A A A A A -A C T A A C A A A A-C A T A A A A C A C-T4-T5 T T T C C A A A A A -A C T A A C A A A A-C A T A A A A C A C-T4-F5 T T T C C A A A A A -A C T A A C A A A A-C A T A A A A C A C-F4-T5 T T T C C A A A A A -A C T A A C A A A A-C A T A A A A C A C-F4-F5 T T T C C A A A A A -A C T A A C A A A A -A T C T T T T C A A-T4-T5 T T T C C A A A A A -A C T A A C A A A A -A T C T T T T C A A-T4-F5 T T T C C A A A A A -A C T A A C A A A A -A T C T T T T C A A-F4-T5 T T T C C A A A A A -A C T A A C A A A A -A T C T T T T C A A-F4-F5 DNA LIBRARY= DNA BITSTRINGS Example: Independent Sets and Cliques 3 Edges in G are {1,2}, {2,3}, {3,4}, {4,5},{1,4},{2,5} Edges in G’ are {1,3}, {1,5}, {2,4}, {3,5}, 4 2 3 3 2 G= G’= 4 1 1 5 2 1 4 5 5 An independent set is a collection of vertices that contains no edge. A clique is a subgraph were every pair of vertices has an edge between them. For a graph G, its complement G’ is the set of edges not in G A maximal independent set in G is a maximal clique in G’, e.g., {1,3,5}. 3 1 5 DNA Computing for Independent Sets and Cliques Let Q1, Q2 ,...Qk be fixed subsets of {1,2,...,n}. a. Find all subsets S {1,2,...,n} with S Qi for i with 1 i k. b. Find all subsets T {1,2,...,n} with Qi T for i with 1 i k 3 3 2 G= 4 1 G’= 2 1 4 5 5 Let {1,2},{1,4},{2,3},{2,5},{3,4},{4,5}=Q1,...,Q6 be fixed subsets of {1,2,...,5}. Finding all subsets T {1,2,...,n} with Qi T for i with 1 i 6, is finding all independent sets in G or all cliques in the complement G'. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. A A A A A A A A C C -T T T C T T A A C C-C A T A A A A C A C-T4-T5 A A A A A A A A C C -T T T C T T A A C C-C A T A A A A C A C-T4-F5 A A A A A A A A C C -T T T C T T A A C C-C A T A A A A C A C-F4-T5 A A A A A A A A C C -T T T C T T A A C C-C A T A A A A C A C-F4-F5 A A A A A A A A C C -T T T C T T A A C C -A T C T T T T C A A-T4-T5 A A A A A A A A C C -T T T C T T A A C C -A T C T T T T C A A-T4-F5 A A A A A A A A C C -T T T C T T A A C C -A T C T T T T C A A-F4-T5 A A A A A A A A C C -T T T C T T A A C C -A T C T T T T C A A-F4-F5 A A A A A A A A C C-A C T A A C A A A A-C A T A A A A C A C-T4-T5 A A A A A A A A C C-A C T A A C A A A A-C A T A A A A C A C-T4-F5 A A A A A A A A C C-A C T A A C A A A A-C A T A A A A C A C-F4-T5 A A A A A A A A C C-A C T A A C A A A A-C A T A A A A C A C-F4-F5 A A A A A A A A C C-A C T A A C A A A A -A T C T T T T C A A-T4-T5 A A A A A A A A C C-A C T A A C A A A A -A T C T T T T C A A-T4-F5 A A A A A A A A C C-A C T A A C A A A A -A T C T T T T C A A-F4-T5 A A A A A A A A C C-A C T A A C A A A A -A T C T T T T C A A-F4-F5 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31. 32. TTTCCAAAAA TTTCCAAAAA TTTCCAAAAA TTTCCAAAAA TTTCCAAAAA TTTCCAAAAA TTTCCAAAAA TTTCCAAAAA TTTCCAAAAA TTTCCAAAAA TTTCCAAAAA TTTCCAAAAA TTTCCAAAAA TTTCCAAAAA TTTCCAAAAA TTTCCAAAAA -T T T C T T A A C C-C A T A A A A C A C-T4-T5 -T T T C T T A A C C-C A T A A A A C A C-T4-F5 -T T T C T T A A C C-C A T A A A A C A C-F4-T5 -T T T C T T A A C C-C A T A A A A C A C-F4-F5 -T T T C T T A A C C -A T C T T T T C A A-T4-T5 -T T T C T T A A C C -A T C T T T T C A A-T4-F5 -T T T C T T A A C C -A T C T T T T C A A-F4-T5 -T T T C T T A A C C -A T C T T T T C A A-F4-F5 -A C T A A C A A A A-C A T A A A A C A C-T4-T5 -A C T A A C A A A A-C A T A A A A C A C-T4-F5 -A C T A A C A A A A-C A T A A A A C A C-F4-T5 -A C T A A C A A A A-C A T A A A A C A C-F4-F5 -A C T A A C A A A A -A T C T T T T C A A-T4-T5 -A C T A A C A A A A -A T C T T T T C A A-T4-F5 -A C T A A C A A A A -A T C T T T T C A A-F4-T5 -A C T A A C A A A A -A T C T T T T C A A-F4-F5 DNA Library 2^( # Coding Strands / 2) # Coding Strands / 2 Bits TTTTTGGAAA 24. T T T C C A A A A A -T T T C T T A A C C -A T C T T T T C A A-F4-F5 TTTTGTTAGT 10.A A A A A A A A C C-A C T A A C A A A A-C A T A A A A C A C-T4-F5 X1=F or X2=F TTTTGTTAGT TTTTTGGAAA 29. T T T C C A A A A A -A C T A A C A A A A -A T C T T T T C A A-T4-T5 All subsets not containing {1,2} T T T T T G G A A A=Probe(F1) Edge {1,2} STM X1=T and X2=T 1. 2. 3. 4. 5. 6. 7. 8. A A A A A A A A C C -T A A A A A A A A C C -T A A A A A A A A C C -T A A A A A A A A C C -T A A A A A A A A C C -T A A A A A A A A C C -T A A A A A A A A C C -T A A A A A A A A C C -T T T C T T A A C C-C A T A A A A C A C-T4-T5 T T C T T A A C C-C A T A A A A C A C-T4-F5 T T C T T A A C C-C A T A A A A C A C-F4-T5 T T C T T A A C C-C A T A A A A C A C-F4-F5 T T C T T A A C C -A T C T T T T C A A-T4-T5 T T C T T A A C C -A T C T T T T C A A-T4-F5 T T C T T A A C C -A T C T T T T C A A-F4-T5 T T C T T A A C C -A T C T T T T C A A-F4-F5 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. A A A A A A A A C C-A C T A A C A A A A-C A T A A A A C A C-T4-T5 A A A A A A A A C C-A C T A A C A A A A-C A T A A A A C A C-T4-F5 A A A A A A A A C C-A C T A A C A A A A-C A T A A A A C A C-F4-T5 A A A A A A A A C C-A C T A A C A A A A-C A T A A A A C A C-F4-F5 A A A A A A A A C C-A C T A A C A A A A -A T C T T T T C A A-T4-T5 A A A A A A A A C C-A C T A A C A A A A -A T C T T T T C A A-T4-F5 A A A A A A A A C C-A C T A A C A A A A -A T C T T T T C A A-F4-T5 A A A A A A A A C C-A C T A A C A A A A -A T C T T T T C A A-F4-F5 T T T C C A A A A A -T T T C T T A A C C-C A T A A A A C A C-T4-T5 T T T C C A A A A A -T T T C T T A A C C-C A T A A A A C A C-T4-F5 T T T C C A A A A A -T T T C T T A A C C-C A T A A A A C A C-F4-T5 T T T C C A A A A A -T T T C T T A A C C-C A T A A A A C A C-F4-F5 21. T T T C C A A A A A -T T T C T T A A C C -A T C T T T T C A A-T4-T5 22. T T T C C A A A A A -T T T C T T A A C C -A T C T T T T C A A-T4-F5 23. T T T C C A A A A A -T T T C T T A A C C -A T C T T T T C A A-F4-T5 24. T T T C C A A A A A -T T T C T T A A C C -A T C T T T T C A A-F4-F5 25. T T T C C A A A A A -A C T A A C A A A A-C A T A A A A C A C-T4-T5 26. T T T C C A A A A A -A C T A A C A A A A-C A T A A A A C A C-T4-F5 27. T T T C C A A A A A -A C T A A C A A A A-C A T A A A A C A C-F4-T5 28. T T T C C A A A A A -A C T A A C A A A A-C A T A A A A C A C-F4-F5 29. T T T C C A A A A A -A C T A A C A A A A -A T C T T T T C A A-T4-T5 30. T T T C C A A A A A -A C T A A C A A A A -A T C T T T T C A A-T4-F5 31. T T T C C A A A A A -A C T A A C A A A A -A T C T T T T C A A-F4-T5 32. T T T C C A A A A A -A C T A A C A A A A -A T C T T T T C A A-F4-F5 DNA Library {1,2} {2,4} {1,3} {2,5} {1,4} {3,4} {1,5} {2,3} {3,5} {4,5} Black ON, Red OFF =Independent Sets in G Black OFF, Red ON =Cliques in G Universal DNA Computer for any Graph on n Vertices DNA Library Every Graph G on n vertices has G union G’= all possible pairs on n vertices. This enables the construction of a universal device. {1,2} {1,3} Each possible edge is an STM. Then depending on the problem, the flow is directed by the edges present (or absent) in the given graph . {n-2,n} {n-1,n} Edges in G ON, Edges in G’ OFF =Independent Sets in G when flow completed Edges in G OFF, Edges in G’ ON =Cliques in G when flow completed