Vyacheslav V. Rykov

advertisement
Vyacheslav V. Rykov
Outline
• DNA Hybridization/Cross Hybridization
• DNA Codes
• Nearest Neighbor Thermodynamics
– complete computations
– bounds
• Overview of Applications and Purposes
• DNA Bitstring Library
• Biomolecular Computing
DNA Hybridization
• DNA strands are modeled by directed 3’--> 5’ sequences of letters
from the alphabet {A, C, G, T}
• (A, T) and (C, G) are complementary pairs.
• Two oppositely directed DNA sequences are capable of coalescing into
a duplex.
• Because an A (C) in one strand can (usually) only bind to a T (G) in
the oppositely directed strand, the greatest energy of duplex formation
is obtained when the two sequences are reverse-complements
(complements)
Orientation of single DNA strands is important for hybridization.
A DNA Code
Coding Strands
for Ligation
TACGCGACTTTC
ATCAAACGATGC
TGTGTGCTCGTC
ATTTTTGCGTTA,
CACTAAATACAA
GAAAAAGAAGAA,
5’
3’
Must Have
Must Avoid
5’TACGCGACTTTC3’
5’GAAAGTCGCGTA3’
TACGCGACTTTC
GCATCGTTTGAT
Probing Complement Strands
for Reading
GAAAGTCGCGTA
GCATCGTTTGAT
GACGAGCACACA
TAACGCAAAAAT
TTGTATTTAGTG
TTCTTCTTTTTC
5’
3’
ATCAAACGATGC
GCATCGTTTGAT
ATTTTTGCGTTA
GAAAAAGAAGAA
Watson Crick
(WC) Duplexes
Cross Hybridized
(CH) Duplexes
x=5’ ggCaCaTcatAct 3’
y=5’agatgGttAAccT 3’
5’ ggCaCaTcatAct 3’
3’ TccAAttGgtaga 5’
5’ ggCaCaTcatAct 3’
5’ AggTTaaCcatct 3’
5’agatgGttAAccT 3’ =y
DNA codes serve as universal components for biomolecular
computing. DNA codes are closed under reverse-complementation.
The strands in a DNA code have such binding specificity that a code
strand will only hybridize with its reverse-complement and will not
cross hybridize with any other code strand in the DNA code
Such collections of strands are crucial to the success biomolecular
computing and biomolecular nanotechnology.
Basic idea is to have correct, parallel and autonomous addressing
Characterization of synthetic DNA bar codes in
Saccharomyces cervisiae gene-deletion strains”
(Eason et al., PNAS).
DNA codes for self-assembly of any
components that can be attached
to DNA. Their size presents the potential
for increased complexity and location
control in nanostructures produced by
assembly that is driven by DNA duplex
formation. Fundamental physical limits and
increasing costs of fabrication facilities will
force alternatives to conventional
microelectronics manufacturing to be
developed.
In self-assembly, weak, local
interactions among molecular
components spontaneously
organize those components into
aggregates with properties that
range from simple to complex
DNA memory:The capacity and storage
density of such memories is potentially very
large. Information could
be mined through massively parallel templatematching reactions. In addition, information
could be processed based upon context, and
information matched associatively based upon
content.
Interest into DNA computing was sparked in 1994 by Len
Adleman.
Adleman showed how we can use DNA molecules to solve
a mathematical problem. (Hamiltonian path problem).
DNA computing relies on the fact that DNA strands can be
represented as sequences of bases (4-ary sequences) and
the property of hybridization.
In Hybridization, errors can occur. Thus, error-correcting
codes are required for efficient synthesis of DNA strands
to be used in computing.
DNA Computing Strand Engineering
No codeword-codewode CH (cc-CH)
No codeword-probe CH (cp-CH)
No probe-probe CH (pp-CH)
A A A A A A A A C C=T1
G G T T T T T T T T =BEAD PROBE (T1)
A T C T T T T C A A=F3
T T G A A A A G A T= BEAD PROBE (F3)
T T T C C A A A A A =F1
T T T T T G G A A A = BEAD PROBE (F1)
C A A T C C A T T A=T4
T A A T G G A T T G= BEAD PROBE (T4)
T T T C T T A A C C=T2
G G T T A A G A A A= BEAD PROBE (T2)
C C T T C T A A A T=F4
A T T T A G A A G G= BEAD PROBE (F4)
A C T A A C A A A A=F2
T T T T G T T A G T= BEAD PROBE (F2)
A C T C C T A A T A=T5
T A T T A G G A G T= BEAD PROBE (T5)
C A T A A A A C A C=T3
G T G T T T T A T G= BEAD PROBE (T3)
T C T C T C T A C T=F1
A G T A G A G A G A= BEAD PROBE (F5)
Only Allowed Hybridizations
C C A A C C A A A A A A = T1
A A A A A A A C C A C C=F1
T T T T T C C T T C C A =T2
T T A C C T C A A A C C =F2
T T T C A C A A C T C C=T3
T T C A A T C C A C A A =F3
T C A C T C T C T C A A =T4
T C T T T C T C C T C T=F4
C A T C T C A C C A T C =T5
No cp-CH
T T T T T T G G T T G G=Probe(T1)
G G T G G T T T T T T T=Probe(F1)
T G G A A G G A A A A A=Probe(T2)
G G T T T G A G G T A A =Probe(F2)
G G A G T T G T G A A A=Probe(T3)
T T G T G G A T T G A A=Probe(F3)
T T G A G A G A G T G A=Probe(T4)
A G A G G A G A A A G A=Probe(F4)
G A T G G T G A G A T G=Probe(T5)
G T G T G T A G T G T T=Probe(F5)
A A C A C T A C A C A C =F5
No cc-CH
No pp-CH
DNA Computing Strand Engineering
No codeword cp-CH
T T C A A T C C A C A A =F3
T T G T G G A T T G A A=Probe(F3)
T C A C T C T C T C A A =T4
T T G A G A G A G T G A=Probe(T4)
T C T T T C T C C T C T=F4
A G A G G A G A A A G A=Probe(F4)
C A T C T C A C C A T C =T5
G A T G G T G A G A T G=Probe(T5)
A A C A C T A C A C A C =F5
G T G T G T A G T G T T=Probe(F5)
C C A A C C A A A A A A = T1
T T T T T T G G T T G G=Probe(T1)
A A A A A A A C C A C C=F1
G G T G G T T T T T T T=Probe(F1)
T T T T T C C T T C C A =T2
T G G A A G G A A A A A=Probe(T2)
T T A C C T C A A A C C =F2
G G T T T G A G G T A A =Probe(F2)
T T T C A C A A C T C C=T3
G G A G T T G T G A A A=Probe(T3)
PROBE(F2)
GGTTTGAGGTAA
C A A C C A A A A A A- T T A C C T C A A A C C- T T C A A T C C A C A A- T C A C T C T C T C A A - C A T C T C A C C A T C
Yes WC bonding
Yes, bitstring is F2
Good read
T1-F2-F3-T4-T5
1 0 0 1 1
PROBE(T2)
GGAGTTGTGAA
C A A C C A A A A A A- T T A C C T C A A A C C- T T C A A T C C A C A A- T C A C T C T C T C A A - C A T C T C A C C A T C
Darn! CH bonding
No, bitstring is not T2
Bad read
T1-F2-F3-T4-T5
DNA Computing Strand Engineering
No codeword pp-CH, cc-CH
GGTTTGAGGTAA
pp-CH
interferes with reading
PROBE(F2)
T T G A G A G A GT G
PROBE(T4)
PROBE(F2)
GGTTTGAGGTAA
bonding site
competition
C A A C C A A A A A A- T T A C C T C A A A C C- T T C A A T C C A C A A- T C A C T C T C T C A A - C A T C T C A C C A T C
cc-CH
interferes with separation and leads to unwanted library strand interaction
T1-F2-F3-T4-T5
F1-F2-T3-T4-f5
C A A C C A A A A A A- T T A C C T C A A A C C- T T C A A T C C A C A A- T C A C T C T C T C A A - C A T C T C A C C A T C
T T T C C A A A A-A T T A C C T C A A A C C- T T T C A C A A C T C C- T C A C T C T C T C A A - A A C A C T A C A C A C
Watson-Crick Nearest Neighbor Computation
AA
AC
AG
AT
CA
CC
CG
CT
GA
GC
GG
GT
TA
TC
TG
TT
AA
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
AC
0
1.44
0
0
0
0.52
0
0
0
0
0
0
0
0
0
0
AG
0
0.13
1.28
0
0
0
0
0
0
0
0
0
0
0
0
0
AT
0
0
0
0.88
0
0
0
0
0
0
0
0
0
0
0
0
CA
0
0
0
0
1.45
0
0
0
0
0
0
0
0
0
0
0
CC
0
0
0
0
0
1.84
0
0
0
0
0
0
0
0
0
0
CG
0
0
0
0
0.47
0.11
2.17
0
0
0
0
0
0
0
0
0
CT
0
0
0
0
0.12
0.32
0
1.28
0
0
0
0
0
0
0
0
GA
0
0
0
0
0
0
0
0
1.3
0.25
0
0
0
0
0
0
GC
0
0.59
0
0
0
1.11
0
0
0
2.24
0
0.27
0
0.25
0
0
GG
0
0
0.32
0
0
0
0.11
0
0
1.11
1.84
0.52
0
0
0
0
GT
0
0
0
0
0
0
0
0.13
0
0.59
0
1.44
0
0
0
0
TA
0
0
0
0
0
0
0
0
0
0
0
0
0.58
0
0
0
TC
0
0
0
0
0
0
0
0
0
0
0
0
0
1.3
0
0
TG
0
0
0.12
0
0
0
0.47
0
0
0
0
0
0
0
1.45
0
TT
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
2.24
WC
Duplex
5’ g g c a c a 3’
3’ c c g t g t 5’
5’ g g c a c a 3’
5’ g g c a c a 3’
1.44
5’ g g c a c a 3’
5’ g g c a c a 3’
1.84
1.45
1.45
NNFE=8.42
Cross Hybridized Nearest Neighbor Upper Bound Computation
AA
AC
AG
AT
CA
CC
CG
CT
GA
GC
GG
GT
TA
TC
TG
TT
AA
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
AC
0
1.44
0
0
0
0.52
0
0
0
0
0
0
0
0
0
0
AG
0
0.13
1.28
0
0
0
0
0
0
0
0
0
0
0
0
0
AT
0
0
0
0.88
0
0
0
0
0
0
0
0
0
0
0
0
CA
0
0
0
0
1.45
0
0
0
0
0
0
0
0
0
0
0
CC
0
0
0
0
0
1.84
0
0
0
0
0
0
0
0
0
0
CG
0
0
0
0
0.47
0.11
2.17
0
0
0
0
0
0
0
0
0
CT
0
0
0
0
0.12
0.32
0
1.28
0
0
0
0
0
0
0
0
GA
0
0
0
0
0
0
0
0
1.3
0.25
0
0
0
0
0
0
GC
0
0.59
0
0
0
1.11
0
0
0
2.24
0
0.27
0
0.25
0
0
GG
0
0
0.32
0
0
0
0.11
0
0
1.11
1.84
0.52
0
0
0
0
GT
0
0
0
0
0
0
0
0.13
0
0.59
0
1.44
0
0
0
0
TA
0
0
0
0
0
0
0
0
0
0
0
0
0.58
0
0
0
TC
0
0
0
0
0
0
0
0
0
0
0
0
0
1.3
0
0
TG
0
0
0.12
0
0
0
0.47
0
0
0
0
0
0
0
1.45
0
TT
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1.45
5’ g
ggCaCaTcatAct
gC aCaTcatAct
3’ 3’
3’ Tc
TccAAttGgtaga
cA AttGgtaga5’5’
5’ ggCaCaTcatAct 3’
5’ AggTTaaCcatct 3’
1.28
5’ g g C a C a T c a t A ct 3’
5’ A g g T T a a C c a t
ct 3’
1.84
.27
0.88
NNFE~<5.45
NNFE~<5.72
Intermolecular Interactions
Duplexes
loop
symmetric
loop
asymmetric
2.90 (5.3)
2.20 (5.3)
Intramolecular Interactions
CAAGACTTTTTGGTAGTAAA
***TTTCCC*********GGAA***GGGAAA***********TTCC***
NNFE~<5.66
NNFE~<5.66+ .59 + .32=6.57
5’ gg C a C a T c a t A ct 3’
3’ T cc AA t t G g t a
ga 5’
5’ ggCaCaTcatAct 3’
3’ TccAAttGgtaga 5’
5’ ggCaCaTcatAct 3’
5’ AggTTaaCcatc 3’
5’ gg C a C a T c a t A ct 3’
5’ A gg TT a a C c a t
ct 3’
Virtual
Stacked
Pairs
Virtual
Duplex
5’ GGCACATCATACT 3’
5’ AGTATGATGTGCC 3’
5’ AGGTTAACCATCT3’
5’ AGATGGTTAACCT3’
5’ GGCACATCATACT 3’
Neareast Neighbor Appr. Free Energy
of duplex formation (WC)
5’ AGTATGATGTGCC 3’
…= 18.8
1.84 2.24
1.45
5’ GGCACATCATACT 3’
5’ AGGTTAACCATCT3’
5’ ggCaCaTcatAct 3’
3’ TccAAttGgtaga 5’
5’ ggCaCaTcatAct 3’
5’ AggTTaaCcatct3’
5’ gg C a C a T c a t A ct 3’
3’ T cc AA t t G g t a
ga 5’
1.28
1.45
5’ gg C a C a T c a t A ct 3’
5’ A gg TT a a C c a t
ct 3’
1.84
0.88
NNFE
CH =6.45
correlation=.737
Our FE bound
Precise FE
Length 16
435 random
n
Let Fq denote a set consisting of all vectors (codewords) of
length n built over Aq  {0,1,...q  1}
i.e.
x  Fq  x  ( x1 , x2 ,..., xn )
xi  Aq
n
Let d : Fq  Fq  {0,1,2,...} such that:
n
x, y, z  Fq
n
n
1)
2)
3)
Let Cq (n, M , d ) 
d (x, y )  0  x  y
d (x, y )  d (y, x)
d (x, y )  d (x, z )  d (z, y )
n
Fq be such that:
x, y  Cq (n, M , d )
d ( x, y )  d
Cq (n, M , d )  M
Cq (n, M , d ) is referred to as a Code of length n, size M, and minimum
distance d.
n
A sphere in Fq centered at x having radius d:
Sphq (n, x, d )  {y : y  Fqn , d (x, y)  d}
Volume of the sphere around x, of radius d:
Vq (n, x, d ) 
 Sph (n, x, i)  y : y  F
d
q
n
q
, d (x, y )  d

i 0
A space is HOMOGENEOUS when the volume of a
sphere does not depend on where it is centered i.e.
(x, y  Fqn )(d  0, n)(Vq (n, x, d )  Vq (n, y, d ))
A space is NON - HOMOGENEOUS when the volume of a
sphere does depend on where it is centered.
z  ( z1 , z2 ,..., zk )
is a subsequence of x  ( x1 , x2 ,..., xn )
Sequence
if and only if there exists a strictly increasing
sequence of indices: (i1 , i2 ,..., ik )
Such that: j, z j  xi
j
LCS (x, y )
is defined to be the set of longest common subsequences
of x and y
LCS (x, y)
is defined to be the length of the longest common
subsequence of x and y
Just what it says:
x = <A C G T C G A G C>
y = <G A C G C T G A G>
LCS(x, y) = {<A C G C G A G>}
|LCS(x, y)| = 7
Original Insertion-Deletion metric (Levenshtein 1966):
x, y  Fq
n
d (x, y )  (n  L(x, y ))  (n  L(x, y ))  2(n  L(x, y ))
This metric results from the number of deletions and insertions that
need to be made to obtain ‘ y ’ from ‘ x ’.
For vectors that have the same length:
the number of deletions that will be made is:
n  L ( x, y )
likewise, the number of insertions that will be made is:
n  L ( x, y )
Better Metric ?
• LCS is simple and easy to compute.
• LCS essentially is a count of the number of
base pairings between two sequences, and
thus does approximate bonding energy.
• Clue: if two base pairs bond, but neither their
neighbors to the right or left bond, it really
doesn’t contribute much.
• We might call such inconsequential bonds
“lone” bonds.
“Lone Bonds”
B B B B B B B B B B B B B B B B
B B B B B B B B B B B B B B B B
The red bonds are “lone bonds” that
don’t contribute to the binding energy.
Block LCS
The longest common subsequence
SUCH THAT:
If xi is matched to yj, THEN EITHER
xi-1 is matched to yj-1, OR
xi+1 is matched to yj+1
A common subsequence z  ( z1 , z2 ,..., zm ), 2  m  n, is called a
common stacked pair subsequence of length z  m between x
and y
if two elements zi , zi 1 , i  1,2,..., m  1 , are consecutive
in x and consecutive in y or if they are non -consecutive in x and
or non-consecutive in y, then zi 1 , zi and zi 1 , zi  2 are consecutive
in x and y.
Let S  (x, y ), 0  S  (x, y )  n , denote the length z of the
longest sequence occurring as a common stacked pair
subsequence subsequence z between sequences x and y.
The number S  (x, y ) , is called a similarity of blocks between x
and y. The metric is defined to be d (x, y )  n  S  (x, y )
We will be working in a NON-HOMOGENEOUS space
making the obtainment of exact formulas for sphere
volumes and code sizes VERY HARD.
[6] L. M. G. M. Tolhuizen (1997): The Generalized Varshamov-Gilbert Bound is Implied
by Turan’s Theorem, IEEE Transactions on Information Theory, 43:05.
Varshamov-Gilbert Lower Bound on Code Size in
metric:
n
q with any
F
q n  Cqmax (n, M , d )  Vqmax n, d  1
n
Let G be a simple graph on q vertices and e
edges. G contains an M-clique if:
2n
1
q


e  1 

 M 1  2
CLIQUES:
The edge set of G is constructed as follows;
an edge (x, y) exists in G if and only if
d(x, y) > d. The first question is; how many
edges does G have? This can be found by
taking spheres of radius d − 1 around each
vector and counting how many vectors are
outside the particular sphere. Since
edges will be double counted, we must divide
by 2:

1
1  2n
n
e   q  Vq (n, x, d  1)   q  q n
2 xFq n
2




V (n, x, d  1) 
q
xFq
n
q
n

qn n

q  Vqavg (n, d  1)
 2


If:

1  q 2n q n n

avg
1


q

V
(n, d  1)


q
2
 M 1  2
Then there exists a code of size M.
qn
M  1  avg
Vq (n, d  1)
M  C qmax (n, d )


Let


qn
M   avg

Vq (n, d  1) 
Then:
qn
qn
 M  avg
1
avg
Vq (n, d  1)
Vq (n, d  1)
Hence there exists a code of size M and so:
qn
max

C
( n, d )
avg
q
Vq (n, d  1)
The upper bound for the average sphere volume in this metric will
be:
avg
q
V
(n, d  1)  q
d 1

 n  d 1  
min  2 ( d 1) 1, 
 
 2 


j 1
n  d 

 j 1
j   min( j 1,d 1)  j  1 d  d  1  k 
   


j
 k max( 0, j d )  k  j  k 
j 


The Varshamov-Gilbert bound becomes:
q n d 1

 n  d 1  
min  2 ( d 1) 1, 
 
 2 


j 1
n  d 

 j 1
j   min( j 1,d 1)  j  1 d  d  1  k 
   


j
 k max( 0, j d )  k  j  k 
 Cqmax (n, d )
j 


d=6
d=7
d=8
d=9
d = 10
Length
(n)
Min. size
Length
(n)
Min. size
Length
(n)
Min.
size
Length
(n)
Min.
size
Length
(n)
Min.
size
15
8
15
2
20
4
20
1
25
2
16
15
16
3
21
7
21
2
26
3
17
28
17
5
22
12
22
2
27
5
18
53
18
8
23
21
23
4
28
8
19
107
19
13
24
39
24
6
29
15
20
223
20
24
25
75
25
10
30
27
21
479
21
46
26
149
26
18
22
1055
22
90
27
304
27
33
23
2386
23
183
28
635
28
62
24
5524
24
381
29
1354
29
121
25
13068
25
815
30
2946
30
243
26
31545
26
1783
27
77600
27
3988
28
1943016
28
9102
29
494758
29
21174
30
1279652
30
50155
Thermodynamic weight of virtual stacked pairs.

A
C
G
T
A
1.00
1.45
1.30
0.58
C
1.44
1.84
2.24
1.30
G
1.28
2.17
1.84
1.45
T
0.88
1.28
1.44
1.00
•Can use statistical estimation of sphere volume.
Case 1:
sequences end with the same
symbol
A C G C G T T A
C T G A T A C A
Get LCS of this
match
and add 1 for the
A’s that have to
Case 2:
sequences end with different
symbols
A C G C G T T A
C T G A T A C C
Take the best LCS of these two
Solve Problem Recursively
• If x(i) and y(j) end with the same symbol, say
A, then:
LCS(x(i), y(j)) = LCS(x(i – 1), y(j – 1) + A
• If xi and yj do NOT end with the same symbol,
then:
LCS(x(i), y(j)) = max[LCS(x(i – 1), y(j)),
LCS(x(i), y(j – 1))]
•
•
•
•
•
Inefficient: we keep evaluating the
same LCS(i, j) over and over.
Instead, use dynamic
programming.
Fill in a table of LCS(i, j) values by
i and j.
You only have to figure each
LCS(i, j) once.
O(n2).
A
C
T
G
A
C
T
A
1
1
1
1
1
1
1
G
1
1
1
2
2
2
2
C
1
2
2
2
2
3
3
T
1
2
3
3
3
3
4
A
1
2
3
3
4
4
4
C
1
2
3
3
4
5
5
G
1
2
3
4
4
5
5
In terms of dynamic programming
table:
A C G T A A C
G
C
Cell we are
trying to
figure out
T
G
C
A
Information we
use
Stacked pair metric
The longest common subsequence SUCH
THAT there are no lone bonds.
If xi is matched to yj, THEN EITHER
xi-1 is matched to yj-1, OR
xi+1 is matched to yj+1
Cannot “break” a block LCS
Big regular LCS:
A C T G C T
G A C G C T
Break to get two smaller regular LCS’s:
A C T
G C T
G A C
G C T
Cannot “break” a block LCS
Big block LCS:
G G T A G G
C C T A C C
CANNOT break to get two smaller block
LCS’s:
G G T
A G G
C C T
A C C
Adding a single symbol to a string
can have effects arbitrarily far back
A C T C C C C T
These three bonds make
the LCSP.
G G G G G A C T G
A C T C C C C T G
G G G G G A C T G
Add just one symbol, G,
and the red bond must be
moved to make the new
LCSP.
Two algorithms
“Efficient”
examines cases of INPUT:
the two sequences
“Simple”
examines cases of OUTPUT:
the matching
Tail equality of two sequences
Tail equality 3:
A G C T C
A T C T C
Tail equality 0:
A G C T G
A T C T A
End count of a matching
End count 2:
A G C T C
A T C T C
End count 0:
A G C T G
A T C T A
• The end count of a matching between x and y
cannot exceed the tail equality of x and y.
• Let LCSP(k)(i, j) be the length of the longest
LCSP(i, j) achievable with a matching of end
count k.
LCSPi, j   max LCSP
0 k  e
k 1
(k )
i, j 
• where e is the tail equality of x and y.
Example: Figure LCSP(3):
A
C
T
G
C
T
A
T
A
C
T
G
C
T
A
T
best of these two + 3
i, j   max LCSPi  k 1, j  k  , LCSPi 
LCSPi  k 1, j  k  , LCSPi  k , j  k  1  k
LCSP
(k )
Substituting:
LCSPi, j   max max LCSPi  k  1, j  k , LCSPi  k , j  k
0 k  e
k 1
max LCSPi  k  1, j  k , LCSPi  k , j  k  1  k 
• O(n) worst case for one cell.
• O(n3) for algorithm.
• In practice, only 56% more time.
• “Efficient” algorithm takes O(n) memory.
• “Simple” algorithm takes O(n2) memory.
Matches / Sequence Length vs. Sequence Length
100 pairs evaluated at each data point
Program: PP Mean LCS-LCSP vs length
0.700
Matches / Sequence Length
0.600
0.500
0.400
LCS
Block LCS
0.300
0.200
0.100
0.000
0
100
200
300
400
500
Sequence Length
600
700
800
900
1000
12000
LCS and Block LCS
of 100,000 Random Pairs
of Length 500
10000
8000
6000
LCS
Block LCS
4000
2000
0
270
280
290
300
310
320
330
340
• Start with empty code.
• Repeatedly generate random codewords and add them if they meet
the distance requirement.
When to stop?
•
•
•
•
After “n” trials?
When “n” trials in a row have failed?
When fewer than “i” of the last “n” trials have succeeded?
When the size of the code is near a maximum predicted by
theory?
Empirical Relation Between
Codeword Length and Code Size
n
Cq (n, d )  2
1.5
  0.088 for distance 8
  0.104 for distance 7
3500
3000
Code size, predicted and observed
vs
word length at distance 8
2500
2000
1500
1000
Observed
Predicted
500
0
10
12
14
16
18
20
22
24
26
28
6000
5000
Code size, predicted and observed
vs
word length at distance 7
4000
3000
2000
Observed
Predicted
1000
0
8
10
12
14
16
18
20
22
24
26
A DNA Computing Paradigm
Let Q1, Q2 ,...Qk be fixed subsets of {1,2,...,n}.
a. Find all subsets S  {1,2,...,n} with S  Qi for i with 1  i  k.
b. Find all subsets T  {1,2,...,n} with Qi  T for i with 1  i  k
The identification of maximal frequent sets in data fields are the
computational bottleneck in association rule discovery. This is
an important problem and the independent sets and maximal
cliques problems fit this paradigm.
DNA Code and DNA Bitstring Library
A A A A A A A A C C=T1
G G T T T T T T T T =BEAD PROBE (T1)
T T T C C A A A A A =F1
T T T T T G G A A A = BEAD PROBE (F1)
T T T C T T A A C C=T2
G G T T A A G A A A= BEAD PROBE (T2)
A C T A A C A A A A=F2
T T T T G T T A G T= BEAD PROBE (F2)
C A T A A A A C A C=T3
G T G T T T T A T G= BEAD PROBE (T3)
A T C T T T T C A A=F3
T T G A A A A G A T= BEAD PROBE (F3)
C A A T C C A T T A=T4
T A A T G G A T T G= BEAD PROBE (T4)
C C T T C T A A A T=F4
A T T T A G A A G G= BEAD PROBE (F4)
A C T C C T A A T A=T5
T A T T A G G A G T= BEAD PROBE (T5)
T C T C T C T A C T=F5
A G T A G A G A G A= BEAD PROBE (F5)
DNA CODE
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
27.
28.
29.
30.
31.
32.
A A A A A A A A C C -T T T C T T A A C C-C A T A A A A C A C-T4-T5
A A A A A A A A C C -T T T C T T A A C C-C A T A A A A C A C-T4-F5
A A A A A A A A C C -T T T C T T A A C C-C A T A A A A C A C-F4-T5
A A A A A A A A C C -T T T C T T A A C C-C A T A A A A C A C-F4-F5
A A A A A A A A C C -T T T C T T A A C C -A T C T T T T C A A-T4-T5
A A A A A A A A C C -T T T C T T A A C C -A T C T T T T C A A-T4-F5
A A A A A A A A C C -T T T C T T A A C C -A T C T T T T C A A-F4-T5
A A A A A A A A C C -T T T C T T A A C C -A T C T T T T C A A-F4-F5
A A A A A A A A C C-A C T A A C A A A A-C A T A A A A C A C-T4-T5
A A A A A A A A C C-A C T A A C A A A A-C A T A A A A C A C-T4-F5
A A A A A A A A C C-A C T A A C A A A A-C A T A A A A C A C-F4-T5
A A A A A A A A C C-A C T A A C A A A A-C A T A A A A C A C-F4-F5
A A A A A A A A C C-A C T A A C A A A A -A T C T T T T C A A-T4-T5
A A A A A A A A C C-A C T A A C A A A A -A T C T T T T C A A-T4-F5
A A A A A A A A C C-A C T A A C A A A A -A T C T T T T C A A-F4-T5
A A A A A A A A C C-A C T A A C A A A A -A T C T T T T C A A-F4-F5
T T T C C A A A A A -T T T C T T A A C C-C A T A A A A C A C-T4-T5
T T T C C A A A A A -T T T C T T A A C C-C A T A A A A C A C-T4-F5
T T T C C A A A A A -T T T C T T A A C C-C A T A A A A C A C-F4-T5
T T T C C A A A A A -T T T C T T A A C C-C A T A A A A C A C-F4-F5
T T T C C A A A A A -T T T C T T A A C C -A T C T T T T C A A-T4-T5
T T T C C A A A A A -T T T C T T A A C C -A T C T T T T C A A-T4-F5
T T T C C A A A A A -T T T C T T A A C C -A T C T T T T C A A-F4-T5
T T T C C A A A A A -T T T C T T A A C C -A T C T T T T C A A-F4-F5
T T T C C A A A A A -A C T A A C A A A A-C A T A A A A C A C-T4-T5
T T T C C A A A A A -A C T A A C A A A A-C A T A A A A C A C-T4-F5
T T T C C A A A A A -A C T A A C A A A A-C A T A A A A C A C-F4-T5
T T T C C A A A A A -A C T A A C A A A A-C A T A A A A C A C-F4-F5
T T T C C A A A A A -A C T A A C A A A A -A T C T T T T C A A-T4-T5
T T T C C A A A A A -A C T A A C A A A A -A T C T T T T C A A-T4-F5
T T T C C A A A A A -A C T A A C A A A A -A T C T T T T C A A-F4-T5
T T T C C A A A A A -A C T A A C A A A A -A T C T T T T C A A-F4-F5
DNA LIBRARY=
DNA BITSTRINGS
Example: Independent Sets and Cliques
3
Edges in G are
{1,2}, {2,3}, {3,4},
{4,5},{1,4},{2,5}
Edges in G’ are
{1,3}, {1,5}, {2,4},
{3,5},
4
2
3
3
2
G=
G’=
4
1
1
5
2
1
4
5
5
An independent set is a collection of vertices that contains no edge.
A clique is a subgraph were every pair of vertices has an edge between them.
For a graph G, its complement G’ is the set of edges not in G
A maximal independent set in G is a maximal clique in G’, e.g., {1,3,5}.
3
1
5
DNA Computing for Independent Sets and Cliques
Let Q1, Q2 ,...Qk be fixed subsets of {1,2,...,n}.
a. Find all subsets S  {1,2,...,n} with S  Qi for i with 1  i  k.
b. Find all subsets T  {1,2,...,n} with Qi  T for i with 1  i  k
3
3
2
G=
4
1
G’=
2
1
4
5
5
Let {1,2},{1,4},{2,3},{2,5},{3,4},{4,5}=Q1,...,Q6 be fixed subsets of {1,2,...,5}.
Finding all subsets T  {1,2,...,n} with Qi  T for i with 1  i  6, is finding all independent sets
in G or all cliques in the complement G'.
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
A A A A A A A A C C -T T T C T T A A C C-C A T A A A A C A C-T4-T5
A A A A A A A A C C -T T T C T T A A C C-C A T A A A A C A C-T4-F5
A A A A A A A A C C -T T T C T T A A C C-C A T A A A A C A C-F4-T5
A A A A A A A A C C -T T T C T T A A C C-C A T A A A A C A C-F4-F5
A A A A A A A A C C -T T T C T T A A C C -A T C T T T T C A A-T4-T5
A A A A A A A A C C -T T T C T T A A C C -A T C T T T T C A A-T4-F5
A A A A A A A A C C -T T T C T T A A C C -A T C T T T T C A A-F4-T5
A A A A A A A A C C -T T T C T T A A C C -A T C T T T T C A A-F4-F5
A A A A A A A A C C-A C T A A C A A A A-C A T A A A A C A C-T4-T5
A A A A A A A A C C-A C T A A C A A A A-C A T A A A A C A C-T4-F5
A A A A A A A A C C-A C T A A C A A A A-C A T A A A A C A C-F4-T5
A A A A A A A A C C-A C T A A C A A A A-C A T A A A A C A C-F4-F5
A A A A A A A A C C-A C T A A C A A A A -A T C T T T T C A A-T4-T5
A A A A A A A A C C-A C T A A C A A A A -A T C T T T T C A A-T4-F5
A A A A A A A A C C-A C T A A C A A A A -A T C T T T T C A A-F4-T5
A A A A A A A A C C-A C T A A C A A A A -A T C T T T T C A A-F4-F5
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
27.
28.
29.
30.
31.
32.
TTTCCAAAAA
TTTCCAAAAA
TTTCCAAAAA
TTTCCAAAAA
TTTCCAAAAA
TTTCCAAAAA
TTTCCAAAAA
TTTCCAAAAA
TTTCCAAAAA
TTTCCAAAAA
TTTCCAAAAA
TTTCCAAAAA
TTTCCAAAAA
TTTCCAAAAA
TTTCCAAAAA
TTTCCAAAAA
-T T T C T T A A C C-C A T A A A A C A C-T4-T5
-T T T C T T A A C C-C A T A A A A C A C-T4-F5
-T T T C T T A A C C-C A T A A A A C A C-F4-T5
-T T T C T T A A C C-C A T A A A A C A C-F4-F5
-T T T C T T A A C C -A T C T T T T C A A-T4-T5
-T T T C T T A A C C -A T C T T T T C A A-T4-F5
-T T T C T T A A C C -A T C T T T T C A A-F4-T5
-T T T C T T A A C C -A T C T T T T C A A-F4-F5
-A C T A A C A A A A-C A T A A A A C A C-T4-T5
-A C T A A C A A A A-C A T A A A A C A C-T4-F5
-A C T A A C A A A A-C A T A A A A C A C-F4-T5
-A C T A A C A A A A-C A T A A A A C A C-F4-F5
-A C T A A C A A A A -A T C T T T T C A A-T4-T5
-A C T A A C A A A A -A T C T T T T C A A-T4-F5
-A C T A A C A A A A -A T C T T T T C A A-F4-T5
-A C T A A C A A A A -A T C T T T T C A A-F4-F5
DNA Library
2^( # Coding Strands / 2)
# Coding Strands / 2 Bits
TTTTTGGAAA
24. T T T C C A A A A A -T T T C T T A A C C -A T C T T T T C A A-F4-F5
TTTTGTTAGT
10.A A A A A A A A C C-A C T A A C A A A A-C A T A A A A C A C-T4-F5
X1=F or X2=F
TTTTGTTAGT TTTTTGGAAA
29. T T T C C A A A A A -A C T A A C A A A A -A T C T T T T C A A-T4-T5
All subsets
not containing
{1,2}
T T T T T G G A A A=Probe(F1)
Edge {1,2} STM
X1=T and X2=T
1.
2.
3.
4.
5.
6.
7.
8.
A A A A A A A A C C -T
A A A A A A A A C C -T
A A A A A A A A C C -T
A A A A A A A A C C -T
A A A A A A A A C C -T
A A A A A A A A C C -T
A A A A A A A A C C -T
A A A A A A A A C C -T
T T C T T A A C C-C A T A A A A C A C-T4-T5
T T C T T A A C C-C A T A A A A C A C-T4-F5
T T C T T A A C C-C A T A A A A C A C-F4-T5
T T C T T A A C C-C A T A A A A C A C-F4-F5
T T C T T A A C C -A T C T T T T C A A-T4-T5
T T C T T A A C C -A T C T T T T C A A-T4-F5
T T C T T A A C C -A T C T T T T C A A-F4-T5
T T C T T A A C C -A T C T T T T C A A-F4-F5
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
A A A A A A A A C C-A C T A A C A A A A-C A T A A A A C A C-T4-T5
A A A A A A A A C C-A C T A A C A A A A-C A T A A A A C A C-T4-F5
A A A A A A A A C C-A C T A A C A A A A-C A T A A A A C A C-F4-T5
A A A A A A A A C C-A C T A A C A A A A-C A T A A A A C A C-F4-F5
A A A A A A A A C C-A C T A A C A A A A -A T C T T T T C A A-T4-T5
A A A A A A A A C C-A C T A A C A A A A -A T C T T T T C A A-T4-F5
A A A A A A A A C C-A C T A A C A A A A -A T C T T T T C A A-F4-T5
A A A A A A A A C C-A C T A A C A A A A -A T C T T T T C A A-F4-F5
T T T C C A A A A A -T T T C T T A A C C-C A T A A A A C A C-T4-T5
T T T C C A A A A A -T T T C T T A A C C-C A T A A A A C A C-T4-F5
T T T C C A A A A A -T T T C T T A A C C-C A T A A A A C A C-F4-T5
T T T C C A A A A A -T T T C T T A A C C-C A T A A A A C A C-F4-F5
21. T T T C C A A A A A -T T T C T T A A C C -A T C T T T T C A A-T4-T5
22. T T T C C A A A A A -T T T C T T A A C C -A T C T T T T C A A-T4-F5
23. T T T C C A A A A A -T T T C T T A A C C -A T C T T T T C A A-F4-T5
24. T T T C C A A A A A -T T T C T T A A C C -A T C T T T T C A A-F4-F5
25. T T T C C A A A A A -A C T A A C A A A A-C A T A A A A C A C-T4-T5
26. T T T C C A A A A A -A C T A A C A A A A-C A T A A A A C A C-T4-F5
27. T T T C C A A A A A -A C T A A C A A A A-C A T A A A A C A C-F4-T5
28. T T T C C A A A A A -A C T A A C A A A A-C A T A A A A C A C-F4-F5
29. T T T C C A A A A A -A C T A A C A A A A -A T C T T T T C A A-T4-T5
30. T T T C C A A A A A -A C T A A C A A A A -A T C T T T T C A A-T4-F5
31. T T T C C A A A A A -A C T A A C A A A A -A T C T T T T C A A-F4-T5
32. T T T C C A A A A A -A C T A A C A A A A -A T C T T T T C A A-F4-F5
DNA Library
{1,2}
{2,4}
{1,3}
{2,5}
{1,4}
{3,4}
{1,5}
{2,3}
{3,5}
{4,5}
Black ON, Red OFF =Independent Sets in G
Black OFF, Red ON =Cliques in G
Universal DNA Computer for any
Graph on n Vertices
DNA Library
Every Graph G on n vertices has
G union G’= all possible pairs on
n vertices. This enables the construction
of a universal device.
{1,2}
{1,3}
Each possible edge is an STM. Then depending
on the problem, the flow is directed by the edges
present (or absent) in the given graph
.
{n-2,n}
{n-1,n}
Edges in G ON, Edges in G’ OFF =Independent Sets in G
when flow completed
Edges in G OFF, Edges in G’ ON =Cliques in G
when flow completed
Download