PowerPoint Presentation - Computationally

advertisement
DNA7
Experimental Construction of Very
Large Scale DNA Databases
with Associative Search Capability
John H. Reif 1, Thomas H. LaBean 1, Michael
Pirrung 2, Vipul S. Rana 2, Bo Guo 1, Carl
Kingsford 1, and Gene S. Wickham 1
Departments of Computer Science1 and Chemistry2
Duke University
Goal:
Construction of a
Pedabit DNA Database with Associative
Search Capability
Each data base element encoded by a DNA molecule.
 1015 data base elements stored (with 10-fold redundancy).
10 milligrams of DNA holds entire database.
100,000,000 More Storage Density than Conventional Storage Media !
Massively Parallel Associative Search via DNA annealing.
Parallel I/O to Digital Media can be done via optically
addressed DNA arrays.
Organization of Talk:
Introduction to DNA & Overview of Biotechnology
The Associative Search Problem & Relevance
Overview of DNA Search Project
Preliminary Pre-Processing of Image Data Base
Computer Simulation of DNA Search
Design of the DNA Library Coding Sequences
Experimental Construction of the DNA Library
Experiments of Associative Search in DNA Library
High Rate Input/Output via DNA Chips
Current Status DNA Search Project
Future Work
Introduction to DNA &
Overview of Biotechnology
Extremely compact DNA storage:
1 Gram of DNA:
contains 2.1 x 1021 DNA bases
can store approximately
4.2 x 1021 = 4.2 billion trillion bits.
Potential Data Storage Capacity
of DNA:
a factor of 4.2 x 1012 more compact
than conventional storage technologies
Actual Experiments:
1015 data base elements stored (10-fold
redundancy)
•use only 10 milligrams of DNA:
108 more compact than conventional storage media.
•When in solution in in 10 milliliters H2O:
105 more compact than conventional storage media.
•
The 4 DNA bases form two sets of complementary pairs, known as
Watson-Crick complements.
Recombinant DNA Operations:
 DNA annealing operation:
two single stranded DNA sections combine into a doubly stranded DNA if the DNA bases of
these sequence are complementary to each other.
DNA ligation:

two abutting single stranded DNA sections joined
 Primer-extension operation:
a DNA strand known as a primer anneals to another DNA strand S which has the
complement of that primer as a subsequence; then the use of an enzyme known as
polymerase allows for the extension of that primer to form the full sequence
complementary to S.
 The Polymerase Chain Reaction (PCR):
uses repeated primer-extension to amplify only those DNA strands with particular end
flanking subsequences defined by the primers.
Biotechnology Techniques Used:
Recombinant DNA operations:
cut & splice DNA in massively parallel fashion
PCR: uses DNA annealing to amplify very small
quantities of DNA having a chosen sequence
DNA Chip
DNA Annealing Arrays provide:
surfaces for optically addressed
parallel DNA synthesis
optical detection of DNA sequences via
florescent labels
input and output of DNA databases to
convention digital storage media.
MicroBeads provide:
surfaces for parallel DNA synthesis
DNA
Hybridization
Associat ive
Mat ching on
n-vect ors
Distance < d
query
vector v
Dist ance > d
Non-Matching
Dat a Base vector
Associative Search
Mat ch wit h
query
Data Base
vector
Database:
Ordered list of elements of n-vectors, whose elements range over a finite range.
Each vector of database has a unique identifying index in the database.
Associative Search Query:
Query vector v
Distance bound d.
distance(u,v) = |u1-v1|+ |u2-v2|+… |un-vn|
Find distance d near-matches:
Search the entire database for those vectors of the database that are of distance at
most d from the query vector.
Closest match:
Find the index to a vector of the database of smallest distance from the query vector.
Associative Search in Image Database

Preprocess using a procedure A forming an attribute database:


Given an input image I:


List of low level image attributes for each image or sub-image.
Use A to determine its vector A(I) of image attributes.
Associative search in the attribute database provides the closest
match to A(I).

Provides an index to that image in the image database whose attributes
best match that of the input image I.
DNA Associative Search: Massively parallel
associative search in extremely large databases encoded as DNA strands.
[Baum95]:
proposed using known recombinant DNA methods for DNA ligation affinity
separation.
Use of DNA words for vector elements
Each element of a vector of the database is encoded by a DNA word.
Each n-vector v of database encoded by sequence of n DNA words, followed by DNA
word for identifying index to v.
Advantages:
-Ultra-compact DNA storage media
-Supports highly parallel associative searches within media:
Possible methods:
Use known recombinant DNA methods for Detection of Matches:
- PCR for amplification
uses DNA annealing to amplifying the frequency of those DNA
strands that have a particular chosen sequence.
-DNA ligation affinity separation.
-Scalable: If < maximum concentration,
# of recombinant DNA operations and volume are independent of database size.
DNA Annealing as a Massively Parallel Associative Search Engine.
Major
Challenges:
(a) Experimental Construction of a Large Scale DNA Library Data Base


(b) Experimental Testing of Large Scale DNA Associative Search

(c) Refining the Associative Search to Exact Affinity Separation:



(d) Input and Output (I/O) to Conventional Media:


The query may not be an exact match or even partial match with any data in the database.
DNA annealing affinity methods:
 Work best annealing on complementary sequences.
 Do not perform well for associative matching in case of partial matches with scattered mismatches in
interior of vectors.
Goals: error-resiliency and optimal I/O rate for a given error rate.
(e) Extension to Include Boolean Conditionals:


Extend associative search queries to Boolean formula conditionals (with a bounded number of Boolean
variables), by combining our methods for DNA associative search with BMC methods for solving the SAT
problem.
Example: extended queries executed on:
 Natural DNA strands (from blood or other tissues)
 Appended with DNA words encoding binary information about each strand (e.g, the social security
number of the person whose DNA was sampled, cell type, the date, further medical data, etc.).
Our New Techniques for DNA associative search:
J.H. Reif and T. H. LaBean, Computationally Inspired
Biotechnologies: Improved DNA Synthesis and Associative
Search Using Error-Correcting Codes and Vector-Quantization,
Sixth International Meeting on DNA Based Computers (DNA6), DIMACS Series
in Discrete Mathematics and Theoretical Computer Science,
Leiden, The Netherlands, (June, 2000) ed. A. Condon. Springer-Verlag
volume in Lecture Notes in Computer Science, (2000).
URL: http://www.cs.duke.edu/~reif/paper/SELFASSEMBLE/selfassemble.pdf
We use improved biotechnology techniques based on Error-Correction
and VQ Coding.
 New Techniques:

The database may initially be in conventional (electronic, magnetic, or
optical) media, rather than the form of DNA strands.


The query may not be an exact match or even partial match with any data
in the database, but DNA annealing affinity methods work best for these
cases.


Proposed Solution: Apply DNA chip technology improved by Error-Correction
and VQ Coding methods for error-correction and compression.
Proposed Solution: Apply various VQ Coding methods for refining the
associative search to exact matches.
Extend associative search queries in DNA databases to include Boolean
formula conditionals (with bounded # of Boolean variables).

Proposed Solution: Combine our methods for DNA associative search with
known BMC methods for solving small size SAT problems.
Relevance to Image Databases:
•To Execute Associative Search in Digital Image
Databases of Huge Size.
(Potentially Thousands of Terabytes)
Constructing DNA Databases & Executing
Associative Search:
Preprocess Image Data:
- We use image segmentation, wavelet transforms and
vector quantization (via C code which we developed)
One-time conversion of digital database to DNA database:
- Can use parallel optical synthesis on DNA arrays
(via known biotechnology)
- For experimental purposes, we instead artificially
synthesized a huge DNA database.
Parallel Associative Search in DNA DataBase:
- Executed via PCR amplification and DNA Annealing.
Overview of DNA Search Project
Goals of Laboratory Experiments:
 Artificially Synthesize random DNA DataBase of huge size
 Test Associative Search in the DNA DataBase
 Use 10-fold redundancy (10 identical DNA per
database element).
 The distinct DNA for any chosen database element is
amplified by PCR and detected.
Only a total of a 10 milligram of DNA used.
 Can store 1015 = 1,000 trillion data elements.
Can give 108 factor of Storage Density than
Conventional Storage Media !
Preliminary Pre-Processing of Image
Data Base
•
Software Preprocessing
of Image Database
•Carl Kingsford
Graduate student,
Princeton University
Initial Image Preprocessing on Conventional Computer
Image Segmentation:
- When Image database was created, the image was broken into tiles.
- Tiles are small subimages (typically 8 by 8 or 16 by 16 pixels) that were extracted from the image.
- For each pixel in the image:
a tile is extracted with its upper-left corner at that pixel.
Wavelet Transformation:
- Tiles are wavelet transformed and stored in a file on the server (called the DAT file).
Vector Quantization(VQ) Transformation:
- The index is encoded using the search template.
- The resulting tiles are then VQ encoded
- The VQ index as well as the position of the tile is encoded into DNA (as specified by the template).
(After you click search, the selected tile is wavelet transformed and VQ encoded. )
Vector Quantization (VQ) Coding
Clust er
cluster
radius
cent er
point
Vect or
Quant ization
Map to
cent er point
Data Base
vector

Partition vectors of database into clusters of vectors. For each cluster:




Well-known algorithms (Jain, Dubes [JD88]) compute clusters:




The center vector is the average of all the vectors of the cluster.
radius of cluster = maximum distance between any vector of the cluster to center vector.
Cluster index uniquely identifies the cluster.
Minimizing cluster radius
Cluster size parameter m = average number of vectors in each cluster.
Number of clusters is a multiple 1/m of original number of vectors of database.
Used in computer science for compressing data (se.g. speech and images) within bounded error.


Each vector is approximated by the center point of its cluster and coded by the cluster index.
VQ coding induces errors tuned by choice of parameter m.



Data-rate/distortion is asymptotically optimal,
assuming various statistical source models for the data
(memoryless or finite-state stationary processes [Gray90]).
Clust er
cluster
radius
Vect or
Quant ization
Applying VQ Coding
Methods

Map to
cent er point
Data Base
vector
Applying VQ Coding Methods To Associative Search: Refining the Associative Search to
Exact Matches:



DNA annealing affinity methods work best on complementary sequences.
Yet, we need to process an associative match query, even if the query in not an exact match or
even partial match with any data in the database.
We use VQ-Coding clustering techniques:



cent er
point
Reduces associative search problem to finding just exact matches via complementary hybridization.
Can be done very effectively by known DNA annealing methods (e.g., PCR).
To Increase DNA Chip I/O:

Use VQ data clustering techniques to determine the clusters.

Only the center points need to be transmitted (at 1/m the cost of transmitting the entire set of
the database).

Each vector v of the database is represented by a DNA strand encoding:


Identification tag for v and
Identification tag for center point of cluster containing v.
Clust er
cluster
radius
Vect or
Quant ization
Applying VQ Coding
Methods

Map to
cent er point
Data Base
vector
Applying VQ Coding Methods To Associative Search: Refining the Associative Search to
Exact Matches:



DNA annealing affinity methods work best on complementary sequences.
Yet, we need to process an associative match query, even if the query in not an exact match or
even partial match with any data in the database.
We use VQ-Coding clustering techniques:



cent er
point
Reduces associative search problem to finding just exact matches via complementary hybridization.
Can be done very effectively by known DNA annealing methods (e.g., PCR).
To Increase DNA Chip I/O:

Use VQ data clustering techniques to determine the clusters.

Only the center points need to be transmitted (at 1/m the cost of transmitting the entire set of
the database).

Each vector v of the database is represented by a DNA strand encoding:


Identification tag for v and
Identification tag for center point of cluster containing v.
Reducing Associative Search with Given Match
Distance d to the Problem of Exact Match
"Possibility Vectors"
Clust er
query
vector
Using VQ for
Associat ive
Matching
Map to
center point
center
point
Match
with
query
Map to
cent er point
Data Base
vector

For each cluster G of database vectors:



The "possibility vectors" of the cluster are those vectors that are within
distance d of the center point of G.
Query vector v will be included among the "possibility vectors" of those
clusters whose centers are of distance at most d from v.
Vectors in these clusters are at most distance 2d to v, and they
include all database vectors that are at most distance d to the query
vector, as required.
Software Simulation
of DNA Search
•Carl Kingsford
Graduate student,
Princeton University
•DNA Associative Search Simulation
- When you click the "Search" button, the tile is converted into a representation of DNA
- BIND is used to simulate hybridation between the search strand and each database strand
- The product is a set of DNA strands from the database, called the result set.
- Each of the strands in the result set is decoded into the VQ index and the tile position. A white box is
drawn on the image at the tile position to indicate that this tile was found.
•
Software Simulation
of DNA Annealing
•Xavier Berni
Graduated Masters
Duke University
- Determine the Search Strand:
the DNA strand that represents the tile you are looking for
(created from image you selected in the image selection box).
This search strand is "dipped" into the DNA database that represents the image.
Simulation of DNA Annealing:
- between search strand and database strands
- Determines Probability Database Strand is Annealed to the Search Strand.
- BIND: Software Used for DNA Annealing Simulation
Conditions used by BIND simulation:
Temperature - the temperature of the solution.
Strand concentration - what percentage of the solution is this strand.
Salt concentration - what percentage of the solution is salt.
 Mathematical Annealing model: represents the binding energies between strands of DNA.
Max. Mismatch distance.
Input to BIND:
the conditions of the solution and two strands
Output from BIND: the likelihood that these strands will anneal.
Design of the DNA Library Coding
Sequences
DNA sequences of Hamming distance 1 from
ACGT.
CCGT
GCGT
ACGC
ACGT
ACGA
ACGG
ACTT

ACAT
TCGT
AAGT
AGGT
ATGT
ACCT
2D Projection of a local region in sequence space.
 Neighboring sequences are shown for a central tetramer (ACGT) with
substitutions in the first position to the north, second to the east,
third position south, and fourth west.
DNA Word Encoding:





Vectors of the database are encoded by single stranded DNA sequences
DNA sequences use 4 bases, but we use a Base 12 encoding:
Each word has 12 distinct 5 base DNA subsequences
A number is encoded by a DNA sequence with consecutive blocks of 5 bases.






BLOCK 2
BLOCK 3
BLOCK 7
Wo r d
Wo r d
Wo r d
Wo r d
.
.
.
Wo r d
Wo r d
Wo r d
Wo r d
Wo r d
.
.
.
Wo r d
Wo r d
Wo r d
Wo r d
Wo r d
.
.
.
Wo r d
Wo r d
Wo r d
Wo r d
Wo r d
.
.
.
Wo r d
1 .1
1 .2
1 .3
1 .4
1 .12
2 .1
2 .2
2 .3
2 .4
2 .12
3 .1
3 .2
3 .3
3 .4
3 .12
. . .
8 .1
8 .2
8 .3
8 .4
8 .12
Redundant Encoding of each Database Element


BLOCK 1
use approx 10 identical single stranded DNAs per element
Word Design to Minimize DNA Annealing Mismatches:
To discriminate exact matches:
Distinct DNA words in block differ by at least 3 DNA bases
Data Base Values:
Each database element holds only a single bit value:
value is 1 <=> element is in the library.
Further values can easily appended to a DNA strand using flanking sequences
Simple Example of
DNA Code Word Design:
Block
1
Block
2
Block
3
Block
4
Block
5
Block
6
3'
CTAA
ATAC
AACT
CATT
TATC
AATC
TACT
CAAT - TCTT - ATCA
CAAA
ACAA
- CACCACCTTTAAACCTCC - bead
CCACATTAATCCTCCACC TTTC
ACTA
TTAC
TCTA
CTTT
CTAT
Dra I
Ase I
AAAC
TTCA
CATA
AAAC
ACAT
TTCA
5'
Each DNA Sequence uses distinct elements of Blocks
BLOCK 1
BLOCK 2
BLOCK 3
BLOCK 4
Word1A
Word1B
Word1C
Word1D
Word2A
Word2B
Word2C
Word2D
Word3A
Word3B
Word3C
Word3D
Word4A
Word4B
Word4C
Word4D
ATAC
CAAT
ACTA
TTCA
AACT
TCTT
TTAC
CATA
AAAC
ATCA
TCTA
CATT
TATC
CAAA
CTTT
ACAT
AATC
ACAA
CTAT
TTCA
CTAA
TACT
TTTC
AAAC
ATAC
CAAT
ACTA
TTCA
AACT
TCTT
TTAC
CATA
AAAC
ATCA
TCTA
CATT
TATC
CAAA
CTTT
ACAT
AATC
ACAA
CTAT
TTCA
CTAA
TACT
TTTC
AAAC
ATAC
CAAT
ACTA
TTCA
AACT
TCTT
TTAC
CATA
AAAC
ATCA
TCTA
CATT
TATC
CAAA
CTTT
ACAT
AATC
ACAA
CTAT
TTCA
CTAA
TACT
TTTC
AAAC
ATAC
CAAT
ACTA
TTCA
AACT
TCTT
TTAC
CATA
AAAC
ATCA
TCTA
CATT
TATC
CAAA
CTTT
ACAT
AATC
ACAA
CTAT
TTCA
CTAA
TACT
TTTC
AAAC
DNA Library Size:
Library Diversity
18
16
14
Log(Diversity)
12 words
12
10 words
10
8 words
6 words
8
4 words
6
4
2
5
7
9
11
13
15
17
Number of Blocks
Library Size = [word count] block count
Initial Library: Size = 12 7
• Blocks = 7
• Words per Block = 12
Scaling of
Library Size with:
• # Blocks
• Words/Block
DNA Library Design:
Used Extensive Computer Search for good DNA code words:
•
•
•
Minimize melting temperature difference (Tm) between words so hybridization
of multiple words proceeds simultaneously
At least 3 base mismatches between word and complements in blocks
Avoid frame shift binding errors
DNA Sequences Used for Library:
Block 1 Block 2 Block 3 Block 4 Block 5 Block 6 Block 7
AAACC
ACCAA
ACTCT
ATCTC
CATAC
CCTTA
CTACA
CTCAT
TACCA
TCAAC
TCCTT
TTTCC
AATCC
ACACT
ATCAC
CAAAC
CCATA
CCTAT
CTCTT
CTTCA
TACCT
TCCAA
TCTTC
TTACC
AACCA AACCT AATCC ACACA AACCA
ACATC ACCTA ACAAC ACCAT ACACT
ACCAT ACTAC ACCTT ATCTC ACTTC
ATTCC ATACC ATCCA CACAA ATCAC
CACTT CAAAC CAACT CATTC CATAC
CATAC CCATT CCATA CCATT CCAAA
CCAAA CTCAA CTCAT CTAAC CTCTA
CTACT CTTCT CTTTC CTTCA CTTCT
TAACC TATCC TACTC TAACC TACTC
TCCTA TCACA TCCAA TCCTA TCCAT
TCTCT TCCAT TCTCT TCTAC TCTCA
TTCAC TTCTC TTACC TTCCT TTACC
CATCG[GATC]C [-------------- insert 7 words as above---------------] AGATC[TCAC]ACCCTCCAC
5'
Bam HI
Library Region
Bgl II
3'
3D Fluid Bio-Technology using
DNA Attached to Beads





DNA solid supported to Beads
Can use fluorescence tags
Bead sizes: 3 to 100 microns
Bead material: plastic or polystyrene
-Readout Methods for Beads:
 Fluorescence activated cell sorter (FACS)
 Example: MoFlo cell sorter
 Fiber Optic Readout: Illuminata, Inc.
– 60,000 fibers each of 3.5 microns
– Etch ends of fibers and then add attachment chemistry to attach a bead
to each fiber.
-Combinatorial Libraries of Digital Tags appended to Natural DNA or RNA: (Lynx, Inc)
 Generate Beads with Combinatorial Library of Digial Tags:
 Lynx Tags: Each tag is a string 8 words chosen from alphabet of 8 words of
4 bases each.
 They Synthesized via 8 stages of resin splitting
 They Used FACS readout
 Allowed differential analysis
Experiments
at Duke University Chemistry Laboratories
Tentagel Beads:
10 to 20 micron Tentagel Beads
Dr. Thom Labean Imaging the Tentagel Beads
Construction of a DNA library of size 127
Each element in initial database encodes a sequence of 7 numbers over {1,…,12}
 use a sequence of 7 consecutive 5 base DNA sequences.
3'
5'
Dist a l
Pri me r
Si t e
Da t aba se Re gi on
Pr o xim a l
Pri me r
Si t e
Re sin
Be ad
Synthesis on 50 milligrams of TentaGel M NH2 Resin
 ~ 108 Polystyrene Microbeads of 10 micrometer diameter
 ~ 1011 strands of DNA attached per bead:
 a total of ~ 1019 strands of DNA attached
Experimental Construction of
DNA Data Base
Two Stage Experimental Synthesis:
Using mix-and-split methods on plastic microbeads,
we construct an initial DNA library of size:
127 =35,831,808
 By combining pairs of initially synthesized library strands,
we square the size of the initial library to size:
(127)2 = 1214 = 1.28 x 1015
Construction of a DNA library of size 127
Each element in initial database encodes a sequence of 7 numbers over {1,…,12}
 use a sequence of 7 consecutive 5 base DNA sequences.
Used Mix-and-split DNA synthesis on plastic microbeads:
1. Split
 Gives Exponential Growth in database size with
number of steps:
- Each splitting step generates a factor 10 more
-Takes 7 steps of splitting and mixing to construct
DNA database of size 127
- Limited by maximum number of beads.
 Use ABI automatic synthesizer with

conventional phosphoramidite chemistry.
2. Synthesize
3. Mix
4. Split
5. Synthesize
6. Mix

7. Split
8. Synthesize
9. Mix
Construction of a DNA library of size 127
Each element in initial database encodes a sequence of 7 numbers over {1,…,12}
 use a sequence of 7 consecutive 5 base DNA sequences.
Synthesis on 50 milligrams of TentaGel M NH2 Resin
 ~ 108 Polystyrene Microbeads of 10 micrometer diameter
 ~ 1011 strands of DNA attached per bead:
1. Split
3'
5'
2. Synthesize
Dist a l
Pri me r
Si t e
Da t aba se Re gi on
Pr o xim a l
Pri me r
Si t e
Re sin
Be ad
 a total of ~ 1019 strands of DNA attached
Used Mix-and-split DNA synthesis on plastic microbeads:
 Gives Exponential Growth in database size with
number of steps:
- Each splitting step generates a factor 10 more
-Takes 7 steps of splitting and mixing to construct
DNA database of size 127
- Limited by maximum number of beads.
  Use ABI automatic synthesizer with

conventional phosphoramidite chemistry.
3. Mix
4. Split
5. Synthesize
6. Mix
7. Split
8. Synthesize
9. Mix
Construction of a DNA library of size 1.28 x 1215
Each element in the initial database encodes a sequence of 7 numbers in {1,…,12}
 combine pairs of the initially synthesized library strands
Extend annealed primer.
Divide into
2 halves.
GGAT CC
CCT AGG
AGAT CT
T CT AGA
GGAT CC
CCT AGG
BamHI cut.
GGAT CC
CCT AGG
AGAT CT
T CT AGA
Bgl II cut.
GAT CC
G
A
T CT AG
AGAT CT
T CT AGA
Anneal; Ligate.
Annealing & Ligation
GGAT CC
CCT AGG
AGAT CC
T CT AGG
AGAT CT
T CT AGA
 Resulting DNA is a concatenation of two of the previously constructed strands
 Each element in squared database encodes sequence of 14 numbers in {1,…,12}
 Squares the size of the initial library from 127 to size:
 (127)2 = 1214 > 1.28 x 1015
Experiments
at Duke LSRC Chemistry Laboratory
DNA
Synthesizer:
Prof. Michael Pirrung
DNA Synthesis for our Mix and Split Library Construction
Dr. Vipul Rana (Postdoc, Duke Chemistry Dept)
Loading Tentagel Beads into Synthesizer
Experiments of Associative Search in
DNA Database:
-Use cell sorter to separate out DNA on
attached beads with selected suffix sequence
-PCR to amplify results
-Optical Readout Via DNA Annealing Array
Experiments
at Duke University Laboratories
Fluorescent Activated Cell Sorter (FACS)
Used to Separate Tentagel Beads with a given DNA Sequence
Operated by Assistant Prof Thom Labean and Dr. Joel Ross
Test Queries for Small Database














tag: 5'
probe:3'


3'
5'
Most common tag sequence/high probability words (1 copy in 130 sentences)
tag:
ATAC AACT AAAC TATC AATC CTAA
probe:
TATG TTGA TTTG ATAG TTAG GATT
Moderate sentence probability/constant moderate word probability (1 copy in 8,304)
tag:
CAAT TTAC ATCA CTTT ACAA TTTC
probe:
GTTA AATG TAGT GAAA TGTT AAAG
Moderate sentence probability/variable word probabilities (1 copy in 8,304 sentences)
tag:
ATAC CATA TCTA TATC TTCA TACT
probe:
TATG GTAT AGAT ATAG AAGT ATGA
Least common tag sequence/low probability words (1 copy in 531,441 sentences)
tag:
TTCA CATA CATT ACAT TTCA AAAC
probe:
AAGT GTAT GTAA TGTA AAGT TTTG
4
16
Sorting Beads by FACS
10
F
Control. 50:50
NF
10
2
10
3
NF
10
1
F
1b
0
1
10
2
10
3
10
4
10
0
10
1
10
2
10
3
10
4
32
10
4
10
10
0
0
1a
10
10
3
NF
Control 1:10,000
10
2
F
NF
10
1
F
2a
2b
10
1
10
2
10
3
10
10
0
0
0
4
10
0
10
1
10
2
10
3
10
4
Probe 1 Query
(1:~130)
10
4
256
10
10
3
NF
10
2
NF
F
10
1
F
3a
10
0
3b
0
10
0
10
1
10
2
10
3
10
4
10
0
10
1
10
2
10
3
10
4
Experiments
at Duke University Laboratories
PCR Experiments (ongoing)
Anneal primer.
Extend annealed primer
with DNA polymerase.
Melt H-bonding.
1. Anneal primers.
2. Extend with polymerase.
Dr. Thom Labean
Melt H-bonding.
PCR Amplification of a
Selected Data Element:
requires
repeated stages of annealing on 35 base primers
(the prefix or suffix of each composed DNA word).
To Insure Annealing stringency of PCR primers:
 primers are only 35 bases
 pairs of distinct DNA words in same block have > 3 base mismatch
Readout via a DNA annealing array.
1. Anneal primers.
2. Extend with polymerase.
High Rate Input &
Output via DNA Chips
DNA Chip
DNA
Hybridization

Individual DNA chips give highly parallel input/output over 2D
surfaces.
 Use for Input:

use of photosensitive DNA-on-a-chip technology:


Use for Output:


2D optical input is converted to DNA strands encoding the input data.
Via hybridization at the sites with fluorescent labeled DNA, the output can
be read as a 2D image.
Scaling of Individual DNA chips:


Each DNA chip can be optically addressed at up to 105 sites.
Projected to be millions of sites in the immediate future.
Optical Readout Via DNA Annealing Array:
DNA Chip
DNA
Hybridization
Query strand:
binds to its complement on an element of the database
• fluorescent labeled by primer extension with a fluorescent terminator nucleoside
•binds via a complementary region to a site on probe array
•is detected by fluorescent microscopy
•
Bead-bound
Database Strand
Query Strand
Surface-bound
Read-out Array
Massively Parallel I/O Using
Arrays of DNA Chips
DNA Chip Array

I/O between:


conventional electronic media
and a "wet" database of DNA strands
– (in solution or on solid support).

Propose Solution: Large Arrays of DNA Chips:



A few thousand chips can be placed on a 2D array compact enough so all
chips can be addressed by a single optical system.
Gives potential of parallel synthesis of DNA at 108 sites or more to many
billions of sites.
Massively parallel DNA input/output:

Has a potential for achieving a rate of I/O to convention optical/electronic
media in the order of gigabit rates or more.
Massively Parallel I/O using Arrays of DNA Chips
Technical Challenge:
Error rates due to optically addressed base synthesis.

Most common error in optically addressed synthesis of DNA is a
premature truncation and deletion in the growing strand.
 Error rate in optically addressed DNA synthesis methods used for
DNA chips is roughly 4% to 8% per base
 Corresponds to an expected error in every 12 to 25 base pairs.

Application of DNA chips for I/O in BMC limited by current error rates.
 Each DNA strand synthesized may be quite long (over 25 bases per
strand).
 Majority of DNA strands expect > one synthesis error.

Commercial DNA chips (proprietary Affymetrix technology)
 Synthesis error rates not known for each type for possible error.
 Utilize only a fraction of 105 optically addressable sites.
 Current maximum: about 42,000 sites
 Typical DNA chip: uses about 7,000 sites
 Today: currently synthesis error rate seems not the dominant limiting
factor,
 Future: will impact scalability (addressable sites & strand length) of
DNA chip technology.
DNA Synthesis Error Models
Synthesis errors with independent base deletions (causing base
bulges) will be first order approximated by an error model with a
uniform, independent probability p of base replacement
A
C
G
T
A
C
T
G
C
A
T
G
a
A
C
G
T
A
C
T
G
C
T
T
G
b
A
C
G
T
A
C
T
G
C
A T
T
G
c

Exact and Inexact Hybridization:
 Short stretches of double-stranded DNA are depicted showing:



a) exact Watson-Crick(WC) complementary matching;
b) a mismatch (T-T) imbedded within a WC match region; and
c) a WC match region surrounding a bulged base (T). The bulged base
can be described as a deletion from the left-hand strand or an insertion
into the right-hand strand.
Error-Correction Methods From Computer Science
Adapted to Biotechnology
Methods for repairing faulty oligonucleotides contained within surfacebound probe arrays.
3' Original Probe
5'
5'
Prefix
Suffix
Biased-Error Synthesis
3'
Erro r-Free
EC Strand

Use error-correcting codes for design of error-free probes.
 Use "Error-Correction (EC)" DNA strands:

Specifically designed to bind both error-containing and error-free probes.
Original Error-Containing
P robe Array

Error-Correction
Strands Bound
Primer
Extension
Error-Free
Probe Array
Error-Correction of Synthesized DNA Strands

Resulting in Overhangs.

Extra benefit: duplex probes containing single-stranded overhangs
less error-prone than simple single-stranded probes.
Synthesizing EC Strands
Biased-Error
DNA Synthesis
Error-Free
DNA st rand S
Biased-Error
Synt hesis
Synt hesized
Error-Cont aining
DNA st rand

Direct synthesis and purification:


Biased-Error Chemical Synthesis:



small scale only.
*** Recommended (details in paper)
The relative simplicity of the biased-error chemical synthesis approach
makes it the most appealing of methods for generating diverse prefixes
for EC strands.
Other Methods for Generating Diverse Prefix for EC Strands:


Mutagenisis via Polymerase Enzymes.
DNA Self Assembly.
Current Status DNA Search Project
Tasks:
 Computer Preprocessing of Image Database
o Image segmentation
o wavelet transform
o vector quantization
 Computer Simulation of DNA Search
 Experimental Synthesis of DNA Library
o Computer Search for Sequences Defining DNA Library
o Two Stage Synthesis of DNA Library

 Experimental Tests of Associative Search
o select and amplify chosen DNA library element (Status:Tested)
o readout results
Future Work
Other Applications to DNA Database Search:
Digital Tagged Natural DNA
"Wet" Data Base Strand
Prefix
"Digital Tag"
DNA wor ds
Encoding
Boolean Variables
Suffix
Natural DNA

DNA strands augmented with prefix "digital tag strands" consisting
of a sequence of DNA words encoding Boolean values.
 Example:



The extended database might consist of natural DNA strands (e.g., from
blood or other body tissues)
Appended "digital tag strands" consisting of DNA words encoding
identifying information about each strand (such as social security
number of the person whose DNA was sampled, cell type, the date,
further medical data, etc.).
The "digital tag strands" may have been constructed by previous
BMC processing.
Associative Search
With Boolean Conditionals

Combine:

Our methods for DNA associative search with

BMC methods for solving the SAT problem

(e.g., using surface chemistry techniques).

Vectors of the database are augmented with "digital tag vectors" consisting of a list of n' Boolean values,
encoding binary information about the vector.
An extended query consist of

Query vector to be matched with and

Boolean formula to be satisfied.


The extended query requires finding those database vectors that:

closely match the query vector and also

whose Boolean variables satisfy the queries Boolean formula.

Execute the extended query in two stages:

First execute the Boolean formula portion of the query as a SAT problem, using biomolecular computing
techniques.

Strands not encoding SAT solutions are deleted, and all the remaining DNA strands satisfy the Boolean
formula.

Then execute our associative search procedure on the remaining strands, to find the closest match to
the query vector that satisfies the query's Boolean formula.
NRO DII 2001 Proposal
"A PEDA-OP BIOCHEMICAL SYSTEM FOR PROCESSING DATA
BASE QUERIES WITH AFM IMAGE OUPUT "
Summary:
We propose a biochemical system for:
• storing an image database,
• processing logical Boolean database queries on properties of these images, and
• output of these queries.
 The system is unique in its capabilities:
• to process, within a few minutes, complex logical queries
• in a database with at least 1015 elements.
 The total mass of the storage scales linearly with database size :
• a pedabyte database requires less than 1/10 of a gram of DNA (in solution within
approximately 20 milliliters of water).
 The rendering of the images selected by queries is done at the
molecular scale by a self-assembly process:
• the output images are single molecules, imaged by an atomic force microscope.
DNA Database:
DNA strands encode Image with Boolean Vectors
"Wet" Data Base Strand
Prefix
Encodes
Image
Suffix
"Digital Tag"
Encodes
Boolean
Variables
”Digital” DNA strands:
 Prefix: DNA strands with prefix consisting of a sequence of DNA
words encoding Boolean values.
 Suffix: consisting of DNA words encoding Boolean values giving
identifying information about each strand





origin
location and
date of image acquisition
other image properties
The "digital tag strands" may have been constructed by previous
BMC processing.
The Image Database:
assume a database whose elements are vectors with:
 Prefix: encodes the pixels of an image,
 Suffix: encodes a list of n Boolean values defining properties of the image:
•
•
•
the origin date
placement of the image, or
properties detected within the image, etc.
Logical Database Queries:
 The task is to process logical database queries on this database.
 A query is defined by a formula of Boolean logic.
 The query processing selects only those database elements whose Boolean variables
satisfy the query.
 The output is the set of selected images in the database.
Boolean Query Processing: Biochemical Steps

Initialization: execute operations that concatenate to each DNA strand in database O(K log L) of
copies of strand
Boolean Query: AND of a list of K logical clauses
each clause: OR of a list of literals (Boolean variables or negation),
one literal needs to be satisfied.
Logical Query Processing: process each clause C in the formula in turn,
•
selectively amplify DNA strands whose Boolean variables satisfy a literal of C.
Operations to Satisfy Clause C:
•
add PCR primers encoding literals of C and complements.
•
execute a series of primer-extension reaction that replicate only those DNA strands (or
complements) that encode a literal of C.
Repeat O(log L) PCR cycles:
amplifies only DNA strands whose Boolean variables satisfy a literal of C
After Processing all Clauses:
•
output strands satisfying all the clauses vastly predominate all other strands of DNA database.
Exquisitely sensitive: need < 10 identical strands of DNA that satisfy query.
Unique Method for Output of the Selected Images:

Render the selected images as a
patterned 2D lattice at the molecular scale
- a few tens of Angstroms per pixel.

Scalable to extremely large images
- not diffraction limited

Can be Viewed by an Atomic Force Microscope
Self-Assembly of Patterned 2D Lattices:

Tiles (DNA nanostructures) self-assemble around each segment of a DNA strand encoding an
image pixel.

Each tile has a surface perturbation depending on pixel intensity.

The tiles then self-assemble into a 2D tiling lattice.
DNA Nanostructures
TAO tile:
3 double stranded DNA with Holiday junctions
2
4
GTTCAGCCTTAGT
CCACAGTCACGGATGG
ACTCGATAGCCAA
CAAGTCGGAATCA
GGTGTCAGTGCCTACC
TGAGCTATCGGTT
ACTCC
TGGCATCTCATTCGCA
GGACA
TGAGG
ACCGTAGAGTAAGCGT
CCTGT
T
TCTGG
T
T
AGACC
T
1
4
CATCTCGT
CCTTGCGTTTCGCCAATCCAGAAGCC
GTAGAGCA
GGAACGCAAAGCGGTTAGGTCTTCGG
3
1
T
GGTAG
T
T
CCATC
T
3
TGCGAGCA
ACGCTCGT
2
2D DNA Self-Assembled Tilings:
Rendering Simple Banded Images
B* Tiles with Loops
Atomic Force Microscope Image
Bands Generated by B* Tiles
with Attached Beads
2D DNA Self-Assembled Tiling
The Process of Rendering an Image:
Self Assembly of Tiles
around a DNA Strand Defining Pixels of an Image
DNA Self-Assembled Tiling Challenge Problem:
Rendering an 100 x 100 Image via a DNA Self Assembled Molecul
Illustration of Portion of Containing NRO Letters:
• actual tiling would be size at least 100 x 100 and include detailed
image with NRO Logo
Download