DNA7 Experimental Construction of Very Large Scale DNA Databases with Associative Search Capability John H. Reif 1, Thomas H. LaBean 1, Michael Pirrung 2, Vipul S. Rana 2, Bo Guo 1, Carl Kingsford 1, and Gene S. Wickham 1 Departments of Computer Science1 and Chemistry2 Duke University Goal: Construction of a Pedabit DNA Database with Associative Search Capability Each data base element encoded by a DNA molecule. 1015 data base elements stored (with 10-fold redundancy). 10 milligrams of DNA holds entire database. 100,000,000 More Storage Density than Conventional Storage Media ! Massively Parallel Associative Search via DNA annealing. Parallel I/O to Digital Media can be done via optically addressed DNA arrays. Organization of Talk: Introduction to DNA & Overview of Biotechnology The Associative Search Problem & Relevance Overview of DNA Search Project Preliminary Pre-Processing of Image Data Base Computer Simulation of DNA Search Design of the DNA Library Coding Sequences Experimental Construction of the DNA Library Experiments of Associative Search in DNA Library High Rate Input/Output via DNA Chips Current Status DNA Search Project Future Work Introduction to DNA & Overview of Biotechnology Extremely compact DNA storage: 1 Gram of DNA: contains 2.1 x 1021 DNA bases can store approximately 4.2 x 1021 = 4.2 billion trillion bits. Potential Data Storage Capacity of DNA: a factor of 4.2 x 1012 more compact than conventional storage technologies Actual Experiments: 1015 data base elements stored (10-fold redundancy) •use only 10 milligrams of DNA: 108 more compact than conventional storage media. •When in solution in in 10 milliliters H2O: 105 more compact than conventional storage media. • The 4 DNA bases form two sets of complementary pairs, known as Watson-Crick complements. Recombinant DNA Operations: DNA annealing operation: two single stranded DNA sections combine into a doubly stranded DNA if the DNA bases of these sequence are complementary to each other. DNA ligation: two abutting single stranded DNA sections joined Primer-extension operation: a DNA strand known as a primer anneals to another DNA strand S which has the complement of that primer as a subsequence; then the use of an enzyme known as polymerase allows for the extension of that primer to form the full sequence complementary to S. The Polymerase Chain Reaction (PCR): uses repeated primer-extension to amplify only those DNA strands with particular end flanking subsequences defined by the primers. Biotechnology Techniques Used: Recombinant DNA operations: cut & splice DNA in massively parallel fashion PCR: uses DNA annealing to amplify very small quantities of DNA having a chosen sequence DNA Chip DNA Annealing Arrays provide: surfaces for optically addressed parallel DNA synthesis optical detection of DNA sequences via florescent labels input and output of DNA databases to convention digital storage media. MicroBeads provide: surfaces for parallel DNA synthesis DNA Hybridization Associat ive Mat ching on n-vect ors Distance < d query vector v Dist ance > d Non-Matching Dat a Base vector Associative Search Mat ch wit h query Data Base vector Database: Ordered list of elements of n-vectors, whose elements range over a finite range. Each vector of database has a unique identifying index in the database. Associative Search Query: Query vector v Distance bound d. distance(u,v) = |u1-v1|+ |u2-v2|+… |un-vn| Find distance d near-matches: Search the entire database for those vectors of the database that are of distance at most d from the query vector. Closest match: Find the index to a vector of the database of smallest distance from the query vector. Associative Search in Image Database Preprocess using a procedure A forming an attribute database: Given an input image I: List of low level image attributes for each image or sub-image. Use A to determine its vector A(I) of image attributes. Associative search in the attribute database provides the closest match to A(I). Provides an index to that image in the image database whose attributes best match that of the input image I. DNA Associative Search: Massively parallel associative search in extremely large databases encoded as DNA strands. [Baum95]: proposed using known recombinant DNA methods for DNA ligation affinity separation. Use of DNA words for vector elements Each element of a vector of the database is encoded by a DNA word. Each n-vector v of database encoded by sequence of n DNA words, followed by DNA word for identifying index to v. Advantages: -Ultra-compact DNA storage media -Supports highly parallel associative searches within media: Possible methods: Use known recombinant DNA methods for Detection of Matches: - PCR for amplification uses DNA annealing to amplifying the frequency of those DNA strands that have a particular chosen sequence. -DNA ligation affinity separation. -Scalable: If < maximum concentration, # of recombinant DNA operations and volume are independent of database size. DNA Annealing as a Massively Parallel Associative Search Engine. Major Challenges: (a) Experimental Construction of a Large Scale DNA Library Data Base (b) Experimental Testing of Large Scale DNA Associative Search (c) Refining the Associative Search to Exact Affinity Separation: (d) Input and Output (I/O) to Conventional Media: The query may not be an exact match or even partial match with any data in the database. DNA annealing affinity methods: Work best annealing on complementary sequences. Do not perform well for associative matching in case of partial matches with scattered mismatches in interior of vectors. Goals: error-resiliency and optimal I/O rate for a given error rate. (e) Extension to Include Boolean Conditionals: Extend associative search queries to Boolean formula conditionals (with a bounded number of Boolean variables), by combining our methods for DNA associative search with BMC methods for solving the SAT problem. Example: extended queries executed on: Natural DNA strands (from blood or other tissues) Appended with DNA words encoding binary information about each strand (e.g, the social security number of the person whose DNA was sampled, cell type, the date, further medical data, etc.). Our New Techniques for DNA associative search: J.H. Reif and T. H. LaBean, Computationally Inspired Biotechnologies: Improved DNA Synthesis and Associative Search Using Error-Correcting Codes and Vector-Quantization, Sixth International Meeting on DNA Based Computers (DNA6), DIMACS Series in Discrete Mathematics and Theoretical Computer Science, Leiden, The Netherlands, (June, 2000) ed. A. Condon. Springer-Verlag volume in Lecture Notes in Computer Science, (2000). URL: http://www.cs.duke.edu/~reif/paper/SELFASSEMBLE/selfassemble.pdf We use improved biotechnology techniques based on Error-Correction and VQ Coding. New Techniques: The database may initially be in conventional (electronic, magnetic, or optical) media, rather than the form of DNA strands. The query may not be an exact match or even partial match with any data in the database, but DNA annealing affinity methods work best for these cases. Proposed Solution: Apply DNA chip technology improved by Error-Correction and VQ Coding methods for error-correction and compression. Proposed Solution: Apply various VQ Coding methods for refining the associative search to exact matches. Extend associative search queries in DNA databases to include Boolean formula conditionals (with bounded # of Boolean variables). Proposed Solution: Combine our methods for DNA associative search with known BMC methods for solving small size SAT problems. Relevance to Image Databases: •To Execute Associative Search in Digital Image Databases of Huge Size. (Potentially Thousands of Terabytes) Constructing DNA Databases & Executing Associative Search: Preprocess Image Data: - We use image segmentation, wavelet transforms and vector quantization (via C code which we developed) One-time conversion of digital database to DNA database: - Can use parallel optical synthesis on DNA arrays (via known biotechnology) - For experimental purposes, we instead artificially synthesized a huge DNA database. Parallel Associative Search in DNA DataBase: - Executed via PCR amplification and DNA Annealing. Overview of DNA Search Project Goals of Laboratory Experiments: Artificially Synthesize random DNA DataBase of huge size Test Associative Search in the DNA DataBase Use 10-fold redundancy (10 identical DNA per database element). The distinct DNA for any chosen database element is amplified by PCR and detected. Only a total of a 10 milligram of DNA used. Can store 1015 = 1,000 trillion data elements. Can give 108 factor of Storage Density than Conventional Storage Media ! Preliminary Pre-Processing of Image Data Base • Software Preprocessing of Image Database •Carl Kingsford Graduate student, Princeton University Initial Image Preprocessing on Conventional Computer Image Segmentation: - When Image database was created, the image was broken into tiles. - Tiles are small subimages (typically 8 by 8 or 16 by 16 pixels) that were extracted from the image. - For each pixel in the image: a tile is extracted with its upper-left corner at that pixel. Wavelet Transformation: - Tiles are wavelet transformed and stored in a file on the server (called the DAT file). Vector Quantization(VQ) Transformation: - The index is encoded using the search template. - The resulting tiles are then VQ encoded - The VQ index as well as the position of the tile is encoded into DNA (as specified by the template). (After you click search, the selected tile is wavelet transformed and VQ encoded. ) Vector Quantization (VQ) Coding Clust er cluster radius cent er point Vect or Quant ization Map to cent er point Data Base vector Partition vectors of database into clusters of vectors. For each cluster: Well-known algorithms (Jain, Dubes [JD88]) compute clusters: The center vector is the average of all the vectors of the cluster. radius of cluster = maximum distance between any vector of the cluster to center vector. Cluster index uniquely identifies the cluster. Minimizing cluster radius Cluster size parameter m = average number of vectors in each cluster. Number of clusters is a multiple 1/m of original number of vectors of database. Used in computer science for compressing data (se.g. speech and images) within bounded error. Each vector is approximated by the center point of its cluster and coded by the cluster index. VQ coding induces errors tuned by choice of parameter m. Data-rate/distortion is asymptotically optimal, assuming various statistical source models for the data (memoryless or finite-state stationary processes [Gray90]). Clust er cluster radius Vect or Quant ization Applying VQ Coding Methods Map to cent er point Data Base vector Applying VQ Coding Methods To Associative Search: Refining the Associative Search to Exact Matches: DNA annealing affinity methods work best on complementary sequences. Yet, we need to process an associative match query, even if the query in not an exact match or even partial match with any data in the database. We use VQ-Coding clustering techniques: cent er point Reduces associative search problem to finding just exact matches via complementary hybridization. Can be done very effectively by known DNA annealing methods (e.g., PCR). To Increase DNA Chip I/O: Use VQ data clustering techniques to determine the clusters. Only the center points need to be transmitted (at 1/m the cost of transmitting the entire set of the database). Each vector v of the database is represented by a DNA strand encoding: Identification tag for v and Identification tag for center point of cluster containing v. Clust er cluster radius Vect or Quant ization Applying VQ Coding Methods Map to cent er point Data Base vector Applying VQ Coding Methods To Associative Search: Refining the Associative Search to Exact Matches: DNA annealing affinity methods work best on complementary sequences. Yet, we need to process an associative match query, even if the query in not an exact match or even partial match with any data in the database. We use VQ-Coding clustering techniques: cent er point Reduces associative search problem to finding just exact matches via complementary hybridization. Can be done very effectively by known DNA annealing methods (e.g., PCR). To Increase DNA Chip I/O: Use VQ data clustering techniques to determine the clusters. Only the center points need to be transmitted (at 1/m the cost of transmitting the entire set of the database). Each vector v of the database is represented by a DNA strand encoding: Identification tag for v and Identification tag for center point of cluster containing v. Reducing Associative Search with Given Match Distance d to the Problem of Exact Match "Possibility Vectors" Clust er query vector Using VQ for Associat ive Matching Map to center point center point Match with query Map to cent er point Data Base vector For each cluster G of database vectors: The "possibility vectors" of the cluster are those vectors that are within distance d of the center point of G. Query vector v will be included among the "possibility vectors" of those clusters whose centers are of distance at most d from v. Vectors in these clusters are at most distance 2d to v, and they include all database vectors that are at most distance d to the query vector, as required. Software Simulation of DNA Search •Carl Kingsford Graduate student, Princeton University •DNA Associative Search Simulation - When you click the "Search" button, the tile is converted into a representation of DNA - BIND is used to simulate hybridation between the search strand and each database strand - The product is a set of DNA strands from the database, called the result set. - Each of the strands in the result set is decoded into the VQ index and the tile position. A white box is drawn on the image at the tile position to indicate that this tile was found. • Software Simulation of DNA Annealing •Xavier Berni Graduated Masters Duke University - Determine the Search Strand: the DNA strand that represents the tile you are looking for (created from image you selected in the image selection box). This search strand is "dipped" into the DNA database that represents the image. Simulation of DNA Annealing: - between search strand and database strands - Determines Probability Database Strand is Annealed to the Search Strand. - BIND: Software Used for DNA Annealing Simulation Conditions used by BIND simulation: Temperature - the temperature of the solution. Strand concentration - what percentage of the solution is this strand. Salt concentration - what percentage of the solution is salt. Mathematical Annealing model: represents the binding energies between strands of DNA. Max. Mismatch distance. Input to BIND: the conditions of the solution and two strands Output from BIND: the likelihood that these strands will anneal. Design of the DNA Library Coding Sequences DNA sequences of Hamming distance 1 from ACGT. CCGT GCGT ACGC ACGT ACGA ACGG ACTT ACAT TCGT AAGT AGGT ATGT ACCT 2D Projection of a local region in sequence space. Neighboring sequences are shown for a central tetramer (ACGT) with substitutions in the first position to the north, second to the east, third position south, and fourth west. DNA Word Encoding: Vectors of the database are encoded by single stranded DNA sequences DNA sequences use 4 bases, but we use a Base 12 encoding: Each word has 12 distinct 5 base DNA subsequences A number is encoded by a DNA sequence with consecutive blocks of 5 bases. BLOCK 2 BLOCK 3 BLOCK 7 Wo r d Wo r d Wo r d Wo r d . . . Wo r d Wo r d Wo r d Wo r d Wo r d . . . Wo r d Wo r d Wo r d Wo r d Wo r d . . . Wo r d Wo r d Wo r d Wo r d Wo r d . . . Wo r d 1 .1 1 .2 1 .3 1 .4 1 .12 2 .1 2 .2 2 .3 2 .4 2 .12 3 .1 3 .2 3 .3 3 .4 3 .12 . . . 8 .1 8 .2 8 .3 8 .4 8 .12 Redundant Encoding of each Database Element BLOCK 1 use approx 10 identical single stranded DNAs per element Word Design to Minimize DNA Annealing Mismatches: To discriminate exact matches: Distinct DNA words in block differ by at least 3 DNA bases Data Base Values: Each database element holds only a single bit value: value is 1 <=> element is in the library. Further values can easily appended to a DNA strand using flanking sequences Simple Example of DNA Code Word Design: Block 1 Block 2 Block 3 Block 4 Block 5 Block 6 3' CTAA ATAC AACT CATT TATC AATC TACT CAAT - TCTT - ATCA CAAA ACAA - CACCACCTTTAAACCTCC - bead CCACATTAATCCTCCACC TTTC ACTA TTAC TCTA CTTT CTAT Dra I Ase I AAAC TTCA CATA AAAC ACAT TTCA 5' Each DNA Sequence uses distinct elements of Blocks BLOCK 1 BLOCK 2 BLOCK 3 BLOCK 4 Word1A Word1B Word1C Word1D Word2A Word2B Word2C Word2D Word3A Word3B Word3C Word3D Word4A Word4B Word4C Word4D ATAC CAAT ACTA TTCA AACT TCTT TTAC CATA AAAC ATCA TCTA CATT TATC CAAA CTTT ACAT AATC ACAA CTAT TTCA CTAA TACT TTTC AAAC ATAC CAAT ACTA TTCA AACT TCTT TTAC CATA AAAC ATCA TCTA CATT TATC CAAA CTTT ACAT AATC ACAA CTAT TTCA CTAA TACT TTTC AAAC ATAC CAAT ACTA TTCA AACT TCTT TTAC CATA AAAC ATCA TCTA CATT TATC CAAA CTTT ACAT AATC ACAA CTAT TTCA CTAA TACT TTTC AAAC ATAC CAAT ACTA TTCA AACT TCTT TTAC CATA AAAC ATCA TCTA CATT TATC CAAA CTTT ACAT AATC ACAA CTAT TTCA CTAA TACT TTTC AAAC DNA Library Size: Library Diversity 18 16 14 Log(Diversity) 12 words 12 10 words 10 8 words 6 words 8 4 words 6 4 2 5 7 9 11 13 15 17 Number of Blocks Library Size = [word count] block count Initial Library: Size = 12 7 • Blocks = 7 • Words per Block = 12 Scaling of Library Size with: • # Blocks • Words/Block DNA Library Design: Used Extensive Computer Search for good DNA code words: • • • Minimize melting temperature difference (Tm) between words so hybridization of multiple words proceeds simultaneously At least 3 base mismatches between word and complements in blocks Avoid frame shift binding errors DNA Sequences Used for Library: Block 1 Block 2 Block 3 Block 4 Block 5 Block 6 Block 7 AAACC ACCAA ACTCT ATCTC CATAC CCTTA CTACA CTCAT TACCA TCAAC TCCTT TTTCC AATCC ACACT ATCAC CAAAC CCATA CCTAT CTCTT CTTCA TACCT TCCAA TCTTC TTACC AACCA AACCT AATCC ACACA AACCA ACATC ACCTA ACAAC ACCAT ACACT ACCAT ACTAC ACCTT ATCTC ACTTC ATTCC ATACC ATCCA CACAA ATCAC CACTT CAAAC CAACT CATTC CATAC CATAC CCATT CCATA CCATT CCAAA CCAAA CTCAA CTCAT CTAAC CTCTA CTACT CTTCT CTTTC CTTCA CTTCT TAACC TATCC TACTC TAACC TACTC TCCTA TCACA TCCAA TCCTA TCCAT TCTCT TCCAT TCTCT TCTAC TCTCA TTCAC TTCTC TTACC TTCCT TTACC CATCG[GATC]C [-------------- insert 7 words as above---------------] AGATC[TCAC]ACCCTCCAC 5' Bam HI Library Region Bgl II 3' 3D Fluid Bio-Technology using DNA Attached to Beads DNA solid supported to Beads Can use fluorescence tags Bead sizes: 3 to 100 microns Bead material: plastic or polystyrene -Readout Methods for Beads: Fluorescence activated cell sorter (FACS) Example: MoFlo cell sorter Fiber Optic Readout: Illuminata, Inc. – 60,000 fibers each of 3.5 microns – Etch ends of fibers and then add attachment chemistry to attach a bead to each fiber. -Combinatorial Libraries of Digital Tags appended to Natural DNA or RNA: (Lynx, Inc) Generate Beads with Combinatorial Library of Digial Tags: Lynx Tags: Each tag is a string 8 words chosen from alphabet of 8 words of 4 bases each. They Synthesized via 8 stages of resin splitting They Used FACS readout Allowed differential analysis Experiments at Duke University Chemistry Laboratories Tentagel Beads: 10 to 20 micron Tentagel Beads Dr. Thom Labean Imaging the Tentagel Beads Construction of a DNA library of size 127 Each element in initial database encodes a sequence of 7 numbers over {1,…,12} use a sequence of 7 consecutive 5 base DNA sequences. 3' 5' Dist a l Pri me r Si t e Da t aba se Re gi on Pr o xim a l Pri me r Si t e Re sin Be ad Synthesis on 50 milligrams of TentaGel M NH2 Resin ~ 108 Polystyrene Microbeads of 10 micrometer diameter ~ 1011 strands of DNA attached per bead: a total of ~ 1019 strands of DNA attached Experimental Construction of DNA Data Base Two Stage Experimental Synthesis: Using mix-and-split methods on plastic microbeads, we construct an initial DNA library of size: 127 =35,831,808 By combining pairs of initially synthesized library strands, we square the size of the initial library to size: (127)2 = 1214 = 1.28 x 1015 Construction of a DNA library of size 127 Each element in initial database encodes a sequence of 7 numbers over {1,…,12} use a sequence of 7 consecutive 5 base DNA sequences. Used Mix-and-split DNA synthesis on plastic microbeads: 1. Split Gives Exponential Growth in database size with number of steps: - Each splitting step generates a factor 10 more -Takes 7 steps of splitting and mixing to construct DNA database of size 127 - Limited by maximum number of beads. Use ABI automatic synthesizer with conventional phosphoramidite chemistry. 2. Synthesize 3. Mix 4. Split 5. Synthesize 6. Mix 7. Split 8. Synthesize 9. Mix Construction of a DNA library of size 127 Each element in initial database encodes a sequence of 7 numbers over {1,…,12} use a sequence of 7 consecutive 5 base DNA sequences. Synthesis on 50 milligrams of TentaGel M NH2 Resin ~ 108 Polystyrene Microbeads of 10 micrometer diameter ~ 1011 strands of DNA attached per bead: 1. Split 3' 5' 2. Synthesize Dist a l Pri me r Si t e Da t aba se Re gi on Pr o xim a l Pri me r Si t e Re sin Be ad a total of ~ 1019 strands of DNA attached Used Mix-and-split DNA synthesis on plastic microbeads: Gives Exponential Growth in database size with number of steps: - Each splitting step generates a factor 10 more -Takes 7 steps of splitting and mixing to construct DNA database of size 127 - Limited by maximum number of beads. Use ABI automatic synthesizer with conventional phosphoramidite chemistry. 3. Mix 4. Split 5. Synthesize 6. Mix 7. Split 8. Synthesize 9. Mix Construction of a DNA library of size 1.28 x 1215 Each element in the initial database encodes a sequence of 7 numbers in {1,…,12} combine pairs of the initially synthesized library strands Extend annealed primer. Divide into 2 halves. GGAT CC CCT AGG AGAT CT T CT AGA GGAT CC CCT AGG BamHI cut. GGAT CC CCT AGG AGAT CT T CT AGA Bgl II cut. GAT CC G A T CT AG AGAT CT T CT AGA Anneal; Ligate. Annealing & Ligation GGAT CC CCT AGG AGAT CC T CT AGG AGAT CT T CT AGA Resulting DNA is a concatenation of two of the previously constructed strands Each element in squared database encodes sequence of 14 numbers in {1,…,12} Squares the size of the initial library from 127 to size: (127)2 = 1214 > 1.28 x 1015 Experiments at Duke LSRC Chemistry Laboratory DNA Synthesizer: Prof. Michael Pirrung DNA Synthesis for our Mix and Split Library Construction Dr. Vipul Rana (Postdoc, Duke Chemistry Dept) Loading Tentagel Beads into Synthesizer Experiments of Associative Search in DNA Database: -Use cell sorter to separate out DNA on attached beads with selected suffix sequence -PCR to amplify results -Optical Readout Via DNA Annealing Array Experiments at Duke University Laboratories Fluorescent Activated Cell Sorter (FACS) Used to Separate Tentagel Beads with a given DNA Sequence Operated by Assistant Prof Thom Labean and Dr. Joel Ross Test Queries for Small Database tag: 5' probe:3' 3' 5' Most common tag sequence/high probability words (1 copy in 130 sentences) tag: ATAC AACT AAAC TATC AATC CTAA probe: TATG TTGA TTTG ATAG TTAG GATT Moderate sentence probability/constant moderate word probability (1 copy in 8,304) tag: CAAT TTAC ATCA CTTT ACAA TTTC probe: GTTA AATG TAGT GAAA TGTT AAAG Moderate sentence probability/variable word probabilities (1 copy in 8,304 sentences) tag: ATAC CATA TCTA TATC TTCA TACT probe: TATG GTAT AGAT ATAG AAGT ATGA Least common tag sequence/low probability words (1 copy in 531,441 sentences) tag: TTCA CATA CATT ACAT TTCA AAAC probe: AAGT GTAT GTAA TGTA AAGT TTTG 4 16 Sorting Beads by FACS 10 F Control. 50:50 NF 10 2 10 3 NF 10 1 F 1b 0 1 10 2 10 3 10 4 10 0 10 1 10 2 10 3 10 4 32 10 4 10 10 0 0 1a 10 10 3 NF Control 1:10,000 10 2 F NF 10 1 F 2a 2b 10 1 10 2 10 3 10 10 0 0 0 4 10 0 10 1 10 2 10 3 10 4 Probe 1 Query (1:~130) 10 4 256 10 10 3 NF 10 2 NF F 10 1 F 3a 10 0 3b 0 10 0 10 1 10 2 10 3 10 4 10 0 10 1 10 2 10 3 10 4 Experiments at Duke University Laboratories PCR Experiments (ongoing) Anneal primer. Extend annealed primer with DNA polymerase. Melt H-bonding. 1. Anneal primers. 2. Extend with polymerase. Dr. Thom Labean Melt H-bonding. PCR Amplification of a Selected Data Element: requires repeated stages of annealing on 35 base primers (the prefix or suffix of each composed DNA word). To Insure Annealing stringency of PCR primers: primers are only 35 bases pairs of distinct DNA words in same block have > 3 base mismatch Readout via a DNA annealing array. 1. Anneal primers. 2. Extend with polymerase. High Rate Input & Output via DNA Chips DNA Chip DNA Hybridization Individual DNA chips give highly parallel input/output over 2D surfaces. Use for Input: use of photosensitive DNA-on-a-chip technology: Use for Output: 2D optical input is converted to DNA strands encoding the input data. Via hybridization at the sites with fluorescent labeled DNA, the output can be read as a 2D image. Scaling of Individual DNA chips: Each DNA chip can be optically addressed at up to 105 sites. Projected to be millions of sites in the immediate future. Optical Readout Via DNA Annealing Array: DNA Chip DNA Hybridization Query strand: binds to its complement on an element of the database • fluorescent labeled by primer extension with a fluorescent terminator nucleoside •binds via a complementary region to a site on probe array •is detected by fluorescent microscopy • Bead-bound Database Strand Query Strand Surface-bound Read-out Array Massively Parallel I/O Using Arrays of DNA Chips DNA Chip Array I/O between: conventional electronic media and a "wet" database of DNA strands – (in solution or on solid support). Propose Solution: Large Arrays of DNA Chips: A few thousand chips can be placed on a 2D array compact enough so all chips can be addressed by a single optical system. Gives potential of parallel synthesis of DNA at 108 sites or more to many billions of sites. Massively parallel DNA input/output: Has a potential for achieving a rate of I/O to convention optical/electronic media in the order of gigabit rates or more. Massively Parallel I/O using Arrays of DNA Chips Technical Challenge: Error rates due to optically addressed base synthesis. Most common error in optically addressed synthesis of DNA is a premature truncation and deletion in the growing strand. Error rate in optically addressed DNA synthesis methods used for DNA chips is roughly 4% to 8% per base Corresponds to an expected error in every 12 to 25 base pairs. Application of DNA chips for I/O in BMC limited by current error rates. Each DNA strand synthesized may be quite long (over 25 bases per strand). Majority of DNA strands expect > one synthesis error. Commercial DNA chips (proprietary Affymetrix technology) Synthesis error rates not known for each type for possible error. Utilize only a fraction of 105 optically addressable sites. Current maximum: about 42,000 sites Typical DNA chip: uses about 7,000 sites Today: currently synthesis error rate seems not the dominant limiting factor, Future: will impact scalability (addressable sites & strand length) of DNA chip technology. DNA Synthesis Error Models Synthesis errors with independent base deletions (causing base bulges) will be first order approximated by an error model with a uniform, independent probability p of base replacement A C G T A C T G C A T G a A C G T A C T G C T T G b A C G T A C T G C A T T G c Exact and Inexact Hybridization: Short stretches of double-stranded DNA are depicted showing: a) exact Watson-Crick(WC) complementary matching; b) a mismatch (T-T) imbedded within a WC match region; and c) a WC match region surrounding a bulged base (T). The bulged base can be described as a deletion from the left-hand strand or an insertion into the right-hand strand. Error-Correction Methods From Computer Science Adapted to Biotechnology Methods for repairing faulty oligonucleotides contained within surfacebound probe arrays. 3' Original Probe 5' 5' Prefix Suffix Biased-Error Synthesis 3' Erro r-Free EC Strand Use error-correcting codes for design of error-free probes. Use "Error-Correction (EC)" DNA strands: Specifically designed to bind both error-containing and error-free probes. Original Error-Containing P robe Array Error-Correction Strands Bound Primer Extension Error-Free Probe Array Error-Correction of Synthesized DNA Strands Resulting in Overhangs. Extra benefit: duplex probes containing single-stranded overhangs less error-prone than simple single-stranded probes. Synthesizing EC Strands Biased-Error DNA Synthesis Error-Free DNA st rand S Biased-Error Synt hesis Synt hesized Error-Cont aining DNA st rand Direct synthesis and purification: Biased-Error Chemical Synthesis: small scale only. *** Recommended (details in paper) The relative simplicity of the biased-error chemical synthesis approach makes it the most appealing of methods for generating diverse prefixes for EC strands. Other Methods for Generating Diverse Prefix for EC Strands: Mutagenisis via Polymerase Enzymes. DNA Self Assembly. Current Status DNA Search Project Tasks: Computer Preprocessing of Image Database o Image segmentation o wavelet transform o vector quantization Computer Simulation of DNA Search Experimental Synthesis of DNA Library o Computer Search for Sequences Defining DNA Library o Two Stage Synthesis of DNA Library Experimental Tests of Associative Search o select and amplify chosen DNA library element (Status:Tested) o readout results Future Work Other Applications to DNA Database Search: Digital Tagged Natural DNA "Wet" Data Base Strand Prefix "Digital Tag" DNA wor ds Encoding Boolean Variables Suffix Natural DNA DNA strands augmented with prefix "digital tag strands" consisting of a sequence of DNA words encoding Boolean values. Example: The extended database might consist of natural DNA strands (e.g., from blood or other body tissues) Appended "digital tag strands" consisting of DNA words encoding identifying information about each strand (such as social security number of the person whose DNA was sampled, cell type, the date, further medical data, etc.). The "digital tag strands" may have been constructed by previous BMC processing. Associative Search With Boolean Conditionals Combine: Our methods for DNA associative search with BMC methods for solving the SAT problem (e.g., using surface chemistry techniques). Vectors of the database are augmented with "digital tag vectors" consisting of a list of n' Boolean values, encoding binary information about the vector. An extended query consist of Query vector to be matched with and Boolean formula to be satisfied. The extended query requires finding those database vectors that: closely match the query vector and also whose Boolean variables satisfy the queries Boolean formula. Execute the extended query in two stages: First execute the Boolean formula portion of the query as a SAT problem, using biomolecular computing techniques. Strands not encoding SAT solutions are deleted, and all the remaining DNA strands satisfy the Boolean formula. Then execute our associative search procedure on the remaining strands, to find the closest match to the query vector that satisfies the query's Boolean formula. NRO DII 2001 Proposal "A PEDA-OP BIOCHEMICAL SYSTEM FOR PROCESSING DATA BASE QUERIES WITH AFM IMAGE OUPUT " Summary: We propose a biochemical system for: • storing an image database, • processing logical Boolean database queries on properties of these images, and • output of these queries. The system is unique in its capabilities: • to process, within a few minutes, complex logical queries • in a database with at least 1015 elements. The total mass of the storage scales linearly with database size : • a pedabyte database requires less than 1/10 of a gram of DNA (in solution within approximately 20 milliliters of water). The rendering of the images selected by queries is done at the molecular scale by a self-assembly process: • the output images are single molecules, imaged by an atomic force microscope. DNA Database: DNA strands encode Image with Boolean Vectors "Wet" Data Base Strand Prefix Encodes Image Suffix "Digital Tag" Encodes Boolean Variables ”Digital” DNA strands: Prefix: DNA strands with prefix consisting of a sequence of DNA words encoding Boolean values. Suffix: consisting of DNA words encoding Boolean values giving identifying information about each strand origin location and date of image acquisition other image properties The "digital tag strands" may have been constructed by previous BMC processing. The Image Database: assume a database whose elements are vectors with: Prefix: encodes the pixels of an image, Suffix: encodes a list of n Boolean values defining properties of the image: • • • the origin date placement of the image, or properties detected within the image, etc. Logical Database Queries: The task is to process logical database queries on this database. A query is defined by a formula of Boolean logic. The query processing selects only those database elements whose Boolean variables satisfy the query. The output is the set of selected images in the database. Boolean Query Processing: Biochemical Steps Initialization: execute operations that concatenate to each DNA strand in database O(K log L) of copies of strand Boolean Query: AND of a list of K logical clauses each clause: OR of a list of literals (Boolean variables or negation), one literal needs to be satisfied. Logical Query Processing: process each clause C in the formula in turn, • selectively amplify DNA strands whose Boolean variables satisfy a literal of C. Operations to Satisfy Clause C: • add PCR primers encoding literals of C and complements. • execute a series of primer-extension reaction that replicate only those DNA strands (or complements) that encode a literal of C. Repeat O(log L) PCR cycles: amplifies only DNA strands whose Boolean variables satisfy a literal of C After Processing all Clauses: • output strands satisfying all the clauses vastly predominate all other strands of DNA database. Exquisitely sensitive: need < 10 identical strands of DNA that satisfy query. Unique Method for Output of the Selected Images: Render the selected images as a patterned 2D lattice at the molecular scale - a few tens of Angstroms per pixel. Scalable to extremely large images - not diffraction limited Can be Viewed by an Atomic Force Microscope Self-Assembly of Patterned 2D Lattices: Tiles (DNA nanostructures) self-assemble around each segment of a DNA strand encoding an image pixel. Each tile has a surface perturbation depending on pixel intensity. The tiles then self-assemble into a 2D tiling lattice. DNA Nanostructures TAO tile: 3 double stranded DNA with Holiday junctions 2 4 GTTCAGCCTTAGT CCACAGTCACGGATGG ACTCGATAGCCAA CAAGTCGGAATCA GGTGTCAGTGCCTACC TGAGCTATCGGTT ACTCC TGGCATCTCATTCGCA GGACA TGAGG ACCGTAGAGTAAGCGT CCTGT T TCTGG T T AGACC T 1 4 CATCTCGT CCTTGCGTTTCGCCAATCCAGAAGCC GTAGAGCA GGAACGCAAAGCGGTTAGGTCTTCGG 3 1 T GGTAG T T CCATC T 3 TGCGAGCA ACGCTCGT 2 2D DNA Self-Assembled Tilings: Rendering Simple Banded Images B* Tiles with Loops Atomic Force Microscope Image Bands Generated by B* Tiles with Attached Beads 2D DNA Self-Assembled Tiling The Process of Rendering an Image: Self Assembly of Tiles around a DNA Strand Defining Pixels of an Image DNA Self-Assembled Tiling Challenge Problem: Rendering an 100 x 100 Image via a DNA Self Assembled Molecul Illustration of Portion of Containing NRO Letters: • actual tiling would be size at least 100 x 100 and include detailed image with NRO Logo