PatternMatchingWithDNAComputers

advertisement
Pattern Matching with DNA Computers (Shah, Niemier)
PATTERN MATCHING WITH DNA COMPUTERS – DRAFT
Shetu N. Shah, Michael T. Niemier
College of Computing, Georgia Institute of Technology, Atlanta, Georgia USA
14 September 2004
Introduction
We explore implementing a pattern matching application using DNA as the
computational medium. Given a string of input and a specified pattern, the application
should return the location of all matches of the pattern in the input. DNA will not simply
serve as the substrate; it will be used to represent the inputs and to drive the logic for the
application. While this implementation does not utilize a systolic approach, it shall
exploit the inherent parallelism in DNA.
The inputs to the application will be two single-stranded oligonucleotides (short strings of
nucleotides): one oligo will be the string we want to search, the other oligo—which is
shorter than the search string—will be the DNA string of the pattern we want to find. The
underlying procedure in our approach is the Sanger sequencing procedure (Sanger).
Background
Since 1994 when Leonard Adleman first published his work on solving an instance of the
Hamiltonian Path Problem using custom-synthesized strands of DNA, many researchers
have been exploring ways to hone the computational power of DNA (Adleman).
Briefly DNA (deoxyribonucleic acid) is comprised of nucleotides, or bases, strung
together to form a chain. These four bases are adenine, thymine, guanine, and cytosine
(abbreviated as A, T, G, and C). These bases make strong but selective bonds: adenine
will only pair with thymine, and guanine will only pair with cytosine. Furthermore, DNA
computing is attractive because of its potential for massive parallelism. For example, if a
probe strand of DNA is introduced into a test tube containing other strands of DNA, the
probe will “test” itself with each of the other stands for a complement in a parallel
fashion. That is, the probe strand will be chemically attracted to its complement strand
-1-
Pattern Matching with DNA Computers (Shah, Niemier)
and will not have to be tested with each strand individually. The proposed approach will
have a O(1) versus a O(n).
Let us define the following:
The DNA alphabet ΣDNA = {A, T, G, C}.
The language LDNA = {s | s є ΣDNA+}.
The search string w є LDNA.
The pattern wpattern є LDNA.
Tools
DNA computing has borrowed many of its techniques and procedures from the fields of
biochemistry and molecular biology. This section overviews some of these tools and is
discussed in more detail in (Maley).
Anneal. Also known as “hybridization”, annealing is when two complementary, singlestranded DNA join to form a double strand of DNA when suspended in solution. This
occurs through the hydrogen bonds that arise when complementary base pairs are brought
into proximity (see Figure X).
Figure X: Annealing is when two complementary single strands of DNA join to form one double strand of
DNA.
-2-
Pattern Matching with DNA Computers (Shah, Niemier)
Melt. The temperature of a solution is raised beyond the point where the longest double
strands of DNA are stable, and the weak hydrogen bonds are broken. The doublestranded DNA will separate into single-stranded DNA (see Figure X).
Figure X: Melting is when the hydrogen bonds in a double strand of DNA are broken, usually by heating
the DNA, to result in two single strands of DNA.
Ligate. This concatenation of DNA strands is most efficiently performed by allowing
single strands to anneal together and then using ligase to seal the covalent bonds between
the adjacent fragments.
-3-
Pattern Matching with DNA Computers (Shah, Niemier)
Figure X: To connect strand y to strand x, a “glue strand” is used to bring the two strands into proximity
via annealing. Then a ligase enzyme is added to fuse the two strands together.
Polymerase extension. When a short strand is annealed to a longer strand, polymerase
enzymes can attach to the 3’ end of the shorter strand to “extend” the 3’ end, by 1 base, in
order to allow the building of a complementary sequence to the longer strand (see Figure
X).
-4-
Pattern Matching with DNA Computers (Shah, Niemier)
Figure X: Polymerase extension is when a polymerase enzyme attaches to the 3’ end of a short primer
sequence and allows the complement of the longer sequence to be constructed.
Amplify. Often an experiment is equipped only with a single strand of DNA, a very small
and fragile sample to have at hand. The Polymerase Chain Reaction (PCR) is a process
by which a single strand of DNA is replicated to create and exponentially large sample.
PCR if often used to amply DNA so that it can be seen by the naked eye through
separation techniques, such as gel electrophoresis (see Figure X).
-5-
Pattern Matching with DNA Computers (Shah, Niemier)
Figure X: PCR starts when a double-stranded DNA template is melted into two single strands. Primers
then anneal to both strands. Polymerase enzymes extend the 3’ end of the primers to create double-stranded
replicas of the template. This process is then repeated to exponentially increase the number of templates; it
can be repeated as long as there are enough primers to catalyze the reaction.
Chain-Termination Sequencing Procedure
The chain-termination sequencing procedure, developed by Frederick Sanger et al. in
1977, determines nucleotide sequences by generating populations of DNA fragments that
all have one end in common and terminate at each possible position. The procedure uses
in vitro DNA synthesis in the presence of specific chain-terminators. Specifically, 2’,3’-
-6-
Pattern Matching with DNA Computers (Shah, Niemier)
dideoxyribonucleoside 5’-triphosphate (ddXTP)—where the nucleoside (X) can take the
form of either A, T, G, or C—is the most commonly used chain-terminator (Snustad).
The normal DNA precursors are 2’-deoxyribonucleoside 5’-triphosphate (dXTP), which
has a hydroxyl group (OH) at the 3’ position. This hydroxyl group is an absolute
requirement for chain elongation with DNA polymerase. The ddXTPs lack the 3’-OH;
thus, chain elongation cannot continue, and the chain is said to terminate (see Figure X).
Figure X: Comparison of the structures of the normal DNA precursor 2’-deoxyribonucleoside triphosphate
and the chain terminator 2’,3’-deoxyribonucleoside triphosphate used in DNA sequencing reactions
(Snustad).
Using ddATP, ddTTP, ddCTP, and ddGTP as the chain-terminators in a DNA synthesis
reaction will result in a population of all possible substrings, which start with the first
nucleotide at the 3’ end of the original strand, of the complement to the original strand
(see Figure X). To obtain a suitably high probability that the population will contain
fragments terminating at each respective base, the ratio of dXTP:ddXTP in a given
reaction is approximately 100:1 (Sanger).
3G
C A T G A T C G G5
5C
G
G
G
G
G
G
5C
5C
5C
5C
5C
T
T
T
T
T
T
A
A
A
A
A
A
C T A G C Cdd
C T A G Cdd
C T A Gdd
C T Add
C Tdd
Cdd
-7-
Pattern Matching with DNA Computers (Shah, Niemier)
5C
G T Add
5C G Tdd
5C Gdd
5Cdd
+ + + + + + + + + + +
Figure X: As the template strand (blue) is replicated using the chain-termination procedure, all possible
substrings (red) of the complement to the template that start with the first nucleotide at the 3’ end of the
template are produced with almost certain probability. When these substrings are separated using gel
electrophoresis, the shorter strands travel the farthest through the gel toward the positive charge.
After the population has been generated, the fragments are denatured (melted) from their
template strands and separated by gel electrophoresis (see Figure X).
Gel electrophoresis is a method of separating DNA molecules by size. An agarose or
acrylamide gel is used as the medium for this procedure; agarose gels are better sieves for
larger molecules (larger than a few hundred nucleotides) while acrylamide gels yield
better resolutions for separating smaller DNA molecules (Snustad). The DNA
populations are loaded into wells at one end of the gel. Since DNA molecules hold a
negative electric charge, a positive charge is applied to the opposite end of the gel to
attract the DNA. The structure of the gel is similar to a sieve. As the DNA molecules
travel toward the positive charge, the gel makes it more difficult for larger molecules to
pass and easier for smaller molecules to pass. Therefore, the closer the DNA molecule is
to the positive end of the gel, the shorter the oligonucleotide (Khalsa).
The gel-separated fragments are then transferred onto a membrane via Southern blotting
(see Figure X), a technique developed by Edward Southern (Sanger).
-8-
Pattern Matching with DNA Computers (Shah, Niemier)
Figure X: Southern blot procedure used to transfer DNA strands, separated by gel electrophoresis, onto
nylon membranes (Snustad).
A nitrocellulose membrane or other positively charged nylon membrane is typically used.
Transfer is usually done by capillary action, although a vacuum blot apparatus may be
used instead. The vacuum blot apparatus works similarly to capillary action except the
vacuum sucks more of the transfer solution (usually SSC, a solution containing sodium
chloride and sodium citrate) through the gel and the membrane, so the transfer process
only takes about an hour instead of several hours with capillary action. Once the DNA
fragments are transferred onto the membrane, they should be dried with ultraviolet light.
The UV light will create covalent bonds between the DNA and the membrane (Khalsa).
The membrane is now ready for probing with a radioactively labeled DNA strand
(hybridization).
Matching the Pattern
The probe used is wpattern, which represents the pattern we wish to match, labeled with
radioactive 32P. The probe will anneal to the immobilized DNA on the membrane due to
the binding of complementary strands. The nonhybridized probe is then washed off the
-9-
Pattern Matching with DNA Computers (Shah, Niemier)
membrane, and the membrane is exposed to X-ray film to detect any presence of
radioactivity from the probe.
Interpreting the Results
When the X-ray film is developed, the dark bands will show the positions of the DNA
sequences that have hybridized with the probe. The film should be read from bottom to
top (i.e., from the shortest fragment to the longest). If no dark bands are present, the
pattern was not found in the string.
If the pattern is found, the position of the first band will reveal the position of the pattern
because this will be the shortest substring that contains the complete pattern. Note,
however, that all substrings longer than this will also detect at least this one instance of
the pattern because the probe also will have annealed to all subsequent fragments. By
measuring the light intensities with a spectrophotometer, one can determine whether
multiple instances of the pattern were found on the same strand. Because each probe
emits radiation independent of other probes, it follows that the light intensity of a
fragment with exactly two matches should be twice that of a fragment with only one
match. Empirical data should show that a graph of light intensities will be a step function.
<figure of expected results of spectroscopy>
An Improvement
The aforementioned analysis, while feasible, can be rather tedious and complicated. A
spectrophotometer is expensive and may not be readily available. If we could somehow
guarantee that the probe would attach only to the 3’ end of the strand, then each band on
the X-ray film would indicate a distinct instance of the pattern. Even if multiple instances
of the patterns overlap in the string we want to search, each instance should be easily
identifiable by the naked eye. This modification would provide more clarity and would
eliminate the need for a spectrophotometer or other special tools.
- 10 -
Pattern Matching with DNA Computers (Shah, Niemier)
To accomplish this, we must first dedicate one of the nucleotides a “terminating
nucleotide” to mark the 3’ of the strand. If the probe ends in the complement of this
terminating nucleotide, then the probe will only anneal to the 3’ end of the strand, as
desired. However, we must reserve T as our terminating nucleotide and reduce our set of
nucleotides to A, G, and C. We select these nucleotides because a higher ratio of Gs and
Cs makes for more stable stands (reference needed). Note that the roles of A and T could
be interchanged and still produce the same results. Therefore, the chain-terminators Add,
Gdd, and Cdd are now “upgraded” to ATdd, GTdd, and CTdd, respectively. The results of
PCR with Sanger’s chain-terminating procedure are all substrands of the original input
strand, which start at the 5’ end of the original strand. Each of these substrands will have
an extra Tdd at the 3’ end (See Figure X).
3G
C T G T C G G5
5C
G A C A G C CTdd
G A C A G CTdd
G A C A GTdd
G A C ATdd
G A CTdd
G ATdd
GTdd
5C
5C
5C
5C
5C
5C
5CTdd
+ + + + + + + + + +
Figure X: Results of gel electrophoresis with modified chain-terminators. Each substring now has a T dd
marking the 5’ end of the strand. Note, however, that ∑ = {T, G, C} to prevent Tdd from annealing to the
strand.
We now create a probe wpattern concatenated with A, the complement of our terminating
nucleotide T. Since each of the substrands have exactly one T from the chain-terminator,
the A from the probe will complement this T and cause the probe to anneal only at the 3’
ends of the substrands.
- 11 -
Pattern Matching with DNA Computers (Shah, Niemier)
If these probes are radioactively labeled, we can detect all instances of our pattern by
exposing the DNA to X-ray or ultraviolet radiation in the same manner as before. If no
bands react to the radiation, then no instances of the pattern were found in the input
strand. If any bands do react to the radiation, the length of the respective substrand
reveals the position of the pattern on the strand.
Size Estimates
The current length of oligonucleotides that can be chemically synthesized without much
error is 70-80 nucleotides long (Ogihara). While it is possible to synthesize oligos 150
nucleotides long and longer, there is an extremely low yield of long oligos due to DNA
coupling, the probability that the nucleotides will correctly attach. Additional losses
result from purification processes (HPLC, PAGE, or other), which are highly
recommended with long oligos (“Metabion”).
The maximum length that can be synthesized routinely and economically is
approximately 80 nucleotides long (“Metabion”). This may be far too short to represent a
search space of practical length. However, the past ten years have shown great advances
in DNA synthesis, and additional progress in this area may increase the length of usable,
synthesizable oligos.
… lab equipment / space needed to do the computation
Performance Estimates
PCR: 8-12 hours
Gel electrophoresis: 3-6 hours
Southern blotting: 1 to several hours (vacuum vs. capillary action)
Drying: 2 hours ?
Hybridization: ???
X-ray development: < 1 hour
Energy at each step?
- 12 -
Pattern Matching with DNA Computers (Shah, Niemier)
Information density: 1 bit per cubic nanometer (Adleman)
Error rates?
Buildability
The oligo length limitations notwithstanding (see Size Estimates), we should be able to
implement this with current technology. The genetic tools used (e.g., gel electrophoresis,
Southern blot, hybridization) are affordable and readily available to the genetics and
biochemistry communities.
Outstanding Issues
By using the modified procedure, we must omit one of the nucleotides from LDNA. This is
not ideal because the encoding of w and wpattern must now be done with fewer characters.
Craig Venter of XXX is leading a team to synthesize pairs of additional nucleotides,
which exhibit the essential properties of the current set (reference). If successful, this
project would increase the size of LDNA and allow w and wpattern to be encoded on shorter
oligonucleotides.
Hairpin sequences: the oligos should not have sections of long complementary sequences
on the same strand. These sections may pair up with each other and create a hairpin
sequence, which is unusable. However, since the DNA molecules will be denatured, it
may prevent the hairpin from forming. This may not be a problem.
G/C-rich sequences: Apparently sequences rich in Gs and Cs are more stable than those
with more As and Cs. Approximately 50% of the nucleotides should be Gs and Cs;
however, I’m not sure this will affect the scope and scale of this application.
- 13 -
Pattern Matching with DNA Computers (Shah, Niemier)
References
Adleman, Leonard. “Molecular Computation of Solutions to Combinatorial Problems”
Science. (266) 11 Nov. 1994. pp. 1021-4.
Khalsa, Guruatma. “Mama Ji’s Molecular Kitchen.”
http://lsvl.la.asu.edu/resources/mamajis/index.html. Last viewed: 7 Apr. 2004.
Kusuma, Srinivasu. Telephone interview. 25 Apr. 2004.
Maley, Carlo C. “DNA Computing and Its Frontiers” Molecular Computing. Sienko, et
al., eds. MIT: Cambridge 2003. 153-89.
“Metabion”. http://www.metabion.com/faqs/FAQ00.html. Last viewed: 7 Apr. 2004.
Ogihara, Mitsunori and Animesh Ray. “Circuit Evaluation: Thoughts on a Killer
Application in DNA Computing” Computing with Bio-molecules: Theory and
Experiments. Gheorghe Paun, ed. Springer: Singapore; 1998, 111-26.
Sanger, F. et al. “DNA sequencing with chain-terminating inhibitors” Proc. Natl. Acad.
Sci. 74 (12), Dec. 1977. pp. 5463-7.
Snustad, D. Peter and Michael J. Simmons. Principles of Genetics. 3 ed. 2003.
- 14 -
Download