Digital Information preservation on DNA

advertisement
Digital information
preservation in DNA
Robust Chemical Preservation of Digital Information on DNA in Silica with
Error-Correcting Codes (Angewandte 2015)
Towards practical, high-capacity, low-maintenance information storage in
synthesized DNA (Nature 2013)
Next-Generation Digital Information Storage in DNA (Science 2012)
Mikk Puustusmaa 2015
Introduction
• Information, such as text printed on paper or images
projected onto microfilm, can survive for over 500 years.
However, the storage of digital information for time
frames exceeding 50 years is challenging.
• As digital information continues to accumulate, higher
density and longer-term storage solutions are necessary
• DNA has many potential advantages as a medium for
immutable, high latency information storage needs
– At theoretical maximum, DNA can encode two bits
per nucleotide (nt) or 455 exabytes (455 x 1012 mb)
per gram of single-stranded DNA
Advantages
• Unlike most digital storage media, DNA storage is not
restricted to a planar layer and is often readable despite
degradation in non ideal conditions over millennia.
• Most recently, 300 000 year old mitochondrial DNA from
bears and humans has been sequenced.
• DNA’s essential biological role provides access to
natural reading and writing enzymes and ensures that
DNA will remain a readable standard for the foreseeable
future.
Advantages
• The number of bases of synthesized DNA needed to
encode information grows linearly with the amount of
information to be stored, but we must also consider the
indexing information required to reconstruct full-length
files from short fragments.
• As indexing information grows only as the logarithm of
the number of fragments to be indexed, the total amount
of synthesized DNA required grows sub-linearly.
Problems
• Since synthesis and sequencing of very long DNA
strands is technically impeded, data must be stored on
several short DNA segments
• Approaches using living vectors are not as reliable,
scalable or cost-efficient owing to disadvantages such
as constraints on the genomic elements and locations
that can be manipulated without affecting viability, the
fact that mutation will cause the fidelity of stored and
decoded information to reduce over time, and possibly
the requirement for storage conditions to be carefully
regulated
Science 2012
•
They converted an html coded draft of a book that included
53,426 words, 11 JPG images, and one JavaScript program
into a 5.27-megabit bitstream (674,56kB)
•
Converted individual bits to A or C for 0 and T or G for 1.
Bases were chosen randomly while disallowing homopolymer
runs greater than three. Addresses of the bitstream were 19
bits long and numbered consecutively, starting from
0000000000000000001.
Science 2012
• They encode one bit per base (A or C for zero, G or T for
one), instead of two. This allows them to encode
messages many ways in order to avoid sequences that
are difficult to read or write such as extreme GC content,
repeats, or secondary structure.
• By splitting the bit stream into addressed data blocks,
we eliminate the need for long DNA constructs that are
difficult to assemble at this scale
Synthesis
• We synthesized 54,898 oligonucleotides on Agilent’s
Oligo Library Synthesis microarray platform.
• In order to avoid cloning and sequence verifying
constructs, they synthesized, stored, and sequenced
many copies of each individual oligo
• 54,898 159-nt oligonucleotides
– each encoding a 96-bit data block (96 nt),
– a 19-bit address specifying the location of the data
block in the bit stream (19 nt)
– flanking 22-nt common sequences for amplification
and sequencing.
Sequencing
• We sequenced the amplified library by loading on a
single lane of a HiSeq 2000 using paired end 100 reads.
From the lane we got 346,151,426 million paired reads
with 87.14% >= Q30 and mean Q score of 34.16. Since
we were sequencing a 115bp construct with paired
100bp reads, we used SeqPrep (9) to combine
overlapping reads into a single contig.
• They joined overlapping paired-end 100-nt reads to
reduce the effect of sequencing error
• Errors in synthesis and sequencing are rarely coincident,
each molecular copy corrects errors in the other copies
Results
• Then with only reads that gave the expected 115-nt
length and perfect barcode sequences, we generated
consensus at each base of each data block at an
average of ~3000-fold coverage
• All data blocks were recovered with a total of 10 bit
errors out of 5.27 million, which were predominantly
located within homopolymer runs at the end of the oligo,
where we only had single sequence coverage
• Future work could use compression, redundant
encodings, parity checks, and error correction to
improve density, error rate, and safety.
Nature 2013
They encoded computer files totalling 739
kilobytes of hard-disk storage and with an
estimated Shannon information of 5.2x106 bits
into a DNA code, synthesized this DNA,
sequenced it and reconstructed the original files
with 100% accuracy
Nature 2013
•
The five files comprised all 154 of Shakespeare’s sonnets
(ASCII text), a classic scientific paper (PDF format), a mediumresolution colour photograph of the European Bioinformatics
Institute (JPEG 2000 format), a 26-s excerpt from Martin
Luther King’s 1963 ‘I have a dream’ speech (MP3 format) and
a Huffman code used in this study to convert bytes to base-3
digits (ASCII text), giving a total of 757,051 bytes or a Shannon
information of 5.2 x 106 bits
•
The bytes comprising each file were represented as single
DNA sequences with no homopolymers, which are associated
with higher error rates in existing high-throughput sequencing
technologies and led to errors in a recent DNA-storage
experiment
Nature 2013
Trit to nucleotide
Nature 2013
•
Each DNA sequence was split into overlapping segments,
generating fourfold redundancy, and alternate segments were
converted to their reverse complement. These measures
reduce the probability of systematic failure for any particular
string, which could lead to uncorrectable errors and data loss.
•
Each segment was then augmented with indexing information
that permitted determination of the file from which it originated
and its location within that file, and simple parity-check errordetection.
•
In all, the five files were represented by a total of 153,335
strings of DNA, each comprising 117 nucleotides (nt).
Synthesis
• We synthesized oligonucleotides (oligos) corresponding
to our designed DNA strings using an updated version of
Agilent Technologies’ OLS
• Errors occur only rarely (1 error per 500 bases) and
independently in the different copies of each string
• DNA in lyophilized form that is expected to have
excellent long-term preservation characteristics
Sequencing
• paired-end mode on the Illumina HiSeq 2000
• Strings with uncertainties due to synthesis or
sequencing errors were discarded and the remainder
decoded using the reverse of the encoding procedure,
with the error-detection bases and properties of the
coding scheme allowing us to discard further strings
containing errors. Although many discarded strings will
have contained information that could have been
recovered with more sophisticated decoding, the high
level of redundancy and sequencing coverage rendered
this unnecessary in their experiment.
Results
• Four of the five resulting DNA sequences could be fully
decoded without intervention. The fifth however
contained two gaps, each a run of 25 bases, for which
no segment was detected corresponding to the original
DNA. Each of these gaps was caused by the failure to
sequence any oligo representing any of four consecutive
overlapping segments.
• Inspection of the neighbouring regions of the
reconstructed sequence permitted us to hypothesize
what the missing nucleotides should have been and we
manually inserted those 50 bases accordingly. This
sequence could also then be decoded. Inspection
confirmed that our original computer files had been
reconstructed with 100% accuracy.
Results
• This also suggests that our mean sequencing coverage
of 1,308 times was considerably in excess of that
needed for reliable decoding. But data indicates that
reducing the coverage by a factor of 10 (or even more)
would have led to unaltered decoding characteristics,
which further illustrates the robustness of our DNAstorage method.
Angewandte (2015)
• They translated 83 kB of information to 4991 DNA
segments, each 158 nucleotides long, which were
encapsulated in silica
• They employed error-correcting codes to correct
storage-related errors.
• Accelerated aging experiments were performed to
measure DNA decay kinetics, which show that data can
be archived on DNA for millennia under a wide range of
conditions.
Angewandte (2015)
Error correcting
• In classical data-storage devices, error correcting codes
are implemented, which add redundancy and allow the
correction of essentially all errors that occur during
usage. To account for the specific requirements of
storage on DNA the existing data coding schemes had
to be adapted: Individual sequences are indexed and
two independent error correcting codes (specifically
Reed–Solomon codes) are used in a concatenated
fashion
Angewandte (2015)
• To physically test the code we stored the text from two
old documents the Swiss Federal Charter from 1291 and
the English translation of the Method of Archimedes
• The (uncompressed) total text is 83 kilobytes large, and
was encoded. This resulted in 4991 sequences, each
117 nucleotides long to which constant primers were
added (giving a total length of 158 nt)
• The sequences were synthesized on an electrochemical
microarray technology (CustomArray), prepared for
sequencing by a custom PCR (polymerase chain
reaction) method, and read using the Illumina MiSeq
platform
Results
• From reading the sequences, the inner code had to
correct an average of 0.7 nt errors per sequence and the
outer code had to account for a loss of 0.3% of total
sequences and correct about 0.4% of the sequences,
thereby resulting in a complete and error-free recovery
of the original information.
Results
• To test if DNA stored in the solid state is more
stable,they took the 4991 element oligo pool and tested
the stability of three previously established dry storage
procedures for DNA by accelerated aging tests.
Result
• From the data shown in Figure 2 it is evident that DNA
preservation is best in the inorganic storage format
(DNA encapsulated in silica), which has the lowest local
water concentration. By separating the DNA molecules
from the environment by an inorganic layer, the degree
of preservation is not affected by the humidity of the
storage environment. This independence of humidity is
very important for guaranteeing long-term stability, as a
nonhumid environment is hard to maintain
• In contrast, stabilityincreasing factors such as low
temperature (e.g. permafrost) and absence of light can
be maintained for extended periods of time without
energy input.
Results
• The original information could be
recovered error free, even after
treating the DNA in silica at 70°C
for one week. This is thermally
equivalent to storing information on
DNA in central Europe for 2000
years.
Price
• With negligible computational costs and optimized use of
the technologies estimated current costs to be $12,400 /
MB for information storage in DNA and $220 /MB for
information decoding.
• Current technology and our encoding scheme (Nature
2013), DNA-based storage may be cost-effective for
archives of several megabytes with a 600–5,000-yr
horizon. One order of magnitude reduction in synthesis
costs reduces this to ,50–500 yr; with two orders of
magnitude reduction, as can be expected in less than a
decade if current trends continue
Price
• DNA-based storage might already be economically
viable for long horizon archives with a low expectation of
extensive access, such as government and historical
records
• An examplein a scientific context is CERN’s CASTOR
system, which stores a total of 80 PB of Large Hadron
Collider data and grows at 15 PB yr .Only 10% is
maintained on disk. Archives of older data are needed
for potential future verification of events, but access
rates decrease considerably 2–3 years after collection.
Further examples are found in astronomy, medicine and
interplanetary exploration
Conclusion
•
Density, stability, and energy efficiency are all potential
advantages of DNA storage, although costs and times for
writing and reading are currently impractical for all but centuryscale archives
•
DNA-based storage remains feasible on scales many orders of
magnitude greater than current global data volumes
•
However, the costs of DNA synthesis and sequencing have
been dropping at exponential rates of 5- and 12-fold per year,
respectively—much faster than electronic media at 1.6-fold per
year
•
DNA synthesis costs drop at a pace that should make data
storing on DNA cost-effective for sub-50-year archiving within
a decade.
Tänan Kuulamast!
Download