18 Media Selection and Migration This best practice deals with both digital storage media reliability concerns versus costs and the inevitable need to migrate existing data to different (usually newer technology) storage. Digital storage media is sometimes intertwined with digital file formats, but formats are dealt with in Best Practice 2.8. Table of Contents 18.1 Media Selection 18.2 Digital Storage Media in Use 18.3 Handling & Storage of Optical Media 18.4 Risks and Reliability Considerations 18.5 Error Detection & Correction: Media validation / Fixity tests 18.6 Media Migration 18.7 References and Additional Sources 18.1 Media Selection There are many types of data storage media and selection of the best type(s) for a given situation can be complex. Factors such as total data size, rate of data growth, user access needs, desired length of retention, preservation needs, data value, and available budget can all affect the suitability of storage types for a given situation. Even in cases where we know a data set should be online, some of those factors will affect which online storage array(s) are used and what sort of backup or replication configuration is used. • • • For questions about online storage needs, start by contacting IT Infrastructure & Software Development (ISD). Depending on the project needs and funding, storage might be provided to you by ISD, Grainger Library, or outsourced. For questions about experience with optical media, contact Digital Content Creation (DCC). For questions about the Library’s Fedora digital repository currently in planning stages, contact Tom Habing. 18.2 Digital Storage Media in Use At the Library, several types of digital media are currently used to store digitized content. Each one presents tradeoffs with respect to reliability, expected longevity, ease of access and validation, and various costs (including purchase, maintenance, labor, energy, and space). To keep this document manageable this document only makes reference to storage media in use by the UIUC Library. • Magnetic disk drives o Online: Networked storage volumes (includes SAN and server-attached storage disk arrays) o • • Positive considerations: high density; highly accessible & available; high speed; automated validation is very feasible; duplication and remote replication is fast and relatively easy; most flexible option; many variations available. Negative considerations: High cost; high energy use; most susceptible to data loss by human error; Offline: Disk drives dismounted and stored unpowered. Positive considerations: Inexpensive; very high-density; no energy use while offline; fast data storage and retrieval while connected (temporarily online); can be connected to servers for multi-user access if needed. Negative considerations: Should be duplicated to multiple physical discs; periodic verification is feasible but requires some labor; not all hard disks are designed for external storage; some hard disks still fail with power cycling. Magnetic tape, offline. (various formats) o Summary: Not recommended for primary storage but can be effective for duplicate/backup copies. o Negative considerations: Hundreds of tape formats become effectively obsolete quicker than the media deteriorates; often recorded using proprietary hardware and/or data encoding; medium labor costs; periodic verification may be prohibitively costly; dependability is variable. o Positive considerations: Very low energy use; data density can be much higher than optical; can be cost-effective for some storage scenarios (most often backups of primary online storage); CITES has a highly reliable and scalable tape architecture, but does not (yet) provide archiving capabilities beyond one year. Optical discs o o • CD-R and DVD-R. Choose discs using phthalocyanine dye with gold reflective layer(s). Negative considerations: High labor costs, poor accessibility; periodic verification is not cost-effective; low density by today’s standards. Positive considerations: Low cost for equipment; very low energy use; writeonce mode of operation. CD-RW and DVD-RW. [Not recommended for data preservation] Externally-hosted repositories o Digital preservation systems (e.g. HathiTrust) o Other (e.g. Internet Archive) 18.3 Handling & Storage of Optical Media To maximize the longevity and readability of optical media (CD-R, DVD-R, etc.), the following are recommended: 1 • • • • • • • • Select only discs manufactured with phthalocyanine dye and gold reflective layer(s). Record discs using moderate recording speeds. High-speed recording increases the likelihood they’ll be unreadable on many other systems. Immediately after recording them, re-read and verify the data against the original. Preferably (though less efficiently) this should be done on a different optical disk drive from a different manufacturer. Handle them only by the outer edge or center hole. Don’t purchase huge quantities of optical discs in advance of need. Their pre-recorded shelf life is relatively short. Don’t apply labels or other items to the discs. Label them carefully using a CD-safe permanent marker. Store them in individual jewel cases upright. Store them in a cool, dry area with stable temperature and humidity and no direct UV light. NARA recommends 62-70 degrees F (+/- 2 degrees fluctuation), 35-50% relative humidity (+/5% fluctuations). Other sources recommend different ranges for different optical media types 2. A longer but similar list of these recommendations is on of NIST Special Publication 500-252 3, page vi. 18.4 Risks and Reliability Considerations Every digital storage medium is subject to partial and total data loss. The causes for loss include human error, software or hardware malfunction, physical media deterioration, mechanical failure, damage from electromagnetic fields or environmental conditions, theft, disaster damage (fire, flood, earthquake, etc.), and eventual unreadability due to obsolescence and unavailability of hardware and software that can still read or interface with a given media. These disparate causes require different solutions to address their risk of occurrence. Best practice for increasing the reliability of digital storage media always involves one or more means of creating redundancy in the data to significantly reduce the statistical likelihood of actual information loss even when the inevitable failure occurs with any specific digital media storage unit. Best practices also require methods of detecting data corruption in the media. Moreover, all highly-reliable and disaster-resistant storage systems require the data reside in at least two physical locations as geographically distant as feasible. Even using high-quality CD-R and DVD-R media with a gold substrate layer, as has been the practice by DSD and DCC for some collections, their experience has shown significant media failure rates both initially and upon later attempts to read the discs. 18.5 Error Detection & Correction: Media validation / Fixity tests Even “best practice” RAID-protected storage volumes suffer from data loss which ordinarily goes undetected4,5. To address these issues, a few highly resilient file systems have been developed. Most of these are very expensive proprietary systems out of our reach, but Library IT Infrastructure & Software Development (ISD) unit has begun working with Sun’s open source ZFS 6 in a new pair of storage systems for this additional security. For any long-term digital preservation system this type of silent data loss must be addressed at a level above the hardware using software methods of recurring validation and recovery. In practice, this can be done by a digital preservation system running proactive fixity checks, or by an advanced file system like ZFS or both. These systems all incorporate the computation and storage of one or more checksums (e.g. CRC) or stronger digest hashes (e.g. MD5, SHA256, etc.) of the files and file system metadata. Later we can reread files and recompute the checksum/hash and compare it to the original. Any difference indicates data corruption on the media and should trigger restoring that data from another copy. Note, however, that running such fixity checks is rarely feasible in offline storage scenarios because of high labor requirements. In addition to the increased convenience of access to stored material, this is a strong argument in favor of using online or automated near-line storage systems, despite typically higher cost and energy use. 18.6 Media Migration Since all storage media eventually deteriorates and/or becomes obsolete and inefficient, long-term data storage requires periodic migrations to newer physical media. This involves re-selecting the most appropriate medium at that point in time followed by a process of copying all the desired data from the old medium to the new and verifying its integrity. Once migration is completed, the old media may be retired or destroyed as appropriate, unless it still has some useful lifespan remaining and it is intentionally being retained as an additional backup copy. 18.7 References and Additional Sources NARA Technical Information Paper No. 12: “Digital-Imaging and Optical Digital Data Disk Storage Systems: Long-Term Access Strategies for Federal Agencies”. http://www.archives.gov/preservation/technical/imaging-storage-report.html Optical Storage Technology Association (OSTA) – Understanding CD-R and CD-RW Longevity. http://www.osta.org/technology/cdqa13.htm 1 NARA Frequently Asked Questions (FAQs) about Optical Storage Media: Storing Temporary Records on CDs and DVDs. http://www.archives.gov/records-mgmt/initiatives/temp-opmedia-faq.html 2 NIST Special Publication 500-252. Care and Handling of CDs and DVDs – A Guide for Librarians and Archivists http://www.itl.nist.gov/iad/894.05/docs/CDandDVDCareandHandlingGuide.pdf, pg. 16, table 3. 3 NIST Special Publication 500-252, pg. vi 4 Summary of CERN’s data storage reliability study http://storagemojo.com/2007/09/19/cerns-data-corruptionresearch/ 5 Carnegie Mellon Univ. paper “Disk Failures in the Real World: What Does an MTTF of 1,000,000 Hours Mean to You?” http://www.usenix.org/events/fast07/tech/schroeder.html 6 The presentation at http://hub.opensolaris.org/bin/download/Community+Group+zfs/docs/zfslast.pdf summarizes the many benefits of ZFS compared to traditional online storage systems including how it provides end-to-end file integrity and recovery. The most relevant pages are 12-18, 21-23, and 41.