Terry Harrison – notes on Seminal reading: _________________________________________________________________ Preserving Digital Information: Final Report and Recommendations Task Force on Archiving of Digital Information – Research Library Group http://www.rlg.org/ArchTF/ _________________________________________________________________ This document takes a broad look at the goals and requirements of preservation of digital objects from a nationwide (and possibly even larger) perspective. A distributed archive solution is recommended, for reasons of both cost and redundancy. A digital archive certification mechanism will be needed to ensure the capabilities of participating archives. – TH Background: At the end of 1994 the Commission on Preservation and Access (CPA) and RLG created a Task Force on Archiving of Digital Information charged with investigating and recommending means to ensure "continued access indefinitely into the future of records stored in digital electronic form." In May 1996, the 21-member task force, co-chaired by Donald Waters and John Garrett, completed their final report. Both RLG and CPA have made this widely available. (In 1997, CPA merged with the Council on Library Resources to become CLIR—the Council on Library and Information Resources.) This is perhaps the first commissioned study on digital archiving. 64 pages Status: currently reading _________________________________________________________________ Thoughts: - Status of emulation software / hardware Migration – to move from one technology to another Refresh – to periodically recopy (e.g. to a drive w. less hours on it) Digital archives are different than digital libraries Archives need a “certification” process (to assure competency) Archives need to be able to exercise an aggressive rescue function for digital information Features of digital landscape Stakeholders Hardware obsolescence – no new parts for old machines 1960 Census – only 2 UNIVAC type II-A machines left in world, when Census Bureau decided to attempt to refresh data 1964 – the 1st email was not saved. Not sure which research group sent first message: MIT, Carnegie Institute of Tech, or Cambridge University 1960s LUNR Project – Land Use and Natural Resources Inventory Project; computerized map of NY depicting patterns of land usage and identifying natural resources. In 1980’s data could not be retrieved off of computer tapes, leaving only printouts and transparency overlays. 1985 - Committee on the Records of Government “The United States is in danger of losing its memory” “If we are effectively to preserve for future generations the portion of this rapidly expanding corpus of information in digital form that represents our cultural record, we need to understand the costs of doing so and we need to commit ourselves technically, legally, economically and organizationally to the full dimensions of the task. Failure to look for trusted means and methods of digital preservation will certainly exact a stiff, long-term cultural penalty.” PAGE 4 Refreshing is good, but not a compete solution 2-5 year technical obsolescence cycle (shorter than media shelf life) Hardware and software dependent records may not be forward compatible Expensive for hardware and software to maintain backward compatibility Proprietary systems often don’t work w. competing products Emulators could help Migration Defined as “periodic transfer of digital materials from one hardware/software configuration to another, or from one generation of computer technology to another” Migration may not yield perfect copies onto newer technologies E.g. - .psd to .jpeg is a lossy transition Forward migration of information to a new standard or application program is “time consuming, costly, and more complex than simply refreshing” Legal and Organizational Issues Always complex “Bits know no borders” Sudo summary “(The) greatest fear about the life of information in the digital future: namely that owners or custodians who can no longer bear the expense and difficulty (of moving digital information forward in the digital future) will deliberately or inadvertently, through a simple failure to act, destroy the objects without regard for future use. The Need for a Deep Infrastructure w. “Various systematic supports” Many diff aspects of the digital environment will have to be addressed in diff ways - th No one stop solution – th Conceptual Framework A national system of digital archives envisioned Long-term storage and access goals (which are issues not always dealt w. by dig. libraries) Repository criteria: o Development of an archive certification standard suggested o “Aggressive fail-safe mechanism” to rescue information that is endangered at its current location. Plan of Work Archival responsibility starts at creation (w. creator) “We can afford to continue and increase economic and social investments in digital information objects and in the responsibilities for them on the information superhighway if , and only if, we also create the archival means for the knowledge the objects and repositories contain to endure and redound to the benefit of future generations” INFORMATION OBJECTS IN THE DIGITAL LANDSCAPE p.11 Integrity of digital information: determined by content, fixity, reference, provenance, and context gives the digital information object its value. Content: Definition as: preserving unique bit configuration, checksums to check, all well but limiting, esp. where limited by hardware/software (think word-processing document). Bits versus use. To save as an image as a JPEG is lossy, but aids in storage and use. Save bits? Save the idea conveyed? Save it so that it can be used? Answers will be different for different information Fixity: An object’s integrity is lost if it is constantly changing (ie document revisions can obscure the original document). Watermarking of digital objects (i.e. as canonical version) Snapshots in time (i.e. for databases) – sounds familiar to Internet Archives – Wayback Machine Reference: “Must be able to locate it definitively and reliably over time…” URN, URLs Must take into account provenance and context Provenance: tracing of the path from where a digital object came Helpful in sorting out multiple versions, derived works, source of data (instrumentation), migration, transformations, authenticity Context: how a digital object interacts with “the wider digital environment” p.18 Technical context: Hardware/software dependencies (i.e.: disk may need special drive, format may require special application) Emulators are helpful in dealing w. some of there issues (i.e.: video game emulation) Medium – i.e. capturing the experience of using a CD-ROM Stakeholder Interests p.19 - Those interested in making, appending, or using an object. Must be careful not to corrupt the actual object. ARCHIVAL ROLES AND RESPONSIBILITIES p.21 Best plan: distributed digital archives Must be stored and maintained in an accessible form p.22 “Fail-safe mechanism” to protect records if in danger of neglect, destruction, abandonment (where one agency could rescue another’s endangered materials). Current and proposed Copyright law don’t provide for an aggressive fail-safe mechanism Possibility: Legally mandated depositories – make creator bound to place a copy of their digital work in a certified digital archive in a standard archival format Appraisal and Selection – an ongoing process, know where copies are to reduce redundant acquisition. "Which things do you keep?" Decisions on migration when it fundamentally alters the work Accession – preparation for object for archiving "Carefully packing them up" Metadata Deaccesion to public should be announced so rescue efforts can be made Access Control – Terms and Condition (i.e.: to meet copyright requirements) Storage "Where to put them" TH- new costs metrics are making tape libraries no longer cost effective Online, near-line (i.e. robotic jukeboxes), off-line REDUNDANCY – cover you bum! - TH Access "How to get to them. How to protect them from unauthorized users" Prevent unauthorized use, protect intellectual property right (facilitate transactions between rightsholders and users Systems Engineering "When to copy and how to do so" Help to determine when digital archives should migrate to new hardware and software p.27 MIGRATION STRATEGIES Change Media – I.e.: text digital objects printed out and stored to microfilm (long-lasting, low operational barrier format) Migration may cause “flattening” of an non-standard object (i.e.: can’t really make microfilm version of a spreadsheet and maintain its internal computational capabilities Change Format Good to transform to a standard format, but this may be lossy (i.e.: JPEG is a lossy image compression) Incorporate Standards Archives technological infrastructure should conform to widely adopted standards p29 Build Migration Paths Work w. industry/donors on backwards compatibility and migration paths. Support the development of industry standards w. these issues in mind Vendors don’t often build migration paths between/with competitors Using Processing Centers Develop centers that specialize in reformatting and migration of obsolete materials Emulation (both hardware and software) is valuable here. Development of a national laboratory for digital preservation (modeled after National Media Laboratory) Managing Costs and Finances Storage costs declining Unknown rights-management costs Investment in systems engineering and infrastructure is critical for distributed archive system Cost Modeling p.31 Hard to determine the costs of archiving the different kinds of digital objects Yale model o Used to compare traditional paper based archive versus digital equivalent o Digital model most effective when resources are distributed Financing Who will pay for all of this? Tax incentives and accounting rules favoring preservation? Digital information direct charging? APS (America Physics Society) and ACM (Association of Computing Machinery) are facing these issues as they create their own digital libraries SUMMARY “First line of defense against loss of valuable digital information resides w. the creator, providers and owners of digital information” Need for deep infrastructure to support a distributed digital archive Need sufficient number of trusted organizations capable of storing, migrating, and providing access to the digital information Need a process of digital archive certification to develop trust “Certified archives must have the right and duty to pursue an aggressive rescue function … for valuable digital information in jeopardy of destruction, neglect or abandonment by its current custodian.”