“What I Learned This Summer”: A Week at SAA’s First Electronic Records Summer Camp Daniel Linke University Archivist and Curator of Public Policy Papers December 14, 2007 Geisel Library at UCSD (Photo by Sara Muth) University of California, San Diego August 6-10, 2007 Yes, that Geisel (Photo by Sara Muth) Eleanor Roosevelt Campus (Photo by Sara Muth) Our accommodations in the Asante dormitory (Photo by Sara Muth) • My suitemates: Peter Johnson, Eric Paquette, and Dylan McDonald • (Photo courtesy of Eric Paquette) 27 attendees from a variety of institutions (government, educational, and private repositories): • UCSD, UC-Irvine, Harvard B. School, U. New Mexico, UT:Arlington, Occidental College, UWI:Madison • AZ, CA, NC, and WA State Archives • CIGNA, National Fire Protection Association, Ford, History Associates • Sacramento Archives and Museum • Marist Brothers of Canada Terrace of the college commons where we took our meals (Photos by Sara Muth) Fellow “campers” : Police Explorers Club (Photo by Sara Muth) Our classroom was within the SDSC (Photo by Eric Paquette) Our classroom (Photo by Chien-Yi Hou) Some instructors standing at the back (Photo by Chien-Yi Hou) SAA Summer School Instructors • • • • • • Mark Conrad (NARA) Preservation principles Mike Smorul (U Md) Preservation services Reagan Moore (SDSC) Data grids Arcot (Raja) Rajasekar (SDSC) Advanced data grids Richard Marciano (SDSC) Preservation applications Chien-Yi Hou (SDSC) Preservation applications What the week consisted of (in format) (Photo by Chien-Yi Hou) What the week consisted of in subjects covered • Monday – – – – Electronic Records 101 (Conrad) Components of an Electronic Records Program (Conrad) Infrastructure Independence (Moore) mySRB Tutorial (Moore) • Tuesday – Appraisal and Disposition (Conrad, Marciano, Chien-Yi) – Accessioning (Smorul, Marciano, Conrad) • Wednesday – Arrangement (Marciano, Conrad, Moore) – Description (Marciano, Rajasekar, Chien-Yi, Moore) • Thursday – Preservation (Moore, Smorul, Chien-Yi) – Access (Moore, Marciano) • Friday – Scalability (Moore, Marciano) – Getting started (Conrad, Moore) What are Electronic Records? • Easy to Define – Any Record that Can Only be Accessed With a Computer • Hard to Define – Many Records Don’t Have an Analog Equivalent – Often Difficult to Say Where the “Boundaries” of a Record Are Where Do They Come From? • Types of applications that can create electronic records – Word processing – Databases – Spreadsheets – Geographic Information Systems – E-mail – Any Computer Application Could Potentially be used to Create Electronic Records Unique Qualities: Faster than Rabbits • They Multiply! • PERMANENT Federal Electronic Records – 1 to 5% of the Total Produced – Next 15 Years – 350 Petabytes Produced (Peta = 1000 TB) – Beyond the Current State of the Art • Archivists can Identify the Wheat and Chaff – Resource Allocators are Taking Notice Unique Qualities: Handle With Care • They are Fragile! – Easily Deleted – Keeping the Contextual Information Linked to the Data is Difficult • Without this it is difficult to assert you have authentic records Unique Qualities: Manipulation • The Good: Organized or Used in Multiple Ways – Records can be more easily used. • Records that would be difficult to use in paper form can be used quite easily in electronic form. •The Not So Good: -Records can be easily changed. Unique Qualities: Native Habitat vs. Zoo • Original Applications – Run Out of Room – Go Belly Up • Moving the Records Out of Their Native Habitat can be Challenging – Where is the Boundary Between the Records and the Application? – How do You Maintain Essential Characteristics in a Zoo (aka Preservation Environment)? – The Formats Become Obsolete, Too! COMPONENTS OF AN ELECTRONIC RECORDS PROGRAM 1. Policies and Mandates 2. Technical Infrastructure 3. Social Infrastructure Technical Infrastructure • Challenge: there are NO proven methods for the long-term retention of E/R in many formats -Ongoing Empirical Research: but theory does not Make it So! Storage Resource Broker (SRB) Infrastructure Independence Evolving Technology Preservation Records Environment External World Preservation environment middleware insulates records from changes in the external world Infrastructure Independence • Use data grids to preserve records independently of the choice of technology • Management of archives properties • Map technology components to preservation principles – Capabilities that support preservation requirements • Construct preservation environment from components – Archival engineering perspective • Use infrastructure independence to enable use of new technology – View that new technology is an opportunity instead of a challenge Preservation Standards • Architectural Model – OAIS, Reference Model for an Open Archival Information System • Representation information for each record • Submission / Archival / Dissemination Information Package (SIP / AIP / DIP) – Data grid - Storage Resource Broker (SRB), integrated Rule Oriented Data System (iRODS) – Digital Library - DSpace, Fedora • Metadata – Dublin core – LCDRG, NARA Life Cycle Data Requirements Guide – PREMIS, Preservation Metadata Implementation Strategies • Metadata organization • – MPEG-21, ISO/IEC TR 21000-1: MPEG-21 Multimedia Framework – METS, Metadata Encoding and Transmission Standard – OAIS, Reference Model for an Open Archival Information System Submission / Harvesting – Producer Archive Interface (NASA) – OAI-PMH, Open Archives Initiative - Protocol for Metadata Harvesting • Data format – pdf, xml, (330 formats retrievable on web crawls) • Assessment criteria – RLG/NARA TRAC - Trustworthy Repositories Audit & Certification: Criteria and Checklist. http://wiki.digitalrepositoryauditandcertification.org/pub/Main/ReferenceInputDocuments/tra c.pdf Using a Data Grid – in Abstract Data Grid •User asks for data from the data grid •The data is found and returned •Where & how details are hidden Using a Data Grid - Details ux-brk14 Oracle ux-brk12 DB Storage Resource Broker Server Metadata Catalog Storage Resource Broker Server •User asks for data •Data request goes to SRB Server •Server looks up information in catalog •Catalog tells which SRB server has data •1st server asks 2nd for data •The data is found and returned For more details, see: Moore, Regan, “Building Preservation Environments with Data Grid Technology”, American Archivist, vol. 69, no. 1, pp. 139-158, July 2006 Appraisal of ER: Get There Early • Records Need to be Appraised: – Early in Their Lifecycle • Fragile • Ephemeral – In Their Native Habitat • Functionality Technical Appraisal • For Permanent Records Have to Conduct Technical Appraisal – Feasibility of Preserving the Records – Identify all of the Digital Objects – Essential Characteristics • At Scale! Bootcamp continued… Appraise this !@#$ Disposition Arrangement In Action… In Action… Tapping into Archival Knowledge Electronic Records "Summer Camp" The Website The Website, cont’d Formulating Appraisal Rules • Retrieve root webpage • ‘http://water.usgs.gov/lookup/getgislist’ For each entry: – Create an “matching entry” collection on the SRB – Add ‘entry description’ metadata to that collection – Create “Description” subcollection • • • • Load Load Load Load web page all “.gif” | “.jpg” | “.jpeg” files all “.doc” metadata file – Create “ArcINFO” subcollection • Load all “.e00” | “.clr” | “.asc” | “.nit” | “.dlg” | “.txt” files – Create “Shape” subcollection • Load all “.shp” files – Create “SDTS” subcollection • Load all “.sdts” files – Create “Others” subcollection • Load “.tfw” | “.rdb” | “.clr” | “.asc” | “.prj” files – DECOMPRESS & LOAD “.zip” | “.gz” | “.tgz” | “.tar” | “.tar.gz” files E-FOIA Document Collections: Dep. Of State National Archives and Records Administration Transcontinental Persistent Archive Prototype Federation of Five Independent Data Grids NARA I MCAT NARA II MCAT Georgia Tech MCAT U Md MCAT SDSC MCAT Extensible Environment, can federate with additional research and education sites. Each data grid uses different vendor products. ACE – Basic Methodology • Three-tiered Cryptographic Information. Integrity Token k:1 Cryptographic Summary Information • 1 IT/object • 1 CSI/time window • ~1KB • 1 CSI / (n) objects • ~100MB/year l:1 Witness • 1 Witness/week • ~2-3KB/year • Each tier is periodically audited separately according to policies set by managers. End of the day (Photos by Sara Muth) Club Asante Photos by Sara Muth (top) and Eric Paquette (right) Commemorative Corkscrew (Photo by Gary Spurr) Acknowledgments Slides with text are from the course instructors’ PowerPoint presentations: Conrad, et. al Photos as credited. (Photo by Eric Paquette)