Chronopolis: Preserving Our Digital Heritage David Minor UC San Diego San Diego Supercomputer Center What is Chronopolis? • A digital preservation network developed by a national consortium, with initial funding from The Library of Congress / National Digital Information and Infrastructure Preservation Program (NDIIPP). UCSD Libraries • Chronopolis partners are : – San Diego Supercomputer Center (SDSC) and the UC San Diego (UCSD) Libraries – University of Maryland Institute for Advanced Computer Studies (UMIACS) – National Center for Atmospheric Research (NCAR) in Boulder, Colorado UCSD/SDSC/UMIACS/NCAR http://chronopolis.sdsc.edu 2 Chronopolis Fast Facts • Digital preservation environment using a data grid framework • Designed to leverage capabilities at multiple institutions • Emphasizes heterogeneous and redundant data storage systems • Has a current storage capacity of 150 TB (50 TB at 3 nodes) • Has geographically distributed copies of all data • Includes detailed monitoring and monthly auditing of all data 3 Institutional Roles • All partners provide: – Storage, network support – Complete copy of all data – SRB support • UCSD Libraries: – Metadata expertise • SDSC: – Project Management – Finances, contracts, etc • UMIACS: – Preservation tool development – Storage technology testing • NCAR: – Data portal development UCSD/SDSC/UMIACS/NCAR http://chronopolis.sdsc.edu 4 Data Providers • California Digital Library – – – – 12 TB of data Crawls of political and government web sites ARC files, uniform size BagIt protocol for data transfer • Inter-university Consortium for Political and Social Research (ICPSR) – – – – 10 TB of data 40+ years of social science research Millions of files Already using SRB http://chronopolis.sdsc.edu 5 Data Providers • North Carolina State University Libraries – 6 TB of data – State and local geospatial data – BagIt protocol for data transfer • Scripps Institution of Oceanography – 1 TB of data – 50 years of data from SIO research cruises – Already using SRB UCSD/SDSC/UMIACS/NCAR http://chronopolis.sdsc.edu 6 Core Chronopolis Tools • Storage Resource Broker (SRB) • BagIt • SRB Replication Monitor • Auditing Control Environment (ACE) • Chronopolis Web Portal UCSD/SDSC/UMIACS/NCAR http://chronopolis.sdsc.edu 8 Storage Resource Broker • The underlying infrastructure of Chronopolis • Each site is a separate zone with its own MCAT and management • Data is replicated at each zone • Will be moving to iRODS in next few months UCSD/SDSC/UMIACS/NCAR http://chronopolis.sdsc.edu 9 BagIt BagIt is a hierarchical file packaging format for the exchange of generalized digital content. • There is no software to install • Consists of base directory with manifest file & subdirectory with content • Manifest file has a row for each content file with: – Full path in content directory – A checksum for file Holey Bags • Have additional ‘fetch.txt’ file in base directory & empty content directory • URLs for each content file are listed in fetch.txt file. • Can reduce transfer time by fetching content in parallel http://www.digitalpreservation.gov/library/resources/tools/docs/bagitspec.pdf 10 BagIt UCSD/SDSC/UMIACS/NCAR http://chronopolis.sdsc.edu 11 SRB Replication Monitor • Product of UMIACS • A webapp that watches registered directories and ensures that copies exist at designated mirrors. • The monitor stores enough information to know if files have been added or removed from the master site and when the last time a file was seen. • Any action that the webapp takes on files is logged. • The monitor does NOT do any type of integrity checking, this is the responsibility of other components (eg, ACE). UCSD/SDSC/UMIACS/NCAR http://chronopolis.sdsc.edu 12 Replication Process Replication Monitor UCSD/SDSC/UMIACS/NCAR http://chronopolis.sdsc.edu 13 14 15 Auditing Control Environment (ACE) • Product of UMIACS • Software to protect the integrity of digital assets in the long term • Underpinnings are based on rigorous cryptographic techniques • Scalable, cost-effective, can interoperate with any archiving architecture UCSD/SDSC/UMIACS/NCAR http://chronopolis.sdsc.edu 16 ACE – Overview object Client Hash (obj) Integrity Token ACE-IMS (Integrity Management Service) 3rd Party Auditor ACE-AM (Audit Manager) 17 ACE Audit • Can audit millions of files and TBs of data • Two types of audit: – A file audit: checks files in registered directories against stored hashes to ensure files have not been corrupted – Token audit: checks the stored hashes against a remote Integrity Management Server to ensure nobody has tampered with the stored hashes UCSD/SDSC/UMIACS/NCAR http://chronopolis.sdsc.edu 18 ACE Audit Object 1. Each digital object is audited locally using the integrity token, according to the policy set by the local manager. Integrity Token 2. The integrity management system periodically audits the integrity tokens according to its policies. Cryptographic Summary Information 3. Cryptographic summaries are audited as necessary using the published witness values. UCSD/SDSC/UMIACS/NCAR http://chronopolis.sdsc.edu Witness 19 UCSD/SDSC/UMIACS/NCAR http://chronopolis.sdsc.edu 20 UCSD/SDSC/UMIACS/NCAR http://chronopolis.sdsc.edu 21 UCSD/SDSC/UMIACS/NCAR http://chronopolis.sdsc.edu 22 Web Portal • Designed to give data providers an in-depth look at their holdings • Shows where data is in all locations • Unifies information from SRB, ACE and the Replication Monitor UCSD/SDSC/UMIACS/NCAR http://chronopolis.sdsc.edu 23 24 25 26 Chronopolis Metadata • Working with team from UCSD Libraries • What technical metadata is system tracking? • What descriptive metadata is present? • What are the significant events? UCSD/SDSC/UMIACS/NCAR http://chronopolis.sdsc.edu 27 ACE ET-1 Service Level Agreement ET-5 Acquisition Registration into ACE ET-8 File Integrity Check DP Node 2 ET-7 Acquisition Replication Data ET-3 Acquisition Validation Manifest ET-2 Acquisition Transfer Replication Monitor Data ET-4 Acquisition Registration to SRB ET-6 Inter-Node Inventory Check MCAT Node 1 Node 3 http://chronopolis.sdsc.edu 28 Future directions • • • • • • Update auditing procedures Updated portal Automation of collection ingest New collections and storage nodes Fully-fledged business model TRAC certification UCSD/SDSC/UMIACS/NCAR http://chronopolis.sdsc.edu 29 http://chronopolis.sdsc.edu minor@sdsc.edu 30