Trusted Datagrids: Library of Congress Projects with UCSD Ardys Kozbial – UCSD Libraries David Minor - SDSC Building Trust in a 3RD Party Repository: A Pilot Project David Minor San Diego Supercomputer Center someone they control? How can thecan’t LC trust Moving forward in the right direction requires more than fuzzy promises Cyberinfrastructure … it takes a combination of experts and tools. Cyberinfrastructure is the collection of ... Resources Computers, data storage, networks, scientific instruments, experts, etc. + Glue Integrating software, systems, and organizations “Effective cyberinfrastructure for the humanities and social sciences will allow scholars to focus their intellectual and scholarly energies on the issues that engage them, and to be effective users of new media and new technologies, rather than having to invent them.” - ACLS Commission on Cyberinfrastructure for the Humanities & Social Sciences • “The mission of the San Diego Supercomputer Center (SDSC) is to empower communities in data-oriented research, education, and practice through the innovation and provision of Cyberinfrastructure” SDSC ... • Is one of the original NSF supercomputer centers • Supports high performance computing systems • Supports data applications for science, engineering, social sciences, cultural heritage institutions • Has LARGE data capabilities • 3+ PB Disk Storage • 25+ PB Tape Storage UCSD Libraries • 3.5+ million volumes • Digital Access Management System (in development) • 250,000+ objects • 15+ TB • Shared collections with UC • California Digital Library • Digital Preservation Repository • eScholarship repository Partnerships and Collaborations LC Pilot Project – Building Trust in a 3rd Party Repository – – – – Using test image collections/web crawls ingest content to SDSC repository Allow access for content audit Track usage of content over time Deliver content back to LC at end of project Library of Congress NDIIPP Chronopolis Program – Build Production Capable Chronopolis Grid (50 TB x 3) – Further define transmission packaging for archival communities – Investigate best network transfer models for I2 and TeraGrid networks California Digital Library (CDL) Mass Transit Program – Enable UC System Libraries to transfer high-speed mass digitization collections across CENIC/I2 – Develop transmission packaging for CDL content UCSD Libraries’ Digital Asset Management System – RDF System with data managed in SRB at SDSC SDSC DPI Group Digital Preservation Initiatives Group – Charged with Developing and Supporting Digital Preservation Services within the Production Systems Division of SDSC. – http://dpi.sdsc.edu – Cross-Organizational Group • SDSC Personnel/UCSD Libraries Personnel – – – – Libraries Archives Technology Information Science Cyberinfrastructure Trust For Example: We worked together to setup high speed data replication services Achieved 200Mb/s Checksums = 2 TB/day Highly reliable Checksums Internet2 Network setup involved … LC and SDSC staff working together Configurations on networks and computers Resolving different security environments Network monitoring Networking is hard! It’s not magic - there’s always a reason Lessons Learned It highlights collaborative nature of work Can’t forget it once it’s setup Have multi-institutional issues been solved? Does new infrastructure improve process? Trust Elements Has a long-term solution been found? Is solution useful for other organizations? SDSC created a robust storage environment for this data Multiple replications … … at SDSC … and geographically diverse locations (a process with several characteristics) Needed to replicate structure exactly This had to be done for 5+ replications Complex environment had to be transparent Data had to be available for manipulation The Storage Resource Broker provided replication services ... ... and extensive monitoring, (which led to many conversations) logging and reporting functions Logging and monitoring procedures Scripts which compared the files within the system a master – checked changes What is with the master listlist and who maintains it? on either side … fairly straightforward Who decides what is a legitimate change? But … Do you want a dark archive or an active remote data center? We tested a new Front-End … and explored an important issue “Reliability” Versus “Accessibility” Always keep expectations aligned Duplication of structure is complicated Lessons Learned Don’t confuse accessibility and reliability Communication highlights communication Can remote data be accessed? Can remote data be verified? Trust Elements Can remote data be retrieved and re-used? Can ownership be clearly defined? SDSC and LC explored a new approach to working with web archives Parallel indexing 50,000 ARC files and display system 6 Terabytes of data Looked “default” to the user Short processing time Using default tools, our initial indexing rate was 1000 files per day… … more This was than over 6 weeks of constant computing to index entire collection. our time budget. We ran 18 parallel indexing instances – reduced processing to a week We modified the Wayback sourcecode to create a new access infrastructure Default setup isn’t always easiest Time is a wonderful motivator Lessons Learned Sometimes you need to start over Experts are often interested in your work Are the final results the same? Can the results be reached in a better way? Trust Elements Can a new organization bring new expertise? Can a new organization work with your partners? Next steps …. Chronopolis! Chronopolis: A Partnership Chronopolis is being developed by a national consortium led by SDSC and the UCSD Libraries. Initial Chronopolis provider sites include: SDSC and UCSD Libraries at UC San Diego University of Maryland National Center for Atmospheric Research (NCAR) in Boulder, CO UCSD Libraries Institutions and Roles - UCSD SDSC – Storage and networking services – SRB support – Transmission Packaging Modules UCSD Libraries – Metadata services (PREMIS) – DIPs (Dissemination Information Packages) – Other advanced data services as needed Institutions and Roles - NCAR National Center for Atmospheric Research – Archives: Complete copy of all data – Storage and network support – Network testing Institutions and Roles - UMIACS University of Maryland – Institute for Advanced Computer Studies – Archives: Complete copy of all data – Advanced data services • PAWN: Producer – Archive Workflow Network in Support of Digital Preservation • ACE: Auditing Control Environment to Ensure the Long Term Integrity of Digital Archives – Other advanced data services as needed SDSC Chronopolis Program Chronopolis Vocabulary Partners – UCSD Libraries, National Center for Atmospheric Research, University of Maryland Institute for Advanced Computer Studies all provide grid enabled storage nodes for Chronopolis services. Clients – ICPSR, CDL– contribute content to the Chronopolis preservation network. SRB – Storage Resource Broker – datagrid software. iRODS – integrated Rule Oriented Data System – datagrid software. ACE – Audit Control Cnvironment – part of the ADAPT project at UMD. PAWN – Producer Archive Workflow Network – part of the ADAPT project at UMD. INCA – user level grid monitoring - executes periodic, automated, user-level testing of Grid software and services – grid middleware. Bagit – Transfer specification developed by CDL and the Library of Congress. GridFTP – parallel transfer technology - moves large collections within a grid widearea network. Chronopolis: Inside Chron Clients: CDL ICPSR Linked by main staging grid where data is verified for integrity, and quarantined for security purposes. Push Collections are independently pulled into each system. Grid Manifest layer provides added security for database management and data integrity validation. Brick Disks Benefits – 3 independently managed copies of the collection – High availability – High reliability Manifest Management MCAT DB Multiple Hash Verifications SDSC Staging Grid NCAR UMD Pull Pull Copy 3 Copy 2 Pull MCAT SDSC Core Center Archive Copy 1 MCAT HPSS Tape MCAT Grid Brick Disks SDSC Leveraged Infrastructure Serves Both HPC & Digital Preservation Archive 25 PB capacity Both HPSS & SAM-QFS Online disk ~3PB total HPC parallel file systems Collections Databases Adapted from Richard Moore (SDSC) Access Tools Chronopolis Demonstration Project Demonstration Project 2006-2007 – Demonstration Collections Ingested within Chronopolis • National Virtual Observatory (NVO) – 3 TB Hyperatlas Images (partial collection) • Library of Congress PG Image Collection – 600 GB Prokudin-Gorskii Image Collection • Interuniversity Consortium for Political and Social Research (ICPSR) – 2TB Web Accessible Data • NCAR Observational Data – 3TB Observational Re-Analysis Data NDIIPP Chronopolis Project • Creating a 3-node federated data grid at SDSC, NCAR and UMD – up to 50 TB data from CDL and ICPSR • Installing and testing a suite of monitoring tools using ACE, PAWN, INCA • Creating Appropriate Transmission Information Packages • Generating PREMIS definitions for data • Writing Best Practices documents for clients and partners Chronopolis Grid Framework Chronopolis Data 12-25TB Chronopolis Data 12TB CDL CDL Server Server ICPSR Server UC BerkeleyNet work Sun 6140 62TB SRB MCAT ICPSR Network SRB D-Broker SRB D-Broker NCAR NCAR Network Network SRB MCAT SDSC SDSC Network Network SRB MCAT SRB D-Broker Sun SAM-QFS Maryland UMD Network Network SRB D-Broker Tape Silos SRB D-Broker Apple Xsan SRB D-Broker Adapted from Bryan Banister (SDSC NDIIPP Chronopolis Clients-CDL California Digital Library – A part of UCOP, supports the University of California libraries – Providing up to 25TB of data: Web-At-Risk project • Five years of political and governmental websites • ARC files created from web crawls • Using Bagit Transfer Structure Diagram of CDL Data Transfer Wget Bagit CDL Virtual Machine at UCB Wget files 1-10, 11-20 SDSC Network Parallel Wget Xfer Bagit Manifest Possible SRB/Bagit Module UMIACS Network File 1 File n Chron Staging Chron Repository Adapted from Bryan Banister (SDSC) NCAR Network NDIIPP Chronopolis Clients-ICPSR Inter-University Consortium for Political and Social Research, University of Michigan – Providing @12TB of data: Wide variety of types – Already working with SDSC using SRB Diagram of ICSPR Transfer Sput/Srsync Files ICPSR SRB Repository UMich Sput tar files SDSC Network Parallel Sput/Srsync Xfer Chron SRB MCAT EMC SAN UMIACS Network File 1 File n Chron Staging Chron Repository Adapted from Bryan Banister (SDSC) NCAR Network Ongoing and Future Initiatives • Migration of Chronopolis from SRB to iRODS • Develop Interoperability with Community Based Archival Systems/Standards • TRAC compliance for SDSC Production Preservation Services/Chronopolis Consortium Looking for Partnerships • Repositories interested in moving large digital collections among heterogeneous repository systems. • Fedora, DSpace or E-Prints sites interested in managed datagrid storage. • Institutions interested in personnel swaps to conduct TRAC audit assessment compliance. • Community Needs for Mass-Scale Data Transmission and Storage. Chronopolis Credits SDSC – Fran Berman – Richard Moore – David Minor – Chris Jordan – Jim D’Aoust – Robert McDonald – Don Sutton – Brian Banister – Phong Dinh – Jay Dombrowski – Emilio Valente UCSD Libraries – Brian Schottlaender – Luc Declerck – Ardys Kozbial – Brad Westbrook – Arwen Hutt NCAR – Don Middleton – Michael Burek – Linda McGinley UMIACS – Joseph JaJa – Mike Smorul – Mike McGann Library of Congress – Martha Anderson – Lisa Hoppis CACI – Mike Ivey http://chronopolis.sdsc.edu Chronopolis is ... • a geographically distributed preservation environment that supports long-term management and stewardship of digital collections • implemented by developing and deploying a distributed data grid, and by supporting its human, policy, and technological infrastructure. • technology forecasting and migration in support of long-term life-cycle management of the dedicated preservation environment. Chronopolis focuses on ... • Assessment of the needs of potential user communities and development of appropriate service models • Development of Memoranda of Understanding (MOUs), Service Level Agreements (SLAs), etc. to formalize trust relationships and manage expectations • Assessment and prototyping of best practices for bit preservation, authentication, metadata, etc. • Development of cost and risk models for long-term preservation • Development of appropriate success metrics to evaluate usefulness, reliability, and usability of infrastructure UCSD Libraries The people of Chronopolis are ... Organizations need ways to In conclusion … validate trust in 3rd parties SDSC and the Library of Congress explored one way to do this … by working with Cyberinfrastructure … and demonstrating trust. With a trusted relationship, many journeys become possible