Trusted Datagrids: Library of Congress Projects with UCSD

advertisement
Trusted Datagrids:
Library of Congress Projects with UCSD
Ardys Kozbial – UCSD Libraries
David Minor - SDSC
Building Trust in a 3RD Party
Repository: A Pilot Project
David Minor
San Diego Supercomputer Center
someone
they
control?
How can
thecan’t
LC trust
Moving forward in the right direction
requires more than fuzzy promises
Cyberinfrastructure
… it takes a combination
of experts and tools.
Cyberinfrastructure is the collection of ...
Resources
Computers, data storage, networks,
scientific instruments, experts, etc.
+ Glue
Integrating software,
systems, and organizations
“Effective cyberinfrastructure for the humanities
and social sciences will allow scholars to focus
their intellectual and scholarly energies on the
issues that engage them, and to be effective
users of new media and new technologies, rather
than having to invent them.”
- ACLS Commission on
Cyberinfrastructure for the
Humanities & Social Sciences
• “The mission of the San Diego Supercomputer
Center (SDSC) is to empower communities in
data-oriented research, education, and
practice through the innovation and provision
of Cyberinfrastructure”
SDSC ...
• Is one of the original NSF supercomputer centers
• Supports high performance computing systems
• Supports data applications for science,
engineering, social sciences, cultural heritage
institutions
• Has LARGE data capabilities
• 3+ PB Disk Storage
• 25+ PB Tape Storage
UCSD Libraries
• 3.5+ million volumes
• Digital Access Management System (in
development)
• 250,000+ objects
• 15+ TB
• Shared collections with UC
• California Digital Library
• Digital Preservation Repository
• eScholarship repository
Partnerships and Collaborations
LC Pilot Project – Building Trust in a 3rd Party Repository
–
–
–
–
Using test image collections/web crawls ingest content to SDSC repository
Allow access for content audit
Track usage of content over time
Deliver content back to LC at end of project
Library of Congress NDIIPP Chronopolis Program
– Build Production Capable Chronopolis Grid (50 TB x 3)
– Further define transmission packaging for archival communities
– Investigate best network transfer models for I2 and TeraGrid networks
California Digital Library (CDL) Mass Transit Program
– Enable UC System Libraries to transfer high-speed mass digitization collections across
CENIC/I2
– Develop transmission packaging for CDL content
UCSD Libraries’ Digital Asset Management System
– RDF System with data managed in SRB at SDSC
SDSC DPI Group
Digital Preservation Initiatives Group
– Charged with Developing and Supporting Digital
Preservation Services within the Production
Systems Division of SDSC.
– http://dpi.sdsc.edu
– Cross-Organizational Group
• SDSC Personnel/UCSD Libraries Personnel
–
–
–
–
Libraries
Archives
Technology
Information Science
Cyberinfrastructure
Trust
For Example:
We worked together to setup high
speed data replication services
Achieved 200Mb/s
Checksums
= 2 TB/day
Highly reliable
Checksums
Internet2
Network setup involved …
LC and SDSC staff working together
Configurations on networks and computers
Resolving different security environments
Network monitoring
Networking is hard!
It’s not magic - there’s always a reason
Lessons
Learned
It highlights collaborative nature of work
Can’t forget it once it’s setup
Have multi-institutional issues been solved?
Does new infrastructure improve process?
Trust
Elements
Has a long-term solution been found?
Is solution useful for other organizations?
SDSC created a robust storage
environment for this data
Multiple
replications …
… at SDSC
… and
geographically
diverse locations
(a process with several characteristics)
Needed to replicate structure exactly
This had to be done for 5+ replications
Complex environment had to be transparent
Data had to be available for manipulation
The Storage Resource
Broker provided
replication services ...
... and extensive monitoring,
(which
led
to
many
conversations)
logging and reporting functions
Logging and monitoring procedures
Scripts which compared the files within the
system
a master
– checked
changes
What
is with
the master
listlist
and
who maintains
it?
on either side … fairly straightforward
Who decides what is a legitimate change?
But …
Do you want a dark archive or an active
remote data center?
We tested
a new
Front-End
… and explored an important issue
“Reliability”
Versus
“Accessibility”
Always keep expectations aligned
Duplication of structure is complicated
Lessons
Learned
Don’t confuse accessibility and reliability
Communication highlights communication
Can remote data be accessed?
Can remote data be verified?
Trust
Elements
Can remote data be retrieved and re-used?
Can ownership be clearly defined?
SDSC and LC explored a new
approach to working with web archives
Parallel
indexing
50,000 ARC
files and
display system
6 Terabytes of data
Looked “default” to
the
user
Short
processing time
Using default tools, our initial indexing
rate was 1000 files per day…
… more
This
was than
over 6 weeks of constant
computing
to index entire collection.
our
time budget.
We ran 18 parallel indexing
instances – reduced processing to
a week
We modified the Wayback
sourcecode to create a new
access infrastructure
Default setup isn’t always easiest
Time is a wonderful motivator
Lessons
Learned
Sometimes you need to start over
Experts are often interested in your work
Are the final results the same?
Can the results be reached in a better way?
Trust
Elements
Can a new organization bring new expertise?
Can a new organization work with your partners?
Next steps ….
Chronopolis!
Chronopolis: A Partnership
 Chronopolis is being developed by a
national consortium led by SDSC and
the UCSD Libraries.
 Initial Chronopolis provider sites
include:
 SDSC and UCSD Libraries
at UC San Diego
 University of Maryland
 National Center for Atmospheric
Research (NCAR) in Boulder, CO
UCSD Libraries
Institutions and Roles - UCSD
SDSC
– Storage and networking services
– SRB support
– Transmission Packaging Modules
UCSD Libraries
– Metadata services (PREMIS)
– DIPs (Dissemination Information Packages)
– Other advanced data services as needed
Institutions and Roles - NCAR
National Center for Atmospheric Research
– Archives: Complete copy of all data
– Storage and network support
– Network testing
Institutions and Roles - UMIACS
University of Maryland – Institute for Advanced
Computer Studies
– Archives: Complete copy of all data
– Advanced data services
• PAWN: Producer – Archive Workflow Network
in Support of Digital Preservation
• ACE: Auditing Control Environment to Ensure
the Long Term Integrity of Digital Archives
– Other advanced data services as needed
SDSC Chronopolis Program
Chronopolis Vocabulary
Partners – UCSD Libraries, National Center for Atmospheric Research, University of
Maryland Institute for Advanced Computer Studies all provide grid enabled storage
nodes for Chronopolis services.
Clients – ICPSR, CDL– contribute content to the Chronopolis preservation network.
SRB – Storage Resource Broker – datagrid software.
iRODS – integrated Rule Oriented Data System – datagrid software.
ACE – Audit Control Cnvironment – part of the ADAPT project at UMD.
PAWN – Producer Archive Workflow Network – part of the ADAPT project at UMD.
INCA – user level grid monitoring - executes periodic, automated, user-level testing of
Grid software and services – grid middleware.
Bagit – Transfer specification developed by CDL and the Library of Congress.
GridFTP – parallel transfer technology - moves large collections within a grid widearea network.
Chronopolis: Inside
Chron Clients:
CDL
ICPSR
Linked by main staging grid where data is
verified for integrity, and quarantined for
security purposes.
Push
Collections are independently pulled into
each system.
Grid
Manifest layer provides added security
for database
management and data
integrity validation.
Brick
Disks
Benefits
– 3 independently
managed copies of
the collection
– High availability
– High reliability
Manifest
Management
MCAT DB
Multiple Hash
Verifications
SDSC
Staging
Grid
NCAR
UMD
Pull
Pull
Copy 3
Copy 2
Pull
MCAT
SDSC
Core Center
Archive
Copy 1
MCAT
HPSS
Tape
MCAT
Grid
Brick
Disks
SDSC Leveraged Infrastructure
 Serves Both HPC &
Digital Preservation
 Archive
 25 PB capacity
 Both HPSS &
SAM-QFS
 Online disk
 ~3PB total
 HPC parallel file
systems
 Collections
 Databases
Adapted from Richard Moore (SDSC)
 Access Tools
Chronopolis Demonstration
Project
Demonstration Project 2006-2007
– Demonstration Collections Ingested
within Chronopolis
• National Virtual Observatory (NVO)
– 3 TB Hyperatlas Images (partial collection)
• Library of Congress PG Image Collection
– 600 GB Prokudin-Gorskii Image Collection
• Interuniversity Consortium for Political
and Social Research (ICPSR)
– 2TB Web Accessible Data
• NCAR Observational Data
– 3TB Observational Re-Analysis Data
NDIIPP Chronopolis Project
• Creating a 3-node federated data grid at SDSC, NCAR
and UMD – up to 50 TB data from CDL and ICPSR
• Installing and testing a suite of monitoring tools using
ACE, PAWN, INCA
• Creating Appropriate Transmission Information Packages
• Generating PREMIS definitions for data
• Writing Best Practices documents for clients and partners
Chronopolis Grid Framework
Chronopolis
Data
12-25TB
Chronopolis
Data
12TB
CDL
CDL
Server
Server
ICPSR
Server
UC
BerkeleyNet
work
Sun 6140
62TB
SRB
MCAT
ICPSR
Network
SRB
D-Broker
SRB
D-Broker
NCAR
NCAR
Network
Network
SRB
MCAT
SDSC
SDSC
Network
Network
SRB
MCAT
SRB
D-Broker
Sun
SAM-QFS
Maryland
UMD
Network
Network
SRB
D-Broker
Tape Silos
SRB
D-Broker
Apple Xsan
SRB
D-Broker
Adapted from Bryan Banister (SDSC
NDIIPP Chronopolis Clients-CDL
California Digital Library
– A part of UCOP, supports the University
of California libraries
– Providing up to 25TB of data: Web-At-Risk
project
• Five years of political and governmental
websites
• ARC files created from web crawls
• Using Bagit Transfer Structure
Diagram of CDL Data Transfer
Wget Bagit
CDL
Virtual Machine
at UCB
Wget files 1-10, 11-20
SDSC
Network
Parallel Wget Xfer
Bagit
Manifest
Possible
SRB/Bagit
Module
UMIACS
Network
File 1
File n
Chron
Staging
Chron
Repository
Adapted from Bryan Banister (SDSC)
NCAR
Network
NDIIPP Chronopolis Clients-ICPSR
Inter-University Consortium for Political and
Social Research, University of Michigan
– Providing @12TB of data: Wide variety of types
– Already working with SDSC using SRB
Diagram of ICSPR Transfer
Sput/Srsync Files
ICPSR
SRB Repository
UMich
Sput tar files
SDSC
Network
Parallel Sput/Srsync Xfer
Chron
SRB
MCAT
EMC
SAN
UMIACS
Network
File 1
File n
Chron
Staging
Chron
Repository
Adapted from Bryan Banister (SDSC)
NCAR
Network
Ongoing and Future Initiatives
• Migration of Chronopolis from SRB to iRODS
• Develop Interoperability with Community
Based Archival Systems/Standards
• TRAC compliance for SDSC Production
Preservation Services/Chronopolis Consortium
Looking for Partnerships
• Repositories interested in moving large digital
collections among heterogeneous repository
systems.
• Fedora, DSpace or E-Prints sites interested in
managed datagrid storage.
• Institutions interested in personnel swaps to
conduct TRAC audit assessment compliance.
• Community Needs for Mass-Scale Data
Transmission and Storage.
Chronopolis Credits
SDSC
– Fran Berman
– Richard Moore
– David Minor
– Chris Jordan
– Jim D’Aoust
– Robert McDonald
– Don Sutton
– Brian Banister
– Phong Dinh
– Jay Dombrowski
– Emilio Valente
UCSD Libraries
– Brian Schottlaender
– Luc Declerck
– Ardys Kozbial
– Brad Westbrook
– Arwen Hutt
NCAR
– Don Middleton
– Michael Burek
– Linda McGinley
UMIACS
– Joseph JaJa
– Mike Smorul
– Mike McGann
Library of Congress
– Martha Anderson
– Lisa Hoppis
CACI
– Mike Ivey
http://chronopolis.sdsc.edu
Chronopolis is ...
• a geographically distributed preservation environment that supports
long-term management and stewardship of digital collections
• implemented by developing and deploying a distributed data grid, and
by supporting its human, policy, and technological infrastructure.
• technology forecasting and migration in support of long-term life-cycle
management of the dedicated preservation environment.
Chronopolis focuses on ...
• Assessment of the needs of potential user communities and
development of appropriate service models
• Development of Memoranda of Understanding (MOUs), Service
Level Agreements (SLAs), etc. to formalize trust relationships and
manage expectations
• Assessment and prototyping of best practices for bit preservation,
authentication, metadata, etc.
• Development of cost and risk models for long-term preservation
• Development of appropriate success metrics to evaluate
usefulness, reliability, and usability of infrastructure
UCSD Libraries
The people of Chronopolis are ...
Organizations need ways to
In conclusion …
validate trust in 3rd parties
SDSC and the Library of Congress
explored one way to do this …
by working with Cyberinfrastructure
… and demonstrating trust.
With a trusted relationship, many
journeys become possible
Download