Storage Resource Broker Data Grids and Data Management Reagan W. Moore moore@sdsc.edu http://www.sdsc.edu/srb Topics • Data management evolution • Shared collections • Digital Libraries • Persistent Archives • Building shared collections • Project level / National level / International • Demonstration of shared collections • Access to collections at SDSC Shared Collections • Data grids support the creation of shared collections that may be distributed across multiple institutions, sites, and storage systems. • Digital libraries publish data, and provide services for discovery and display • Persistent archives preserve data, managing the migration to new technology Storage Resource Broker Collections at SDS C (12/6/2005 ) Data Gr id NSF/ITR - National Virtual Observatory NSF - National Partnership for Advanced Computational Infrastructure Static collections Р Hayden planetarium Pzone Р public collections NSF/NPACI - Biology and Environmental collections NSF/NPACI Р Joint Center for Structural Genomics NSF - TeraGrid, ENZO Cosmology simulations GBs of data stored К NIH - Biomedical Informatics Research Network Digital Library NSF/NPACI - Long Term Ecological Reserve NSF/NPACI - Grid Portal NIH - Alliance for Cell Signaling microarray d ata NSF - National Science Digital Library SIO Explorer collection NSF/ITR - Southern California Earthquake Center Persistent Archive К NHPRC Persistent Archive Testbed (Kentucky, Ohio, Michigan, Minnesota) UCSD Libraries archive NARA- Research Prototype Persistent Archive NSF - National Science Digital Library persistent archive TOTAL Users with ACLs Number of files К К 93,252 34,319 8,013 16,815 103,849 15,731 187,684 11,189,215 7,235,387 161,352 9,628,452 124,169 1,577,260 2,415,397 100 380 227 68 67 55 3,267 12,812 К 9,078,456 К 337 236 2,620 733 2,452 134,779 34,594 53,048 94,686 1,068,842 2,958,799 К 96 4,147 2,029 4,651 624 TB К 36 460 21 27 73 К 387,825 408,050 1,393,722 45,724,945 93 million 28 29 58 136 5,369 Generic Infrastructure • Can a single system provide all of the features needed to implement each type of data management system, while supporting access across administrative domains and managing data stored in multiple types of storage systems? • Answer is data grid technology Shared Collections • Purpose of SRB data grid is to enable the creation of a collection that is shared between academic institutions • • • • Register digital entity into the shared collection Assign owner, access controls Assign descriptive, provenance metadata Manage state information • Audit trails, versions, replicas, backups, locks • Size, checksum, validation date, synchronization date, … • Manage interactions with storage systems • Unix file systems, Windows file systems, tape archives, … • Manage interactions with preferred access mechanisms • Web browser, Java, WSDL, C library, … Federated Server Architecture Read Application Logical Name Or Attribute Condition Peer-to-peer Brokering Parallel Data Access 1 6 SRB server 3 SRB server 4 SRB agent 5 SRB agent 1.Logical-to-Physical mapping 2.Identification of Replicas 3.Access & Audit Control 5/6 2 R1 MCAT Data Access R2 Server(s) Spawning Generic Infrastructure • Digital libraries now build upon data grids to manage distributed collections • DSpace digital library - MIT and Hewlitt Packard • Fedora digitial library - Cornell University and University of Virginia • Persistent archives build upon data grids to manage technology evolution • NARA research prototype persistent archive • California Digital Library - Digital Preservation Repository • NSF National Science Digital Library persistent archive National Science Digital Library • URLs for educational material for all grade levels registered into repository at Cornell • SDSC crawls the URLs, registers the web pages into a SRB data grid, builds a persistent archive • 750,000 URLs • 13 million web pages • About 3 TBs of data Southern California Earthquake Center •Intuitive User Interface –Pull-Down Query Menus –Graphical Selection of Source Model –Clickable LA Basin Map (Olsen) –Seismogram/History extraction (Olsen) •Access SCEC Digital Library –Data stored in a data grid –Annotated by modelers –Standard naming convention –Automated extraction of selected data and metadata –Management of visualizations SCEC Digital Library Select Receiver (Lat/Lon) Select Scenario Output Fault Model Source Model Time History Seismograms SCEC Community Library Terashake Data Handling • Simulate 7.7 magnitude earthquake on San Andreas fault • 50 Terabytes in a simulation • Move 10 Terabytes per day • Post-Processing of wave field • Movies of seismic wave propagation • Seismogram formatting for interactive on-line analysis • Velocity magnitude • Displacement vector field • Cumulative peak maps • Statistics used in visualizations • Register derived data products into SCEC digital library Humidity Climate Ecological Wireless Oceanography Wind Speed Climate Ecological Wireless Oceanography ROADNet Sensor Network Data Integration Seismic Geophysics Frank Vernon - UCSD/SIO Fire start Rain start NARA Persistent Archive Federation of Three Independent Data Grids Demonstrate preservation environment • Authenticity • Integrity • Management of technology evolution • Mitigation of risk of data loss • Replication of data • Federation of catalogs • Management of preservation metadata • Scalability • Types of data collections • Size of data collections NARA MCAT Original data at NARA, data replicated to U Md & SDSC U Md MCAT Replicated copy at U Md for improved access, load balancing and disaster recovery SDSC MCAT Active archive at SDSC, user access Worldwide University Network Data Grid • • • • • • SDSC Manchester Southampton White Rose NCSA U. Bergen • A functioning, general purpose international Data Grid for academic collaborations Manchester-SDSC mirror WUNGrid Collections • BioSimGrid • Molecular structure collaborations • White Rose Grid • Distributed Aircraft Maintenance Environment • Medieval Studies • Music Grid • e-Print collections • DSpace • Astronomy BioSimGrid • Kaihsu Tai, Stuart Murdock, Bing Wu, Muan Hong Ng, Steven Johnston, Hans Fangohr, Simon J. Cox, Paul Jeffreys, Jonathan W. Essex, Mark S. P. Sansom (2004) BioSimGrid: towards a worldwide repository for biomolecular simulations. Org. Biomol. Chem. 2:3219–3221 DOI: 10.1039/b411352g • University of Oxford • • • • • Mark Sansom, Biochemistry Paul Jeffreys, e-Science Kaihsu Tai, Biochemistry Bing Wu, Biochemistry / e-Science University of Southampton • • • • • • Jonathan Essex, Chemistry Simon Cox, e-Science Stuart Murdock, Chemistry / e-Science Muan Hong Ng, e-Science Hans Fangohr, e-Science Steven Johnston, e-Science •Elsewhere •David Moss, Birkbeck, London •Adrian Mulholland, Bristol •Charles Laughton, Nottingham •Leo Caves, York KEK Data Grid • • • • • • Japan Taiwan South Korea Australia Poland US • A functioning, general purpose international Data Grid for highenergy physics Manchester-SDSC mirror BaBar High-energy Physics • Stanford Linear Accelerator • Lyon, France • Rome, Italy • San Diego • RAL, UK • A functioning international Data Grid for high-energy Manchester-SDSC mirror Moved over 100 TBs of data physics Astronomy Data Grid • Chile • Tucson, Arizona • NCSA, Illinois • A functioning international Data Grid for Astronomy Manchester-SDSC mirror Moved over 400,000 images International Institutions (2005) Project Data mangement project eMinerals Sickkids Hospital in Toronto Welsh e-Science Centre Australian Partnership for Advanced Computing Data Grid Trinity College High Performance Computing (HPC-Europa) National Environment Research Council Center for Advanced Studies, Research, and Development LIACS(Leiden Inst. Of Comp. Sci) Physics Labs Monash E-Research Grid Computational Modelling School Computing Dept. of Computer Science Belfast e-Science Centre Worldwide Universities Network Large Hadron Collider Computing Grid White Rose Grid Protein structure prediction Institution British Antarctic Survey, UK Cambridge e-Science Center, UK Canada Cardiff University, UK Victoria, Australia Trinity College, Ireland United Kingdom Italy Leiden University,The Netherlands University of Bristol, UK Monash University, Australia University of Queensland, Australia University of Leeds, UK University of Liverpool, UK Queen's University, UK University of Manchester, UK University of Oxford, UK University of Sheffield. UK Taiwan University, Taiwan Storage Resource Broker 3.3.1 Application C Library, Java Unix Shell Linux I/O NT Browser, Kepler Actors C++ DLL / Python, Perl, Windows DSpace, http, OpenDAP, Portlet, GridFTP, WSDL, Fedora OAI-PMH) Federation Management Consistency & Metadata Management / Authorization, Authentication, Audit Logical Name Space Database Abstraction Databases DB2, Oracle, Sybase, Postgres, mySQL, Informix Latency Management Data Transport Metadata Transport Storage Repository Abstraction Archives - Tape, File Systems Sam-QFS, DMF, ORB Unix, NT, HPSS, ADSM, Mac OSX UniTree, ADS Databases DB2, Oracle, Sybase, Postgres, mySQL, Informix SRB Objectives • Automate all aspects of data discovery, access, management, analysis, preservation • Security paramount • Distributed data • Provide distributed data support for • • • • Data sharing - data grids Data publication - digital libraries Data preservation - persistent archives Data collections - Real time sensor data SRB Developers Reagan Moore Michael Wan Arcot Rajasekar Wayne Schroeder Charlie Cowart Lucas Gilbert Bing Zhu Antoine de Torcy Sheau-Yen Chen George Kremenek Arun Jagatheesan Marcio Faerman Sifang Lu Richard Marciano - PI - SRB Architect - SRB Manager - SRB Productization - inQ - Jargon - Perl, Python, Windows - mySRB web browser - SRB Administration - SRB Collections - Matrix workflow - SCEC Application - ROADnet Application - SALT persistent archives Contributors from UK e-Science, Academia Sinica, Ohio State University, Aerospace Corporation, … 75 FTE-years of support About 300,000 lines of C History • 1995 - DARPA Massive Data Analysis Systems • 1997 - DARPA/USPTO Distributed Object Computation Testbed • 1998 - NSF National Partnership for Advanced Computational Infrastructure • 1998 - DOE Accelerated Strategic Computing Initiative data grid • 1999 - NARA persistent archive • 2000 - NASA Information Power Grid • 2001 - NLM Digital Embryo digital library • 2001 - DOE Particle Physics data grid • 2001 - NSF Grid Physics Network data grid • 2001 - NSF National Virtual Observatory data grid • 2002 - NSF National Science Digital Library persistent archive • 2003 - NSF Southern California Earthquake Center digital library • 2003 - NIH Biomedical Informatics Research Network data grid • 2003 - NSF Real-time Observatories, Applications, and Data management Network • 2004 - NSF ITR, Constraint based data systems • 2005 - LC Digital Preservation Lifecycle Management • 2005 - LC National Digital Information Infrastructure and Preservation program Development • SRB 1.1.8 - December 15, 2000 • Basic distributed data management system • Metadata Catalog • SRB 2.0 - February 18, 2003 • Parallel I/O support • Bulk operations • SRB 3.0 - August 30, 2003 • Federation of data grids • SRB 3.4 - October 31, 2005 • Feature requests (extensible schema) Separation of Access Method from Storage Protocols Access Method Access Operations Data Grid Storage Operations Storage Protocol Storage System Map from the operations used by the access method to a standard set of operations used to interact with the storage system Data Grid Operations • File access • • • • Open, close, read, write, seek, stat, synch, … Audit, versions, pinning, checksums, synchronize, … Parallel I/O and firewall interactions Versions, backups, replicas • Latency management • Bulk operations • Register, load, unload, delete, … • Remote procedures • HDFv5, data filtering, file parsing, replicate, aggregate • Metadata management • SQL generation, schema extension, XML import and export, browsing, queries, • GGF, “Operations for Access, Management, and Transport at Remote Sites” Examples of Extensibility • Storage Repository Driver evolution • • • • • • • • • Initially supported Unix file system Added archival access - UniTree, HPSS Added FTP/HTTP Added database blob access Added database table interface Added Windows file system Added project archives - Dcache, Castor, ADS Added Object Ring Buffer, Datascope Added GridFTP version 3.3 • Database management evolution • • • • • • Postgres DB2 Oracle Informix Sybase mySQL (most difficult port - no locks, no views, limited SQL) Examples of Extensibility • The 3 fundamental APIs are C library, shell commands, Java • Other access mechanisms are ported on top of these interfaces • API evolution • • • • • • • • • • Initial access through C library, Unix shell command Added iNQ Windows browser (C++ library) Added mySRB Web browser (C library and shell commands) Added Java (Jargon) Added Perl/Python load libraries (shell command) Added WSDL (Java) Added OAI-PMH, OpenDAP, DSpace digital library (Java) Added Kepler actors for dataflow access (Java) Added GridFTP version 3.3 (C library) Added Fedora Logical Name Spaces Data Access Methods (C library, Unix, Web Browser) Storage Repository • Storage location • User name • File name • File context (creation date,…) • Access constraints Data access directly between application and storage repository using names required by the local repository Logical Name Spaces Data Access Methods (C library, Unix, Web Browser) Data Collection Storage Repository Data Grid • Storage location • Logical resource name space • User name • Logical user name space • File name • Logical file name space • File context (creation date,…) • Logical context (metadata) • Access constraints • Control/consistency constraints Data is organized as a shared collection Federation Between Data Grids Data Access Methods (Web Browser, DSpace, OAI-PMH) Data Collection A Data Grid Data Collection B Data Grid • Logical resource name space • Logical resource name space • Logical user name space • Logical user name space • Logical file name space • Logical file name space • Logical context (metadata) • Logical context (metadata) • Control/consistency constraints • Control/consistency constraints Access controls and consistency constraints on cross registration of digital entities Types of Risk • Media failure • Replicate data onto multiple media • Vendor specific systemic errors • Replicate data onto multiple vendor products • Operational error • Replicate data onto a second administrative domain • Natural disaster • Replicate data to a geographically remote site • Malicious user • Replicate data to a deep archive How Many Replicas • Three sites minimize risk • Primary site • Supports interactive user access to data • Secondary site • Supports interactive user access when first site is down • Provides 2nd media copy, located at a remote site, uses different vendor product, independent administrative procedures • Deep archive • Provides 3rd media copy, staging environment for data ingestion, no user access Deep Archive Firewall Deep Archive Staging Zone Pull Remote Zone Server initiated I/O Pull PVN Z3 No access by Remote zones Z2 Register Z3:D3:U3 Z1 Register Z2:D2:U2 For More Information Reagan W. Moore San Diego Supercomputer Center moore@sdsc.edu http://www.sdsc.edu/srb/