Data Grids Reagan W. Moore San Diego Supercomputer Center 9500 Gilman Drive, La Jolla, CA 92093-0505 Phone: 858 534-5073 FAX: 858 534-5152 E-mail: moore@sdsc.edu http://www.npaci.edu/DICE/ National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Topics • Data Grid Requirements – Data management – Automation – Latency hiding • Current technology – Distributed collections / digital libraries / data grids • State of the art systems – Virtual data grids / persistent archives – Emerging Standards National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Data Management Environments • Code development – Collaboration, check-out, versioning • Run-time execution – High performance access, locking, latency hiding, automation, archival storage • Publication – Discovery, consistency, persistent archives • Are the capabilities required by all three environments compatible? National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Data Requirements are Met by Collection Technology • Provide three levels of abstraction for data, information, and knowledge management (bits, tagged attributes, relationships) • Automate access through use of information discovery on logical collections that span storage systems • Manage latency by streaming, caching, replication, aggregation, remote proxies, staging • Provide a persistent environment by building a consistent environment over evolving technology National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Current Technology • Logical data collections – Storage Resource Broker / Metadata Catalog • Abstract data management by building a data handling system that interoperates with storage systems (file systems, archives, databases) • Abstract information management by building information catalog management that interoperates with information repositories (databases) National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center SDSC Storage Resource Broker & Meta-data Catalog Application Resource, User User Defined C, C++, Linux I/O Unix Shell Java, NT Prolog Web Browsers Predicate SRB MCAT Archives Dublin Core HPSS, ADSM, HRM UniTree, DMF File Systems Databases Unix, NT, Mac OSX Third-party copy Remote Proxies DB2, Oracle, Postgres DataCutter Application Meta-data National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Information Management Projects • Digital Libraries – – – – – NSF Digital Library Initiative, Phase II - UCSB, Stanford NLM Digital Embryo digital library - GMU NPACI Digital Sky - Caltech 2MASS sky survey California Digital Library - AMICO NSF National SMETE Digital Library - UCAR / DLESE • Grid Environments – – – – NASA Information Power Grid - NASA Ames DOE Data Visualization Corridor - LLNL DOE Particle Physics Data Grid - Babar NSF Grid Physics Network - U Fl • Persistent Archives – NARA Persistent Archive – NHPRC - Scalable archives National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Data Grids Data Grid - links multiple data collections Separate name spaces Separate administration domains Heterogeneous database instances Stage data from collection into the data grid Database A Data grid Database B The data grid is itself a collection that provides mechanisms to hide latency and provide a global namespace National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center State-of-the-art Data Management • Provide knowledge management abstraction – Abstract the processes that create the derived data product (Virtual data grid) – Abstract the collection formation used to organize the derived data products (Persistent Archive) • A persistent archive is a virtual data grid in which the derived data products are data collections National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Standards • Object Management Group - OMG – Model Driven Architecture for platform independent models of services • Platform dependent models transform an abstract representation into CORBA, Java, C, …. • Builds upon Uniform Modeling Language (UML) • Manages life cycle for software services – Common Warehouse Metamodel • Provides abstract representation for collections that can be used to migrate collections to alternate databases • Builds upon a subset of UML National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Standards • World Wide Web Consortium - W3C – Semantic Web for natural language queries to collections. – Builds upon the DARPA Agent Markup Language for services, and logic manipulation languages (DAML-L, OIL) – Uses Resource Description Framework and XML National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Standards • ISO – Topic maps manage relationships between concept spaces and collection attributes – Provide mechanisms to manage semantic interoperability • Global Grid Forum – Provides authentication systems, data handling systems, execution environments National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Knowledge Based Data Grids Relationships Between Concepts Knowledge Repository for Rules Access Services Rules - KQL Knowledge Management XTM DTD Ingest Services Knowledge or Topic-Based Query / Browse Attributes Semantics Information Repository SDLIP Information XML DTD (Model-based Access) Attribute- based Query Fields Containers Folders Storage (Replicas, Persistent IDs) National Partnership for Advanced Computational Infrastructure Grids Data MCAT/HDF (Data Handling System - SRB) Feature-based Query San Diego Supercomputer Center Data Intensive Computing Environment Group Staff Students - GSRA • • • • • • • • • • • • • • • • • Reagan Moore Chaitan Baru Sheau Yen Chen Charles Cowart Amarnath Gupta George Kremenek Bertram Ludäscher Richard Marciano Arcot Rajasekar Abe Singer Michael Wan Ilya Zaslavsky Bing Zhu Martin Kuhl Liying Sui Yang Yu Valter Crescenzi Students - Undergrad Interns • • • • • Peter Shin Roman Olshanowsky Shabbar Tambawala Pratik Mukhopadhyay +/- NN National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Further Information http://www.npaci.edu/DICE National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center