Applied CyberInfrastructure Concepts ISTA 420/520 Fall 2014 Will Computers Crash Genomics? Science Vol 331 Feb 2011 Nirav Merchant (nirav@email.arizona.edu) Bio Computing & iPlant Collaborative Eric Lyons (ericlyons@email.arizona.edu) Plant Sciences & iPlant Collaborative University of Arizona 1 http://goo.gl/p4j3m or https://sites.google.com/site/appliedciconcepts/ 1 Topic Coverage Lifecycle Issues (example from MIT) Why DM (Data Management) iRODS Introduction Scaling the Infrastructure for Data Management (Chapter 3 from FiMDA) Group homework Reality of data “We are drowning in data, but starving of information” - Attribution unknown Data Life Cycle http://www.data-archive.ac.uk/create-manage/life-cycle iRODS Background and Evolution • integrated Rule-Oriented Data System (iRODS) http://www.irods.org • Originated at SDSC, developed by the DICE (Data Intensive Cyber Environments) group • Based on decade-long SRB development experience for managing distributed data • Community-driven • Most of the group migrated to UNC Chapel Hill in 2008-2009 – The group is bi-coastal: DICE-UNC, DICE-UCSD • First release of iRODS in 2009 • iRODS picked up where SRB left off 5 iRODS Background and Evolution • Modular, extensible, customizable • Open source (BSD license) • Supported at UNC with complementary activities by DICE and RENCI, a research unit of UNC Chapel Hill • https://github.com/irods/irods 6 iRODS I. Data grid middleware II. Data management infrastructure III. A framework for procedural implementation of data management policy (policy-driven data management) iRODS is all these. iRODS Unified Virtual Collection iRODS View of Distributed Data User Client User sees a single collection My Data: disk, filesystem, site-specific storage, ... My Data: tape, database, filesystem, ... Partner’s Data remote disk, tape, filesystem, site-specific storage,… • iRODS installs over heterogeneous data resources • Users can share & manage distributed data as a single collection iRODS as a Data Grid • Sharing data across: – geographic and institutional boundaries – heterogeneous resources (hardware/software) • Virtual (logical) collections of distributed data • Global name spaces – data: files and collections – users: single sign on – storage: virtual resources • Metadata catalogue (iCAT) manages mappings between logical and physical name spaces A RENCI Data Grid A complete data grid (zone) has one metadata catalogue (iCAT) NCSU Duke iRODS Server iRODS Server iPlant iRODS Server UNC-A UNC-CH iRODS Server RENCI, Europa Center iRODS Server iRODS Server Metadata Catalog (iCAT) • Client asks for data – request goes to an iRODS server • Server contacts the iCAT-enabled server • Information (location, access rights, etc) is retrieved from the iCAT • Server containing data is signaled to send data to authorized client TUCASI Infrastructure Project (TIP) Federated Data Grids Independent data grids (zones), each with its own iCAT, 18 September 2012 can be federated 11 Federation of Data Grids • NASA – Disparate data collections: Satellite data, model data, remote sensing data – Manage the collections separately (technically and administratively) with separate data grids – Federate the data grids to give users an overall view onto NASA data • Collaboration between consortia – DataNet Federation Consortium: 6 science domain partners, federating their data grids to share data, users – Users authenticate to home data grid, access federated data grids • For geographically distributed replication, evolution in data life cycle 18 September 2012 12 iPlant Data Store Free Your Data Different Users, Different Access Needs: One Data Store