iRODS: integrated Rule Oriented Data System Ray Idaszak Director , Collaborative Environments RENCI University of North Carolina at Chapel Hill iRODS • Integrated Rule-Oriented Data System – What It Is • Origins, How it works, What’s different about it – Why It Is • Context, Role it serves – Where It’s Going (Today, Future) • Funding, Key efforts iRODS Talk Outline • Integrated Rule-Oriented Data System – What is the Integrated Rule-Oriented Data System? • Origins, Technology, How it works – Why It Is • Context, Role it serves – Where It’s Going (Today, Future) • Funding, Key efforts What’s Different about iRODS? • iRODS lets you manage your data with your rules and in your way… Against a backdrop of federatable community data worldwide via Policies iRODS Background • Integrated Rule-Oriented Data System – Open-source initiative that represents 12+ years of development and over $10M of NSF grant funding – Another $8M+ funding pending (via NSF DataNet) • Collaboration between – UNC Chapel Hill • Data Intensive Cyber Environments group (DICE) – RENCI • State-funded Cyberinfrastructure Institute at UNC Chapel Hill – San Diego Supercomputing Center iRODS Data and Policy Virtualization User Client Views & Manages Data Data Grid User Sees Single “Virtual Collection” /cuahsi/catalog /cuahsi/modeling /cuahsi/terrain RENCI Utah State Univ SDSC /cuahsi/modeling /cuahsi/catalog /cuahsi/terrain The iRODS Data Grid installs in a “layer” over storage systems, so you can view, manage, access, add, and share part or all of your data in a unified Collection. Using a Data Grid - Details RENCI SDSC iRODS Server Rule Engine iRODS Server Rule Engine USU iCAT Metadata Catalog iRODS Server Rule Engine • User asks for data using logical properties (client-server) • Data request goes to 1st Server • Server looks up information in Catalog (applies rules) • Catalog responds 3rd Server has data • 1st Server peer-to-peer asks 3rd Server to serve up data • 3rd Server applies rules and serves data Using a Data Grid – NEAR FUTURE (DB Resource) RENCI SDSC iRODS Server Rule Engine iRODS Server Rule Engine USU MySQL PostgreSQL Oracle iCAT Metadata Catalog iRODS Server Rule Engine • User not running SQL Server locally makes query • Query goes to 1st Server • Server looks up information in Catalog (applies rules) • Catalog responds that 3rd Server has SQL db • 1st Server sends 3rd Server SQL query • 3rd Server applies rules and serves query result Example Clients & Client Interfaces (i.e. iRODS is client agnostic) • • • • • • • • • • • • • C library calls .NET Unix shell commands Java I/O class library (JARGON) SAGA Web browser (Java-python) Windows browser WebDAV Fedora digital library middleware Dspace digital library Parrot Kepler workflow Fuse user-level file system - Application level - Windows client API - Scripting languages iDrop - Web services - Drag and drop GUI - Grid API - User actions can be - Web interface mapped to policies - Windows interface - iPhone interface - Digital library middleware - Digital library services - Unification interface - Grid workflow - Unix file system iRODS Policies • iRODS is described as a “Policy-based” data management system • Policy def’n: A proposed or adopted course of action – ergo iRODS associates a “course of action” for all data • Pre- and Post- “Policy Enforcement Points” (PEP) – Pre: Course of action for data coming into iRODS – Post: Course of action for data going out of iRODS iRODS Policies • • • • • • • • • • Retention, disposition, distribution, arrangement Authenticity, provenance, description Integrity, replication, synchronization Deletion, trash cans, versioning Archiving, staging, caching Authentication, authorization, redaction Access, approval, IRB, audit trails, report generation Assessment criteria, validation Derived data product generation, format parsing Federation iRODS Rule Engine, Workflows • iRODS has its own built-in imperative interpreted programming language called the Rule Engine • The iRODS Rule Engine executes Microservices • An iRODS “program” is called a Workflow – A Microservice is one “step” of an iRODS Workflow – iRODS Workflows are executed on the iRODS Server – Arbitrary external WEB-SERVICES can be one “step” of an iRODS Workflow • Encapsulated as a microservice iRODS Microservices • Microservices are written in C and provide: Well, really anything that can be done in C, and that’s in part what makes iRODS so extensible, but typically: – – – – – Standard operations; e.g. file or format conversion Queries on metadata catalog Interaction with web services Triggering external HPC workflows Remote and delayed execution control • Microservices communicate through – Arguments, session variables, user space variables, etc. Differentiating Workflows • iRODS data grid workflows – Low-complexity, a small number of operations compared to the number of bytes in the file – Server-side workflows – Data sub-setting, filtering, metadata extraction • Grid workflows – High-complexity, a large number of operations compared to the number of bytes in the file – Client-side workflows – Computer simulations, pixel re-projection A few more iRODS notes… • Authentication – GSI (PKI), Kerberos, Shibboleth, Challenge-response • Authorization – Roles, user groups, resource groups, policy constraints, ACLs • Transport – TCP/IP (parallel I/O streams), Reliable Blast UDP • Metadata catalog – PostgreSQL, mySQL, Oracle • Distributed rule engine – Scheduler, messaging system, execution engine, rule base iRODS Talk Outline • Integrated Rule-Oriented Data System – What is the Integrated Rule-Oriented Data System? • Origins, Technology, How it works – Why is there an Integrated RuleOriented Data System? • Context, Role it serves – Where It’s Going (Today, Future) • Funding, Key efforts Entire Data Life Cycle: The iRODS Vision Each data life cycle stage increases the value and usability of the original collection Project Collection Data Grid Data Processing Pipeline Private Shared Analyzed Published Preserved Sustained Local Policy Distribution Policy Service Policy Description Policy Representation Policy Re-purposing Policy Jeff et. al. hit jackpot: collection now accepted as ref collection for decades Hydrology Datagrid grows in value to ecology and biology and federated Jeff gets data from a sensor Jeff shares data with colleagues Together w/ colleagues, analyzes data and produces results Digital Library Reference Collection Federation Results peerreviewed and published iRODS Talk Outline • Integrated Rule-Oriented Data System – What is the Integrated Rule-Oriented Data System? • Origins, Technology, How it works – Why is there an Integrated Rule-Oriented Data System? • Context, Role it serves – Where Is iRODS going Today and in the Future? • Funding, Key efforts iRODS: Future • Pending 2011 NSF DataNet – DataNet Federation Consortium (DFC) • Includes CUAHSI!! (and several others) • RENCI: Creating an “Enterprise” version of iRODS – http://iren-web.renci.org/irods-meeting/irods@renci2011UserMeeting-contribution.pdf Summary • iRODS fills an important niche – Differentiation: It’s a Policy-driven distributed data management system formally supporting the entire Data LifeCycle • E.g. an iRODS DataGrid is a vehicle to fulfilling NSF’s Data Management Plan requirement at the community scale – Classification: Middleware • iRODS is not intended to be all encompassing, but rather work with other DataNets, Workflow Engines, systems like CUAHSI HIS, etc. in canvasing a National Cyberinfrastructure – i.e. Falls primarily in the “Data Services/Storage” portion of NSF’s Data Enabled Science description • With iRODS, the community is still responsible for: – Schema, data formats, defining policies, defining web interfaces, building analysis and knowledge tools, etc. iRODS Credits Principal Investigators Richard Marciano, Reagan Moore (PI), Arcot Rajasekar Additional Contributors William Sims Bainbridge, Leesa Brieger, Luis Carriço, Sheau-Yen Chen, Michael Conway, Jason Coposky, Vijay Dantuluri, Antoine de Torcy, Wei Ding, Kevin Gamiel, Lucas Gilbert, Nuno Guimarães, Chien-Yi Hou, Bernard J. ( Jim) Jansen, Oleg Kapeljushnik, Mounia Lalmas, Christopher A. Lee, Xia Lin, Gary Marchionini, Cathy Marshall, Jason Reilly, Meredith Ringel Morris, Stefan Rüger, Wayne Schroeder, Michael Stealey, Lisa Stilwell, Jaime Teevan, Paul Tooby, Michael Wan, Bing Zhu iRODS Credits Research Supported By NSF ITR 0427196, Constraint-Based Knowledge Systems for Grids, Digital Libraries, and Persistent Archives (2004–2007) NARA supplement to NSF SCI 0438741, Cyberinfrastructure; From Vision to Reality—Developing Scalable Data Management Infrastructure in a Data Grid-Enabled Digital NARA supplement to NSF SCI 0438741, Cyberinfrastructure; From Vision to Reality—Research Prototype Persistent Archive Extension (2006–2007) NSF SDCI 0721400, SDCI Data Improvement: Data Grids for Community Driven Applications (2007–2010) NSF/NARA OCI-0848296, NARA Transcontinental Persistent Archive Prototype (2008–2012) iRODS Credits For More Information http://www.irods.org http://diceresearch.org/ http://dice.unc.edu/ http://www.renci.org/news/releases /renci-teams-with-dice Thank You. http://www.renci.org