Grid Federated Data Management Overview Grid Federated Data Management Projects Data Replication OGSA Early Adopters Workshop 30th May 2002 Steve Glanville (deglanvs@us.ibm.com) Inderpal Narang (narang@almaden.ibm.com) Vijayshankar Raman (shankar@almaden.ibm.com) Steven Beckhardt (consultant) Grid Federated Data Management Overview Replication Mission To ensure data location is optimized to ensure best possible Quality of Service To achieve this through the extension of standard Grid services To render enhanced data replication services through OGSA, making them available through compliant Grid toolkits To further enhance this model to cope with heterogeneous platforms and data types Grid Federated Data Management Overview Overall Project • • • • To Develop Grid-Federated Data Management (GFDM) Services GFDM must handle multiple, heterogeneous data sources GFDM services are rendered through OGSA GFDM services may be a consumer of, and/or a provider for other OGSA rendered services Grid Federated Data Management Overview Data Replication Value Proposition Scalability Availability Consistency (QoS) Disaster Recovery Utility Model for DB eUtilities, e.g. Analytics eUtilities Collaboration Dynamic Data Distribution is key in Grid computing Data Replication services would be a core service to facilitate this Data Replication includes Database, Files, Binaries,… Grid Federated Data Management Overview Data Replication Business Scenarios The nomenclature used here describes a replication ‘Service’ as having multiple ‘Subscriptions’ and enables the following scenarios: Replication of data-sets from place to place, selectively or in entirety to optimize for the collaborative resource model in, say, life-sciences research Aggregation of multiple data-sources to a single target, such as might occur to populate a data-warehouse. This can be used to consolidate information for reporting purposes. A common QoS is assumed. Replication of a single source to multiple targets. May need to make multiple local copies of centralized data. As 2 and 3 with transformations for different targets, to support Unit-ofMeasure (UoM) or language localizations. May need to convert to a single currency and UoM for head-quarters consolidation purposes. Equally may need to budget ‘out’ from HQ currency to local-currency systems at remote locations. As 3 with transformations of different QoS, to support the needs of different target environments. Business functions such as sales-analysis may be quite happy with a daily copy of sales-data taken at the close of business each day, however, Disaster Recovery (DR) support might needs near real-time data replication to minimize data outage in event of a catastrophic server or storage failure. Grid Federated Data Management Overview Data Replication: System Abstraction Capture Service Replication Control Service - Sheperd ShePeRD •Service Interface for •ScHeduler •Subscriptions (Static and Dynamic) •PatrollEr •QoS (latency, security,..) •and •Notifications (subscriber, metadata, …) •Restart • Billing, Auditing.. •Director ∆ Source Control flow, Data Flow Psvcs ∆ Csvcs Apply Service Target Capture Program •Data producer Apply Program •Data consumer Grid Federated Data Management Overview Replication Subscription Catalog (OGSA) Replication Schema Objects •Subscription Pairing (source-target) •Requestor Privilege Determination •Object Type (table, index, xls, txt etc) •Object Location •Object Security (encryption, keys etc) •(Replica Catalog, audit, billing) Grid Federated Data Management Overview Replication Subscription Schema (Policy, Life-cycle) Replication Schema Objects •Subscription Policy (QoS etc.) •Location Resource Quotas •Transformation Details (UoMs, currency, lang •Replica Trigger Conditions (once, time, tx or size based) •Replica Life-Cycle (once, repeat, continuous) •(Checkpoint restart data) Grid Federated Data Management Overview Critical Components to Support Replication Control-Node • SHEPERD code drop • Catalog Schema • Transaction table for audit / billing support Data-Node • Capture-Producer code drop • Transaction table for latency Grid Federated Data Management Overview Replication Architecture Source D D Apply Service D Target D Apply Service D Target Capture Service disk buffer initiate, monitor QoS initiate, monitor QoS SHEPERD subscriptions Src Tgt QoS Requestor … Capture Service SHEPERD subscriptions new subscription Src Tgt QoS Requestor … D Target he ar tb ea t it ia e he ar tb ea t te Apply Service Disk buffer in D initiate Source subscription thread at i ti in Grid Federated Data Management Overview Normal Replication Grid Federated Data Management Overview Role of SHEPERD (ScHEduler/PatrollEr/Restart Director) • Manage persistent state of processes • Initiate / Schedule / Update subscriptions • Handle crash recovery Grid Federated Data Management Overview Role of the Capture Service D Source D Capture Service D Stream data to be replicated with checkpoints Flow Control eject lagging subscriptions Heartbeat to SHEPHERD operational progress of D capture (e.g. log reader position) È speeds up crash restart È avoids persistent state at source Transformations, Filters ( ) subscription hear tbea t subscription SHEPERD Grid Federated Data Management Overview Role of the Apply Service Pick D from source channel and give to target Maintain persistent list of applied Ds detect D drops ensure one behavior for all ‘drops’ Near realtime capability avoids persistent queues and two-phase commit Heartbeat Operational Enhanced operational services… Apply Service Target D transac. tbl. transac. tbl. On disk heartbeat SHEPHERD Grid Federated Data Management Overview Crash Recovery Example Source Heartbeat termination indicates failure to SHEPERD SHEPERD tries to re-initiate, supplies checkpoint È If cannot restart due to node failure, attempts restart on another local shared data node – If restart not possible, will assume node in recovery and periodically check for a heartbeat Reads disk-buffer to establish restart checkpoint Restarts Capture process Grid Federated Data Management Overview Crash Recovery Example Target Heartbeat termination indicates Apply failure to SHEPERD SHEPERD tries to re-initiate È If cannot restart due to node failure, attempts restart on another local shared data node – If restart not possible, will assume node in recovery and periodically check for a heartbeat SHEPARD restarts Apply process SHEPARD resets Capture process to last consistent checkpoint Grid Federated Data Management Overview Data Grid Engagement Model Work with a consortium of initial contributors (IBM, Oracle…), UK eScience, DBTF (DAIS Working Group) etc and Globus to determine the organizational structure, methods and timescales to permit the inclusion and support of non-Globus developed code into Globus. It is clear that many organizations from IBM to Oracle to Globus to Cern are all thinking along the same lines in terms of increasing the common infrastructure capabilities within Globus to enable security and privileges associated with data replication. It is time to share our ideas in this area. Hopes: • Share initial designs with Globus, UK Grid, DBTF folks before next GGF • Meet at next GGF or shortly thereafter to agree processes for code inclusion and support • Integration into GDS, GDSF, DSR, GDTS (catalog schema) if it becomes the de facto mechanism for rendering GDS Grid Federated Data Management Overview Data Replication Milestones (Very Tentative) Agree initial design by end 5/2002 Complete the Technology Preview by end 7/2002 Iterate to Reference-level implementation by end 12/2002 Present Data Replication enhancements to GGF at the 10/2002 conference Define relationships with and dependencies on other enhanced OGSA services by 12/2002 Validate enhanced design’s use of DB2 and Oracle Replication Services by 12/2002, SQL-Server phase-2 (2003)