Towards a loosely coupled and scalable component set for scheduling bulk data copying across different storage resources as fault tolerant batch jobs. http://code.google.com/p/dtsproject/ David Meredith1, Stephen Crouch2, Peter Turner3, Gerson Galang4, Ming Jiang5, Hung Nguyen6 1NGS, Science and Technology Facilities Council, Daresbury Labs, UK, david.meredith@stfc.ac.uk 2OMII-UK, School of Electronics and Comp Sci, University of Southampton, UK, s.crouch@omii.ac.uk 3University of Sydney, Sydney, Australia, p.turner@chem.usyd.edu.au 4Victorian eResearch Strategic Initiative (VeRSI), Victoria, Australia, gerson.galang@arcs.org.au 5NGS, Science and Technology Facilities Council, Daresbury Labs, UK, ming.jiang@stfc.ac.uk 6University of Sydney, Sydney, Australia, nguyen_h@chem.usyd.edu.au Australia (DataMINX) United Kingdom: Overview / Aims • An open-source project developing a set of loosely coupled components for efficiently brokering data copies between a wide range of (potentially incompatible) storage resources as schedulable, fault-tolerant batch jobs (ftp, gridftp, srb, irods, sftp, file, webdav, srm?). • To scale from small embedded deployments to large distributed deployments through an expandable ‘worker-node pool’ controlled through message orientated middleware (MOM, JMS). • To maximize data access and transfer efficiency through the strategic placement and subscription of worker-nodes at or between particular data sources/sinks. • To be inherently asynchronous and side-step the bandwidth, concurrency and scalability concerns for clients in networks with limited capability relative to the direct connectivity between the source and sink. • Aims to address geographical-topological deployment concerns by allowing service hosting to be either centralized (as part of a shared service), or confined to a single institution or domain. • Adoption of established design patterns and open source components which are coupled with a proposal for an open standards based messaging protocol. • Employs a single port-type document-centric model, with service semantics defined solely by the message model. DTS Features / Intentions 1 1. Encourage a common messaging model We are engaging with OGF in the definition of an open standard describing a bulk data copy activity with subsequent control and event messages. The aim is to provide a key foundation in addressing the challenges of data management. Ideally standards based; OGF engagement DMI, JSDL, also communications with Globus, Unicore, GridSAM developers (a longer term perspective). 2. Platform independence Includes the worker agent that manages a bulk data copy activity, the message broker, the message channel adapters that enable the different transports and protocols, commons VFS. 3. Adopts well recognized Enterprise Integration Patterns Described in Hohpe and Woolf (2003): Competing Consumers, Service Activator, Selective Consumer, Polling Consumer, Message Driven Consumer, Transport Channel Adapter, Header Based Router. http://www.enterpriseintegrationpatterns.com DTS Features / Intentions 2 4. Value in the correct framework choice – deploy out of the box features in remoting, scaling, batching: • Spring Batch; one of the only open source batch processing frameworks currently available (purportedly the only?). It provides many functions that are essential in batch processing. • Spring Integration; supports the EAI patterns identified by Hohpe and Woolf. Importantly it provides a set of inbound and outbound message-channeladaptors for different integration options, both polling and message driven adapters (e.g. JMS subscription, file/directory polling, RMI, WS, email) • Message broker (e.g. Apache ActiveMQ or any JMS 1.2 message-channel MOM broker). Buffering data via an intermediary when copying between incompatible resources / protocols Client provides single interface to different (potentially incompatible) storage resources, e.g. Srb GsiFtp, Ftp, Sftp, iRODS, file, Webdav. Client e.g. Portal/Hermes Get and Put, or Mem buffer Bit pipe Client brokers between storage resources when third-party transfer is not available. SRB/ FTP File operations (list, upload, download, delete, rename) Authentication tokens (un/pw, x509?) SFTP/ GSIFTP Client-Side Intermediary Benefits 1. Auth tokens only in memory on one computer. 2. Self contained and interactive. 3. Extensible for new and emerging resources/protocols. Challenges 1. Software is required that is capable of enacting a data copying activity between a variety of sources and sinks (bit pipe via byte streams or combined get/put). 2. The client must be constantly available throughout the duration of the transfer. 3. Buffering of large quantities of data introduces bandwidth and concurrency concerns for clients residing on networks with limited capability (e.g. wireless connectivity) relative to the direct connectivity between the source and sink. DTS – Remotely Placed Worker Agents Aim: Strategically place intermediary software agent(s) (e.g. at different institutions, within a network, at a local source/sink) and remotely invoke an appropriate agent using a message router with a ‘Bulk Data Copy Activity’ executed as a fault tolerant batch process. Best practice: process data as close to where it resides as possible. 3 Core DTS Components: • Batch/Worker Agent. Software that will mange a bulk data copy activity. Is a batch operation – automated processing of large volumes of information that is most efficiently processed without user interaction (fire + forget). • Common Message format that describes a data copying activity with subsequent control and event messages. • Lists data sources and sinks. So that the recipient worker • Transfer requirements. can access the data on behalf • User credentials. of the user. • Message Broker/Router for routing of messages to appropriate workers and scaling via the Competing Consumer pattern . DTS Architecture (Simplified) Broker between remote sources and sinks Clients Meta-data system or data catalogue (ICAT) that provides list of data URLs and credentials. OR lightweight file operations directly interacting with source/sink (list, delete, rename) Queue Channel Data copy activity message. Data copy: Get/Put or Bit pipe DTS workers Source Sink Authentication tokens (un/pw, myproxy details) DTS Architecture (Simplified) Broker between local source and remote sink (and vice-versa) Clients Message Bus is a combination of a messaging infrastructure, a common data model and command set to allow different systems to communicate through a shared set of interfaces (our message channels). http://www.enterpriseintegrationpatterns.com Facility Queue Facility / Department Y Source/Sink Facility / Department X Source/ Sink Home Lab Deployment Strategies Small– Local or embedded worker agent Med – Single worker pool Large – Multiple worker pools and message router Client P (Service Activator) s Source WN c Sink e 1) Lightweight local worker deployment. The worker agent is invoked by a script or is integrated into an existing application. S = Submit message (bulk copy activity document), C = Control message, e = Event message. Worker pool DMQ P s C JobQb ControlQ e s Source C c ReplyQ Sink 2) Distributed deployment with a single worker pool. Worker pool A C C JobQ Router DMQ HTTPS P s JobQa ControlQ Worker pool B AControlQ C JobQb C BJobQ c e ReplyQ AJobQ BControlQ ControlQ Router 3) Distributed deployment with a multiple worker pools. s C Core Component Message Router / Broker Schedule and route messages to strategically placed worker agents. Scale with multiple agents using competing consumer pattern. Scaling How can the architecture scale for increasing loads ? • Scale Out: Competing Consumer Pattern To scale horizontally (or scale out) means to add more nodes to a system. • Scale Up: Multi-process Service Activator To scale vertically (or scale up) means to add resources and/or processes to a single node in a system. Scale Out – Competing Consumer Pattern • • • • Only requirement is that the JMS client and consumer must be able to access the broker . This provides location independence which enables scaling and clustering of services since multiple workers can be configured to pull messages from the same queue. If the service may become overburden and falls behind in its processing, all that is needed is to turn-up a few more worker instances to listen to the queue. Consumers do not have to coordinate with each other which improves resilience, since workers can be added and removed without affecting each other. JMS client (Producer) Queue depth ok Broker (Queue) Worker (Consumer) Basic architecture is repeatable – use multiple brokers and queues as required, (e.g. broker clusters, master slave brokers etc). Message Routing How can the appropriate remote worker(s) be invoked: • How to invoke a worker(s) that resides at the data source and/or sink ? • How to invoke a worker(s) that is installed at my institution or within a specific network ? • How to target a specific worker ? 1. Multiple Destinations 2. Message Selectors 3. Hybrid Approach Message Routing: Multiple Destinations Multiple static/administered queues can be configured on one broker in order to partition workers into different groupings. Main Advantages: Queue depth is directly related to load. Therefore load balancing can be performed effectively since queues are not polluted with . DTS Should add new queues for different groupings (e.g. project queues, separate queues for different facilities). Main Disadvantages: Changes are required on the broker to cater for new worker groupings (configuration of new administered queues). This does not provide a high level of decoupling between message producer and consumer since changes are required to the broker. Request Qa JMS clients Request Qb Request Qc Broker Worker groups Group A (Facility A) Group B (Project B) Group C (Institution C) In DTS, multiple destinations are used to partition static queue consumer cluster groups, e.g. Request Q per facility, beam-line, project, institution etc. Message Routing: Message Selectors Message Selectors - workers can be ‘Selective Consumers‘ and clients can be ‘Specifying Producers’. A message selector is an expression based on SQL92 conditional syntax, e.g. Facility=‘FacilityX‘ AND BeamLine=‘ProteinMX’ AND WorkerAccessKey=‘abcdefadsf_guuid' • • • • Filtering is performed by the broker – it delivers only those messages that match the selective consumer’s criteria. Importantly, workers can therefore decide which messages to process depending on their own selector statements. Main benefit is that this approach is extensible: provides for a higher level of decoupling between message producer and receiver since clients and workers can be easily added without change to the broker. Selectors are optional, this pattern can also be combined with multiple destination approach to route messages as required (hybrid approach). Selectors can be used to perform fine-grained routing and route messages however you require, e.g. • Route to first available worker in a particular group that specifies a common/shared selector value, e.g. a common ‘groupID’ AND/OR ‘networkID’ AND/OR ‘facilityGroup’ AND/OR ‘domain’ AND/OR ‘GB limit’ etc…. (SQL). • Can route to a specific worker using a unique and opaque client identifier/access key, e.g. GUUID (this is ok since the broker performs filtering so different workers don’t see each others selectors). Specifying producer would need to persist this value between server re-starts/different sessions. = Specifying Producers Request Q = = Messages with selection values Selective Consumers Message Routing: Hybrid Approach Best approach is to use a combination of the message filtering approach and the multidestination approach to suit your service instance requirements. Each approach is not mutually exclusive and can be used together provided both patterns are catered for in your system. Request Qa Request Qb Request Response (Client Worker Conversation) 1. ReplyTo header 2. Application ID exchange with message filtering 3. Temporary queues Request Response (Conversation) Request message contains a Return Address that indicates where to send the reply. 1. Return Address is added to the message header. 2. Consumer does not need to know where to send the reply, it can just ask the request. Reply Channel 1 Reply Channel 2 Request Channel Specifying Producers (Clients) Reply Channel 1 Selective Consumer (Workers) Reply Channel 2 Variations of this pattern depending on clients requirements: a) Further expand the Message Filtering Approach to Exchange client and worker Application IDs. Client can also selectively consume response messages with its own client ID added to request header. b) Temporary queue created by the client (lasts only for duration of client session). Request Response (Conversation) using Filtering DTS Clients DTS Workers Q Consumer Cluster ‘facilityA’ JMS Message Headers MessageID = guuidA WorkerGroupID = facilityA ClientID = DTSClient1 NGS Portal (An App. Bounded to facilityA ) MDP Selective Consumer Pool on WorkerGroupID = facilityA MDP Producer Pool Connected to InvokeClientQ JobSubmitQ MDP Selective Consumer Pool on WorkerID = workerA DTS Client1 1) MDP Producer Pool Connected to JobSumitQ MDP Selective Consumer Pool on ClientID = DTSClient1 MDP Producer Pool Connected to InvokeWorkerQ 2) JMS Message Headers CorrelationID = guuidA WorkerID = workerA ClientID = DTSClient1 3) InvokeClientQ Q Consumer Cluster ‘facilityB’ GridSAM (An App. Bounded to facilityB ) (Exchange of client and worker Application IDs so that recipient worker and client can converse) JMS Message Headers CorrelationID = guuidA WorkerID = workerA ClientID = DTSClient1 InvokeWorkerQ Request Response (Conversation) using Filtering • Each JMS client (worker and client) has a unique instance/application ID (clientID, workerID). 1. A client sends a job request and adds its own clientID to the headers (in conjunction with the other headers used in message selection, e.g. MessageID and WorkerGroupID). 2. Worker picks up a message and responds to an administered response queue (not a dynamic queue) via the ReplyTo header and itself returns its own WorkerID and forwards the given ClientID in the message header. 3. Client receives messages from the response queue and filters on ClientID. 4. Client can now converse with the recipient worker since both the client and worker have their respective IDs and can correlate messages on the original message ID using CorrelationID. • • • Using this approach only requires a limited number of administered queues: e.g. JobSumitQ, InvokeClientQ, InvokeWorkerQ . Main benefit is that this approach is extensible: provides for a higher level of decoupling between message producer and receiver since clients and workers easily added without change to the broker. Can also combine this approach with multiple channels as required (hybrid approach). Core Component Batch / Worker Agent Enacts the Bulk Data Copy Activity as a fault tolerant batch job for copying between sources and sinks. Scopes, checkpoints and restarts. Batch / Worker Agent • Role is to enact the data copy activity according to the activity document, report status events and respond to control messages. • Copy activity is a batch processing task (automated processing of large volumes of information is most efficiently processed without user interaction). • DTS worker based on Spring Batch and Commons VFS (contract driven approach facilitates different implementations e.g. scripts / shelling out to command line client). • Spring Batch provides framework for functions that are essential in batch processing e.g. split/monitor/merge, logging/tracing, tx management, processing statistics, job pause and restart, skip, retry, check-pointing. A Spring Bach implementation deals with breaking apart the business logic and sharing it efficiently between parallel processes or processors as step-jobs. http://static.springsource.org/springbatch/index.html Core Component Message Model Bulk Data Copy Activity Document. Control Messages (stop, start, cancel) Event Messages (faults, status, instance attributes) Message Model Requirements Document Message • Bulk Data Copy Activity description • Captures all information required to connect to each source and sink URI and subsequently enact the activity. • Transfer requirements e.g. URI Properties, file selectors (reg-expression), scheduling (batch-window), retry count, source/sink alternatives, checksums?, sequential ordering? DAG? • Serialized user credentials. • Probably adopt/extend the Data End Point Reference (DEPR) construct from DMI. A specialized form of WS-Address element which does not mandate any particular URL/transport scheme, multiple <DataLocations/> Control Messages • Interact with a state/lifecycle model (e.g. stop, resume, cancel) Event Messages • Standard fault types and status updates Information Model • To advertise the service capabilities / properties / supported protocols Existing/In-Scope Specifications Related Specifications 1. Job Submission Description Language (JSDL) • An activity description language for generic compute applications. 2. OGSA Data Movement Interface (DMI) • Low level schema for defining the transfer of bytes between and single source and sink. 3. JSDL HPC File Staging Profile (HPCFS) • Designed to address file staging not bulk copying. 4. OGSA Basic Execution Service (BES) • Defines a basic framework for defining and interacting with generic compute activities: JSDL + extensible state and information models. • Neither fully captures our requirements (this is not a criticism of these specs, they are designed to address their existing use-cases which only partially overlap with the requirements for a bulk data copy activity). Proprietary • Condor Stork - based on Condor Class-Ads • Glite JDL (again based on a Class-Ads) • Not sure if Globus has/intends a similar definition in its new developments (e.g. SaaS) anyone ? JSDL Data Staging 1 and the HPC File Staging Profile <jsdl:DataStaging> <jsdl:FileName>fileA</jsdl:FileName> <jsdl:CreationFlag>overwrite</jsdl:CreationFlag> <jsdl:DeleteOnTermination>true</jsdl:DeleteOnTermination> <jsdl:Source> <jsdl:URI>gsiftp://griddata1.dl.ac.uk:2811/myhome/fileA</jsdl:URI> </jsdl:Source> <jsdl:Target> <jsdl:URI>ftp://ngs.oerc.ox.ac.uk:2811/myhome/fileA</jsdl:URI> </jsdl:Target> <Credentials> … </Credentials> </jsdl:DataStaging> define both the source and target within the same <DataStaging/> element which is permitted in JSDL. However, the HPC File Staging Profile (Wasson et al. 2008), which is an extension to JSDL, limits the use of credentials to a single credential definition within a data staging element. Often, different credentials will be required for the source and the target. <jsdl:DataStaging> <jsdl:FileName>fileA</jsdl:FileName> <jsdl:FilesystemName>DL_HOME</jsdl:FilesystemName> <jsdl:CreationFlag>overwrite</jsdl:CreationFlag> <jsdl:DeleteOnTermination>true</jsdl:DeleteOnTermination> <jsdl:Source> <jsdl:URI>gsiftp://griddata1.dl.ac.uk:2811/myhome/fileA</jsdl:URI> </jsdl:Source> <Credentials> … </Credentials> </jsdl:DataStaging> JSDL Data Staging 2 <jsdl:DataStaging> <jsdl:FileName>fileA</jsdl:FileName> <jsdl:FilesystemName>NGS_HOME</jsdl:FilesystemName> <jsdl:CreationFlag>overwrite</jsdl:CreationFlag> <jsdl:Target> <jsdl:URI>ftp://ngs.oerc.ox.ac.uk:2811/myhome/fileA</jsdl:URI> </jsdl:Target> <Credentials> … </Credentials> </jsdl:DataStaging> Coupled staging elements; A source data staging element for fileA and a corresponding target element for staging out of the same file. By specifying that the input file is deleted after the job has executed, this example simulates the effect of a data copy from one location to another through the staging host. No multiple data locations (alternative sources and sinks). More elements required (e.g. transfer requirements, file selectors, uri properties). Intended for compute and data staging, not really bulk data copying. OGSA DMI The OGSA Data Movement Interface (DMI) (Antonioletti et al. 2008) defines a number of XML constructs for describing and interacting with a data transfer activity. The data source and destination are each described separately with a Data End Point Reference (DEPRs), which is a specialized form of WS-Address element (Box et al. 2004). In contrast to the JSDL data staging model, a DEPR facilitates the definition of one or more <Data/> elements within a <DataLocations/> element. This is used to define alternative locations for the data source and/or sink. In doing this, an implementation is then free to select between its supported protocols and retry different source/sink combinations from the available list. This improves resilience and the likelihood of performing a successful data transfer by matching protocols supported by the service. DEPR Example <dmi:SourceDataEPR> <wsa:Address>http://www.ogf.org/ogsa/2007/08/addressing/none</wsa:Address> <wsa:Metadata> <dmi:DataLocations> <dmi:Data ProtocolUri="http://www.ogf.org/ogsadmi/2006/03/im/protocol/gridftp-v20" DataUrl="gsiftp://example.org/name/of/the/dir/"> <dmi:Credentials><wsse:UsernameToken/></dmi:Credentials> <other stuff/> </dmi:Data> <dmi:Data ProtocolUri="urn:my-project:srm" DataUrl="srm://example.org/name/of/the/dir/"> <dmi:Credentials><wsse:UsernameToken/></dmi:Credentials> <other stuff/> </dmi:Data> </dmi:DataLocations> </wsa:Metadata> </dmi:SourceDataEPR> Defines alternative locations for the data source and/or sink. <dmi:SinkDataEPR> . . . Similar to above but for the sink . . . </dmi:SinkDataEPR> DMI cont.. There are some limitations: DMI is intended to describe only a single data transfer operation between one source and one sink. To do several transfers, multiple invocations of a DMI service factory would be required to create multiple DMI service instances. We require a single (atomic) message packet that wraps multiple transfers that can be delivery transacted, e.g. through a message routers. Some of the existing constructs require extension / slight modification. Therefore: DMI v2 strawman proposal at OGF to canvass some new extensions and to propose a new bulk-copy doc that builds on DMI. Bulk Data Copy Doc and JSDL Integration ? <jsdl:JobDefinition> <jsdl:JobDescription> <jsdl:JobIdentification ... /> <jsdl:Application> <!-- Option a) Embed BulkDataCopy document --> <other:BulkDataCopy ... /> <!-- If Basic Profile compliance is important --> <jsdl-hpcpa:HPCProfileApplication> <jsdl-hpcpa:Executable>/usr/bin/datacopyagent.sh<jsdl-hpcpa:Executable> <jsdl-hpcpa:Argument>‘myBulkDataCopyDoc.xml’</jsdl-hpcpa:Argument> ... </jsdl-hpcpa:HPCProfileApplication> </jsdl:Application> <jsdl:Resources> <!-- Option b) Stage-in BulkDataCopy document --> <jsdl:DataStaging> <jsdl:FileName>myBulkDataCopyDoc.xm</jsdl:FileName> ... </jsdl:DataStaging> </jsdl:Resources> </jsdl:JobDescription> </jsdl:JobDefinition> Possible? options for integrating the proposed <BulkDataCopy/> document within JSDL; a) nesting within the <jsdl:Application/> element or b) staging-in of a <BulkDataCopy/> document as input for the named executable - why not ?