Resource Management Reading: “A Resource Management Architecture for Metacomputing Systems” What is Resource Management? Mechanisms for locating and allocating computational resources Authentication Process creation Remote job submission Scheduling Other resources that can be managed: Memory Disk Networks Resource Management Issues for Grid Computing Site autonomy Resources owned by different organizations, in different administrative domains Local policies for use, scheduling, security Heterogeneous substrate Different local resource management systems Policy extensibility Local sites need ability to customize their resource management policies More Issues for Grid Computing Co-allocation May need resources at several sites Mechanism for allocating multiple resources, initiating computation, monitoring and managing On-line control Adapt application requirements to resource availability Specifying Resource and Job Requirements Resource requirements: Machine type Number of nodes Memory Network Job or scheduler parameters: Directory Executable Arguments Environment Maximum time required Resource and Job Specification Globus: Resource Specification Language (RSL) &(executable=myprog) (|(&(count=5)(memory>=64)) (&(count=10)(memory>=32))) Condor: Classified ads Resource owners advertise abilities and constraints Applications advertise resource requests Matchmaking: match offers & requests Components of Globus Resource Management Architecture Resource specification using RSL Resource brokers: translate resource requirements into specifications Co-allocators: break down requests for multiple sites Local resource managers: apply local, site-specific resource management policies Information about available compute resources and their characteristics Resource Specification Language Common notation for exchange of information between components API provided for manipulating RSL RSL Syntax Elementary form: parenthesis clauses (attribute op value [ value … ] ) Operators Supported: <, <=, =, >=, > , != Some supported attributes: executable, arguments, environment, stdin, stdout, stderr, resourceManagerContact, resourceManagerName Unknown attributes are passed through May be handled by subsequent tools Constraints: “&” For example: & (count>=5) (count<=10) (max_time=240) (memory>=64) (executable=myprog) “Create 5-10 instances of myprog, each on a machine with at least 64 MB memory that is available to me for 4 hours” Multirequest: “+” A multirequest allows us to specify multiple resource needs, for example + (& (count=5)(memory>=64) (executable=p1)) (&(network=atm) (executable=p2)) Execute 5 instances of p1 on a machine with at least 64M of memory Execute p2 on a machine with an ATM connection Multirequests are central to co-allocation Resource Broker Takes high-level RSL specification Transforms into concrete specifications through “specialization” process Locate resources that meet requirements Multiple brokers may service single request Application-specific brokers translate application requirements Output: complete specification of locations of resources; given to co-allocator Examples of Resource Brokers Nimrod-G Automates creation and management of large parametric experiments Run application under wide range of input conditions and aggregate results Queries MDS to find resources Generates number of independent jobs GRAM allocates jobs to computational nodes Higher-level broker: allows user to specify time and cost constraints Examples of Resource Brokers AppLeS Application Level Scheduler Map large number of independent tasks to dynamically varying pool of available computers Use GRAM to locate resources and initiate and manage computation Resource co-allocators May request resources at multiple sites Two or more computers and networks Break multi-request into components Pass each component to resource manager Provide means for monitoring job status or terminating job Complex: Two or more resource managers Global state like availability of resources difficult to determine Different co-allocation services 1. 2. 3. Require all resources to be available before job proceeds; fail globally if failure occurs at any resource Allocate at least N out of M resources and return Return immediately, but gradually return more resources as they become available Each useful for some class of applications Concurrent Allocation If advance reservations are available: Obtain list of available time slots from each participating resource manager and choose timeslot Without reservations: Optimistically allocate resources Hope desired set will be available at future time Use information service (MDS) to determine current availability of resources Construct RSL request that is likely to succeed If allocation fails, all started jobs must be terminated Disadvantages of Concurrent Allocation Scheme Computational resources wasted while waiting for all requested resources to become available Application must be altered to perform barrier to synchronize startup across components Detecting failure of a resource is difficult, e.g. in queue-based local resource managers Local Resource Managers Implemented with Globus Resource Allocation Manager (GRAM) 1. Processing RSL specifications representing resource requests Deny request Create one or more processes (jobs) that satisfy request 2. Enable remote monitoring and management of jobs 3. Periodically update MDS information service with current availability and capabilities of resources GRAM (cont.) Interface between grid environment and entity that can create processes E.g., Parallel scheduler or Condor pool GRAM may schedule resource itself More commonly, maps resource specification into a request to a local resource allocation mechanism E.g., Condor, LoadLeveler, LSF Co-exists with local mechanisms GRAM (cont.) GRAM API has functions for: Submitting a job request: produces globally unique job handle Canceling a job request Asking when job request is expected to run Upon submission, can request that progress be signaled asynchronously to callback URL GRAM Scheduling Model Jobs are either: Pending: resources have not yet been allocated to the job Active: resources allocated, job running Done: when all processes have terminated and resources have been deallocated Failed: job terminates due to : explicit termination error in request format failure in resource management system denial of access to resource GRAM Components Gatekeeper Responds to a request: 1. Performs mutual authentication of user and resource 2. Determines local user name for remote user 3. Starts a job manager that executes as local user and handles request GRAM Components (cont.) Job manager Creates processes requested by user Submits resource allocation requests to underlying resource management system (or does fork) Monitors state of created processes Notifies callback contact of state transitions Implements control operations like termination GRAM Components (cont.) GRAM reporter Responsible for storing into MDS (information service) info about: Scheduler structure Support reservations? Number of queues Scheduler state Currently active jobs Expected wait time in queue Total number of nodes and available nodes Resource Management Architecture RSL specialization Broker RSL Queries & Info Application Ground RSL Information Service Co-allocator Simple ground RSL Local resource managers GRAM GRAM GRAM LSF EASY-LL NQE Job Submission Interfaces Globus Toolkit includes several command line programs for job submission globus-job-run: Interactive jobs globus-job-submit: Batch/offline jobs globusrun: Flexible scripting infrastructure