University Dortmund
EuroPar 2004
Pisa, Italy
Ramin Yahyapour
CEI, University of Dortmund
Background on Resource Management and Scheduling
Job scheduling and resource management on HPC resources
Transition to Grid RM and Grid Scheduling
Current state-of-the-art
Future of GRMS
Requirements, outlook for upcoming GRMS
2
We all know what “the Grid” is…
one of the many definitions:
“ Resource sharing & coordinated problem solving in dynamic, multiinstitutional virtual organizations ” (Ian Foster)
however, the actual scope of “the Grid” is still quite controversial
Many people consider High Performance Computing (HPC) as the main Grid application.
today’s Grids are mostly Computational Grids or Data Grids with HPC resources as building blocks
thus, Grid resource management is much related to resource management on HPC resources (our starting point).
we will return to a broader Grid scope and its implications later
3
HPC resources are usually parallel computers or large scale clusters
The local resource management systems (RMS) for such resources includes:
configuration management
monitoring of machine state
job management
There is no standard for this resource management.
Several different proprietary solutions are in use.
Examples for job management systems:
PBS, LSF, NQS, LoadLeveler, Condor
4
Control Service
Job Master
Master
Server
Resource and Job
Monitoring and Management Services
Compute Resources/
Processing Nodes
Resource/
Job Monitor
Resource/
Job Monitor
Resource/
Job Monitor
Compute
Node
Compute
Node
Compute
Node
5
A job is a computational task
that requires processing capabilities (e.g. 64 nodes) and
is subject to constraints (e.g. a specific other job must finish before the start of this job)
The job information is provided by the user
resource requirements
CPU architecture, number of nodes, speed memory size per CPU software libraries, licenses
I/O capabilities job description additional constraints and preferences
The format of job description is not standardized, but usually very similar
6
Simple job script:
#!/bin/csh
# resource limits: allocate needed nodes
#PBS -l nodes=1
#
# resource limits: amount of memory and CPU time
([[h:]m:]s). whole job file is a shell script information for the RMS are comments
#PBS -l mem=256mb
#PBS -l cput=2:00:00
# path/filename for standard output
#PBS -o master:/mypath/myjob.out
./my-task the actual job is started in the script
7
The user “submits” the job to the RMS e.g. issuing “ qsub jobscript.pbs
”
The user can control the job
qsub: submit
qstat: poll status information
qdel: cancel job
It is the task of the resource management system to start a job on the required resources
Current system state is taken into account
8
qsub jobscript
Job Submission
Management
Server
Scheduler
Job Execution
Job Execution
Job Execution
Processing Node
Processing Node
Processing Node
Job & Resource
Job & Resource
Monitor
Monitor
Monitor
9
Time sharing:
The local scheduler starts multiple processes per physical CPU with the goal of increasing resource utilization.
multi-tasking
The scheduler may also suspend jobs to keep the system load under control
preemption
Space sharing:
The job uses the requested resources exclusively; no other job is allocated to the same set of CPUs.
The job has to be queued until sufficient resources are free.
10
Batch Jobs vs interactive jobs
batch jobs are queued until execution
interactive jobs need immediate resource allocation
Parallel vs. sequential jobs
a job requires several processing nodes in parallel
the majority of HPC installations are used to run batch jobs in spacesharing mode!
a job is not influenced by other co-allocated jobs
the assigned processors, node memory, caches etc. are exclusively available for a single job.
overhead for context switches is minimized
important aspects for parallel applications
11
A job is preempted by interrupting its current execution
the job might be on hold on a CPU set and later resumed; job still resident on that nodes (consumption of memory)
alternatively a checkpoint is written and the job is migrated to another resource where it is restarted later
Preemption can be useful to reallocate resources due to new job submissions (e.g. with higher priority) or if a job is running longer then expected.
12
A job is assigned to resources through a scheduling process
responsible for identifying available resources
matching job requirements to resources
making decision about job ordering and priorities
HPC resources are typically subject to high utilization therefore, resources are not immediately available and jobs are queued for future execution
time until execution is often quite long (many production systems have an average delay until execution of >1h)
jobs may run for a long time (several hours, days or weeks)
13
Minimizing the Average Weighted Response Time
AWRT
j
Jobs w j
(t
j
Jobs w j j
r j
)
r : submission time of a job t : completion time of a job w : weight/priority of a job
Maximize machine utilization/minimize idle time
conflicting objective criteria is usually static for an installation and implicit given by the scheduling algorithm
14
A user job enters a job queue, the scheduler (its strategy) decides on start time and resource allocation of the job.
Grid
-
User
Job
Description
Scheduler
Job Execution
Management
Schedule lokale
Job-Queue
HPC
Machine
Node Job
Mgmt
Node Job
Mgmt
Node Job
Mgmt
15
Well known and very simple: First-Come First-Serve
Jobs are started in order of submission
Ad-hoc scheduling when resources become free again
no advance scheduling
Advantage:
simple to implement
easy to understand and fair for the users
(job queue represents execution order)
does not require a priori knowledge about job lengths
Problems:
performance can extremely degrade; overall utilization of a machine can suffer if highly parallel jobs occur, that is, if a significant share of nodes is requested for a single job.
16
Time
Resources
Procssing Nodes
Queue
1.
2.
3.
4…
Scheduler
Schedule
Job-Queue
Comput e
Resourc e
17
Improvement over FCFS
A job can be started before an earlier submitted job if it does not delay the first job in the queue
may still cause delay of other jobs further down the queue
Some fairness is still maintained
Advantage:
utilization is improved
Information about the job execution length is needed
sometimes difficult to provide user estimation not necessarily accurate
Jobs are usually terminated after exceeding its allocated execution time; otherwise users may deliberately underestimate the job length to get an earlier job start time
18
Time
Job 3 is started before Job 2 as it does not delay it
Queue
1.
2.
3.
4…
Resources
Procssing Nodes
Scheduler
Schedule
Job-Queue
Comput e
Resourc e
19
Time
However, if a job finishes earlier than expected, the backfilling causes delays that otherwise would not occur
need for accurate job length information (difficult to obtain)
Queue
Scheduler
1.
Job finishes earlier!
2.
3.
4…
Resources
Procssing Nodes
Schedule
Job-Queue
Comput e
Resourc e
20
After the scheduling process, the RMS is responsible for the job execution:
sets up the execution environment for a job,
starts a job,
monitors job state, and
cleans-up after execution (copying output-files etc.)
notifies the used (e.g. sending email)
21
Parallel job scheduling algorithms are well studied; performance is usually acceptable
Real implementations may have addition requirements instead of need of more complex theoretical algorithms:
Prioritization of jobs, users, or groups while maintaining fairness
Partitioning of machines
e.g.: interactive and development partition vs. production batch partitions
Combination of different queue characteristics
For instance, the Maui Scheduler is often deployed as it is quite flexible in terms of prioritization, backfilling, fairness etc.
22
University Dortmund
Current state of the art
More resource types come into play:
Resources are any kind of entity, service or capability to perform a specific task
processing nodes, memory, storage, networks, experimental devices, instruments
data, software, licenses people
The task/job/activity can also be of a broader meaning
a job may involve different resources and consists of several activities in a workflow with according dependencies
The resources are distributed and may belong to different administrative domains
HPC is still key the application for Grids. Consequently, the main resources in a Grid are the previously considered HPC machines with their local RMS
24
Several security-related issues have to be considered: authentication, authorization,accounting
who has access to a certain resource?
what information can be exposed to whom?
There is lack of global information:
what resources are when available for an activity?
The resources are quite heterogeneous:
different RMS in use
individual access and usage paradigms
administrative policies have to be considered
25
Source: Ian Foster
26
Grid Resource Management System consists of :
Local resource management system (Resource Layer)
Basic resource management unit
Provide a standard interface for using remote resources e.g. GRAM, etc.
Global resource management system (Collective Layer)
Coordinate all Local resource management system within multiple or distributed
Virtual Organizations (VOs)
Provide high-level functionalities to efficiently use all of resources
Job Submission
Resource Discovery and Selection
Scheduling
Co-allocation
Job Monitoring, etc.
e.g. Meta-scheduler, Resource Broker, etc.
27
Application
“Coordination of several resources”: infrastructure services, application services
“Add resource”: Negotiate access, control access and utilization
“Communication with internal resource functions and services”
“Control local execution”
Collective
Resource
Connectivity
Fabric
Application
Transport
Internet
Link
Source: Ian Foster 28
Grid
Middleware
User/
Application
Resource
Broker
Higher-Level
Services
Core Grid
Infrastructure Services
Information
Services
Monitoring
Services
Security
Services
Local Resource
Management
Grid Resource
Manager
Grid Resource
Manager
PBS LSF
Grid Resource
Manager
…
Resource Resource Resource
29
Globus Toolkit
common source for Grid middleware
GT2
GT3 – Web/GridService-based
GT4 – WSRF-based
GRAM is responsible for providing a service for a given job specification that can:
Create an environment for a job
Stage files to/from the environment
Submit a job to a local scheduler
Monitor a job
Send job state change notifications
Stream a job’s stdout/err during execution
30
Job is described in the resource specification language
Discover a Job Service for execution
Job Manager in Globus 2.x (GT2)
Master Management Job Factory Service (MMJFS) in Globus 3.x (GT3)
Alternatively, choose a Grid Scheduler for job distribution
Grid scheduler selects a job service and forwards job to it
A Grid scheduler is not part of Globus
The Job Service prepares job for submission to local scheduling system
If necessary, file stage-in is performed
e.g. using the GASS service
The job is submitted to the local scheduling system
If necessary, file stage-out is performed after job finishes.
31
User/Application
RSL
RSL
Resource Broker
Specialized
RSL
Resource
Allocation
MDS
GRAM
PBS
Resource
GRAM
LSF
Resource
GRAM
…
Resource
32
Grid jobs are described in the resource specification language (RSL)
RSL Version 1 is used in GT2
It has an LDAP filter-like syntax that supports boolean expressions:
Example:
& (executable = a.out)
(directory = /home/nobody )
(arguments = arg1 "arg 2")
(count = 1)
33
pending stage-in suspended active stage-out done failed
34
With transition to Web/Grid-Services, the job management becomes
the Master Managed Job Factory Service (MMJFS)
the Managed Job Factory Service (MJFS)
the Managed Job Service (MJS)
The client contacts the MMJFS
MMJFS informs the MJFS to create a MJS for the job
The MJS takes care of managing the job actions.
interact with local scheduler
file staging
store job status
35
User/Application
Master Managed
Job Factory Service
Managed
Factory Job Service
Resource Information
Provider Service
Managed
Job Service
File Streaming
Factor Service
File Streaming
Service
Local Scheduler
Globus as a toolkit does not perform scheduling and automatic resource selection
36
Client
Job
Monitoring
Service
1
9
Job
Submission
Service
2
7
Scheduling
Service
3
5
Resource
Selection
Service
4
Resource
Information
Service
Local
Resource
Monitoring
Service (RIPS)
Job Manger
Service (MJS)
Providers
Resource
Preference
Provider
Information
Provider
8
Resource
Reservation
Service
6
Local
Resource
Manager (PBS) Workflow
Monitoring Information Flow
37
Source: Jin-Soo Kim
The version 2 of RSL is XML-based
Two namespaces are used:
rsl: for basic types as int, string, path, url
gram: for the elements of a job
*GNS = “http://www.globus.org/namespaces“
<?xml version="1.0" encoding="UTF-8"?>
<rsl:rsl xmlns:rsl="GNS/2003/04/rsl" xmlns:gram="GNS/2003/04/rsl/gram" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="
GNS/2003/04/rsl
./schema/base/gram/rsl.xsd
GNS/2003/04/rsl/gram
./schema/base/gram/gram_rsl.xsd">
<gram:job>
<gram:executable><rsl:path>
<rsl:stringElement value="/bin/a.out"/>
</rsl:path></gram:executable>
</gram:job>
</rsl:rsl>
38
<count> (type = rsl:integerType)
Number of processes to run (default is 1)
<hostCount> (type = rsl:integerType)
On SMP multicomputers, number of nodes to distribute the “count” processes across
count/hostCount = number of processes per host
<queue> (type = rsl:stringType)
Queue into which to submit job
<maxWallTime> (type = rsl:longType)
Maximum wall clock runtime in minutes
<maxCpuTime> (type = rsl:longType)
Maximum CPU runtime in minutes
<maxTime> (type = rsl:longType)
Only applies if above are not used
Maximum wall clock or cpu runtime (schedulers’s choice) in minutes
39
globus-job-run: interactive jobs
globus-job-submit: batch jobs
globusrun: takes RSL as input
40
A simple job submission requiring 2 nodes: globus-job-run –np 2 –s myprog arg1 arg2
A multirequest specifies multiple resources for a job globus-job-run -dumprsl -: host1 /bin/uname -a \
-: host2 /bin/uname –a
+ ( &(resourceManagerContact="host1")
(subjobStartType=strict-barrier) (label="subjob 0")
(executable="/bin/uname") (arguments= "-a") )
( &(resourceManagerContact="host2")
(subjobStartType=strict-barrier)(label="subjob 1")
(executable="/bin/uname") (arguments= "-a") )
41
The full flexibility of RSL is available through the command line tool globusrun
Support for file staging: executable and stdin/stdout
Example: globusrun -o –r hpc1.acme.com/jobmanager-pbs
'&(executable=$(HOME)/a.out) (jobtype=single)
(queue=time-shared)’
42
The deliverables of the GGF Working Group JSDL
:
A specification for an abstract standard Job Submission Description
Language (JSDL) that is independent of language bindings, including;
the JSDL feature set and attribute semantics,
the definition of the relationship between attributes, and the range of attribute values.
A normative XML Schema corresponding to the JSDL specification.
A document of translation tables to and from the scheduling languages of a set of popular batch systems for both the job requirements and resource description attributes of those languages, which are relevant to the JSDL.
43
The job attribute categories will include:
Job Identity Attributes
ID, owner, group, project, type, etc.
Job Resource Attributes
hardware, software, including applications, Web and Grid Services, etc.
Job Environment Attributes
environment variables, argument lists, etc.
Job Data Attributes
databases, files, data formats, and staging, replication, caching, and disk requirements, etc.
Job Scheduling Attributes
start and end times, duration, immediate dependencies etc.
Job Security Attributes
authentication, authorisation, data encryption, etc.
44
Problem: Resource Management Systems
Differ Across Each Component
Application
Task definition
Submit
Task Analysis
Scheduler
Executing Host
Staging
Task
Interface Format Execution Environment
LSF
Has API plus Batch Utilities via
“LSF Scripts”
User: Local disk exported
System: Remote initialized (option)
Platform Mix
Unix, Windows
Grid Engine
GDI API Interface plus
Command line interface
System: Remote initialized, with SGE local variables exported
Unix only
PBS
API (script option)
Batch Utilities via “PBS Scripts”
System: Remote initialized, with PBS local variables exported
Unix only
Source: Hrabri Rajic
45
GGF Working Group “Distributed Resource Management
Application API”
From the charter:
Develop an API specification for the submission and control of jobs to one or more Distributed Resource
Management (DRM) systems.
The scope of this specification is all the high level functionality which is necessary for an application to consign a job to a DRM system including common operations on jobs like termination or suspension.
The objective is to facilitate the direct interfacing of applications to today's DRM systems by application's builders, portal builders, and Independent Software
Vendors (ISVs).
46
Source: Hrabri Rajic
The remote job could be in following states:
system hold user hold system and user hold simultaneously queued active system suspended user suspended system and user suspended simultaneously running finished (un)successfully
47
Condor-G is a Condor system enhanced to manage Globus jobs.
It provides two main features
Globus Universe: interface for submitting, queuing and monitoring jobs that use Globus resources
GlideIn: system for efficient execution of jobs on remote Globus resources
CondorG runs as a “personal Condor” system
daemons run as non-privileged user processes
each user runs her/his Condor-G
48
Globus is used to run the Condor daemons on Grid resources
Condor daemons run as a Globusmanaged job
GRAM service starts daemons rather than the Condor jobs
When the resources run these GlideIn jobs, they will join the personal
Condor pool
These daemons can be used to launch a job from Condor-G to a
Globus resource
Jobs are submitted as Condor jobs and they will be matched and run on the Grid resources
the daemons receive jobs from the user’s Condor queue
combines the benefits of Globus and Condor
49
glide-ins
Source: ANL/USC ISI
50
DAGMan allows you to specify the dependencies between your Condor-G jobs, so it can manage them automatically for you.
(e.g., “Don’t run job “B” until job “A” has completed successfully.”)
A DAG is defined by a .dag file , listing each of its nodes and their dependencies:
Job A
# diamond.dag
Job A a.sub
Job B b.sub
Job C c.sub
Job D d.sub
Parent A Child B C
Parent B C Child D
Job B Job C
Job D
each node will run the Condor-G job specified by its accompanying Condor submit file
Source: Miron Livny
51
University Dortmund
How to select resources in the Grid?
Resource-level scheduler
low-level scheduler, local scheduler, local resource manager
scheduler close to the resource, controlling a supercomputer, cluster, or network of workstations, on the same local area network
Examples: Open PBS, PBS Pro, LSF, SGE
Enterprise-level scheduler
Scheduling across multiple local schedulers belonging to the same organization
Examples: PBS Pro peer scheduling, LSF Multicluster
Grid-level scheduler
also known as super-scheduler, broker, community scheduler
Discovers resources that can meet a job’s requirements
Schedules across lower level schedulers
53
Discovers & selects the appropriate resource(s) for a job
If selected resources are under the control of several local schedulers, a meta-scheduling action is performed
Architecture:
Centralized: all lower level schedulers are under the control of a single
Grid scheduler
not realistic in global Grids
Distributed: lower level schedulers are under the control of several grid scheduler components; a local scheduler may receive jobs from several components of the grid scheduler
54
Grid User
Scheduler
Grid-Scheduler
Scheduler Scheduler
Schedule
Job-Queue
Machine 1
Schedule
Job-Queue
Machine 2
Schedule
Job-Queue
Machine 3
55
GGF Document:
“10 Actions of Super
Scheduling (GFDI.4)”
Phase One-Resource Discovery
1. Authorization Filtering
2. Application Definition
3. Min. Requirement Filtering
Phase Two - System Selection
4. Information Gathering
5. System Selection
Phase Three- Job Execution
6. Advance Reservation
7. Job Submission
8. Preparation Tasks
9. Monitoring Progress
10 Job Completion
11. Clean-up Tasks
Source: Jennifer Schopf
56
A Grid scheduler answers the question: to which local resource manger(s) should this job be submitted?
resources may dynamically join and leave a computational grid
not all currently unused resources are available to grid jobs:
resource owner policies such as “maximum number of grid jobs allowed”
it is hard to predict how long jobs will wait in a queue
57
Most systems do not provide advance information about future job execution
user information not accurate as mentioned before new jobs arrive that may surpass current queue entries due to higher priority
Grid scheduler might consider current queue situation, however this does not give reliable information for future executions:
A job may wait long in a short queue while it would have been executed earlier on another system.
Available information:
Grid information service gives the state of the resources and possibly authorization information
Prediction heuristics: estimate job’s wait time for a given resource, based on the current state and the job’s requirements.
58
Distribute jobs in order to balance load across resources
not suitable for large scale grids with different providers
Data affinity: run job on the resource where data is located
Use heuristics to estimate job execution time.
Best-fit: select the set of resources with the smallest capabilities and capacities that can meet job’s requirements
Quality of Service of
a resource or
its local resource management system
what features has the local RMS?
can they be controlled from the Grid scheduler?
59
Working Group in the Global Grid Forum to
“Define the attributes of a lower-level scheduling instance that can be exploited by a higherlevel scheduling instance.”
The following attributes have been defined:
Attributes of allocation properties
Revocation of an allocation
The local scheduler reserves the right to withdraw a given allocation.
Guaranteed completion time of allocation,
A deadline for job completion is provided by the local scheduler.
Guaranteed Number of Attempts to Complete a Job
A local scheduler will retry a given job task, e.g. useful for data transfer actions.
Allocations run-to-completion
A job is not preempted if it has been started
Exclusive Allocations
A job has exclusive access to the given resources; e.g. no time-sharing is performed
Malleable Allocations
The given resource set may change during runtime; e.g. a computational job will gain
(moldable job) or lose processors
60
Attributes of available information
Access to tentative schedule
The local scheduler exposes his schedule of future allocations
Option: Only the projected start time of a specified allocation is available
Option: Only partial information on the current schedule is available
Exclusive control
The local scheduler is exclusively in charge of the resources; no other jobs can appear on the resources
Event Notification
The local scheduler provides an event subscription service
Attributes for manipulating allocation execution
Preemption
Checkpointing
Migration
Restart
61
Attributes for requesting resources
Allocation Offers
The local system can provides an interface to request offers for an allocation
Allocation Cost or Objective Information
The local scheduler can provide cost or objective information
Advance Reservation
Allocations can be reserved in advance
Requirement for Providing Maximum Allocation Length in Advance
The higher-level scheduler must provide a maximum job execution length
Deallocation Policy
A policy applies for the allocation that must be met to stay valid
Remote Co-Scheduling
A schedule can be generated by a higher-level instance and imposed on the local scheduler
Consideration of Job Dependencies
The local scheduler can deal with dependency information of jobs; e.g. for workflows
62
An open source implementation of OGSA-based metascheduler for
VOs.
Supports emerging WS-Agreement spec
Supports GT GRAM
Fills in gaps in existing resource management picture
Contributed from platform to the Globus Toolkit
Extensible, Open Source framework for implementing metaschedulers
Provides basic protocols and interfaces to help resources work together in heterogeneous environments
63
RIPS = Resource Information
Provider Service
Source: Chris Smith
Platform
LSF User
Metascheduler
Plugin
LSF
Globus
Toolkit User
Grid Service Hosting
Environment
Global
Information
Service
RIPS
Job
Service
Meta-Scheduler
Reservation
Service
Queuing
Service
GRAM
SGE
RIPS
GRAM
PBS
RIPS
RM
Adapter
SGE PBS
Platform LSF
64
Source: Chris Smith
Rsrc
Info
Req
Cluster
Info
Req
Data
Store
Req
Data
Load
Req
Global Information Service
Index Service
Registry
SD
Aggregator
SDProvider
Manager
Data
Storage
RIPS
Rsrc, job rsv info
RIPS
Rsv
Info
Job
Info
Data
Store
Load
DB
65
User of VO “A” User of VO “B”
Virtual
Organization “A”
Community
Scheduler
Community
Scheduler
Virtual
Organization “B”
Organization 3
Organization 1
Organization 2
Source: Chris Smith
66
Job Service:
creates, monitors and controls compute jobs
Reservation Service:
guarantees resources are available for running a job
Queuing Service:
provides a service where administrators can customize and define scheduling policies at the VO level and/or at the different resource manager level
Defines an API for plug in schedulers
RM Adaptor Service:
provides a Grid service interface that bridges the Grid service protocol and resource managers (LSF, PBS, SGE, Condor and other RMs)
67
MMJFS = Master Managed Job Factory Service
MJS = Managed Job Service
Blue indicates a Grid Service hosted in a GT3 container
MJS for LSF
MMJFS
RIPS
Site A – MMJFS on node1
LSF managed-jobglobusrun managed-jobglobusrun managed-jobglobusrun
MJS for PBS
MMJFS PBS
RIPS
Site B – MMJFS on node2
Index Service
MJS for SGE
MMJFS SGE
RIPS
Site C – MMJFS on node3
68
Source: Chris Smith
Queuing
Service
Job
Service
Reservation
Service
Index Service
Virtual
Organization
RM Adapter for LSF
RIPS
LSF
RM Adapter for PBS
Site A
RIPS
PBS
MMJFS/MJS
Site B
SGE
RIPS
Site C
69
Source: Chris Smith
In CSF, Job Service instances are “submitted” to the
Queue Service for dispatch to a resource manager.
The Queue Service provides a plug in API for extending the scheduling algorithms provided by default with CSF.
The Queue Service is responsible for:
loads and validates configuration information
loads all configured scheduler plugins
calls the plugin API functions
schedInit() after loading the plugin successfully
schedOrder() when a new job is submitted
schedMatch() during the scheduling cycle
schedPost() before the scheduling cycle ends, and after scheduling decisions are sent to the job service instances
70
Authorization
System
Information
Services
Data
Management
Adaptive
Resource
Discovery
File Transfer
Unit
Jobs
Queue
Job Receiver BROKER
Execution
Unit
SLA
Negotiation
GRMS
Scheduler
Workflow
Manager
Resource
Reservation
Prediction
Unit
GLOBUS, other
Local Resources (Managers)
Source: Jarek Nabrzyski
Monitoring
Application
Manager
71
Reliable and predictable delivery of a service
Quality of service for a job service:
Reliable job submission: two-phase commit
Predictable start and end time of the job
Advance reservation assures start time and throughput
Fault tolerance/recovery:
Migrate job to another resource before the fault occurs: the job continues
after the fault: the job is restarted
Rerun the job on the same resource after repair
Allocate multiple resources for a job
72
It is often requested that several resources are used for a single job.
that is, a scheduler has to assure that all resources are available when needed.
in parallel (e.g. visualization and processing)
with time dependencies (e.g. a workflow)
The task is especially difficult if the resources belong to different administrative domains.
The actual allocation time must be known for co-allocation
or the different local resource management systems must synchronize each other (wait for availability of all resources)
73
Example Multi-Site Job Execution
Grid-Scheduler
Scheduler
Multi-Side Job
Scheduler Scheduler
Schedule Schedule Schedule
Job-Queue Job-Queue Job-Queue
Machine
1
Machine
2
Machine
3
A job uses several resources at different sites in parallel.
Network communication is an issue.
74
Co-allocation and other applications require a priori information about the precise resource availability
With the concept of advanced reservation, the resource provider guarantees a specified resource allocation.
includes a two- or three-phase commit for agreeing on the reservation
Implementations:
GARA/DUROC/SNAP provide interfaces for Globus to create advanced reservation
implementations for network QoS available.
setup of a dedicated bandwidth between endpoints
75
The interaction between local scheduling and higher-level Grid scheduling is currently a one-way communication
current local schedulers are not optimized for Grid-use
limited information available about future job execution
a site is usually selected by a Grid scheduler and the job enters the remote queue.
The decision about job placement is inefficient.
Actual job execution is usually not known
Co-allocation is a problem as many systems do not provide advance reservation
76
Grid User
Where to put the Grid job?
15 jobs running
20 jobs queued
Grid-Scheduler
5 jobs running
2 jobs queued
40 jobs running
80 jobs queued
Scheduler Scheduler Scheduler
Schedule
Job-Queue
Machine 1
Schedule
Job-Queue
Machine 2
Schedule
Job-Queue
Machine 3
77
Decision making is difficult for the Grid scheduler
limited information about local schedulers is available
available information may not be reliable
Possible information:
queue length, running jobs
detailed information about the queued jobs
execution length, process requirements,…
tentative schedule about future job executions
These information are often technically not provided by the local scheduler
In addition, these information may be subject to privacy concerns!
78
Consider a workflow with 3 short steps (e.g. 1 minute each) that depend on each other
Assume available machines with an average queue length of 1 hour.
The Grid scheduler can only submit the subsequent step if the previous job step is finished.
Result:
The completion time of the workflow may be larger than 3 hours
(compared to 3 minutes of execution time)
Current Grids are suitable for simple jobs, but still quite inefficient in handling more complex applications
Need for better coordination of higher- and lower-level scheduling!
79
University Dortmund
Outlook on future Grid Resource
Management and Scheduling
Remote Center
Reads and Generates TB of Data
WAN Transfer Compute
Resources
LAN/WAN Transfer
Assume a data-intensive simulation that should be visualized and steered during runtime!
Visualization
81
A specified architecture with
48 processing nodes,
1 GB of available memory, and a specified licensed software package for 1 hour between 8am and 6pm of the following day
Time must be known in advance.
A specific visualization device during program execution
Minimum bandwidth between the VR device and the main computer during program execution
Input: a specified data set from a data repository at most 4
€
preference of cheaper job execution over an earlier execution.
actually a pretty “simple” example (no complex workflows)
82
Expected output of a Grid scheduler: resources
Data
Network 1
Computer 1
Network 2
Computer 2
Software License
Storage
Network 3
VR-Cave time
Data Access Storing Data
Data Transfer Data Transfer
Loading Data Parallel Computation Providing Data
Communication for Computation
Parallel Computation
Software Usage
Reservations are necessary!
Data Storage
Communication for Visualization
Visualization
83
Grid-
User/Application
Grid-
Scheduler
Grid
Middleware
Local RMS
Grid
Middleware
Local RMS
Grid
Middleware
Local RMS
Grid
Middleware
Local RMS
Information
Services
Monitoring
Services
Security
Services
Accounting/Billing
Other
Grid Services
Schedule
Compute
Resources
Schedule
Data
Resources
Schedule
Network
Resources
Schedule
Other
Resources/
Services
84
Relevant to Grid scheduling
Information Service
Job/Workflow Description
Requirement Description
Resource Discovery
Reservation
Monitoring/Notification
Job Execution
Security
Accounting/Billing
Data Management
Local RMS
85
Services are used to abstract all resources and functionalities.
Concept of OGSI and WSRF
using WebServices, SOAP, XML to implement the services
OGSI idea of GridServices is implemented in GT3
transition to WSRF with GT4
Core service for building a Grid are discussed in the Open Grid
Services Architecture (OGSA)
86
Users in Problem Domain X
Applications in Problem Domain X
Application & Integration Technology for Problem Domain X
Generic Virtual Service Access and Integration Layer
Job Submission Brokering Workflow
Registry Banking
OGSA
Data Transport Resource Usage
Authorisation
Transformation
Structured Data
Integration
Structured Data Access
OGSI: Interface to Grid Infrastructure
Web Services: Basic Functionality
Compute, Data & Storage Resources
Structured Data
Distributed
Virtual Integration Architecture 87
Data Catalog
Data Provision
Data Integration
Virtual
Organization
Policy +
Agreement
Data Access
Context
Services
Data
Services
Status +
Monitoring
Services
Problem
Determination
Event
Information Logging
Service Service
Application
Content
Manager
Workflow
Manager
Workload
Manager
Broker
Execution
Planning
Service
Job
Manager
Job
Service
Job &
Workflow
Management
Reservation
Service
Service
Container
Deploy +
Configuration
Service
Provisioning
Resource
Management
Services
Security
Services
Infrastructure
Services
WS-RF
(OGSI)
WS
Management
Authentication
Authorization
Delegation
Firewall
Transition
88
“Demand”
Workload Mgmt.
Framework
User/Job
Proxies
Policies
Environment
Mgmt.
Job
Factory
Dependency management
Workload Optimizing Framework
Queuing Services
Workload Optimization
Workload Post Balancing
Workload Models (History/Prediction)
Workload Orchestration
Admission Control (Workload)
SLA Management (Workload)
Primary Interaction
Meta - Interaction
Optimizing Framework
Scheduling
Workload Optimization
Resource – Workload
Optimal Mapping
Resource Selection
Selection Context (e.g. VO)
“Supply”
Resource Mgmt. Framework
CMM Reservation
Information Provider
Resource Provisioning
Resource
Factory
Resource
Allocation
(or Binding)
Resource Optimizing Framework
Capacity Management
Resource Placement
Admission Control (Resources)
Quality of Service (Resources)
Represents one or more OGSA services
89
Functional Requirements:
Cooperation between different resource providers
Interaction with local resource management systems
Support for reservations and service level agreements
Orchestration of coordinated resources allocations
Automatic handling of accounting and billing
Distributed Monitoring
Failure Transparency
90
Scheduling
Service
Reservation
Job
Supervisor
Service
Accounting and Billing
Service
Query for resources
Data Management
Service
Network Management
Service
Maintain information
Information Service static & scheduled/forecasted
Resources
Data
Network
Maintain information
Compute Manager
Management
System
Compute/ Storage /Visualization etc
Data Manager
Data-Resources
Network Manager
Management System
Network
Network-Resources
Scheduling-relevant Interfaces of Basic Blocks are still to be defined!
91
Relevant for Grid Scheduling:
Access to static and dynamic information
Dynamic information include data about planned or forecasted future events
e.g.: existing reservations, scheduled tasks, future availabilities need for anonymous and limited information (privacy concerns)
Information about all resource types
including e.g. data and network
future reservation, data transfers etc.
92
Information about the job specifics (what is the job) and job requirements (what is required for the job) including data access and creation
Need for common workflow description
e.g. a DAG formulation
include static and dynamic dependencies
need for the ability to extract workflow information to schedule a whole workflow in advance
93
Interaction between scheduling instances, between resource/agreement providers and agreement initiators (higher-level scheduler)
access to tentative information necessary negotiations might take very long individual scheduling objectives to be considered probably market-oriented and economic scheduling needed
Need for combining agreements from different providers
coordinate complex resource requests or workflows
Maintain different negotiations at the same time
probable several levels of negotiations, agreement commitment and reservation
94
Interaction to budget information
Charging for allocations, reservations; preliminary allocation of budgets
Concepts for reliable authorization of Grid schedulers to spend money on behalf of the user
Re-funding in terms of resource/SLA failure, re-scheduling etc.
Reliable monitoring and accounting
required for tracing whether a party fulfilled an agreement
95
Monitoring of
resource conditions agreements schedules program execution
SLA conformance workflow
…
Monitoring must be reliable as it is part of accountability
fail or fulfillment of a service/resource provider must be clearly identifiable
96
Grids ultimately require coordinated scheduling services.
Support for different scheduling instances
different local management systems different scheduling algorithms/strategies
For arbitrary resources
not only computing resources, also data, storage, network, software etc.
Support for co-allocation and reservation
necessary for coordinated grid usage
(see data, network, software, storage)
Different scheduling objectives
cost, quality, other
97
Using a Brokerage/Trading strategy:
Submit Grid
Job Description
Discover Resources
Consider individual user policies
Coordinate Allocations
Higher-level scheduling
Select Offers
Collect Offers
Query for
Allocation Offers
Generate
Allocation Offer
Lower-level scheduling
Consider individual owner policies
Analyze Query
98
Multi-level scheduling must support different RM systems and strategies.
Provider can enforce individual policies in generating resource offers.
User receive resource allocation optimized to the individual objective
Different higher-level scheduling strategies can be applied.
Multiple levels of scheduling instances are possible
Support for fault-tolerant and load-balanced services.
99
Multilevel Grid scheduling architecture
Lower level local scheduling instance
Implementation of owner policies
Higher level Grid scheduling instance
Resource selection and coordination
(Static) Interface definition between both instances
Different types of resources
Different local scheduling systems with different properties
Different owner policies
(Dynamic) Communication between both instances
Resource discovery
Job monitoring
100
The mapping of jobs to resources can be abstracted using the concept of Service Level Agreement (SLAs) (Czajkowski, Foster,
Kesselman & Tuecke)
SLA: Contract negotiated between
resource provider, e.g. local scheduler
resource consumer, e.g., grid scheduler, application
SLAs provide a uniform approach for the client to
specify resource and QoS requirements, while
hiding from the client details about the resources,
such as queue names and current workload
101
Resource SLA (RSLA)
A promise of resource availability
Client must utilize promise in subsequent SLAs
Advance Reservation is an RSLA
Task SLA (TSLA)
A promise to perform a task
Complex task requirements
May reference an RSLA
Binding SLA (BSLA)
Binds a resource capability to a TSLA
May reference an RSLA (i.e. a reservation)
May be created lazily to provision the task
Allows complex resource arrangements
102
The client negotiates a TSLA for the task with the Grid Scheduler
TSLA that refers to an RSLA assures the jobs gets the reserved resources at a specified time
TSLA without an RSLA tells little about when the resources will be available to the job
103
For an existing TSLA, the Grid Scheduler may obtain additional
RSLAs
An RSLA is negotiated by the Grid Scheduler with the Resource
A BSLA binds this RSLA to the corresponding TSLA
not needed for the whole duration of the task or
not known completely (e.g., the time at which a resource will be needed) before submitting the task
104
User/
Application
TSLA 1
Grid
Scheduler
User/
Application
TSLA 2
RSLA 1
RSLA 2
Grid Resource
Manager
Grid Resource
Manager
Grid Resource
Manager
The Grid Scheduler receives requests for two agreements.
It negotiates with the resources the
RSLA1 and RSLA2 , and in parallel with the agreement initiators about the corresponding
TSLA1 and TSLA2 .
Resource Resource Resource
105
Goal: Defining WebService-based protocols for negotiation and agreement management
WS-Agreement Protocol:
106
Grid Scheduling Methods:
Support for individual scheduling objectives and policies
Multi-criteria scheduling models
Economic scheduling methods to Grids
Architectural requirements:
Generic job description
Negotiation interface between higher- and lower-level scheduler
Economic management services
Workflow management
Integration of data and network management
107
Current approach:
Extension of job scheduling for parallel computers.
Resource discovery and load-distribution to a remote resource
Usually batch job scheduling model on remote machine
But actually required for Grid scheduling is:
Co-allocation and coordination of different resource allocations for a Grid job
Instantaneous ad-hoc allocation is not always suitable
This complex task involves:
Cooperation between different resource providers
Interaction with local resource management systems
Support for reservations and service level agreements
Orchestration of coordinated resources allocations
108
Local computing typically has:
A given scheduling objective as minimization of response time
Use of batch queuing strategies
Simple scheduling algorithms: FCFS, Backfilling
Grid Computing requires:
Individual scheduling objective
better resources faster execution cheaper execution
More complex objective functions apply for individual Grid jobs!
109
Local computing typically has:
Single scheduling objective for the whole system:
e.g. minimization of average weighted response time or high utilization/job throughput
In Grid Computing :
Individual policies must be considered:
access policy, priority policy, accounting policy, and other
More complex objective functions apply for individual resource allocations !
User and owner policies/objectives may be subject to privacy considerations!
110
Cost model
Use of a resource
Reservation of a resource
Individual scheduling objective functions
User and owner objective functions
Formulation of an objective function
Integration of the function in a scheduling algorithm
Resource selection
The scheduling instances act as broker
Collection and evaluation of resource offers
111
In contrast to local computing, there is no general scheduling objective anymore
minimizing response time
minimizing cost
tradeoff between quality, cost, response-time etc.
Cost and different service quality come into play
the user will introduce individual objectives
the Grid can be seen as a market where resource are concurring alternatives
Similarly, the resource provider has individual scheduling policies
Problem:
the different policies and objectives must be integrated in the scheduling process
different objectives require different scheduling strategies
part of the policies may not be suitable for public exposition
(e.g. different pricing or quality for certain user groups) 112
Due to the mentioned requirements in Grids its not to be expected that a single scheduling algorithm or strategy is suitable for all problems.
Therefore, there is need for an infrastructure that
allows the integration of different scheduling algorithms
the individual objectives and policies can be included
resource control stays at the participating service providers
Transition into a market-oriented Grid scheduling model
113
Market-oriented approaches are a suitable way to implement the interaction of different scheduling layers
agents in the Grid market can implement different policies and strategies
negotiations and agreements link the different strategies together
participating sites stay autonomous
Needs for suitable scheduling algorithms and strategies for creating and selecting offers
need for creating the Pareto-Optimal scheduling solutions
Performance relies highly on the available information
negotiation can be hard task if many potential providers are available.
114
Several possibilities for market models:
auctions of resources/services
auctions of jobs
Offer-request mechanisms support:
inclusion of different cost models, price determination
individual objective/utility functions for optimization goals
Market-oriented algorithms are considered:
robust
flexible in case of errors
simple to adapt
markets can have unforeseeable dynamics
115
t
Job t4
R1 R2 R3 R4 R5 R6 R7 R8 t3 t2 t1
116
t
Offer 1
End of current schedule
R1 R2 R3 R4 R5 R6 R7 R8
117
t
Offer 2
End of current schedule
R1 R2 R3 R4 R5 R6 R7 R8
118
t
Offer 3
New end of current schedule
End of current schedule
R1 R2 R3 R4 R5 R6 R7 R8
119
Evaluation with utility functions
A utility function is a mathematical representation of a user’s preference
The utility function may be complex and
contain several different criteria
Example using response time (or delay time) and price:
max
(
1
2
)
120
A3 latency
A2
A3
A2
A1
price
The utility depends on the user and provider objectives
A1
121
Following are some slides to illustrate the scheduling process in the
NWIRE project.
NWIRE models a Grid infrastructure without central services
A P2P scheduling model is employed where Grid Scheduler/Managers that reside on each local site are queried for offers.
A Grid schedulers asks known other Schedulers.
Similar to P2P systems, no Grid manager knows the whole grid but keeps a list of “know other Grid managers”
Those schedulers can forward the request to other schedulers
The scope and depths of the query can be parameterized
Individual objective functions are used to evaluate the utility for a user and the resource provider.
The scheduler selects the allocation offer with the highest utility value
The whole model represents a Grid market and auction system.
122
Real Workload Data of Grids is not yet available.
However, available are workloads of single HPC installations.
Example of evaluation results:
considered: traces from Cornell Theory Center (430 nodes IBM RS/6000 SP)
4 workload adapted from the real traces, each 10000 jobs
Assumed Synthetic Grid machine configurations:
Name m64 m128 m256 m384 m512
Configuration
4*64 + 6*32 + 8*8
4*128
2*256
1*384 + 1*64 + 4*16
1*512
Largest Maschine
64
128
256
384
512
123
Scheduling Process
1) Client sends a request
Application
Request
Requirements
Job Attributes
ObjectiveFunction
Grid
Manager
Resource
Manager
Resource
Manager
Resource
Manager
124
Example Request
KEY “Hops“ {VALUE “HOPS“ {2}}
KEY “MaxOfferNumber“ {VALUE “MaxOfferNumber“ {10}}
KEY “MinMulti-Site“ {VALUE “MinMulti-Site“ {10}}
...
KEY “Utility“ {
ELEMENT 1 {
CONDITION{((OperatingSystem EQ “Linux“)&&( NumberOfProcessors >= 8))}
VALUE ‘‘UtilityValue‘‘ {-EndTime}
VALUE ‘‘RunTime‘‘ {5}
...
}
ELEMENT 2 {
CONDITION{ ... }
VALUE ‘‘UtilityValue‘‘ {-JobCost}
...
}
}
125
Application
Scheduling Process
Grid
Manager
Request
3) Limitations of direction and depth
Request
Grid
Manager
Request
Request
Request
2) Query for other offers
Request
Requirements
Job Attributes
ObjectiveFunction
Grid
Manager
Resource
Manager
Resource
Manager
Resource
Manager
Scheduler requests offers from other Grid Managers/Schedulers.
Search depths and scope can be parameterized.
126
Scheduling Process
Application
Grid
Manager
Request
Request
Offer
Resource
Manager
Grid
Manager
Resource
Manager
Request
Request
Grid
Manager
Request
Offer
Attributes
ObjectiveFunction
4) Allocation;
Maximization of the objectives
Resource
Manager
Selection of allocation according to the utility value
Returned offers are collected.
127
Scheduling Process
Grid
Manager
Request
Request
Request
Request
Grid
Manager
Request
Offer
Application
5) User and Resource
Managers are informed about allocations
Resource
Manager
Grid
Manager
Resource
Manager
Resource
Manager
Schedule
Allokation_1
Allokation_2
Allokation_3
The user can directly be informed about the schedule for his execution.
The Grid Manager may re-schedule his allocations to optimize the result further.
128
Offer Creation
Zeit
D
C
B
A
1 2 3 4
Resources
5 6 7
129
Offer Creation
Zeit
D
C
B
A
1 2 3 4
Resources
5 6 7
130
Offer Creation
Zeit
D
C
B
A
1 2 3 4
Resources
5 6 7
131
Offer Creation
Zeit
D
C
B
A
1 2 3 4
Resources
5 6 7
132
Offer Creation
Zeit
D
C
B
A
1 2 3 4
Resources
5 6 7
133
Offer Creation
Zeit
D
C
B
A
1 2 3 4
Resources
5 6 7
134
Offer Creation
Zeit
D
C
B
A
1 2 3 4
Resources
5 6 7
135
Offer Creation
Zeit
D
C
B
A
1 2 3 4
Resources
5 6 7
136
AWRT for Economic Scheduling
Compared to Conservative Backfilling
Changes in average completion time of market-economic vs conventional scheduler
10,000
5,000
0,000
-5,000
-10,000
-15,000
M128 M256
Resource Configuration
M384
Conventional algoritms better
Markted-oriented better
137
Changes in Utilization of market-economic vs conventional schedulers
3,00
2,00
1,00
0,00
-1,00
-2,00
-3,00
M128 M256
Resource Configurations
M384
Markted-oriented better
Conventional better
138
AWRT for different Machine Utility Functions
300000
250000
200000
150000
100000
50000
0
MF 1 MF 2 MF 3 MF 4
Maschinen-Funktion
MF 5 MF 6
139
Economical models provide results in the range of conventional algorithms in terms of AWRT
Economic methods leave much more flexibilities in defining the desired resources
Problems of site autonomy, heterogeneous resources and individual owner and user preferences are solved
Advantage of reservation of resources ahead of time
Further analysis needed:
Usage of other utility functions
140
Data is a key resource in the Grid
remote job execution will in most cases include the consideration of data
Data is a resource that can be managed
replication, caching pre-fetching, post-fetching
The requirements on network information and QoS features will increase
scheduling and planning data and resources requires information about available network bandwidth how long does a reliable data transfer take?
is sufficient bandwidth available between two resources (e.g between a visualization device and the compute resource
Both resource types as well as others will need suitable management and scheduling features
negotiation and agreement protocols provide already an infrastructure
141
Most new resource types can be included via individual lower-level resource management systems.
Additional considerations for
Data management
Select resources according to data availability
But data can be moved if necessary!
Network management
Consider advance reservation of bandwidth or SLA
Network resources usually depend on the selection of other resources!
Problem: no general model for network SLAs.
Coordinate data transfers and storage allocation
142
Access to information about the location of data sets
Information about transfer costs
Scheduling of data transfers and data availability
optimize data transfers in regards to available network bandwidth and storage space
Coordination with network or other resources
Similarities with general grid scheduling:
access to similar services
similar tasks to execute interaction necessary
143
Application
Planner:
Data location,
Replica selection,
Selection of compute and storage nodes
Metadata Service
Location based on data attributes
Replica Location
Service
Location of one or more physical replicas
Information Services
State of grid resources, performance measurements and predictions
Security and Policy
Executor:
Initiates data transfers and computations
Data Movement
Data Access
Compute Resources
Source: Carl Kesselmann
Storage Resources
144
Replica Consistency Management Services
Reliable Replication Service
Metadata
Service
Replica Location Service
Reliable Data
Transfer Service
GridFTP
The Replica Location Service is one component in a layered data management architecture
Provides a simple, distributed registry of mappings
Consistency management provided by higher-level services
Source: Carl Kesselmann
145
Scheduling Service:
1.
receives job description
2.
3.
4.
5.
queries Information Service for static resource information prioritizes and pre-selects resources queries for dynamic information about resource availability queries Data and Network Management Services
6.
7.
8.
generates schedule for job reserves allocation if possible otherwise selects another allocation delegates job monitoring to Job Supervisor
Example:
40 resources of requested type are found.
12 resources are selected.
8 resources are available.
Network and data dependencies are detected.
Utility function is evaluated.
6 th tried allocation is confirmed.
Job Supervisor/Network and Data Management: service, monitor and initiate allocation
Data/network provided and job is started
146
Expected output of a Grid scheduler: resources time
Data
Network 1
Computer 1
Network 2
Computer 2
Software License
Storage
Network 3
VR-Cave
Data Access Storing Data
Data Transfer Data Transfer
Loading Data Parallel Computation Providing Data
Communication for Computation
Parallel Computation
Software Usage
Data Storage
Communication for Visualization
Visualization
147
Reconsidering a schedule with already made agreements may be a good idea from time to time
because resource situation may have changed, or workload situation has changed
Optimization of the schedule can only work with the bounds of made agreements and reservations
given guarantees must be observed
The schedulers can try to maximize the utility values of the overall schedule
a Grid scheduler may negotiate with other resource providers in order to get better agreements; may cancel previous agreements a local scheduler may optimize the local allocations to improve the schedule.
148
Core service infrastructure
OGSI/WSRF
OGSA
GGF hosts several groups in the area of Grid scheduling and resource management.
Examples:
WG Scheduling Attributes (finished)
WG Distributed Resource Management Application API (active)
WG Grid Resource Allocation Agreement Protocol (active)
WG Grid Economic Services Architecture (active)
RG Grid Scheduling Architecture (active)
149
Resource management and scheduling is a key service in an Next
Generation Grid.
In a large Grid the user cannot handle this task.
Nor is the orchestration of resources a provider task.
System integration is complex but vital.
The local systems must be enabled to interact with the Grid.
Providing sufficient information, expose services for negotiation
Basic research is still required in this area.
No ready-to-implement solution is available.
New concepts are necessary.
Current efforts provide the basic Grid infrastructure. Higher-level services as Grid scheduling are still lacking.
Future RMS systems will provide extensible negotiation interfaces
Grid scheduling will include coordination of different resources
150
Commercial Interest in the Grid may have stronger focus on non HPCrelated aspects
eCommerce-Solutions
Dynamically migrating Application Server Load
Inter-operation between companies
Supply-Chain Management request distributed resources under certain constraints
Enterprise Application Integration EAI
dynamically discover and use available services
The problems tackled by Grid RMS is common to many application fields.
Grid solutions may become a key technology infrastructure or solutions from other areas (e.g. the WebService-Community) may surpass Grid efforts?
151
Book: “Grid Resource Management: State of the Art and Future Trends”, co-editors Jarek Nabrzyski, Jennifer M. Schopf, and
Jan Weglarz, Kluwer Publishing, 2004
PBS, PBS pro: www.openpbs.org
and www.pbspro.com
LSF, CSF: www.platform.com
Globus: www.globus.org
Global Grid Forum: www.ggf.org
, see SRM area
152
Contact: ramin.yahyapour@udo.edu
Computer Engineering Institute
University of Dortmund
44221 Dortmund, Germany http://www-ds.e-technik.uni-dortmund.de
153