Predictive Workload Management for Grid Computing Daniel Spooner & Stephen Jarvis

Predictive Workload

Management for Grid Computing

Daniel Spooner & Stephen Jarvis

High Performance Systems Group

University of Warwick, UK

Talk Outline

• Grid research at Warwick.

• Grid workload management and related middleware services

• Performance modelling & prediction

• Workflow

• OGSA Integration

2

Grid Research at Warwick

• e -Science project to develop performance-aware middleware services.

• Focus on the performance behaviour of existing grid technologies, and the performance effects of mapping workload to particular grid resources.

• Activities include:

– Study performance behaviour of the MDS in Globus

– Refinement of existing performance modelling tools

– Developing scheduling and request-routing mechanisms

– Developing workload management tools that orchestrate work-flows end-to-end.

3

Grid Research at Warwick

• e -Science project to develop performance-aware middleware services.

• Focus on the performance behaviour of existing grid technologies, and the performance effects of mapping workload to particular grid resources.

• Activities include:

– Study performance behaviour of the MDS in Globus

– Refinement of existing performance modelling tools

– Developing scheduling and request-routing mechanisms

– Developing workload management tools that orchestrate work-flows end-to-end.

4

Grid Workload Management

• The usual WLM constraints apply:

– Make efficient use of resources

– Provide guaranteed qualities of service

• However, there are specific grid issues:

– Grids span administrative domains

– Large-scale

– Dynamic (in resources, users & configuration)

• The environment dictates a distributed approach with clear separations of concern.

5

Decision-making Support

• GWLM must typically identify resource(s) that are suitable for a particular application.

– General users are unlikely to want to make such a choice

“by hand”

– If there is a choice, how is one system selected over another?

– Some decisions are straightforward

• Accept or reject based on architecture, configuration, etc.

– Others are more complex

• No clear favourite exists.

• Current tasks obscure selection.

– Partial solution can be obtained by understanding the anticipated application performance

6

Performance Prediction

• Performance models (at a reasonable level of fidelity) can be effective at driving resource management decisions.

• It is beneficial (from a WLM standpoint) if there is some separation between architecture and application parameters.

• A suitable tool, PACE, aims to predict:

– execution time

– communication usage

– data & resource requirements

• Allows a “best guess” of how a task will behave with respect to performance characteristics

7

PACE Architecture

Application

Resource 1 Resource 2 Resource 3

8

PACE Architecture

Source Code

Analysis

Object Editor

Performance

Library

Performance scripting language

Compiler

Resource 1 Resource 3 Resource 2

9

PACE Architecture

Source Code

Analysis

Object Editor

Performance

Library

Performance scripting language

Compiler

Resource 1

Compiler

Hardware Model

CPU Comm Cache

Resource 3

10

How Does This Help?

• Expected time predicted by a modeling tool for a given workload on a given resource.

– Fast evaluation produces run-times for one or more systems scenarios.

• Pre-execution evaluation

– Scalability analysis (sizing an application appropriately)

• Modular components for heterogeneous resources

– Not limited to a single “performance point”

– Applicable to Grids

• Can drive workload management decisions & scheduling systems.

11

Quality-of-Service

• QoS is inherently user-focussed (are my jobs being completed within my deadline?).

• From the system perspective, it may be more appropriate to reject a task rather than to accept a task and then fail to meet an agreement.

• Performance prediction can reveal “workable” solutions to workload management - it allows service levels to be specified with some confidence that the workload will execute to the user’s requirement.

• “Deliver what was promised”

12

Service Level Agreements

Application 1

Service Policy mA

1

Application 2

Service Policy

Application 2

Service Policy mA

2 mA

3

Hosts

Cluster

Interface

QoS

Reporting


Application 1

Service Policy mA

1

Application 2

Service Policy

Application 2

Service Policy mA

2 mA

3

Hosts

Cluster

Interface

QoS

Reporting


Application 1

Service Policy mA

1

Application 2

Service Policy

Application 2

Service Policy mA

2 mA

3

Hosts

Cluster

Interface

QoS

Reporting

TITAN Architecture overview

• A workload management system has been developed at Warwick that makes significant use of performance models.

• It is a layered architecture that divides the problem into multi-domain, local-domain and resource tiers.

• TITAN does not schedule itself - it is a grid

“enhancement service” that can control cluster RMs such as Condor or the Globus run manager.

Titan Architecture overview

• Tasks are presented to brokers that communicate within a pre-arranged structure (typically a singlerooted hierarchy).

• Each broker is tied to a local scheduler that acts on its behalf - accepting and publishing tasks to other brokers.

• Tasks are “offered” to multiple brokers who respond with an execution-time statement obtained by performance model evaluation. The task is then routed appropriately.

Tiered Scheduling

• TITAN currently executes tasks using Condor.

• Condor is a high-throughput scheduler

– Historically used in “cycle-stealing” mode

– Increasingly used with Globus in dedicated mode

– Submit fil e is produced

– Creates ClassAds for Condor

– Task passed to Condor

• TITAN monitors the Condor queue

– Feedback performance of finished tasks

– Notify scheduler of chances in the resources

18

Greedy Users

70 mins end-to-end

19

TITAN Managed

36 mins end-to-end

20

Workflow

• Recent work has been focussed on managing flows of related work presented to the brokers.

• This fits with the trend in grid computing to develop small functional components linked together to form an application (WebServices).

• This architecture is beneficial from the performance modeling standpoint - small tasks are run frequently and are closed from the user.

• Each component can be modeled / timed prior to user execution and parameterised appropriately.

21

Workflow

• TITAN considers work flow descriptions at both the multi and local level

– Multi: break the work flow into sub-flows based on an estimation of communication dependencies.

– Local: honour local flow - performance model use to guide interleaving.

• Workflow is described in a simple XML document

(Apache ANT-like syntax) to provide a list of dependent actions.

• We are currently looking at combing performance models with different workflow description techniques.

22

Local Workflow

• The TITAN local scheduler uses a genetic algorithm to perform task placement.

• This has been adapted to factor workflow descriptions when building schedule solutions.

• The GA is able to “interleave” different sub-flows and utilise the idle time between neighbouring processes.

• The local scheduler is able to commit to a deadline for an entire flow using performance models. This can be used in partnership with the brokers.

23

Local Workflow

• Two simple flows of heterogeneous tasks (with different performance behaviours) are submitted along with a number of single tasks.

24

Local Workflow

• Using the performance models,

TITAN determines processor mappings that meet the task deadlines.

25

Local Workflow

• Flows are interleaved with other jobs, whilst maintaining task deadlines.

TIME

26

Experimental Results

• We are currently comparing mapping flows using

TITAN against Condor DAG-based equivalents.

• Initial results mimic single task results (i.e. reduction in makespan, improvement in the ability to meet deadlines).

27

OGSA Integration

• To support grid middleware integration, TITAN is now being developed in line with the OGSA (GT3 containers).

• This includes the development of a performance service that is able to provide predictive data ondemand using SOAP.

• TITAN functions as a self-contained Grid service able to offer scheduling functions to local and remote users.

28

Conclusions

• Performance-aware services make a valuable contribution to grid middleware

– good use of available resources

– steer tasks to appropriate architectures

– addresses SLA issues

• Can commit to single task and flows of tasks - this is crucial as the impact of failure is greater.

• Work is currently focused on multi-domain workflow management (dividing the sub-flows) and developing the OGSA integration.

29

Questions?

30

Predictive Workload Management for Grid Computing Daniel Spooner & Stephen Jarvis

Predictive Workload

Management for Grid Computing

Daniel Spooner & Stephen Jarvis

Talk Outline

Grid Research at Warwick

Grid Research at Warwick

Grid Workload Management

Decision-making Support

Performance Prediction

PACE Architecture

PACE Architecture

PACE Architecture

How Does This Help?

Quality-of-Service

Service Level Agreements

Service Level Agreements

Service Level Agreements

TITAN Architecture overview

Titan Architecture overview

Tiered Scheduling

Greedy Users

TITAN Managed

Workflow

Workflow

Local Workflow

Local Workflow

Local Workflow

Local Workflow

Experimental Results

OGSA Integration

Conclusions

Questions?

Related documents

Products

Support

Predictive Workload Management for Grid Computing Daniel Spooner &amp; Stephen Jarvis