Performance-based Middleware for Grid Computing Dr Stephen Jarvis

advertisement
High Performance Systems Group
Performance-based
Middleware for
Grid Computing
Dr Stephen Jarvis
High Performance Systems Group
University of Warwick, UK
High Performance Systems Group
Context
• Funded by / collaborating with
– UK e-Science Core Programme
– IBM (Watson, Hursley)
– NASA (Ames), NEC Europe, Los Alamos National
Laboratory, MIT
• Aims
– Integrate established performance and scheduling
tools with emerging grid middleware
– Test on scientific and business case studies
Performance-managed Grid Middleware
High Performance Systems Group
Do we need performancemanaged Grid middleware?
• User perspective
– Large, complex scientific applications
– Grid provides a number of run options
– Real-time results/guarantees important
– Budget
• Resource providers perspective
– Scheduling of tasks
– Make best use of resources / profit
– Provide QoS
High Performance Systems Group
Performance Services
• Intra-domain
– Lab- / department-based
– Shared resources under
local administration
• Multi-domain
– Campus- / country-based
– Wide-area resource and
task management
– Cross domain
High Performance Systems Group
Performance Services
• Intra-domain
– Lab- / department-based
– Shared resources under
local administration
• Multi-domain
– Campus- / country-based
– Wide-area resource and
task management
– Cross domain
High Performance Systems Group
Performance Services
• Intra-domain
– Lab- / department-based
– Shared resources under
local administration
• Multi-domain
– Campus- / country-based
– Wide-area resource and
task management
– Cross domain
High Performance Systems Group
Performance tools
• Performance prediction tools
• Aim to predict
– Execution time
– Communication usage
– Data and resource requirements
• Provides best guess as to how an
application will execute on a given
resource
High Performance Systems Group
PACE
Application
Resource
User
High Performance Systems Group
PACE
Application
Application
Model
Resource
Model
Resource
User
High Performance Systems Group
PACE
Application
Application
Model
Model parameters
Evaluation
Engine
Resource
Model
Resource
Resource config.
User
High Performance Systems Group
PACE
Application
Application
Model
Model parameters
Evaluation
Engine
Resource
Model
Resource
Resource config.
User
High Performance Systems Group
• Scaling properties on single
architectures
sweep3d
50
45
40
35
30
• Compare performance over
improc
different architectures
closure
fft
25
20
15
10
• jacobi
Re-order tasks according to
memsort
cpideadlines
16
13
10
7
4
5
0
1
Running Execution
Time on SGIOrigin2000
(sec)
Time
Why is prediction useful?
Processing
T he
Number ofElements
Processors
• Give priority to favoured
users
• Maximise resource usage
Allows runtime scenarios to be explored before deployment
High Performance Systems Group
1. Intra-Domain Co-Scheduling
• Augment Condor with additional performance
information
• Handle predictive and non-predictive tasks
• Use predictive data for system improvement
– Time to complete tasks / utilisation of resources
– QoS – ability to meet deadlines
• Scheduler driver, or co-scheduler (called Titan)
High Performance Systems Group
Intra-Domain Co-Scheduling
• Non-predictive tasks
REQUESTS FROM
USERS OR OTHER
DOMAIN SCHEDULERS
PACE
PORTAL
PREEXECUTION
ENGINE
SCHEDULE
QUEUE
MATCHMAKER
GA
Titan
CLUSTER
CONNECTOR
CLASSADS
CONDOR
RESOURCES
High Performance Systems Group
Intra-Domain Co-Scheduling
• Non-predictive tasks
REQUESTS FROM
USERS OR OTHER
DOMAIN SCHEDULERS
PACE
PORTAL
PREEXECUTION
ENGINE
SCHEDULE
QUEUE
MATCHMAKER
GA
Titan
CLUSTER
CONNECTOR
CLASSADS
CONDOR
RESOURCES
High Performance Systems Group
Intra-Domain Co-Scheduling
• Non-predictive tasks
• Tasks with prediction data
REQUESTS FROM
USERS OR OTHER
DOMAIN SCHEDULERS
PACE
PORTAL
PREEXECUTION
ENGINE
SCHEDULE
QUEUE
MATCHMAKER
GA
Titan
CLUSTER
CONNECTOR
CLASSADS
CONDOR
RESOURCES
High Performance Systems Group
Intra-Domain Co-Scheduling
• Non-predictive tasks
• Tasks with prediction data
REQUESTS FROM
USERS OR OTHER
DOMAIN SCHEDULERS
PACE
PORTAL
PREEXECUTION
ENGINE
SCHEDULE
QUEUE
MATCHMAKER
GA
Titan
CLUSTER
CONNECTOR
CLASSADS
CONDOR
RESOURCES
High Performance Systems Group
Intra-Domain Co-Scheduling
• Non-predictive tasks
• Tasks with prediction data
REQUESTS FROM
USERS OR OTHER
DOMAIN SCHEDULERS
PACE
PORTAL
PREEXECUTION
ENGINE
SCHEDULE
QUEUE
MATCHMAKER
GA
Titan
CLUSTER
CONNECTOR
CLASSADS
CONDOR
RESOURCES
High Performance Systems Group
Intra-Domain Co-Scheduling
• Non-predictive tasks
• Tasks with prediction data
REQUESTS FROM
USERS OR OTHER
DOMAIN SCHEDULERS
PACE
PORTAL
PREEXECUTION
ENGINE
SCHEDULE
QUEUE
MATCHMAKER
GA
Titan
CLUSTER
CONNECTOR
CLASSADS
CONDOR
RESOURCES
High Performance Systems Group
Intra-Domain Co-Scheduling
• Non-predictive tasks
• Tasks with prediction data
REQUESTS FROM
USERS OR OTHER
DOMAIN SCHEDULERS
PACE
PORTAL
PREEXECUTION
ENGINE
SCHEDULE
QUEUE
MATCHMAKER
GA
Titan
CLUSTER
CONNECTOR
CLASSADS
CONDOR
RESOURCES
High Performance Systems Group
Intra-Domain Deployment
Without co-scheduler
Time to complete = 70.08m
With co-scheduler
Time to complete = 35.19m
High Performance Systems Group
2. Multi-Domain Management
• Publish intra-domain perf. data through
Globus Information Services
• Augment service with agent system
– One agent per domain / VO
• When a task is submitted
– Agents query IS, and negotiate to discover best
domain to run task
• Scheme is tested on a 256-node exp. Grid
– 16 resource domains; 6 arch. types
High Performance Systems Group
Multi-Domain Management
time
High Performance Systems Group
Multi-Domain Management
Time to complete = 2752s
High Performance Systems Group
Multi-Domain Management
Time to complete = 467s;
an improvement of 83%
High Performance Systems Group
QoS: Ability to Meet Deadline
active
inactive
High Performance Systems Group
Resource usage
active
inactive
High Performance Systems Group
• Software
Project Status
– GT2 and GT3 implementations
– Handles workflows as well as discrete jobs
(demonstration available)
– Have developed predictive methods for business and
scientific applications
• Output
– Presented at GGFs, NeSC workshops, IPDPS, HPDC,
Super Computing, Cluster Computing, CCGrid, EuroPar …
– 23 journal and conference papers
– GT3 software release at All-Hands
• See www.dcs.warwick.ac.uk/~hpsg
Download