Service Level Agreement Based Scheduling Heuristics Rizos Sakellariou, Djamila Ouelhadj

advertisement
Service Level Agreement
Based
Scheduling Heuristics
Rizos Sakellariou, Djamila Ouelhadj
Motivation – is this a good state of affairs?
• Scheduling jobs onto (high-performance)
compute resources is traditionally queue
based (has been since time immemorial  )
• Two basic levels of service are provided:
– “Run this when it gets to the head of the queue”
(in other words, whenever!)
– “Run this at a precise time” (advance
reservation)
Even sophisticated systems, such as Condor,
are still queue-based…
Scheduling workflows
• DAG scheduling heuristics do exist
…but…
• In a queue based system:
– To maintain dependences, each component is
scheduled after all parents have finished: the
penalty is that each component pays the cost
of the batch queue latency!
– Assurances about the start time, completion
time of each component are desirable!
The best that one can aim for at the moment
is advance reservation: too restrictive!
Advance Reservation
• Setting times precisely is not what the user
really wants. Often users are only interested
in the bounds (e.g., latest end time). This
information is not captured, nor used!
• Doesn’t fit well into the batch processing
model.
– Utilisation (hence income) decreases rapidly as
the number of AR jobs increases (gaps can’t be
effectively plugged – checkpointing and/or
suspend/resume costs!)
But it’s not only about workflows…
Renegotiation of resources:
• A long-term goal of the Reality Grid project
• Experiments may need to be extended in time
(at short notice) – (discovery of the century is
around the corner  )
• Resources may need to be changed – in which
case checkpointing/restart is needed: state may
be in the order of 1TB!
Could it also be about expanding the user base?
A novel approach to scheduling?
• There is no queue; jobs do not have a priority
• The schedule is based on satisfying constraints.
• These constraints are expressed in a Service
Level Agreement: a contract between users and
brokers; brokers and local schedulers, etc…
What to optimise for? (objective function)
• Resource utilisation (income)
• If someone comes with lots of cash the
scheduler may want to break some smaller
agreements (money rules?) – reliability?
MetaSLA
Super
Scheduler2
subSLA
Local
Scheduler1
Users
Cluster1
Local
Scheduler2
Super
Scheduler1
Compute
Resources
Local
Scheduler3
Resource Record
Super
Scheduler3
Local
Schedulern
Jobs to finish “anytime”
(no guarantee required)
Key components
• Users: they negotiate and agree an SLA with
a broker (or superscheduler)
• Brokers: based on SLAs agreed with users,
they negotiate and agree SLAs with local
schedulers (and possibly other brokers)
• Local Schedulers: they schedule the work
that corresponds to an SLA agreed with a
broker.
• Two types (?) of SLA:
– Meta-SLA (between user and broker)
– Sub-SLA (between broker and local scheduler)
Issues
• Definition of SLAs
– Resources, start/finish time, how long, cost,
guarantee, penalty for failure
– Meta-SLAs are negotiated first, sub-SLAs come
later
• Negotiation Protocols
– Based on availability (needs behaviour model)
• Scheduling
– Jobs onto resources (local)
• Renegotiation
• Economy (selfish entities…), metrics
The Research Challenges “L”
AI Planning & Scheduling
Fuzzy logic
Multicriteria scheduling
AI constraint satisfaction
Scheduling for the Grid
SLAs
Negotiation
Scheduling heuristics
Economic considerations
SLA Contents
metaSLA
Name, ID
Book keeping info
Info
Resource
List of Resources: H/W, response Time
subSLA
ID
Book keeping info
Info
Client
Owner
Remote Machine
Number of Nodes
Hardware
Resource
List of Resources: H/W, S/W
Nodes
Hardware
Software
Time
Estimated Response Time
Task execution time
Time Period
Date
Deadline
Start Time
End Time
Execution results by this time
Guarantee Level
Cost
Budget Constraint
Execution Host
Compute node definition
Resource
Hardware
Arch
Mem
Disk
CPU
b/w
Payment for task execution
Software
Max cost specified by the user
Preference in a specific machine
Name and version
OS
Resource reservation time
Time
Date
Start Time
End Time
• Meta-SLA
Negotiation
– User requests an SLA
– Based on (high-level view of) availability
broker suggests an SLA
– If the user accepts, a contract is in place
• Sub-SLAs
– Broker has agreed a meta-SLA
– Usage of resources needs to be agreed –
sub-SLA is requested
– Bids are made, based on availability
– Sub-SLA is agreed
Job Client
Super Scheduler1
Local Scheduler1
(SS1)
(LS1)
Submit job execution req.
metaSLA
negotiation takes
place
Send a metaSLA with the id
..
.
Agree metaSLA
Authorize Client
Assign an SLAid
Parse req.
Checks Resource Availability
Check Local Resources Availability
Create + Store subSLA(s)
Submit subSLA(s)
Agree subSLA(s)
Response
Optional Action
Verify
Resources
Response
Cost Calculation
Create + Store metaSLA
Optional Action
Request (execution host info)
When Super
Scheduler unable to
check locally
Update Storage of State
Info about LS
Task Initiation Notification (email)
Status Information Request
Response <job mini-report>
Completion Report
Report (LS state info)
Parse
subSLA(s) +
Verify
Reservation
+ Set Deadline
Initiate Task
Execution
Task
Execution
period
Update LS
Update Storage of State Info about LS, SLA Store.
Local Scheduling
The scheduling problem is defined by the allocation of a set of
independent SLAs, S={SLA1, …, SLAs} to a network of hosts H={H1, …,
Hn}.
The expected execution time Eij of SLAi on host Hj.
The earliest possible start time STi for the SLAi on a selection of hosts is
the latest free time of all the selected hosts.
The expected completion time Ci of SLAi on host Hj = STi + Eij
The makespan is defined as:
Cmax = max (Ci)1<i<s
The objective function is to minimise the makespan:
min Cmax
EPSRC e-Science Meeting 2005
Tabu Search for Local Scheduling
• To solve the problem we propose to investigate the use of
advanced search techniques: tabu search, Genetic algorithms,
simulated annealing, etc.
•Tabu search is a high-level iterative procedure that makes use of
memory structures and exploration strategies based on
information stored in memory to search beyond local optima. In
tabu search, the search process starts from a feasible solution and
iteratively moves from the current solution to its best
neighbouring solution even if that moves worsens the objective
function value (Glover, 1997).
EPSRC e-Science Meeting 2005
Tabu Search for Local Scheduling
Tabu search for local scheduling:
• Initial solution: FCFS, Min-min, Max-min, sufferage, and
backfilling.
• The solution is improved by using two moves: SLA-swap
and SLA-transfer moves.
• SLA-swap move swaps two SLAs performed by different
processors.
• SLA-transfer move shifts the SLA to another processor.
• Composite neighbourhood.
EPSRC e-Science Meeting 2005
Other objective functions for Local Scheduling
• Other objective functions: minimising maximum lateness
minimising cost to the user, maximising profit (to supplier),
maximising personal / general utility, maximise resource
utilisation, etc.
EPSRC e-Science Meeting 2005
Fuzzy Scheduling
• Uncertainty handling using fuzzy models: fuzzy due
dates, fuzzy execution time.
μ(D)
μ(P)
1
1
~
~
D
P
0
p1 p2
EPSRC e-Science Meeting 2005
p3
x
0
d1
d2
x
Fuzzy objective function
The objective is to minimise the maximum fuzzy
completion time :
~
~
minimise C max  max Ci
i 1,..., n
EPSRC e-Science Meeting 2005
Re-negotiation in the Presence of Uncertainties
Dynamic nature of Grid computing:
Resources may fail, high priority jobs may submitted,
new resources can be added, etc.
In the presence of real-time events, which make the LS
agents not any more able to execute the SLAs, the SS
agents re-negotiate the SLAs in failure at the local and
global levels of the Grid in order to find alternative LS
agents to execute them.
EPSRC e-Science Meeting 2005
Renegotiation in the Presence of Uncertainties
Local
Sched12
SS
If
In1itcase
re-negotiates
cannot
SS1 manage
couldthe
not
to
sub-SLAs
find
do so,
alternative
SSin1 re-negotiates
failure
Local
to
SS
At
the
re-negotiates
end
of
task
the
execution,
sub-SLAs
LS
in
sends
failure
a
to
final
2 alternative
find
Schedulers
the
the
Local
with
local
the
Schedulers
and
neighbouring
global
locally
levels,
SSs
within
the
by
SS
locatedat
LS
to
execute
the 22job
in failure.
2 meta-SLAs
22
SS
detects
the
resource
failure.
1alternative
find
report
including
Local
the
output
Schedulers.
file
details
to
the
SS
the
initiating
same an
cluster
a alert
meta-SLA
by
message
initiating
negotiation
to the
a sub-SLA
user
session.
to inform
1 sends
user.
negotiation
him that the
session
meta-SLA
with the
cannot
suitable
be fulfilled.
Local
Schedulers.
User
Local
Sched21
Local
Sched22
SSt2
Local
Schedw2
Local
Sched11
SS 1
SS n
Local
Schedm1
Meta-SLA re-negotiation
Sub-SLA re-negotiation
Sub-SLA re-negotiation
EPSRC e-Science Meeting 2005
Methodology
• Simulation based approach
– Need to evaluate different approaches for agreeing
SLAs (e.g., conservative vs overbooking),
generating bids, pricing/penalties, scheduling, …
– Need to model users behaviour with SLAs
• Evaluation metrics:
– Resource utilisation, jobs completed / SLAs broken
• Difficult to do a fair comparison with a batchqueuing system!
– If job waiting time was the issue, it would translate
to comparing FCFS with soft real-time scheduling!
Conclusions
• SLAs have the potential of changing the
way that jobs are assigned onto compute
resources.
• Increased flexibility appears to be the main
advantage
• Long-term risk: batch systems have shown
a remarkable resistance to change!
http://www.gridscheduling.org
The people
• Manchester:
– Viktor Yarmolenko
– Rizos Sakellariou
– Jon MacLaren (now at Louisiana State
University)
• Nottingham:
– Djamila Ouelhadj
– Jon Garibaldi
Download