Workflow Scheduling Optimisation: The case for revisiting DAG scheduling

advertisement
Workflow Scheduling Optimisation:
The case for revisiting DAG
scheduling
Rizos Sakellariou and Henan Zhao
University of Manchester
Scientific Analysis
Construct the Analysis
Workflow Evolution
Workflow Template
Select the Input Data
Scheduling
Workflow Instance
Map the Workflow onto
Available Resources
Executable Workflow
Execute the Workflow
Ewa Deelman, deelman@isi.edu
Slide Courtesy: Ewa Deelman, deelman@isi.edu
Tasks to be executed
www.isi.edu/~deelman
www.isi.edu/~deelman
Grid Resources
pegasus.isi.edu
pegasus.isi.edu
GRAM
Execution Environment
DAGMan
Pegasus
PBS
LSF
Condor
GridF
TP
Condor-G
Condor-C
GRAM
HTT
P
Storage
PBS
LSF
Condor Q
Condor
LOCAL SUBMIT HOST
Condor-G
GridFTP
GRAM
GRAM
HTTP
PBS
LSF
Condor
LSF
Condor
GridF
TP
HTT
P
Storage
PBS
GridF
TP
Storage
HTT
P
Slide Courtesy: Ewa Deelman, deelman@isi.edu
Storage
www.isi.edu/~deelman
pegasus.isi.edu
In this talk, optimisation relates to performance
What affects performance?
• Aim: to minimise the execution time of the
workflow
• How?
– Exploit task parallelism
• But, even if there is enough parallelism, can the
environment guarantee that this parallelism can be
exploited to improve performance?
– No!
• Why?
– There is interference from the batch job schedulers that
are traditionally used to submit jobs to HPC resources!
Example
• The uncertainty of batch schedulers means that any workflow
enactment engine must wait for components to complete before
beginning to schedule dependent components.
This execution model
fails to hide the latencies
resulting from the length
of job queues: these
determine the execution
time of the workflows.
• Furthermore, it is not clear if parallelism will be fully exploited;
e.g., if the three tasks above that can be executed in parallel are
submitted to 3 different queues of different length, there is no
guarantee that they will execute in parallel – job queues rule!
Then… try to get rid of the evil job queues!
• Advance reservation of resources has been proposed to
make jobs run at a precise time.
• However, resources would be wasted if they are reserved
for the whole execution of the workflow.
• Can we automatically make advance reservations
for individual tasks?
Assuming that there is no job queue…
…what affects performance?
• The structure of the workflow
– number of parallel tasks;
– how long these tasks take to execute;
• The number of the resources
– typically, much smaller than the parallelism available.
• In addition:
– there are communication costs
– there is heterogeneity
– estimating computation+communication is not trivial.
What does all this imply for mapping?
• An order by which tasks will be
executed needs to be established
(eg., red, yellow, or blue first?)
• Resources need to be chosen for
each task (some resources are fast,
some are not so fast!)
• The cost of moving data between
resources should not outweigh the
benefits of parallelism.
Does the order matter?
• If task 6 on the right
takes comparatively
longer to run, we’d 1
like to execute task 2
just after task 0
finishes and before
tasks 1, 3, 4, 5.
0
2
6
3
4
7
9
Follow the critical path! Is this new? Not really… 
5
8
Modelling the problem…
• A workflow is a Directed Acyclic Graph (DAG)
• Scheduling DAGs onto resources is well studied in
the context of homogeneous systems – less so, in the
context of heterogeneous systems (mostly without
taking into account any uncertainty).
• Needless to say that this is an NP-complete problem.
• Are workflows really a general type of DAGs or a
subclass? We don’t really know… (some are clearly
not DAGs – only DAGs considered here…)
Our approach…
• Revisit the DAG scheduling problem for
heterogeneous systems…
• Start with simple static scenarios…
– Even this problem is not well understood, despite
the fact that there have been perhaps more than 30
heuristics published… (check the Heterogeneous
Computing Workshop proceedings for a start…)
• Try to build on as we obtain a good
understanding of each step!
Outline
1. Static DAG scheduling onto heterogeneous
systems (ie, we know computation &
communication a priori)
2. Introduce uncertainty in computation times.
3. Handle multiple DAGs at the same time.
4. Use the knowledge accumulated above to
reserve slots for tasks onto resources.
Based on…
[1] Rizos Sakellariou, Henan Zhao. A Hybrid Heuristic for DAG Scheduling
on Heterogeneous Systems. Proceedings of the 13th IEEE Heterogeneous
Computing Workshop (HCW’04) (in conjunction with IPDPS 2004), Santa
Fe, April 2006, IEEE Computer Society Press, 2004.
[2] Rizos Sakellariou, Henan Zhao. A low-cost rescheduling policy for efficient
mapping of workflows on grid systems. Scientific Programming, 12(4),
December 2004, pp. 253-262.
[3] Henan Zhao, Rizos Sakellariou. Scheduling Multiple DAGs onto
Heterogeneous Systems. Proceedings of the 15th Heterogeneous Computing
Workshop (HCW'06) (in conjunction with IPDPS 2006), Rhodes, Apr. 2006,
IEEE Computer Society Press.
[4] Henan Zhao, Rizos Sakellariou. Advance Reservation Policies for
Workflows. Proceedings of the 12th Workshop on Job Scheduling Strategies
for Parallel Processing, 2006.
How to schedule? Our model…
A DAG, 10 tasks, 3 machines
(assume we know execution times, communication costs)
0
18
1
12
2
9
3
11
14
4
5
1000
19
16
27
23
23
6
7
8
11
17
13
9
15
Task
M1
M2
M3
0
37
39
27
1
30
20
24
2
21
21
28
3
35
38
31
4
27
24
29
5
29
37
20
6
22
24
30
7
37
26
37
8
35
31
26
9
33
37
21
A simple idea…
0
3
2
1
4
6
Assign nodes to the
fastest machine!
5
7
8
9
Makespan is > 1000!
Communication between
nodes 4 and 8 takes way
too long!!!
Heuristics that take into
account the whole structure
of the DAG are needed…
Still, if we consider the whole
HEFT – a minor changeDAG…
leads to different schedules
(~15%):
0
10
20
30
40
50
60
70
80
90
100
110
120
130
140
150
160
170
5
0
3
3
2
1
6
0
4
5
1
4
2
7
7
8
8
6
9
9
Makespan: 143
Makespan: 164
H.Zhao,R.Sakellariou. An experimental study of the rank function
of HEFT. Proceedings of EuroPar’03.
Hmm…
• This was a rather well
defined problem…
• This was just a small
change in the algorithm…
• What about different
heuristics?
• What about more generic
problems?
DAG scheduling: A Hybrid Heuristic
Trying to find out why there were such differences
in the outcome of HEFT…we observed problems
with the order… to address those problems we came
up with a Hybrid Heuristic… it worked quite well!
Phases:
1. Rank (list scheduling)
2. Create groups of independent tasks
3. Schedule independent tasks
•
•
Can be carried out using any scheduling algorithm for
independent tasks, e.g. MinMin, MaxMin, …
A novel heuristic (Balanced Minimum Completion Time)
R.Sakellariou, H.Zhao. A Hybrid Heuristic for DAG Scheduling on
Heterogeneous Systems. Proceedings of the IEEE Heterogeneous
Computing Workshop (HCW 04) , 2004.
An Example
0
14
1
18
22
2
13
3
4
14
15
26
6
25
5
21 17
20
7
8
Nod
e
0
M0
M1
M2
17
19
21
1
22
27
23
2
15
15
9
3
4
8
9
4
17
14
20
5
30
27
18
6
17
16
15
7
49
49
46
8
25
22
16
9
23
27
19
Machines
26
20
9
19
M0 – M1
Time for a
data unit
0.9
M1 – M2
1.0
M0 – M2
1.4
An Example
Phase 1: Rank the nodes
Node
Weight
Rank
0
19
149.93
1
24
120.66
2
13
85.6
3
7
84.13
4
17
112.93
5
25
95.39
6
16
58.06
7
16
85.66
8
21
57.93
9
23
23.0
Mean + Upward Ranking Scheme
The order is {0, 1, 4, 5, 7, 2, 3, 6, 8, 9}
An Example
Phase 1: Rank the nodes
Phase 2: Create groups of independent tasks
0
The order is {0, 1, 4, 5, 7, 2, 3, 6,
8, 9}
1
2
6
3
7
9
4
5
8
Group
Tasks
0
{0}
1
{1, 4, 5}
2
{7, 2, 3}
3
{6, 8}
4
{9}
Balanced Minimum Completion Time
Algorithm (BMCT)
Step I:
Assign each task to the machine that gives the
fastest execution time.
Step II:
Find the machine M with the maximal finish
time. Move a task from M to another
machine, if it minimizes the overall makespan.
An Example (1)
Phase 3: Schedule Independent Tasks in
Group 0
Balanced Minimum Completion Time (BMCT)
0
M0
M1
M2
Initially assign each
task in the group to the
machine giving the
fastest time
No movement for the
entry task
0
20
40
60
80
100
120
140
An Example (2)
Phase 3: Schedule Independent Tasks in
Group
1
0
M0
0
20
40
60
80
100
120
140
M1
M2
1
4
5
Initially assign each task
in the group to the
machine giving the
fastest time
An Example (3)
Phase 3: Schedule Independent Tasks in
Group
1
0
M0
0
20
40
60
80
100
120
140
M1
M2
1
4
5
Initially assign each task
in the group to the
machine giving the
fastest time
M2 is the machine with the
Maximal Finish Time (70)
An Example (4)
Phase 3: Schedule Independent Tasks in
Group
1
0
M0
0
60
80
100
120
140
M2
1
20
40
M1
5
4
5
Task 5 moves to M0
since it can achieve an
earlier overall finish time
Now M0 is the machine
with the Maximal Finish
Time (69)
An Example (5)
Phase 3: Schedule Independent Tasks in
Group
1
Task 1 moves to M2
0
M0
0
20
M1
M2
5
4
1
since it can achieve an
earlier overall finish time
40
60
80
100
120
140
Now M2 is the machine
with the Maximal Finish
Time (59)
No task can be moved from
M2, the movement stops.
Schedule next group
An Example (6)
Phase 3: Schedule Independent Tasks in
Group
2
0
M0
0
20
M1
M2
5
3
4
1
40
60
7
80
2
100
120
140
Initially assign each task
in this group to the
machine giving the
fastest time
An Example (7)
Phase 3: Schedule Independent Tasks in
Group
2
0
M0
0
20
40
60
80
M1
M2
5
3
4
1
2
7
Task 2 moves to M1
since it can achieve an
earlier overall finish time
M2 is the machine with
the Maximal Finish Time
100
120
140
No movement from M2
Schedule next group
An Example (8)
Phase 3: Schedule Independent Tasks in
Group
3
0
M0
0
20
40
60
M1
M2
5
3
4
1
2
7
80
100
120
140
6
8
Initially assign each task
in this group to the
machine giving the
fastest time
An Example (9)
Phase 3: Schedule Independent Tasks in
Group
3
0
M0
0
20
40
60
M1
M2
5
3
6
4
1
2
7
80
100
120
140
8
Task 6 moves to M0
since it can achieve an
earlier overall finish
time
M2 is the machine with
the Maximal Finish Time
An Example (10)
Phase 3: Schedule Independent Tasks in
Group
3
0
M0
0
20
40
M1
5
3
6
4
8
100
1
2
60
80
M2
7
Task 8 moves to M1
since it can achieve an
earlier overall finish
time
M1 is the machine with
the Maximal Finish Time
120
140
No movement from M1
Schedule next group
An Example (11)
Phase 3: Schedule Independent Tasks in
Group
4
0
M0
0
20
40
M1
M2
5
3
6
4
1
2
60
8
7
80
No movement for the
exit task
100
120
140
Initially assign each
task in this group to the
machine giving the fastest
time
9
The Final Schedule
M0
0
M1
M2
0
20
5
40
60
80
100
3
4
6
2
8
1
7
120
140
9
Experiments
DAG Scheduling
– Algorithms
• Hybrid.BMCT (i.e. The algorithm as presented), and
• Hybrid.MinMin (i.e. MinMin instead of BMCT)
– Applications
•
•
•
•
Random-generated graphs
Laplace
FFT
Fork-join graphs
– Heterogeneity setting (following an approach by Siegel et al)
• Consistent
• Partially-consistent
• Inconsistent
Hybrid Heuristic Comparison
NSL
2
1.9
1.8
1.7
1.6
1.5
Hyb.BMCT
Hyb.MinMin
FCP
DLS
1.4
1.3
CPOP
1.2
HEFT
LMT
1.1
1
Random DAGs, 25-100 tasks with inconsistent heterogeneity
Average improvement ~= 25%
Hmm…
• Yes, but, so far, you have
used static task execution
times… in practice such
times are difficult to specify
exactly…
• There is an answer for runtime deviations: adjust at runtime…
• But:
don’t we need to understand
the static case first?
Characterise the Schedule
• Spare time indicates the maximum time that a node,
i, may delay without affecting the start time of an
immediate successor, j.
– A node i with an immediate successor j on the DAG:
spare(i,j) = Start_Time(j) – Data_Arrival_Time(i,j)
– A node i with an immediate successor j on the same
machine: spare(i,j) = Start_Time(j) – Finish_Time(i)
– The minimum of the above: MinSpare for a task.
R.Sakellariou, H.Zhao. A low-cost rescheduling policy for efficient
mapping of workflows on grid systems. Scientific Programming,
12(4), December 2004, pp. 253-262.
Example
DAT(4,7)=40.5, ST(7)=45.5, hence, spare(4,7) = 5
FT(3)=28, ST(5)=29.5, hence, spare(3,5) = 1.5
DAT: Data_Arrival_Time, ST: Start_Time, FT: Finish_Time
Characterise the schedule (cont.)
• Slack indicates the maximum time that a node, i,
may delay without affecting the overall
makespan.
– Slack(i)=min(slack(j)+spare(i,j)), for all successor
nodes j (both on the DAG and the machine)
• The idea: keep track of the values of the slack
and/or the spare time and reschedule only when
the delay exceeds slack…
R.Sakellariou, H.Zhao. A low-cost rescheduling policy for efficient
mapping of workflows on grid systems. Scientific Programming,
12(4), December 2004, pp. 253-262.
Lessons Learned…
(simulation and deviations of up to 100%)
• Heuristics that perform better statically,
perform better under uncertainties.
• By using the metrics on spare time, one can
provide guarantees for the maximum deviation
from the static estimate. Then, we can minimise
the number of times we reschedule still achieving
good results.
• Could lead to orders of magnitude improvement
with respect to workflow execution using
DAGMAN (would depend on the workflow, only
partly true with Montage…)
Challenges still unanswered…
• What are the representative DAGs (workflows) in
the context of Grid computing?
• Extensive evaluation / analysis (theoretical too) is
needed. Not clear what is the best makespan we can
get (because it is not easy to find the critical path)
• What are the uncertainties involved? How good are
the estimates that we can obtain for the execution
time / communication cost? Performance prediction
is hard…
• How ‘heterogeneous’ our Grid resources really are?
Moving on… to multiple DAGs
• It is really ideal to assume that we have
exclusive usage of resources…
• In practice, we may have multiple DAGs
competing for resources at the same time…
Henan Zhao, Rizos Sakellariou. Scheduling Multiple
DAGs onto Heterogeneous Systems. Proceedings of the
15th Heterogeneous Computing Workshop (HCW'06) (in
conjunction with IPDPS 2006), Rhodes, Apr. 2006, IEEE
Computer Society Press.
Scheduling Multiple DAGs:
Approaches
• Approach 1: Schedule one DAG after the other
with existing DAG scheduling algorithms
– Low resource utilization & long overall makespan
• Approach 2: Still one after the other, but do
some backfilling and fill the gaps
– Which DAG to schedule first? The one with longest
makespan or the one with shortest makespan?
• Approach 3: Merge all DAGs into a single,
composite DAG. Much better than Approach 1 or 2.
Example:
Two DAGs to be scheduled together
DAG A
DAG B
A
1
A
2
A
3
A
5
B
1
A
4
B
2
B
3
B
5
B
4
B
6
B
7
Composition Techniques
• C1: Common Entry and Common Exit Node
A0
B0
A1
A2
A3
B1
A4
B2
A5
B3
B5
B6
B7
A6
B8
B4
Composition Techniques
• C2: Level-Based Ordering
A0
B0
A1
A2
A3
B1
A4
B2
A5
B3
B5
B6
B7
A6
B8
B4
Composition Techniques
• C3: Alternate between DAGs… (round
robin between DAGs)…
• Easy!
Composition Techniques
• C4: Ranking-Based Composition (compute a
weight for each node and merge accordingly)
DAG
A0
B0
A
1
A
2
A
3
A
5
A
4
DAG A
A1 50
A2 42
A3 36
A4 20
A5 6
DAG B
B1
B2
B3
B4
B5 45
B6 63
B7 13
B
1
B
2
B
3
B
5
B
4
B
6
B
7
Rank
200
152
122
140
But, is makespan optimisation a good
objective when scheduling multiple
DAGs?
Mission: Fairness
In multiple DAGs:
• Users perspective: I want my DAG to complete
execution as soon as possible.
• System perspective: I would like to keep as many
users as possible happy; I would like to increase
resource utilisation.
Let’s be fair to users!
(The system may want to take into account different
levels of quality of service agreed with each user)
Slowdown
• Slowdown: what is the delay that a DAG
would experience as a result of sharing the
resources with other DAGs (as opposed to
having the resources on its own).
Average slowdown for all DAGs:
Unfairness
• Unfairness indicates, for all DAGs, how
different the slowdown of each DAG is from
the average slowdown (over all DAGs).
• The higher the difference, the higher the
unfairness!
Scheduling for Fairness
• Key idea: at each step (that is, every time a
task is to be scheduled), select the most
affected DAG (that is the DAG with the
highest slowdown value) to schedule.
What is the most affected DAG at any given
point in time?
Fairness Scheduling Policies
• F1: Based on latest Finish Time
– calculates the slowdown value only at the time
the last task that was scheduled for this DAG
finishes.
• F2: Based on Current Time
– re-calculates the slowdown value for every
DAG after any task finishes. A proportion of
time, for tasks running, is taken when the
calculation is carried out.
Lessons Learned… Open questions…
• It is possible to achieve reasonably good fairness
without affecting makespan.
• An algorithm with good behaviour in the static
case appears to make things easier in terms of
achieving fairness…
• What is fairness?
• What is the behavior when run-time changes occur?
• What about different notions of Quality of Service
(SLAs, etc…)
Finally…
• How to automate advance reservations at
the task level for a workflow, when the user
has specified a deadline constraint only for
the whole workflow?
Henan Zhao, Rizos Sakellariou.
Advance Reservation Policies for Workflows.
Proceedings of the 12th Workshop on Job Scheduling
Strategies for Parallel Processing, 2006.
The Schedule
M0
0
M1
M2
0
20
5
40
60
80
3
4
6
2
8
1
7
100
120
140
9
The schedule on
the left can be
used to plan
reservations.
However, if one
task fails to finish
within its slot, the
remaining tasks
have to be renegotiated.
What we are looking for is…
The Idea
• 1. Obtain an initial assignment using any DAG
scheduling algorithm (HEFT, HBMCT, …).
• 2. Repeat
• I. Compute the Application Spare Time (= user specified
deadline – DAG finish time).
• II. Distribute the Application Spare Time among the tasks.
• 3. Until the Application Spare Time is below a
threshold.
Spare Time
• The Spare Time indicates the maximum time that
a node may delay, without affecting the start time
of any of its immediate successor nodes.
– A node i with an immediate successor j on the DAG:
spare(i,j) = Start_Time(j) – Data_Arrival_Time(i,j)
– A node i with an immediate successor j on the same
machine: spare(i,j) = Start_Time(j) – Finish_Time(i)
• The minimum of the above for all immediate
successors is the Spare Time of a task.
• Distributing the Application Spare Time needs to
take care of the inherently present spare time!
Two main strategies
• Recursive spare time allocation:
– The Application Spare Time is divided among
all the tasks.
– This is a repetitive process until the Application
Spare Time is below a threshold.
• Critical Path based allocation:
– Divide the Application Spare Time among the
tasks in the critical path.
– Balance the Spare Time of all the other tasks.
• (a total of 6 variants have been studied)
An Example
Critical Path based allocation
Finally…
Findings…
• Advance reservation of resources for workflows can be
automatically converted to reservations at the task level,
thus improving resource utilization.
• If the deadline set for the DAG is such that there is
enough spare time, then we can reserve resources for each
individual task so that deviations of the same order, for
each task, can be afforded without any problems.
• Advance reservation is known to harm resource
utilization. But this study indicated that if the user is
prepared to pay for full usage when only 60% of the slot
is used there is no loss for the machine owner.
…which leads to pricing!
• R.Sakellariou, H.Zhao, E.Tsiakkouri, M.Dikaiakos. “Scheduling
workflows under budget constraints”. To appear as a Chapter in a
book with selected papers from the 1st CoreGrid Integration
Workshop.
• The idea:
– Given a specific budget, what is the best
schedule you can obtain for your workflow?
• Multicriteria optimisation is hard!
• Our approach:
– Start from a good solution for one objective, and try to
meet the other!
– It works! How well… difficult to tell!
To summarize…
• Understanding the basic static scenarios and having
robust solutions for those scenarios helps the extension
to more complex cases…
• Pretty much everything here is addressed by heuristics.
Their evaluation requires extensive experimentation: Still:
– No agreement about how DAGs (workflows) look like.
– No agreement about how heterogeneous resources are
• The problems addressed here are perhaps more related to
what is supposed to be core CS…
• But… we may be talking about lots of work for only
incremental improvements… 10-15%…
• Who cares in Computer Science about
performance improvements in the order of
10-15%???
• (yet, if Gordon Brown was to increase our
taxes by 10-15% everyone would be so
unhappy )…
• Oh well…
Thank you!
Download