Workflow Scheduling Optimisation: The case for revisiting DAG scheduling Rizos Sakellariou and Henan Zhao University of Manchester Scientific Analysis Construct the Analysis Workflow Evolution Workflow Template Select the Input Data Scheduling Workflow Instance Map the Workflow onto Available Resources Executable Workflow Execute the Workflow Ewa Deelman, deelman@isi.edu Slide Courtesy: Ewa Deelman, deelman@isi.edu Tasks to be executed www.isi.edu/~deelman www.isi.edu/~deelman Grid Resources pegasus.isi.edu pegasus.isi.edu GRAM Execution Environment DAGMan Pegasus PBS LSF Condor GridF TP Condor-G Condor-C GRAM HTT P Storage PBS LSF Condor Q Condor LOCAL SUBMIT HOST Condor-G GridFTP GRAM GRAM HTTP PBS LSF Condor LSF Condor GridF TP HTT P Storage PBS GridF TP Storage HTT P Slide Courtesy: Ewa Deelman, deelman@isi.edu Storage www.isi.edu/~deelman pegasus.isi.edu In this talk, optimisation relates to performance What affects performance? • Aim: to minimise the execution time of the workflow • How? – Exploit task parallelism • But, even if there is enough parallelism, can the environment guarantee that this parallelism can be exploited to improve performance? – No! • Why? – There is interference from the batch job schedulers that are traditionally used to submit jobs to HPC resources! Example • The uncertainty of batch schedulers means that any workflow enactment engine must wait for components to complete before beginning to schedule dependent components. This execution model fails to hide the latencies resulting from the length of job queues: these determine the execution time of the workflows. • Furthermore, it is not clear if parallelism will be fully exploited; e.g., if the three tasks above that can be executed in parallel are submitted to 3 different queues of different length, there is no guarantee that they will execute in parallel – job queues rule! Then… try to get rid of the evil job queues! • Advance reservation of resources has been proposed to make jobs run at a precise time. • However, resources would be wasted if they are reserved for the whole execution of the workflow. • Can we automatically make advance reservations for individual tasks? Assuming that there is no job queue… …what affects performance? • The structure of the workflow – number of parallel tasks; – how long these tasks take to execute; • The number of the resources – typically, much smaller than the parallelism available. • In addition: – there are communication costs – there is heterogeneity – estimating computation+communication is not trivial. What does all this imply for mapping? • An order by which tasks will be executed needs to be established (eg., red, yellow, or blue first?) • Resources need to be chosen for each task (some resources are fast, some are not so fast!) • The cost of moving data between resources should not outweigh the benefits of parallelism. Does the order matter? • If task 6 on the right takes comparatively longer to run, we’d 1 like to execute task 2 just after task 0 finishes and before tasks 1, 3, 4, 5. 0 2 6 3 4 7 9 Follow the critical path! Is this new? Not really… 5 8 Modelling the problem… • A workflow is a Directed Acyclic Graph (DAG) • Scheduling DAGs onto resources is well studied in the context of homogeneous systems – less so, in the context of heterogeneous systems (mostly without taking into account any uncertainty). • Needless to say that this is an NP-complete problem. • Are workflows really a general type of DAGs or a subclass? We don’t really know… (some are clearly not DAGs – only DAGs considered here…) Our approach… • Revisit the DAG scheduling problem for heterogeneous systems… • Start with simple static scenarios… – Even this problem is not well understood, despite the fact that there have been perhaps more than 30 heuristics published… (check the Heterogeneous Computing Workshop proceedings for a start…) • Try to build on as we obtain a good understanding of each step! Outline 1. Static DAG scheduling onto heterogeneous systems (ie, we know computation & communication a priori) 2. Introduce uncertainty in computation times. 3. Handle multiple DAGs at the same time. 4. Use the knowledge accumulated above to reserve slots for tasks onto resources. Based on… [1] Rizos Sakellariou, Henan Zhao. A Hybrid Heuristic for DAG Scheduling on Heterogeneous Systems. Proceedings of the 13th IEEE Heterogeneous Computing Workshop (HCW’04) (in conjunction with IPDPS 2004), Santa Fe, April 2006, IEEE Computer Society Press, 2004. [2] Rizos Sakellariou, Henan Zhao. A low-cost rescheduling policy for efficient mapping of workflows on grid systems. Scientific Programming, 12(4), December 2004, pp. 253-262. [3] Henan Zhao, Rizos Sakellariou. Scheduling Multiple DAGs onto Heterogeneous Systems. Proceedings of the 15th Heterogeneous Computing Workshop (HCW'06) (in conjunction with IPDPS 2006), Rhodes, Apr. 2006, IEEE Computer Society Press. [4] Henan Zhao, Rizos Sakellariou. Advance Reservation Policies for Workflows. Proceedings of the 12th Workshop on Job Scheduling Strategies for Parallel Processing, 2006. How to schedule? Our model… A DAG, 10 tasks, 3 machines (assume we know execution times, communication costs) 0 18 1 12 2 9 3 11 14 4 5 1000 19 16 27 23 23 6 7 8 11 17 13 9 15 Task M1 M2 M3 0 37 39 27 1 30 20 24 2 21 21 28 3 35 38 31 4 27 24 29 5 29 37 20 6 22 24 30 7 37 26 37 8 35 31 26 9 33 37 21 A simple idea… 0 3 2 1 4 6 Assign nodes to the fastest machine! 5 7 8 9 Makespan is > 1000! Communication between nodes 4 and 8 takes way too long!!! Heuristics that take into account the whole structure of the DAG are needed… Still, if we consider the whole HEFT – a minor changeDAG… leads to different schedules (~15%): 0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 5 0 3 3 2 1 6 0 4 5 1 4 2 7 7 8 8 6 9 9 Makespan: 143 Makespan: 164 H.Zhao,R.Sakellariou. An experimental study of the rank function of HEFT. Proceedings of EuroPar’03. Hmm… • This was a rather well defined problem… • This was just a small change in the algorithm… • What about different heuristics? • What about more generic problems? DAG scheduling: A Hybrid Heuristic Trying to find out why there were such differences in the outcome of HEFT…we observed problems with the order… to address those problems we came up with a Hybrid Heuristic… it worked quite well! Phases: 1. Rank (list scheduling) 2. Create groups of independent tasks 3. Schedule independent tasks • • Can be carried out using any scheduling algorithm for independent tasks, e.g. MinMin, MaxMin, … A novel heuristic (Balanced Minimum Completion Time) R.Sakellariou, H.Zhao. A Hybrid Heuristic for DAG Scheduling on Heterogeneous Systems. Proceedings of the IEEE Heterogeneous Computing Workshop (HCW 04) , 2004. An Example 0 14 1 18 22 2 13 3 4 14 15 26 6 25 5 21 17 20 7 8 Nod e 0 M0 M1 M2 17 19 21 1 22 27 23 2 15 15 9 3 4 8 9 4 17 14 20 5 30 27 18 6 17 16 15 7 49 49 46 8 25 22 16 9 23 27 19 Machines 26 20 9 19 M0 – M1 Time for a data unit 0.9 M1 – M2 1.0 M0 – M2 1.4 An Example Phase 1: Rank the nodes Node Weight Rank 0 19 149.93 1 24 120.66 2 13 85.6 3 7 84.13 4 17 112.93 5 25 95.39 6 16 58.06 7 16 85.66 8 21 57.93 9 23 23.0 Mean + Upward Ranking Scheme The order is {0, 1, 4, 5, 7, 2, 3, 6, 8, 9} An Example Phase 1: Rank the nodes Phase 2: Create groups of independent tasks 0 The order is {0, 1, 4, 5, 7, 2, 3, 6, 8, 9} 1 2 6 3 7 9 4 5 8 Group Tasks 0 {0} 1 {1, 4, 5} 2 {7, 2, 3} 3 {6, 8} 4 {9} Balanced Minimum Completion Time Algorithm (BMCT) Step I: Assign each task to the machine that gives the fastest execution time. Step II: Find the machine M with the maximal finish time. Move a task from M to another machine, if it minimizes the overall makespan. An Example (1) Phase 3: Schedule Independent Tasks in Group 0 Balanced Minimum Completion Time (BMCT) 0 M0 M1 M2 Initially assign each task in the group to the machine giving the fastest time No movement for the entry task 0 20 40 60 80 100 120 140 An Example (2) Phase 3: Schedule Independent Tasks in Group 1 0 M0 0 20 40 60 80 100 120 140 M1 M2 1 4 5 Initially assign each task in the group to the machine giving the fastest time An Example (3) Phase 3: Schedule Independent Tasks in Group 1 0 M0 0 20 40 60 80 100 120 140 M1 M2 1 4 5 Initially assign each task in the group to the machine giving the fastest time M2 is the machine with the Maximal Finish Time (70) An Example (4) Phase 3: Schedule Independent Tasks in Group 1 0 M0 0 60 80 100 120 140 M2 1 20 40 M1 5 4 5 Task 5 moves to M0 since it can achieve an earlier overall finish time Now M0 is the machine with the Maximal Finish Time (69) An Example (5) Phase 3: Schedule Independent Tasks in Group 1 Task 1 moves to M2 0 M0 0 20 M1 M2 5 4 1 since it can achieve an earlier overall finish time 40 60 80 100 120 140 Now M2 is the machine with the Maximal Finish Time (59) No task can be moved from M2, the movement stops. Schedule next group An Example (6) Phase 3: Schedule Independent Tasks in Group 2 0 M0 0 20 M1 M2 5 3 4 1 40 60 7 80 2 100 120 140 Initially assign each task in this group to the machine giving the fastest time An Example (7) Phase 3: Schedule Independent Tasks in Group 2 0 M0 0 20 40 60 80 M1 M2 5 3 4 1 2 7 Task 2 moves to M1 since it can achieve an earlier overall finish time M2 is the machine with the Maximal Finish Time 100 120 140 No movement from M2 Schedule next group An Example (8) Phase 3: Schedule Independent Tasks in Group 3 0 M0 0 20 40 60 M1 M2 5 3 4 1 2 7 80 100 120 140 6 8 Initially assign each task in this group to the machine giving the fastest time An Example (9) Phase 3: Schedule Independent Tasks in Group 3 0 M0 0 20 40 60 M1 M2 5 3 6 4 1 2 7 80 100 120 140 8 Task 6 moves to M0 since it can achieve an earlier overall finish time M2 is the machine with the Maximal Finish Time An Example (10) Phase 3: Schedule Independent Tasks in Group 3 0 M0 0 20 40 M1 5 3 6 4 8 100 1 2 60 80 M2 7 Task 8 moves to M1 since it can achieve an earlier overall finish time M1 is the machine with the Maximal Finish Time 120 140 No movement from M1 Schedule next group An Example (11) Phase 3: Schedule Independent Tasks in Group 4 0 M0 0 20 40 M1 M2 5 3 6 4 1 2 60 8 7 80 No movement for the exit task 100 120 140 Initially assign each task in this group to the machine giving the fastest time 9 The Final Schedule M0 0 M1 M2 0 20 5 40 60 80 100 3 4 6 2 8 1 7 120 140 9 Experiments DAG Scheduling – Algorithms • Hybrid.BMCT (i.e. The algorithm as presented), and • Hybrid.MinMin (i.e. MinMin instead of BMCT) – Applications • • • • Random-generated graphs Laplace FFT Fork-join graphs – Heterogeneity setting (following an approach by Siegel et al) • Consistent • Partially-consistent • Inconsistent Hybrid Heuristic Comparison NSL 2 1.9 1.8 1.7 1.6 1.5 Hyb.BMCT Hyb.MinMin FCP DLS 1.4 1.3 CPOP 1.2 HEFT LMT 1.1 1 Random DAGs, 25-100 tasks with inconsistent heterogeneity Average improvement ~= 25% Hmm… • Yes, but, so far, you have used static task execution times… in practice such times are difficult to specify exactly… • There is an answer for runtime deviations: adjust at runtime… • But: don’t we need to understand the static case first? Characterise the Schedule • Spare time indicates the maximum time that a node, i, may delay without affecting the start time of an immediate successor, j. – A node i with an immediate successor j on the DAG: spare(i,j) = Start_Time(j) – Data_Arrival_Time(i,j) – A node i with an immediate successor j on the same machine: spare(i,j) = Start_Time(j) – Finish_Time(i) – The minimum of the above: MinSpare for a task. R.Sakellariou, H.Zhao. A low-cost rescheduling policy for efficient mapping of workflows on grid systems. Scientific Programming, 12(4), December 2004, pp. 253-262. Example DAT(4,7)=40.5, ST(7)=45.5, hence, spare(4,7) = 5 FT(3)=28, ST(5)=29.5, hence, spare(3,5) = 1.5 DAT: Data_Arrival_Time, ST: Start_Time, FT: Finish_Time Characterise the schedule (cont.) • Slack indicates the maximum time that a node, i, may delay without affecting the overall makespan. – Slack(i)=min(slack(j)+spare(i,j)), for all successor nodes j (both on the DAG and the machine) • The idea: keep track of the values of the slack and/or the spare time and reschedule only when the delay exceeds slack… R.Sakellariou, H.Zhao. A low-cost rescheduling policy for efficient mapping of workflows on grid systems. Scientific Programming, 12(4), December 2004, pp. 253-262. Lessons Learned… (simulation and deviations of up to 100%) • Heuristics that perform better statically, perform better under uncertainties. • By using the metrics on spare time, one can provide guarantees for the maximum deviation from the static estimate. Then, we can minimise the number of times we reschedule still achieving good results. • Could lead to orders of magnitude improvement with respect to workflow execution using DAGMAN (would depend on the workflow, only partly true with Montage…) Challenges still unanswered… • What are the representative DAGs (workflows) in the context of Grid computing? • Extensive evaluation / analysis (theoretical too) is needed. Not clear what is the best makespan we can get (because it is not easy to find the critical path) • What are the uncertainties involved? How good are the estimates that we can obtain for the execution time / communication cost? Performance prediction is hard… • How ‘heterogeneous’ our Grid resources really are? Moving on… to multiple DAGs • It is really ideal to assume that we have exclusive usage of resources… • In practice, we may have multiple DAGs competing for resources at the same time… Henan Zhao, Rizos Sakellariou. Scheduling Multiple DAGs onto Heterogeneous Systems. Proceedings of the 15th Heterogeneous Computing Workshop (HCW'06) (in conjunction with IPDPS 2006), Rhodes, Apr. 2006, IEEE Computer Society Press. Scheduling Multiple DAGs: Approaches • Approach 1: Schedule one DAG after the other with existing DAG scheduling algorithms – Low resource utilization & long overall makespan • Approach 2: Still one after the other, but do some backfilling and fill the gaps – Which DAG to schedule first? The one with longest makespan or the one with shortest makespan? • Approach 3: Merge all DAGs into a single, composite DAG. Much better than Approach 1 or 2. Example: Two DAGs to be scheduled together DAG A DAG B A 1 A 2 A 3 A 5 B 1 A 4 B 2 B 3 B 5 B 4 B 6 B 7 Composition Techniques • C1: Common Entry and Common Exit Node A0 B0 A1 A2 A3 B1 A4 B2 A5 B3 B5 B6 B7 A6 B8 B4 Composition Techniques • C2: Level-Based Ordering A0 B0 A1 A2 A3 B1 A4 B2 A5 B3 B5 B6 B7 A6 B8 B4 Composition Techniques • C3: Alternate between DAGs… (round robin between DAGs)… • Easy! Composition Techniques • C4: Ranking-Based Composition (compute a weight for each node and merge accordingly) DAG A0 B0 A 1 A 2 A 3 A 5 A 4 DAG A A1 50 A2 42 A3 36 A4 20 A5 6 DAG B B1 B2 B3 B4 B5 45 B6 63 B7 13 B 1 B 2 B 3 B 5 B 4 B 6 B 7 Rank 200 152 122 140 But, is makespan optimisation a good objective when scheduling multiple DAGs? Mission: Fairness In multiple DAGs: • Users perspective: I want my DAG to complete execution as soon as possible. • System perspective: I would like to keep as many users as possible happy; I would like to increase resource utilisation. Let’s be fair to users! (The system may want to take into account different levels of quality of service agreed with each user) Slowdown • Slowdown: what is the delay that a DAG would experience as a result of sharing the resources with other DAGs (as opposed to having the resources on its own). Average slowdown for all DAGs: Unfairness • Unfairness indicates, for all DAGs, how different the slowdown of each DAG is from the average slowdown (over all DAGs). • The higher the difference, the higher the unfairness! Scheduling for Fairness • Key idea: at each step (that is, every time a task is to be scheduled), select the most affected DAG (that is the DAG with the highest slowdown value) to schedule. What is the most affected DAG at any given point in time? Fairness Scheduling Policies • F1: Based on latest Finish Time – calculates the slowdown value only at the time the last task that was scheduled for this DAG finishes. • F2: Based on Current Time – re-calculates the slowdown value for every DAG after any task finishes. A proportion of time, for tasks running, is taken when the calculation is carried out. Lessons Learned… Open questions… • It is possible to achieve reasonably good fairness without affecting makespan. • An algorithm with good behaviour in the static case appears to make things easier in terms of achieving fairness… • What is fairness? • What is the behavior when run-time changes occur? • What about different notions of Quality of Service (SLAs, etc…) Finally… • How to automate advance reservations at the task level for a workflow, when the user has specified a deadline constraint only for the whole workflow? Henan Zhao, Rizos Sakellariou. Advance Reservation Policies for Workflows. Proceedings of the 12th Workshop on Job Scheduling Strategies for Parallel Processing, 2006. The Schedule M0 0 M1 M2 0 20 5 40 60 80 3 4 6 2 8 1 7 100 120 140 9 The schedule on the left can be used to plan reservations. However, if one task fails to finish within its slot, the remaining tasks have to be renegotiated. What we are looking for is… The Idea • 1. Obtain an initial assignment using any DAG scheduling algorithm (HEFT, HBMCT, …). • 2. Repeat • I. Compute the Application Spare Time (= user specified deadline – DAG finish time). • II. Distribute the Application Spare Time among the tasks. • 3. Until the Application Spare Time is below a threshold. Spare Time • The Spare Time indicates the maximum time that a node may delay, without affecting the start time of any of its immediate successor nodes. – A node i with an immediate successor j on the DAG: spare(i,j) = Start_Time(j) – Data_Arrival_Time(i,j) – A node i with an immediate successor j on the same machine: spare(i,j) = Start_Time(j) – Finish_Time(i) • The minimum of the above for all immediate successors is the Spare Time of a task. • Distributing the Application Spare Time needs to take care of the inherently present spare time! Two main strategies • Recursive spare time allocation: – The Application Spare Time is divided among all the tasks. – This is a repetitive process until the Application Spare Time is below a threshold. • Critical Path based allocation: – Divide the Application Spare Time among the tasks in the critical path. – Balance the Spare Time of all the other tasks. • (a total of 6 variants have been studied) An Example Critical Path based allocation Finally… Findings… • Advance reservation of resources for workflows can be automatically converted to reservations at the task level, thus improving resource utilization. • If the deadline set for the DAG is such that there is enough spare time, then we can reserve resources for each individual task so that deviations of the same order, for each task, can be afforded without any problems. • Advance reservation is known to harm resource utilization. But this study indicated that if the user is prepared to pay for full usage when only 60% of the slot is used there is no loss for the machine owner. …which leads to pricing! • R.Sakellariou, H.Zhao, E.Tsiakkouri, M.Dikaiakos. “Scheduling workflows under budget constraints”. To appear as a Chapter in a book with selected papers from the 1st CoreGrid Integration Workshop. • The idea: – Given a specific budget, what is the best schedule you can obtain for your workflow? • Multicriteria optimisation is hard! • Our approach: – Start from a good solution for one objective, and try to meet the other! – It works! How well… difficult to tell! To summarize… • Understanding the basic static scenarios and having robust solutions for those scenarios helps the extension to more complex cases… • Pretty much everything here is addressed by heuristics. Their evaluation requires extensive experimentation: Still: – No agreement about how DAGs (workflows) look like. – No agreement about how heterogeneous resources are • The problems addressed here are perhaps more related to what is supposed to be core CS… • But… we may be talking about lots of work for only incremental improvements… 10-15%… • Who cares in Computer Science about performance improvements in the order of 10-15%??? • (yet, if Gordon Brown was to increase our taxes by 10-15% everyone would be so unhappy )… • Oh well… Thank you!