Scheduling Bag-of-Tasks Applications to Optimize Computation Time and Cost Anastasia Grekioti, Natalia V. Shakhlevich School of Computing, University of Leeds, Leeds LS2 9JT, U.K. Abstract Bag-of-tasks applications consist of independent tasks that can be performed in parallel. Although such problems are well known in classical scheduling theory, the distinctive feature of Grid and cloud applications is the importance of the cost factor: in addition to the traditional scheduling criterion of minimizing computation time, in Grids and clouds it also important to minimize the cost of using resources. We study the structural properties of the time/cost model and explore how the existing scheduling techniques can be extended to handle the additional cost criterion. Due to the dynamic nature of distributed systems, one of the major requirements to scheduling algorithms is related to their speed. The heuristics we propose are fast and, as we show in our experiments, they compare favourably with the existing scheduling algorithms for distributed systems. Keywords: scheduling, bag-of-tasks applications, heuristics 1 Introduction An important feature of the Grid and cloud infrastructure is the need to deliver the required Quality of Service to its users. Quality of Service is the ability of an application to have some level of assurance that users requirements can be satisfied. It can be viewed as an agreement between a user and resource provider to perform an application within a guaranteed time frame at a pre-specified cost. As a rule, the higher the cost paid by the user, the faster resources can be allocated and the smaller execution time a resource provider can ensure. While time optimization is the traditional criterion in Scheduling Theory, the cost optimization is a new criterion arising in Grid and cloud applications. In fact the cost- and time-factors are in the centre of resource provision on a pay-as-you-go basis. The problem we study arises in the context of Bag-of-Tasks applications. It can be seen as the problem of scheduling a set of independent tasks on multiple processors optimizing computation time and cost. Formally, a BoT application consists of n tasks {1, 2, . . . , n} with given processing requirements τj , j = 1, . . . , n, which have to be executed on m uniform processors Pk , k = 1, . . . , m, with speed- and cost-characteristics sk and ck . For a BoT application a deadline D is given, so that all tasks of the application should be completed by time D. Each processor may handle only one task at a time. If task j is assigned to processor Pk , then it should be completed without preemption; the time required for its completion is τj /sk and the cost of using processor Pk is ck × τj /sk , or equivalently c̄k τj , where ck is the cost of using processor Pk per second, while c̄k = ck /sk is the cost of using processor Pk per million instructions. To distinguish between the two types of cost-parameters, we call the initial given cost-values ck as absolute costs and the adjusted values c̄k = ck /sk as relative costs. 1 If 0 − 1-variables xkj represent allocation of tasks j, 1 ≤ j ≤ n, to processors Pk , k = 1, . . . , m, then the makespan of the schedule Cmax and its cost K are found as Cmax = n ∑ τj max × xkj , k=1,...,m sk j=1 K= m ∑ k=1 c̄k n ∑ τj xkj . j=1 The problem consists in allocating tasks to processors so that the execution of the whole application is completed before the deadline D and the cost K is minimum. Adopting standard scheduling three-field notation, we denote the problem by Q|Cmax ≤ D|K, where Q in the first field classifies the processors as uniform, having different speeds, Cmax ≤ D in the second field specifies that the BoT application should be completed by time D, and K in the third field specifies the cost minimization criterion. Note that developing successful cost-optimization algorithms for problem Q|Cmax ≤ D|K is useful for solving the bicriteria version of our problem, Q||(Cmax , K). Such an algorithm, if applied repeatedly to a series of deadline-constrained problems with different values of deadline D, produces a time/cost trade-off which can be used by a decision maker to optimize two objectives, computation time Cmax and computation cost K. In this study, we explore the structural properties of the time/cost model. Based on these properties, we identify the most appropriate scheduling techniques, developed for makespan minimization, and adopt them for cost optimization. In our computational experiments we show that our algorithms compare favourably with those known in the area of distributed computing that take into account processors’ costs, namely [1, 2, 7, 11]. 2 Heuristics Scheduling algorithms for distributed systems are often developed without recognizing the link with related results from scheduling theory. In particular, the existing time/cost optimization algorithms do not make use of such scheduling approaches as the Longest Processing Time (LPT) rule and the First Fit Decreasing (FFD) strategy. Both approaches have been a subject of intensive research in scheduling literature and they are recognized as the most successful approximation algorithms for solving problem Q|Cmax ≤ D|−, the version of our problem which does not take into account processor costs, see, e.g., [3, 4, 8]. Both algorithms, LPT and FFD, prioritize the tasks in accordance with their processing requirements giving preference to longest tasks first. In what follows we assume that the tasks are numbered in the LPT-order, i.e., in the non-increasing order of their processing requirements: (1) τ1 ≥ τ2 ≥ · · · ≥ τn . We adopt the LPT- and FFD-strategy for a selected subset of processors, taking into account the deadline D. Algorithm ‘Deadline Constrained LPT’ 1. Select unscheduled tasks one by one from the LPT-list (1). 2. Assign a current task to the processor selected from the given subset of processors in such a way so that the task has the earliest completion time. 3. If allocation of a task violates the deadline D, keep the task unscheduled. 2 The FFD algorithm, initially formulated in the context of bin-packing, operates with a fixed ordering of processors (usually, the order of their numbering) and with a list of tasks in the LPT-order (1). It differs from the ‘Deadline Constrained LPT’ by Step 2, which for FFD is of the form: 2. Assign a current task to the first processor on the list for which deadline D is not violated. The above two rules are not tailored to handle processor costs. In order to adjust them for solving the cost minimization problem Q|Cmax ≤ D|K, we exploit the properties of the divisible load version of that problem, in which preemption is allowed and any task can be performed by several processors simultaneously. As shown in [9], there exists an optimal solution to the divisible load problem in which processors are prioritized in accordance with their relative costs. If processors are renumbered so that c̄1 ≤ c̄2 ≤ . . . ≤ c̄m , (2) then in an optimal solution to the divisible load problem, processors {P1 , P2 , . . . , Pu−1 } with the smallest relative cost c̄i are fully occupied until time D, one processor Pu with a higher cost can be partly occupied while the remaining most expensive processors {Pu+1 , . . . , Pm } are free. We call such a schedule as a box-schedule emphasizing equal load of the cheapest processors. In order to solve the non-preemptive version of the problem Q|Cmax ≤ D|K, which corresponds to scheduling a BoT application, we aim to replicate the shape of the optimal box-schedule as close as possible achieving the maximum load of the cheapest processors in the first place. The required adjustment of the FFD algorithm can easily be achieved if processors are renumbered by (2). We call the resulting algorithm ‘FFD Cost’ (for ‘FFD with Processors Prioritized by Cost’). For adjusting the LPT-strategy, notice that it naturally achieves a balanced processor allocation, which in our case is needed only for a subset of cheapest processors. There are several ways of identifying that subset. The two algorithms presented below are based on two different principles of processor selection: in the first algorithm, we group processors with equal relative cost and apply the LPT-strategy to each group; in the second algorithm, we consider one group of r cheapest processors {P1 , . . . , Pr }, loading them as much as possible by the LPT-strategy; the remaining processors {Pr+1 , . . . , Pm } are loaded one by one with unscheduled jobs. The solution of the smallest cost is selected among those constructed for different values of r, 1 ≤ r ≤ m. Algorithm ‘LPT Groups’ (LPT with Groups of Equivalent Processors) 1. Form groups G1 , . . . , Gg of processors putting in one group processors of the same relative cost c̄i . Keep group numbering consistent with processor numbering (2) so that the groups with the smallest indices contain the cheapest processors. 2. Consider the groups one by one in the order of their numbering. For each group Gi , apply ‘Deadline Constrained LPT’ until no more tasks can be allocated to that group. Algorithm ‘LPT One Group’ (LPT with One Group of Cheapest Processors) 1. For r = 1, . . . , m, (a) Select r cheapest processors and apply to them ‘Deadline Constrained LPT’. 3 (b) For the unscheduled tasks and remaining processors {Pr+1 , . . . , Pm } apply ‘FFD with Processors Prioritized by Cost’. 2. Select r which delivers a feasible schedule of minimum cost and output that schedule. Since the LPT and FFD strategies have comparable performance in practice, we introduce the counterpart of the latter algorithm with the FFD strategy applied to r cheapest processors. In addition, we use the result from [8] which suggests that the performance of FFD can be improved if the processors in the group are ordered from slowest to fastest. Algorithm ‘FFD One Group’ (FFD with One Group of Cheapest Processors) 1. For r = 1, . . . , m, (a) Select r cheapest processors and re-number them in non-decreasing order of their speeds. For the selected processors, apply classical FFD with processors considered in the order of their numbering. (b) For the unscheduled tasks and remaining processors {Pr+1 , . . . , Pm } apply ‘FFD with Processors Prioritized by Cost’. 2. Select r which delivers a feasible schedule of minimum cost and output that schedule. The above heuristics are designed as a modification of time-optimization algorithms. Below we propose one more heuristic based on the cost-optimization algorithm from [10]. It should be noted that unlike most of scheduling algorithms, designed for cost functions depending on task completion times, the algorithm from [10] addresses the model with processor costs. The approximation algorithm developed in [10] is aimed at solving a more general problem with unrelated parallel machines. This is achieved by imposing restrictions on allocation of long tasks. Depending on the value of the selected parameter t, a task is allowed to be assigned to a processor only if the associated processing time of the task on that processor does not exceed t. We adopt this strategy in the algorithm presented below. Algorithm ‘FFD Long Tasks Restr.’ (FFD with Restrictive Allocation of Long Tasks) 1. For each combination of k ∈ {1, . . . , m} and ℓ ∈ {1, . . . , n} repeat (a)-(c): (a) Define parameter t = τℓ /sk which is used to classify allocation of task j to processor i as “small” or “large” depending on whether τj /si ≤ t or not. (b) Consider processors one by one in the order of their numbering (2). Repeat for each processor Pi : Allocate to Pi as many “small” unscheduled tasks from the LPT-list (1) as possible, without violating deadline D. (c) If all tasks are allocated, calculate the cost Kt of the generated schedule, else set Kt = ∞. 2. Set K = min {Kt } t∈{τl /sk } To conclude we observe that the proposed heuristics are fast (their time complexity ranges from O(mn) to O(m2 n2 ), assuming n ≥ m) and therefore are suitable for practical applications. In the next section we evaluate the performance of our heuristics empirically comparing them with the known cost-optimization algorithms. 4 3 Computational Experiments We have performed extensive computational experiments in order to evaluate the performance of our algorithms and to compare them with the published ones. Fig. 1 represents the situation we have observed in most experiments. Figure 1: Algorithm Performance The experiments are based on two data sets for processors, denoted by PrI and PrII, and two types of data sets for BoT applications consisting of n tasks, n ∈ {25, 50}. Combining the data sets for processors with those for tasks we generate four classes of instances, denoted by PrI-25, PrI-50 and PrII-25, PrII-50. It should be noticed that we have explored the performance of the algorithms on larger instances as well. While the overall behaviour is similar, the differences in the performance of algorithms become less noticeable. This observation is in agreement with the theoretical result on asymptotic optimality of the algorithms we propose. In a recent study [5] on the workloads of over fifteen grids between 2003 and 2010, it is reported that the average number of tasks per BoT is between 2 and 70 while in most grids the average is between 5 and 20. In order to analyse the distinctive features of the algorithms, in what follows we focus on instances with n ∈ {25, 50}. Data set PrI for processors uses the characteristics of m = 11 processors from [1]; data set PrII uses the characteristics of m = 8 processors from [6] which correspond to US-West Amazon EC2. The data sets for tasks are generated in the same fashion for n = 25 and n = 50. For each value of n, 100 data sets have been created; in each data set the processing requirements of the tasks are random integers sampled from a discrete uniform distribution, as in [1, 2], so that τj ∈ [1 000, 10 000] for j = 1, . . . , n. The generated processors’ data sets and tasks’ data set are then combined to generate 4 classes of instances, each class containing 100 tasks’ data sets. Using the data sets for processors and tasks, the instances are produced by considering each combination of data sets and introducing a series of deadlines D for each such combination. The range of deadlines for each combination is defined empirically making sure that for the smallest deadline is feasible (i.e., there exists a feasible schedule for it) and that the largest deadline does not exceed the maximum load of one processor (assuming that all tasks can be allocated to it). Since the deadline ranges differ, the actual number of instances for each combination of processors’ and tasks’ data sets varies greatly: there 1579 instances in class PrI-25, 3093 instances in PrI-50, 861 instances in PrII-25 and 1645 instances in PrII-50. The comparison is done with the following algorithms known in the context of Grid scheduling: - ‘DBC-cA’ for the version of the ‘Deadline and Budget Constrained (DBC) cost optimization’ algorithm by Buyya & Murshed [1], which uses processors’ absolute costs [1], 5 - ‘DBC-cR’ for the version of the previous algorithm, which uses processors’ relative costs, - ‘DBC-ctA’ for the version of the ‘Deadline and Budget Constrained (DBC) cost-time optimization’ algorithm by Buyya et al. [2], which uses processors’ absolute costs, - ‘DBC-ctR’ for the version of the previous algorithm, which uses processors’ relative costs, - ‘HRED’ for the algorithm ‘Highest Rank Earliest Deadline’ by Kumar et al. [7], - ‘SFTCO’ for the algorithm ‘Stage Focused Time-Cost Optimization’ by Sonmez & Gursoy [11]. We have implemented all algorithms in Free Pascal 2.6.2 and tested them on a computer with an Intel Core 2 Duo E8400 processor running at 3.00GHz and 6GB of memory. All heuristics are sufficiently fast to be applied in practice. Considering the heuristics we propose, the fastest two are ‘LPT Groups’ and ‘FFD Cost’ which solve instances with 50 tasks in about 0.02 s; the actual running time of heuristics ‘LPT One Group’ and ‘FFD One Group’ does not exceed 0.41 s for all instances, while ‘FFD Long Tasks Restr.’ has the highest time complexity and runs on large instances for 4.28 s. As far as the existing algorithms are concerned, the five heuristics ‘DBC-cA’, ‘DBC-ctA’, ‘DBC-cR’, ‘DBCctR’ and ‘SFTCO’ are comparable with our two fastest heuristics producing results in 0.01-0.02 s; the running time of ‘HRED’ is 1-1.45 s. The summary of the results is presented in Tables 1-3. In each table, the best results for each class of instances are highlighted in bold. In Table 1 we present the percentage of instances for which each algorithm produces no-worse solutions (of a lower or equal cost) and the percentage of instances for which each algorithm produces strictly better solutions (of a strictly lower cost) in comparison with any other algorithm; the latter percentage value is given in the parentheses. For each class of instances we present with bold numbers the best performance for the known and new algorithms. Notice that the best results are achieved by algorithms ‘LPT One Group’ and ‘FFD Long Tasks Restr.’. While Table 1 indicates that the algorithms we propose are in general of better performance, the actual entries in the table do not reflect to which extent the new algorithms are superior in comparison with the known algorithms. A general comparison of the two groups of algorithms, the known and the new ones, is performed in Table 2. In the first column we list 4 classes of instances considered in the experiments. In the second column we provide the percentage of instances in each class, for which the best solution (in terms of the strictly lower cost) is found by one of the existing algorithms. In the third column we provide the percentage of instances in each class, for which the best solution is found by one of the new algorithms. The last column specifies the percentage of instances for which the best solutions found by the existing algorithms and the new ones have the same cost. It is easy to see that the algorithms we propose outperform the existing ones producing strictly lower cost schedules in more than 85% of instances. A special attention should be given to instances with tight deadlines. Our experiments show that the new algorithms fail to produce feasible schedules meeting a given deadline in less than 0.46% of instances in all classes, while for the existing algorithms the number of fails range between 0.43% − 100%. 6 Table 1: Percentage of instances for which algorithms produce no worse solutions and strictly better solutions, in parentheses Grid Scheduling Algorithms DBC-ctA [2] DBC-ctR [2] DBC-cA [1] DBC-cR [1] HRED [7] SFTCO [11] Proposed Algorithms FFD One Group LPT One Group LPT Groups FFD Cost FFD Long Tasks Restr. PrI-25 2.44(0) 2.44(0) 4.72(0) 4.72(0) 1.29(0.07) 0(0) PrI-25 18.9(5.41) 23.01(2.88) 11.99(1.06) 13.49(0) 17.01(3.5) PrI-50 1.6(0) 1.6(0) 3.74(0) 3.74(0) 0.71(0.05) 0(0) PrI-50 18.16(4.62) 26.25(4) 12.42(0.85) 13.54(0) 18.25(4.71) PrII-25 0(0) 6.86(0) 0(0) 7.32(0) 6.71(0) 0(0) PrII-25 12.73(2.36) 14.63(2.29) 9.68(0) 10.37(0) 31.71(19.59) Table 2: Percentage of strictly lower cost solutions and ties for existing and rithms. Lowest Cost (Existing) Lowest Cost (Proposed) PrI-25 2.91 85.12 PrI-50 4.75 88.78 PrII-25 9.76 88.50 PrII-50 4.98 94.41 PrII-50 0(0) 5.2(0) 0(0) 5.78(0.06) 4.45(0) 0(0) PrII-50 9.99(0.92) 13.75(2.37) 8.26(0.06) 9.07(0) 43.5(33.68) proposed algoTies 11.97 6.47 1.74 0.61 Our next table (Table 3) evaluates the quality of the solutions produced by various algorithms. It is based on the values of the relative deviation θ = 100 × (K − LB)/LB, in percentage terms, of cost K of heuristic solutions from the lower bound values LB. For each instance, the LB-value is found as the cost of the box-schedule optimal for the relaxed model with preemption allowed. The maximum and the average deviation is then calculated over all instances in each class reflecting the worst and the average performance of each algorithm. The best results for each class of instances are highlighted in bold for the known and new algorithms. It should be noted that the entries in the last two columns for algorithm SFTCO are empty as that algorithm failed to produce deadline feasible solutions in all instances of classes PrII-25 and PrII-50. The algorithms we propose have average percentage deviation from the lower bound between 0.03%−1.09%, while the same figures for the known algorithms are between 0.58%− 162%. The worst case performance of algorithms, measured by the maximum percentage deviation from the lower bound, is between 0.31% − 5.74% for our algorithms and between 4.16% − 287.4% for known ones. Thus in all scenarios our algorithms outperform the known Grid scheduling heuristics. 7 Table 3: The maximum relative deviation θ (in percentage) of cost K of heuristic solutions from LB (the average relative deviation in parentheses) Grid Scheduling Algorithms PrI-25 PrI-50 PrII-25 DBC-ctA [2] 17.55(4.33) 7.27(1.56) 81.56(39.47) DBC-ctR [2] 17.55(4.40) 7.27(1.65) 9.81(2.16) DBC-cA [1] 13.80(2.56) 6.44(0.95) 68.51(29.26) DBC-cR [1] 14.58(2.68) 6.44(1.11) 9.57(1.4) HRED [7] 19.77(9.19) 9.49(4.41) 15.58(5.43) SFTCO [11] 287.4(162) 275.4(137) -(-)∗ Proposed Algorithms PrI-25 PrI-50 PrII-25 FFD One Group 5.23(0.75) 1.04(0.25) 2.11(0.41) LPT One Group 4.32(0.73) 1.04(0.21) 1.79(0.29) LPT Groups 4.99(1.09) 2.22(0.48) 2.9(0.51) FFD Cost 5.74(0.89) 1.29(0.31) 2.11(0.46) FFD Long Tasks Restr. 3.6(0.78) 1.04(0.26) 1.35(0.13) ∗ No feasible solutions found by Algorithm SFTCO in all instances of classes and PrII-50 PrII-50 79.92(35.28) 4.90(0.86) 66.56(24.99) 4.16(0.58) 7.19(2.59) -(-)∗ PrII-50 0.99(0.14) 0.99(0.09) 0.99(0.19) 0.99(0.17) 0.31(0.03) PrII-25 One more observation can be done in relation to the asymptotic optimality of our algorithms. Table 3 supports the fact that as the number of tasks increases from 25 to 50, the solutions found become closer to lower bounds. The proof of the asymptotic optimality of the algorithms is omitted in the current paper. The performance of the known algorithms can be summarized as follows. The best amongst those algorithms are the DBC-algorithms from [1] and [2]. Algorithms DBC-cA and DBC-ctA load the processors giving preference to those with the lowest absolute cost ck , while DBC-cR and DBC-ctR give preference to the processors with the lowest relative cost c̄k , approximating the box-schedule introduced in Section 2. Due to the optimality of the latter schedule for the preemptive version of the problem [9], it is expected that DBCcR and DBC-ctR should outperform DBC-cA and DBC-ctA. That behaviour is clearly observed in our experiments with processors’ data set PrII; experiments with data set PrI are less indicative since in PRI the orderings of processor based on ck and c̄k are very similar. The algorithms from [1] and [2] consider the tasks without taking into account their processing requirements, while all our algorithms make use of the LPT task order. The algorithm from [7] performs task allocation in a fashion similar to the SPT order rather than LPT, which might explain its poor performance. Finally, the algorithm from [11] prioritizes the tasks in the LPT order, but the authors use a strange ratio si /c̄i = s2i /ci to prioritize the processors; the meaning of that ratio and its justification are not provided. Analysing the performance of our algorithms we conclude that the best performance is achieved by either LPT One Group or FFD Long Tasks Restr.. In the instances with tight deadlines, algorithm LPT One Group outperforms the others as it is more successful in finding deadline feasible schedules. The third best algorithm is FFD One Group. Finally, a slightly worse performance of algorithms LPT Groups and FFD Cost is compensated by their faster running times. 8 4 Conclusions In this study we propose new cost optimization algorithms for scheduling Bag-of-Tasks applications and analyse them empirically. Our algorithms are aimed at replicating the shape of a box-schedule, which is optimal for a relaxed version of the problem. This is achieved by combining the most successful classical scheduling algorithms and adjusting them for handling the cost factor. In comparison with the existing algorithms, new algorithms find strictly lower cost schedules in more than 85% of the instances; they are fast and easy to implement and can be embodied in a Grid broker. Acknowledgements This research was supported by the EPSRC funded project EP/G054304/1 “Quality of Service Provision for Grid applications via Intelligent Scheduling”. References [1] Buyya, R., Murshed, M.: GridSim: a Toolkit for the Modeling and Simulation of Distributed Resource Management and Scheduling for Grid Computing, Concurr. Comp.Pract. E. 14, 1175-1220 (2002) [2] Buyya, R., Murshed, M., Abramson, D., Venugopal, S.: Scheduling Parameter Sweep Applications on Global Grids: a Deadline and Budget Constrained Cost-Time Optimization Algorithm, Softw. Pract. Exper. 35, 491–512 (2005) [3] Coffman, E. G., Garey, M. R., Johnson, D. S.: Application of Bin-Packing to Multiprocessor Scheduling, SIAM J. Comput. 7, 1–17 (1978) [4] Gonzalez, T., Ibarra, O. H., Sahni, S.: Bounds for LPT Schedules on Uniform Processors, SIAM J. Comp. 6, 155–166 (1977) [5] Iosup, A., Epema, D.: Grid Computing Workloads, IEEE Internet Comput. 15, 19–26 (2011) [6] Javadi, B., Thulasiram, R., Buyya, R.: Statistical Modeling of Spot Instance Prices in Public Cloud Environments, 4th IEEE International Conference on Utility and Cloud Computing, Melbourne, Australia, 219-228 (2011) [7] Kumar, S., Dutta, K. Mookerjee, V.: Maximizing Business Value by Optimal Assignment of Jobs to Resources in Grid Computing, Eur. J. Oper. Res. 194, 856–872 (2009) [8] Kunde, M., Steppat, H.: First Fit Decreasing Scheduling on Uniform Multiprocessors, Discrete Appl. Math. 10, 165–177 (1985) [9] Shakhlevich, N.V., Djemame, K.: Mathematical Modles for Time/Cost Optimization in Grid Scheduling. Report 2008.04, http://www.engineering. leeds.ac.uk/computing/research/publications/reports/2008+/2008 04.pdf, School of Computing, University of Leeds (2008) 9 [10] Shmoys, D. B., Tardos, É.: An Approximation Algorithm for the Generalized Assignment Problem, Math. Program. 62, 461–474 (1993) [11] Sonmez, O. O., Gursoy, A.: A Novel Economic-Based Scheduling Heuristic for Computational Grids, Int. J. High Perform. C. 21, 21–29 (2007) 10