J. Parallel Distrib. Comput. 73 (2013) 1306–1322 Contents lists available at SciVerse ScienceDirect J. Parallel Distrib. Comput. journal homepage: www.elsevier.com/locate/jpdc A DAG scheduling scheme on heterogeneous computing systems using double molecular structure-based chemical reaction optimization Yuming Xu a , Kenli Li a,∗ , Ligang He b , Tung Khac Truong a,c a College of Information Science and Engineering, Hunan University, Changsha, 410082, China b Department of Computer Science, University of Warwick, United Kingdom c Faculty of Information Technology, Industry University of Hochiminh City, Hochiminh, VietNam highlights • • • • • Applying CRO to solve DAG scheduling problems in heterogeneous computing systems. Developing DMSCRO by adapting the conventional CRO framework. Designing a new solution encoding method for DAG scheduling. Designing new operations for performing elementary chemical reactions in this work. Conducting experiments to verify the effectiveness and efficiency of the DMSCRO. article info Article history: Received 12 May 2012 Received in revised form 10 May 2013 Accepted 30 May 2013 Available online 6 June 2013 Keywords: NP-hard problem Chemical reaction optimization Meta-heuristic approaches Task scheduling Makespan abstract A new meta-heuristic method, called Chemical Reaction Optimization (CRO), has been proposed very recently. The method encodes solutions as molecules and mimics the interactions of molecules in chemical reactions to search the optimal solutions. The CRO method has demonstrated its capability in solving NP-hard optimization problems. In this paper, the CRO scheme is used to formulate the scheduling of Directed Acyclic Graph (DAG) jobs in heterogeneous computing systems, and a Double Molecular Structure-based Chemical Reaction Optimization (DMSCRO) method is developed. There are two molecular structures in DMSCRO: one is used to encode the execution order of the tasks in a DAG job, and the other to encode the task-to-computing-node mapping. The DMSCRO method also designs four elementary chemical reaction operations and the fitness function suitable for the scenario of DAG scheduling. In this paper, we have also conducted the simulation experiments to verify the effectiveness and efficiency of DMSCRO over a large set of randomly generated graphs and the graphs for real-world problems. © 2013 Elsevier Inc. All rights reserved. 1. Introduction A job consisting of a group of tasks with precedence constraints is often modeled as a Directed Acyclic Graph (DAG). When scheduling a DAG job, the main objective is to optimize its makespan, which is defined as the duration between the time when the first task in the DAG starts execution and the time when the last task finishes execution. This problem has been well-studied for many decades. Achieving this objective has been proved to be a NPcomplete problem [9], which means that the time needed to find ∗ Corresponding author. E-mail addresses: xxl1205@163.com (Y. Xu), lkl@hnu.edu.cn (K. Li), liganghe@dcs.warwick.ac.uk (L. He), tung147@gmail.com (T.K. Truong). 0743-7315/$ – see front matter © 2013 Elsevier Inc. All rights reserved. http://dx.doi.org/10.1016/j.jpdc.2013.05.005 the optimal solution increases exponentially as the problem size increases. Therefore, two schools of scheduling methods, heuristic scheduling and meta-heuristic scheduling, have been proposed to find the sub-optimal solution with lower time overhead. Heuristic scheduling algorithms exploit the heuristics to identify a good solution. An important class of heuristic scheduling is list scheduling. List scheduling maintains an ordered list of tasks in a DAG job according to some greedy heuristics. The tasks are selected in the specified order for mapping to the computing nodes which allow the earliest start times. Heuristic scheduling algorithms can find solutions with low time complexity, since the attempted solutions are narrowed down by greedy heuristics to a very small portion of the entire solution space. However, the quality of the solutions obtained by these algorithms is heavily dependent on the effectiveness of the heuristics, and it is not likely for the greedy heuristics to produce consistent results on a wide Y. Xu et al. / J. Parallel Distrib. Comput. 73 (2013) 1306–1322 range of problems, especially when the complexity of the DAG scheduling problem becomes high. Meta-heuristic scheduling (or Guided-random-search-based) techniques work by guiding the searching for solutions in a solution space. Although meta-heuristic scheduling typically takes longer time, they can achieve good performance consistently for a wide range of scheduling scenarios. Well-known examples of meta-heuristic scheduling techniques include Genetic Algorithms (GA), Particle Swarm Optimization (PSO), Ant Colony Optimization (ACO), Simulated Annealing (SA) and Tabu Search (TS), etc. Very recently, a new meta-heuristic method, called Chemical Reaction Optimization (CRO), has been proposed. The method encodes solutions as molecules and mimics the interactions of molecules in chemical reactions to search the optimal solutions. The CRO method has demonstrated its capability in solving NP-hard optimization problems. In this paper, we integrate the CRO framework to schedule DAG jobs on heterogeneous computing systems. Scheduling DAG jobs on heterogeneous systems involves making decisions about the execution order of tasks and task-to-computing-node mapping. This paper adapts the conventional CRO framework and proposes a Double Molecular Structure-based CRO (DMSCRO) scheme to formulate the scheduling of DAG jobs. In DMSCRO, one molecular structure is used to encode the execution order of the tasks in a DAG job, while the other molecular structure to encode the taskto-computing-node mapping. DMSCRO also designs the necessary elementary chemical reaction operations and the fitness function suitable for the scenario of DAG scheduling. According to No-Free-Lunch Theorem in the area of metaheuristics [34], all meta-heuristic methods that search for optimal solutions are the same in performance when averaged over all possible objective functions. In theory, as long as an effective metaheuristic method runs for long enough, it will gradually approach the optimal solution. We have conducted the experiments over a large set of randomly generated graphs, and also the graphs abstracted from two well-known real applications: Gaussian elimination and molecular dynamics application. The experimental results show that the proposed DMSCRO can achieve better performance than the heuristic algorithms, but achieves similar performance as GA in the literature in terms of makespan. We will show in Section 5 that DMSCRO combines the advantages of GA and Simulated Annealing (SA), and therefore may have better performance in terms of searching efficiency. The experimental results presented in Section 6.4 indeed demonstrate that DMSCRO is able to find good solutions faster than GA. The three major contributions of this work are summarized below: • Applying the Chemical Reaction Optimization framework to solve DAG scheduling problems in heterogeneous computing systems. • Developing DMSCRO by adapting the conventional CRO framework and designing a new solution encoding method, new operations for performing elementary chemical reactions and a new fitness function suitable for the scheduling scenarios considered in this work. • Conducting simulation experiments to verify the effectiveness and efficiency of the proposed DMSCRO. Our experimental results show that (1) DMSCRO is able to achieve the similar makespan as GA, but it finds good solutions faster than GA by 26.5% on average (by 58.9% in the best case), and (2) DMSCRO consistently achieves smaller makespans than two heuristic scheduling algorithms (HEFT_B and HEFT_T) which it has been shown in the literature outperform other heuristic scheduling algorithms. Compared with HEFT_B and HEFT_T, the makespan reduction achieved by DMSCRO is 10% and 12.8% on average, respectively. 1307 The remainder of this paper is organized as follows: In Section 2, the related work about scheduling algorithms on heterogeneous systems is presented. Section 3 presents the background knowledge of CRO. Section 4 describes the system and workload model. In Section 5, the DMSCRO scheme is presented for DAG scheduling, aiming to minimize the makespan on heterogeneous computing systems. Section 6 compares the performance of the proposed scheme with the existing heuristic algorithms. Finally, Section 7 concludes the paper. 2. Related work In this section, we discuss the related work on heuristic scheduling, meta-heuristic (or guided-random-search-based) scheduling and job-shop scheduling. 2.1. Heuristic scheduling Heuristic methods usually provide good solutions for restricted instances of a task scheduling problem. They can find a near optimal solution in less than polynomial time. A heuristic method searches a path in the solution space and ignores other possible paths [17,39]. There are three typical types of heuristic-based algorithms for the DAG scheduling problem: list scheduling [17,29], cluster scheduling [1,4], and duplication-based scheduling [30,2]. List scheduling is most popular among these three when it comes to scheduling DAG jobs. A list scheduling algorithm is typically divided into two phases. In the first phase, each task is assigned a priority, and then added to a list of waiting tasks in order of decreasing priority according to some criteria. In the second phase, the task with the highest priority is selected and assigned to the most suitable computing node. The main objective of scheduling a DAG job is to minimize its makespan. Most list scheduling algorithms use two attributes, b-level and t-level, to assign priority for tasks. b-level of a task is the length of the longest path from the exit task of the DAG job to the task, while t-level of a task is the length of the longest path from the entry task to the task. The major difference among different list scheduling algorithms is the means by which the b-level and the t-level are used, and the computing node is selected. Since this work investigates the DAG scheduling in heterogeneous parallel systems, this subsection mainly discusses the related work in heuristic DAG scheduling in heterogeneous parallel systems. Modified Critical Path (MCP) scheduling [36] uses the tasks’ latest start time to assign the scheduling priority. The latest start times of the tasks on the critical path of the DAG are actually their t-levels. Earliest Time First (ETF) scheduling algorithm [13] uses the tasks’ earliest start times to assign the scheduling priority. Both ETF and MCP allocate a task to a processor that minimizes the task’s start time. Dynamic-level scheduling (DLS) [26] calculates the difference between the b-level of task Ti and Ti ’s earliest start time on processor Pj . The difference is called the dynamic level of the (Ti , Pj ) pair. Each time, DLS selects the (task, processor) pair with the highest dynamic level. Mapping heuristic (MH) [7] uses the so called static b-level to assign the scheduling priority. Static b-level is the b-level without considering the communication costs between tasks. A task is then allocated to a processor that gives the earliest start time. Levelized-Min Time (LMT) [14] assigns the scheduling priority in two phases. It first groups the tasks into different levels according the DAG topology, and then in the same level a task with the biggest computation cost has the highest priority. A task is allocated to a processor that minimizes the sum of the task’s computation and the total communication costs with the tasks in the previous level. 1308 Y. Xu et al. / J. Parallel Distrib. Comput. 73 (2013) 1306–1322 Ref. [29] also proposes two heuristic DAG scheduling algorithms for heterogeneous systems. One algorithm uses the tasks’ b-level to determine the priority (i.e., scheduling order) of the tasks, which is called HEFT_B in this paper. The bigger b-level, the higher priority. In HEFT_B, a task is allocated to a processor that gives the earliest start time. The other algorithm uses the sum of tlevel and b-level to determine the tasks’ priority, which is called HEFT_T in this paper. HEFT_T tries to allocate all tasks on the critical path to one processor. HEFT_T allocates a task not on the critical path to the processor that minimizes the task’s start time. The extensive experiments have been conducted in [29] and the results demonstrate that HEFT_B and HEFT_T outperform (in terms of makespan) other representative heuristic algorithms in heterogeneous systems, such as DLS, MH and LMT. Job Shop Scheduling Problem (JSSP) [38,21,10,11,22,25] originates from industry, investigating how to allocate work operations to machines so as to maximize the productivity (e.g., by minimizing the makespan of the work flow). JSSP can be abstracted as DAG scheduling. Therefore, the heuristic methods developed for DAG scheduling can be applied to JSSP in industry. Clustering algorithm [1,4] is another type of heuristic algorithms. Clustering algorithms assume that there are an unlimited number of computing nodes available to task executions. The clustering algorithms will use as many computing nodes as possible in order to reduce the makespan of the schedule. Then if the number of computing nodes used by a schedule is more than the number of computing nodes available, a mapping process is required to merge the tasks in a candidate schedule onto these available computing nodes. The duplication-based scheduling heuristic [30,2] attempts to reduce communication delays by executing the key tasks on more than one computing node. Duplication-based scheduling essentially aims to further improve the performance of list scheduling. The disadvantage of this method is that the same task may have to run several times on different computing nodes and therefore cause more resource consumptions. 2.2. Meta-heuristic scheduling Meta-heuristic algorithms are also called guided-randomsearch-based algorithms. Contrary to heuristic-based algorithms, the meta-heuristic algorithms incorporate a combinatorial process when searching for solutions. The meta-heuristic algorithms typically require sufficient sampling of candidate solutions in the search space and have shown robust performance on a variety of scheduling problems. Many meta-heuristic algorithms have been proposed to successfully solve the task scheduling problem. GA [12,27,33,6] is to date the most popular technique for metaheuristic scheduling. [12] proposed to apply the GA technique to DAG scheduling. In [12], a scheduling solution is encoded as multiple one-dimensional string. Each string represents an ordered list of tasks scheduled to a processor. In the crossover operation, a crossover point is randomly selected in each string of two parent solutions, and the head portion of one parent merges with the tail portion of the other parent. Mutation is applied by randomly exchanging two tasks in two strings. Makespan is used as the fitness function to judge the quality of the scheduling solution. There are also other work applying GA for scheduling, such as in [27,28,23]. In those work, however, GA is developed to tackle the special features in the applications or in the computing systems. For example, [23] applies GA to address the dynamic feature in Grid environments; [27] tackles the security considerations in Grids; [28] designs a GA method to schedule a DAG in which each task is a parallel task. Our work is concerned about the conventional heterogeneous computing systems, and a task in a DAG is a serial task. Therefore, the workload and system model in [12] is the closest one to the model in this paper. Other meta-heuristic scheduling methods include PSO [3,19], ACO [31,8], SA [15], and TS [24,35]. Some research works have also been developed to use agent intelligence to tackle task processing on parallel computing systems [32]. Chemical Reaction Optimization (CRO) was proposed very recently [18,37,5,32,20]. It mimics the interactions of molecules in chemical reactions. In CRO, solutions are encoded, and then sequences of operations, which loosely simulate the molecule behaviors in reaction, are performed over these solutions. Gradually, good solutions will emerge. CRO has already shown its power in solving problems like Quadratic Assignment Problem (QAP), Resource-Constrained Project Scheduling Problem (RCPSP), Channel Assignment Problem (CAP) [18,32], Task Scheduling in Grid Computing (TSGC) [38]. To the best of our knowledge, CRO has not yet been applied to DAG scheduling in the literature. 3. Background of CRO CRO mimics the process of a chemical reaction where molecules undergo a sequence of reactions between each other or with the environment in a closed container. A molecule has a unique structure of atoms, which represents a solution of the optimization problem. Potential energy (PE) and Kinetic energy (KE) are two key properties attached to a molecule structure. The former corresponds to the fitness value of the solution and the fitness of a solution is judged by the PE energy of the molecule, while the latter is used to control the acceptance of new solutions with worse fitness. Suppose ω and f are a molecular structure (solution) and the fitness function, respectively. Then PE ω = f (ω). During the reactions, a molecule structure, ω, attempts to change to another molecule structure ω′ if PE ω′ ≤ PE ω , or PE ω′ ≤ PE ω + KE ω . KE is a non-negative number and it helps the molecule escape from local optimums. A central energy buffer is also implemented in CRO. The energy stored in the buffer can be regarded as the energy in the environment (other than the molecules) in the closed container. The energy may also flow between the molecules and the energy buffer. CRO requires conservation of energy. The total amount of energy in the system is determined by PE and the initial KE of all molecules in the initial population, and the initial energy stored in the buffer. Conservation of energy means that the total energy remains constant during the process of CRO. During the CRO process, the following four types of elementary reactions are likely to happen. • On-wall ineffective collision: this reaction is a uni-molecule reaction, whose reactant involves only one molecule. When a molecule ω collides onto the wall of the closed container, it is allowed to change to another molecule ω′ , if inequality (1) related to their energy values holds. PE ω + KE ω ≥ PE ω′ . (1) After collision, the KE energy will be re-distributed. A certain portion of KE of the new molecule will be withdrawn to the central energy buffer (i.e., environment). The KE energy of the new module can be calculated in Eq. (2), where a is a number randomly selected from the range of [KE LossRate , 1]. KE LossRate is the loss rate of the KE energy, which is a system parameter set during the initialization stage of CRO: KE ω′ = (PE ω − PE ω′ + KE ω ) × a. (2) • Decomposition: this reaction is also a uni-molecule reaction. A molecule ω can decompose into two new molecules, ω1′ and ω2′ , if Inequality (3) holds, where buffer denotes the energy stored in the central buffer, which represents the energy interactions between molecules and the environment: PE ω + KE ω + buffer ≥ PE ω′ + PE ω′ . 1 2 (3) Y. Xu et al. / J. Parallel Distrib. Comput. 73 (2013) 1306–1322 Let Edec = (PE ω +KE ω )−(PE ω′ +PE ω′ ). Then after decomposing, 1 2 the KE energies of ω1 and ω2 are calculated by Eq. (4), where δ1 , δ2 , δ3 , δ4 is a random number generated in [0, 1]: ′ ′ KE ω′ ← (Edec + buffer ) × δ1 × δ2 (4) KE ω′ ← (Edec + buffer − KE ω′ ) × δ3 × δ4 . (5) 1 2 1 The energy in the buffer is updated by Eq. (6): buffer ← Edec + buffer − (PE ω′ + PE ω′ ). 1 (6) 2 • Inter-molecular ineffective collision: this reaction is an intermolecule reaction, whose reactants involve two molecules. When two molecules, ω1 and ω2 , collide into each other, they can change to two new molecules, ω1′ and ω2′ , if Inequality (7) holds: PE ω1 + PE ω2 + KE ω1 + KE ω2 ≥ PE ω′ + PE ω′ . 1 2 (7) Einter denotes the spare energy after the inter-molecule collision, which can be calculated by Eq. (8). The KEs of the new molecules will share the spare energy. Therefore, the KEs of ω1′ and ω2′ are calculated by Eqs. (9) and (10), respectively, where δ1 is a random number generated from [0, 1]: Einter = (PE ω1 + PE ω2 + KE ω1 + KE ω2 ) − (PE ω′ + PE ω′ ) (8) KE ω′ ← Einter × δ1 (9) 1 1 KE ω′ ← Einter × (1 − δ1 ). 2 2 (10) • Synthesis: this reaction is also an inter-molecule reaction. When two molecules, ω1 and ω2 , collide into each other, they can be combined to generate a new molecule, ω′ , if Inequality (11) holds: PE ω1 + PE ω2 + KE ω1 + KE ω2 ≥ PE ω′ . Table 1 Definitions of notations. Notation Definition T Ti P Pk Tentry Texit succ(Ti ) pred(Ti ) Wd (Ti ) S (Ti , Pk ) The set of n weighted tasks in the application The ith task in the application The set of m heterogeneous computing nodes The kth computing node in the system The starting subtask without any predecessors The final subtask with no successors The set of the immediate successors of the subtask Ti The set of the immediate predecessors of the subtask Ti The computational data of subtask Ti The estimated execution speed of subtask Ti on computing nodePk The computational cost of task Ti on the computing node Pk The average computational cost of subtask Ti The amount of communication between subtask Ti and subtask Tj The communication startup cost of computing node Pk The two-dimensional matrix of communication bandwidths between computing nodes The communication bandwidths between computing nodes Pk to Pl The communication cost from the subtask Ti (scheduled on Pk ) to the subtask Tj (scheduled on Pl ) The average communication cost of the edge(Ti , Tj ) The earliest start time of the subtask Ti on the computing node Pk The earliest finish time of the subtask Ti on the computing node Pk The earliest time at which the computing node Pk is ready for the task execution The upward ranking of subtask Ti The downward ranking of subtask Ti The communication to computation ratio is the ratio of the average communication cost to the average computation cost. W (Ti , Pk ) W (Ti ) Cd (Ti , Tj ) Cs (Pk ) B B(Pk , Pl ) C (Ti , Tj ) C (Ti , Tj ) EST (Ti , Pk ) EFT (Ti , Pk ) Tavail (Pk ) Rankb (Ti ) Rankt (Ti ) CCR (11) The KE energy of ω′ is calculated by Eq. (12): KE ω′ = PE ω1 + KE ω1 + PE ω2 + KE ω2 − PE ω′ . 1309 4. MODELS (12) The typical execution flow of CRO is as follows. The CRO is first initialized to set some system parameters, such as PopSize (the size of the populations or molecules), KE LossRate , InitialKE (the initial energy associated with molecules), buffer (initial energy in the energy buffer), MoleColl (MoleColl is used later in the process to determine whether to perform a uni-molecular or an intermolecular operation), etc. Then the process enters a loop. In each iteration, the process first decides for each molecule whether to perform uni-molecular operations or inter-molecular operations following on a certain probability. It is decided in the following way. A random number, b, is generated in the interval [0, 1]. If b is bigger than the value of MoleColl that the process sets in the initialization stage, a uni-molecular operation will be performed; otherwise, an inter-molecular operation will take place. If it is a uni-molecular operation, CRO uses a parameter α to guide the further selection of on-wall collision or decomposition. α represents the maximum number of collisions allowed with no improved solution being found by a molecule. When a molecule undergoes a collision, the parameter NumHit is updated to record the total number of collisions which a molecule has carried out. If a molecule has undergone a number of collisions larger than α , a decomposition will then be triggered. Similarly, if the process decides to perform the inter-molecular operations, CRO uses a parameter β to further decide whether an inter-molecule collision or a synthesis operation should be performed. β specifies the least amount of KE which a molecule should possess. For two molecules ω1 and ω2 , synthesis will be triggered when both KE ω1 and KE ω2 are less than β . Otherwise, inter-molecular ineffective collision will occur. The iteration repeats until the stopping criterion satisfies (e.g., the best solution does not change for a certain number of consecutive iterations). After the searching loop stops, the best solution is then the molecule with the lowest PE. This section discusses the system, application and task scheduling model assumed in this work. The definition of the notations can be found in Table 1. 4.1. System model In this paper, the target system consists of a set P of m heterogeneous computing nodes that are fully interconnected with highspeed network. Each subtask in a DAG task can only be executed on one heterogeneous computing node. The communication time between two dependent subtasks should be taken into account if they are assigned to different heterogeneous computing nodes. We also assume a static computing system model in which the dependent relations and the execution times of subtasks are known a priori and do not change over the course of the scheduling and subtask execution. In addition, all heterogeneous computing nodes are fully available to the computation on the time slots they are assigned to. 4.2. Application model In this work, an application is represented by a DAG, with the graph vertexes representing subtasks and edges between vertexes representing execution precedence between subtasks. If there is an edge from Ti to Tj in the DAG graph, then Ti is called the predecessor of Tj , and Tj is the successor of Ti . pred(Ti ) and succ(Ti ) denote the set of predecessors and successors of task Ti , respectively. There is an entry subtask and an exit subtask in a DAG. The entry subtask Tentry is the starting subtask of the application without any predecessors, while the exit subtask Texit is the final subtask with no successors. A weight is associated with each vertex and edge. The vertex weight, denoted as Wd (Ti ), represents the amount 1310 Y. Xu et al. / J. Parallel Distrib. Comput. 73 (2013) 1306–1322 (a) A simple DAG task model containing 8 subtasks. (b) A fully connected parallel system with 3 heterogeneous computing nodes. Fig. 1. A simple DAG task model containing 8 subtasks and a fully connected parallel system with 3 heterogeneous computing nodes. of data to be processed in the subtask Ti , while the edge weight, denoted as Cd (Ti , Tj ), represents the amount of communications between subtask Ti and subtask Tj . The DAG topology of an exemplar application model and system model is shown in Fig. 1(a) and (b), respectively. The heterogeneity model of a computing system can be divided into two categories: Fix Heterogeneity Model (FHM) and Mixed Heterogeneity Model (MHM). In a FHM computing system, a computing node executes the tasks at the same speed, regardless of the type of the tasks. On the contrary, in a MHM computing system, how fast a computing node executes a task depends on how well the heterogeneous computing node architecture matches the task requirements and features. The work in this paper assumes MHM computing systems. The execution speeds of the computing nodes in the heterogeneous computing system is represented by a two-dimensional matrix, S, in which an element S (Ti , Pk ) represents the speed at which computing node Pk executes subtask Ti . The computation cost of subtask Ti running on computing node Pk , denoted as W (Ti , Pk ), can be calculated by Eq. (13). Assume the DAG has the topology as in Fig. 1(a) and there are 3 heterogeneous computing nodes on the computing system as in Fig. 1(b). An example of the computing node heterogeneity and the computation costs of the subtasks is shown in Table 2. Note that there are two numbers in each computing node in Fig. 1(a). The number at the top is the task id and the one at the bottom is the average computation cost as calculated in Table 2: W (Ti , Pk ) = Wd (Ti ) S (Ti , Pk ) . (13) The average computation cost of subtask Ti , denoted as W (Ti ), can be calculated by Eq. (14): W (Ti ) = m W (Ti , Pk ) k=1 m . (14) The communication bandwidths between heterogeneous computing nodes are represented by a two-dimensional matrix B. The communication startup costs of the computing nodes are represented by an array, Cs , in which the element Cs (Pk ) is the startup cost of computing node Pk . The communication cost C (Ti , Tj ) of edge(Ti , Tj ), which is the time spent in transferring data from subtask Ti (scheduled on Pk ) to subtask Tj (scheduled on Pl ), can be calculated by Eq. (15). When Ti and Tj are scheduled on the same computing node, the communication cost is regarded as 0: C (Ti , Tj ) = Cs (Pk ) + Cd (Ti , Tj ) B(Pk , Pl ) . (15) C (Ti , Tj ) is the average communication cost of the edge(Ti , Tj ) which is defined as shown in Eq. (16): C (Ti , Tj ) = Ti on Pk ,Tj on Pl C (Ti , Tj ) N . (16) where N is the number of the variables transferred between subtasks Ti and Tj between all pairs of processors Pk and Pl . In this paper, a communication cost is only required when two subtasks are assigned to different heterogeneous computing nodes. In other words, the communication cost can be ignored when the subtasks are assigned to the same computing node. It is assumed that the inter-computing-node communications are performed at the same speed (i.e., with the same bandwidths) on all links. And we assume B(Pk , Pl ) = 1, Cs (Pk ) = 0 to simplify our DAG task scheduling model. The earliest start time of the subtask Ti on computing node Pk is denoted as EST (Ti , Pk ), which can be calculated using Eq. (17): EST (Ti , Pk ) 0, Ti = Tentry max EFT (T , P ), P = P j m k m = Tj ∈ pred(Ti ) max , P ) + C ( T , T ( EFT ( T j m j i )), T ∈ pred(T ) j (17) Pk ̸= Pm i EFT (Ti , Pk ) = EST (Ti , Pk ) + W (Ti , Pk ). (18) The earliest finish time of the task Ti on computing node Pk is denoted as EFT (Ti , Pk ), which can be calculated using Eq. (18). EST (Ti ) and EFT (Ti ) denote the earliest start time (EST) and the earliest finish time (EFT) of a subtask Ti over all heterogeneous computing nodes, which can be calculated in Eqs. (19) and (20) respectively. Tavail (Pk ) denotes the earliest time at which the computing node Pk is ready for the task execution: EST (Ti ) = max (EFT (Ti , Pk ), Tavail (Pk )) (19) EFT (Ti ) = min EFT (Ti , Pk ). (20) 1≤k≤m 1≤k≤m In this study, the task scheduling problem is the process of mapping a set T of n subtasks in a DAG to a set P of m computing nodes, aiming at minimizing the makespan. The subtask with the highest priority is selected for computing node allocation, and a computing node which can ensure the earliest finish time is selected to run the subtask. In many task scheduling algorithms, a subtask which has a higher upward rank is more important and is therefore preferred for computing node allocation to Y. Xu et al. / J. Parallel Distrib. Comput. 73 (2013) 1306–1322 Table 2 Computing node heterogeneity and computation costs. Ti P0 Cost T1 Ave. cost P2 W (Ti ) 1.00 1.20 1.33 1.18 1.00 0.75 1.30 1.09 0.85 0.80 1.00 0.81 1.37 1.00 0.93 0.80 1.22 1.09 0.86 1.30 0.79 1.79 1.00 1.20 11 10 9 11 15 12 10 11 13 15 12 16 11 9 14 15 9 11 14 10 19 5 13 10 11.00 12.00 11.67 12.33 15.00 8.67 12.33 12.00 13 11 P1 T4 11 P2 13 13 T3 T0 T6 T2 30 20 Table 3 Task priorities. makespan Ti Rankb Rankt 0 1 2 3 4 5 6 7 101.33 66.67 63.33 73.00 79.33 41.67 37.33 12.00 0.00 22.00 28.00 25.00 22.00 56.33 64.00 89.33 Fig. 2. A schedule for the simple DAG task graph in Fig. 1(a) on 3 computing nodes in Fig. 1(b) using the HEFT_B task scheduling algorithm. The makespan of the schedule is 69. P0 P1 80 70 60 10 T1 P1 11 P2 T0 13 T4 T3 T5 T6 T7 80 70 60 50 . 17 40 W (Ti ) T2 P0 30 C (Ti , Tj ) Fig. 3. A schedule for the simple DAG task graph in Fig. 1(a) on 3 computing nodes in Fig. 1(b) using the HEFT_T task scheduling algorithm. The makespan of the schedule is 80. 20 T7 T5 makespan 10 CCR = edge(Ti ,Tj )∈E T3 T1 50 (22) Where succ(Ti ) is the set of the immediate successors of the subtask Ti , pred(Ti ) is the set of the predecessors of the subtask Ti . Table 3 shows the upward-ranking and downward-ranking of each subtask in the DAG in Fig. 1(a). Note that both computation and communication costs are the costs averaged over all vertexes and links. The communication to computation ratio (CCR) can be used to indicate whether a task graph is communicationintensive or computation-intensive. For a given task graph, it is computed by the average communication cost divided by the average computation cost on a target computing system. The computation can be formulated as in Eq. (23): 13 40 (W (Ti ) + C (Tj , Ti ) + Rankt (Tj )). T0 30 max Tj ∈ pred(Ti ) P2 13 (21) The downward-ranking of subtask Ti , denoted as Rankt (Ti ), can be calculated in Eq. (22): Rankt (Ti ) = T6 11 20 (C (Ti , Tj ) + Rankb (Tj )). 10 T4 10 max Tj ∈ succ(Ti ) T2 17 other subtasks. Intuitively, the upward rank of a subtask reflects the average remaining cost to finish all subtasks after that subtask starts up. The upward-ranking of subtask Ti , denoted as Rankb (Ti ), can be computed in Eq. (21): Rankb (Ti ) = W (Ti ) + T7 T5 80 P1 70 P0 60 P2 50 P1 40 P0 10 0 1 2 3 4 5 6 7 Speed 1311 makespan (23) Ti ∈T Figs. 2–4 show three exemplar solutions of scheduling the DAG in Fig. 1(a) to the computing system in Fig. 1(b), using the HEFT_B algorithm, the HEFT_T algorithm and the DMSCRO task scheduling algorithm respectively. The task scheduling priority queueing in HEFT_B is generated by upward-ranking [29,35]. 5. Design of DMSCRO The concepts in DAG scheduling can be mapped to those in CRO. DAG scheduling involves making decisions about (1) scheduling order of tasks (i.e., the order of the tasks in the waiting queue of the scheduler) and (2) resource allocation (i.e., which computing node is used to run the task). The quality of a scheduling solution is determined by makespan. The shorter makespan, the better Fig. 4. A schedule for the simple DAG task graph in Fig. 1(a) on 3 computing nodes in Fig. 1(b) using the DMSCRO task scheduling algorithm. The makespan of the schedule is 66. scheduling solution. In CRO, there are the concepts of molecule, atoms, molecular structure and energy of a molecule. A molecular structure represents the positions of atoms in a molecule. A molecule has a unique molecule structure. In CRO, the modules react through chemical reactions, aiming to transform to the molecular structures with lower energy (i.e., with more stable states). A scheduling solution (scheduling order and resource allocation) in DAG scheduling corresponds to a molecule in CRO. A scheduling solution is encoded as a one dimensional integer array in this paper. The first half of the integer array represents the scheduling order and the second half represents the resource allocation. The integer array corresponds to a molecule. An element in the array corresponds to an atom in a molecule. 1312 Y. Xu et al. / J. Parallel Distrib. Comput. 73 (2013) 1306–1322 Algorithm 1 IniMolecule Fig. 5. Illustration of the big molecule which consists of two small molecules. The order of the elements in the array represents the molecular structure. This paper designs the operations on the encoded scheduling solutions (integer arrays) to change the order of the elements in the arrays. These designed operations correspond to the chemical reactions that change the molecular structures. Different orders of the elements in the arrays represent different scheduling solutions, and we can calculate from an integer array the corresponding makespan of the scheduling solution. The makespan of a scheduling solution corresponds the energy of a molecule. In this section, we first present the encoding of scheduling solutions and the fitness function used in the DMSCRO, and then present the design of four elementary chemical reaction operations in DMSCRO. Finally, we outline the execution flow of the DMSCRO scheme and discuss a few important properties in DMSCRO. 5.1. Molecular structure and fitness function In this paper, each big molecule ω, which consists of two small molecules ϕ and ψ , represents a solution of the task scheduling problem. The molecular structure is encoded with each integer in the permutation representing the subtask Ti or the computing node Pj . The two molecular structures can be denoted as the following integer queueing, as shown in Fig. 5. Where ϕ = {T1 , T2 , . . . , Ti , . . . , Tn } represents the set of subtasks queueing. An atom Ti in the molecule ϕ represents the subtask i in the DAG which is linked according to their execution priority order. Moreover, the priority order of the subtasks in the molecule ϕ should be a valid topological order, where the start vertex should be placed at the beginning of the molecule while the exit vertex should be placed at the end. In order that a queueing of integers should be feasible, all the subtasks in a DAG should be scheduled and the schedule should satisfy the precedence relations. We take advantage of an upward-ranking heuristic mostly used by traditional list scheduling approaches to estimate the priority of each subtask, as shown in Eq. (21). ψ = {P0 , P1 , . . . , Pk , . . . , Pm } represents the set of heterogeneous computing nodes. And the corresponding position (atom) in molecule ψ is the computing node Pj which is assigned to execute the subtask Ti . The initial solution generator is used to generate the initial solutions for DMSCRO to manipulate. The part one ϕ of the first big molecule ω is generated by upward-ranking, defined as Ranku . The rest big molecules in the initial solutions are generated by a random perturbation in the first big molecule ω. A detailed description is given in Algorithm 1. Potential energy (PE) is defined as the objective function (Fitness function) value of the corresponding solution represented by ω. In this paper, the overall schedule length of the entire subtasks, namely makespan, is the largest finish time among all tasks, which is equivalent to the actual finish time of the exit vertex (subtask) Texit . For the task scheduling problem, the goal is to obtain task assignments that ensure minimum makespan to ensure that the precedence of the tasks is not violated. Hence, the fitness function value is defined as Fitness(ω) = PE ω = makespan = EFT (Texit ). (24) Algorithm 2 presents how to calculate the value of the fitness function. Input: The first molecule ω; Output: The initial solutions; 1: MoleN ← 1; 2: while MoleN < PopSize do 3: For each atom Ti in molecule ϕ to find the first successor succ(i) from i to the end; 4: For each atom Tj , j ∈ (i, succ (i)) to find the first predecessor pred(j) from succ(i) to the begin in molecule ϕ ; 5: if pred(j) < i then 6: Interchanged position of atom Ti and atom Tj in molecule ϕ; 7: end if 8: For each atom Pk in molecule ψ to randomly change; 9: Generate a new molecule ω′ ; 10: MoleN ← MoleN + 1; 11: end while Algorithm 2 Calculating the Fitness value of a molecule Input: The molecule ω of task scheduling queue; Output: The makespan of the ω (task scheduling queue); 1: Add all subtasks (atoms) Ti of the molecule ω to the ScheduleQ according to their priority; 2: while ScheduleQ ̸= ∅ do 3: Select atom Ti from ScheduleQ when its indegree=0; 4: Compute EFT (Ti , Pk ) using the scheduling policy; 5: for all succ (Ti ) of the atom Ti do 6: indegreesucc (Ti ) = indegreesucc (Ti ) − 1; 7: end for 8: Remove atom Ti from the ScheduleQ ; 9: end while 10: return Fitness(ω)=PEω =makespan= EFT (Texit ) ; 5.2. Elementary chemical reaction operations This subsection presents four elementary chemical reaction operations designed in DMSCRO, including on-wall collision, decomposition, inter-molecular collision and synthesis. 5.2.1. On-wall ineffective collision In this paper, the operation OnWall(ω) is used to generate a new solution ω′ from a given big molecule ω. The following steps are followed in the operation: (1) the operation randomly chooses an atom (subtask) Ti in the small molecule ϕ , and then finds the first predecessor of Ti , such as Tj , from the selection position to the begin of molecule ϕ . (2) A random number k ∈ [j + 1, i − 1] is generated, and the atom Ti is stored in a temporary variable Temp, and then from the position i − 1, the operation shifts each atom by one place to the right position until a position k. (3) The operation moves the atom Temp to the position k. The rest of the atoms in ϕ ′ are the same as those in ϕ . For the small molecule ψ , the operation adjusts the atoms in the same positions as in ϕ and generates a molecule ψ ′ . In the end, the operation generates a new molecule ω1′ consisting of ϕ ′ and ϕ ′ . The operations are illustrated in Fig. 6 and Fig. 7, and its detailed executions are presented in Algorithm 3. 5.2.2. Decomposition In this paper, the operation Decomp(ω) is used to generate two new solutions ω1′ and ω2′ from a given solution ω. This operation first uses the steps in on-wall collision to generate ω1′ . Y. Xu et al. / J. Parallel Distrib. Comput. 73 (2013) 1306–1322 Predecessor of subtask 2 Molecule Successor of subtask 2 1313 New Molecule 1 Change molecular structure Molecule New Molecule Fig. 6. Illustration of the molecular structure change for on-wall ineffective collision. P0 0 4 P1 3 6 5 New Molecule 2 Fig. 8. Illustration of the molecular structure change for decomposition. P0 0 4 P1 3 6 5 P2 2 1 7 P0 0 4 P1 3 6 5 P2 1 2 7 P0 0 4 P1 3 6 5 P2 1 2 7 Change molecule structure P2 1 2 7 Fig. 7. Illustration of the subtask-to-computing-node mapping for on-wall ineffective collision. Algorithm 3 OnWall(ω) function Input: Molecule ω; Output: New molecule ω′ ; 1: Choose randomly an atom Ti in the small molecule ϕ , 2: Find the first predecessor Tj of Ti from i to the begin of molecule ϕ; 3: Generate a random number k ∈ [j + 1, i − 1] 4: Temp ← Ti ; 5: for m = i − 1 downto k do 6: Tm ← Tm−1 ; 7: end for 8: Tk ← Temp; ′ 9: Generate a new small molecule ϕ 10: Generate δ ∈ [0, 1] 11: if δ < 0.5 then 12: Temp ← Pi ; 13: for m = i − 1 downto k do 14: Pm ← Pm−1 ; 15: end for 16: Pk ← Temp; 17: Generate a new small molecule ψ ′ 18: end if ′ ′ ′ 19: Generate a new molecule ω consisting of ϕ and ψ ; ′ 20: return the new molecule ω ; Then the operation generate the other new solution ω2′ in the similar fashion. The only difference is that in Step 1, we use the position of the first successor of Ti as the starting position instead of randomly selecting a position. Note that because of decomposition (and synthesis discussed later) the number of solutions in the CRO process may be varied during the reactions, which is a feature of CRO that is different from Genetic algorithms. This operations are illustrated in Fig. 8 and Fig. 9, and its detailed executions are presented in Algorithm 4. 5.2.3. Inter-molecular ineffective collision In this paper, the operation Intermole(ω1 , ω2 ) is used to generate two new solutions ω1′ and ω2′ from the given solutions ω1 and ω2 . The new solutions are generated in the following steps. P0 0 4 P1 3 6 5 P2 1 2 7 Decomposition Fig. 9. Illustration of the task-to-computing-node mapping for decomposition. Algorithm 4 Decomp(ω) function Input: Molecule ω; Output: Two new molecules ω1′ and ω2′ ; 1: Choose randomly an atom Ti in the small molecule ϕ ; 2: Find the first predecessor Tj of Ti from i to the begin; 3: Temp ← Ti ; 4: for k = i − 1 downto j + 1 do 5: Tk ← Tk−1 ; 6: end for 7: Tj+1 ← Temp; ′ 8: Generate a new small molecule ϕ 9: Change some atoms accordingly in the molecule ψ and generate a new small molecule ψ ′ ′ ′ ′ 10: Generate a new molecule ω1 consisting of ϕ and ψ ; 11: Choose randomly an atom Ti in the small molecule ϕ ; 12: Find the first predecessor Tj of Ti from i to the begin; 13: Temp ← Ti ; 14: for k = i − 1 downto j + 1 do 15: Tk ← Tk−1 ; 16: end for 17: Tj+1 ← Temp; ′ 18: Generate a new small molecule ϕ 19: Change some atoms accordingly in the molecule ψ and generate a new small molecule ψ ′ ′ ′ ′ 20: Generate a new molecule ω2 consisting of ϕ and ψ ; ′ ′ 21: return the new molecules ω1 and ω2 ; (1) An integer i is randomly selected between 1 and n. (2) ϕ1 and ϕ2 are cut off at the position i to become the left and right segments. (3) The left segments of ϕ1′ and ϕ2′ are inherited from the left 1314 Y. Xu et al. / J. Parallel Distrib. Comput. 73 (2013) 1306–1322 New Molecule 2 Molecule 1 Molecule 2 New Molecule 1 Fig. 12. Illustration of molecular structure change for synthesis. Fig. 10. Illustration of molecular structure change for inter-molecular ineffective collision. Two Molecule Two Molecule P0 0 3 P1 1 4 P2 2 5 P0 0 Two New Molecule P0 0 3 4 6 P1 1 6 5 7 P2 2 7 4 P0 0 4 3 6 1 2 5 7 P1 3 6 5 P1 P2 2 1 7 P2 P0 0 1 P1 4 2 5 P2 3 6 7 P0 0 3 P1 2 1 5 P2 4 6 7 New Molecule Synthesis P0 0 3 P1 4 1 5 P2 6 2 7 Fig. 13. Illustration of the task-to-computing-node mapping for synthesis. Fig. 11. Illustration of the task-to-computing-node mapping for inter-molecular ineffective collision. segments of ϕ1 and ϕ2 , respectively. (4) Each subtask in the right segments of ϕ1′ comes from the subtasks in ϕ2 that do not appear in the left segment of ϕ1′ . The atoms at the corresponding positions in ψ1 and ψ2 are also adjusted in the fashion same as above to generate ψ1′ and ψ2′ . As a result, the operation generates ω1′ and ω2′ from ω1 and ω2 . This operations are illustrated in Fig. 10 and Fig. 11, and its detailed executions are outlined in Algorithm 5. Algorithm 5 Intermole(ω1 , ω2 ) function Input: Two molecules ω1 and ω2 ; Output: Two new molecules ω1′ and ω2′ ; 1: Choose randomly a cut-off point i ∈ [1, n]; 2: Cut the small molecule ϕ1 and ϕ2 into left and right segments; 3: Cut the mall molecule ψ1 and ψ2 into left and right segments in the same way; ′ 4: Inherit from the left segment of ϕ1 to the left segments of ϕ1 ; 5: Copy from the subtasks in ϕ2 that do not appear in the left segment of ϕ1′ to the right segment of ϕ1′ . ′ 6: Inherit from the left segments of ψ1 to the left segments of ψ1 ; 7: Copy from the subtasks in ψ2 that do not appear in the left segment of ψ1′ to the right segment of ψ1′ . ′ ′ ′ 8: Generate a new molecule ω1 which consists of ϕ1 and ψ1 ; ′ 9: Inherit from the left segments of ϕ2 to the left segments of ϕ2 ; 10: Copy from the subtasks in ϕ1 that do not appear in the left segment of ϕ2′ to the right segment of ϕ2′ . ′ 11: Inherit from the left segments of ψ2 to the left segments of ψ2 ; 12: Copy from the subtasks in ψ1 that do not appear in the left segment of ψ2′ to the right segment of ψ2′ . ′ ′ ′ 13: Generate a new molecule ω2 which consists of ϕ2 and ψ2 ; ′ ′ 14: return the new molecules ω1 and ω2 ; 5.2.4. Synthesis In this paper, the operation Synth(ω1 , ω2 ) is used to generate a new solution ω′ from two existing solutions ω1 and ω2 . The new solution is generated in the following steps. (1) An integer i between 1 and n is randomly selected. (2) ϕ1 and ϕ2 are cut off at the position i to become the left and right segments. (3) The left segment of ϕ ′ is inherited from the corresponding segment of ϕ1 , and each task in the right segment of ϕ ′ comes from ϕ2 that do not appear in the left segment of ω′ . Then the atoms at the corresponding positions in ψ1 and ψ2 are also adjusted in the fashion same as above to generate ψ ′ . This operations are illustrated in Fig. 12 and Fig. 13, and its detailed executions are outlined in Algorithm 6. Algorithm 6 Synth(ω1 , ω2 ) function Input: Two molecules ω1 and ω2 ; Output: the new molecule ω′ ; 1: Generate ρ ∈ [0, 1] 2: if ρ < 0.5 then 3: The left segment of ω′ is inherited from the corresponding segment of the original small molecule ϕ1 ; 4: The right segment ω′ is copied from the another original small molecule ψ2 ; 5: Generate a new molecule ω′ which consists of ϕ1 and ψ2 ; 6: else 7: The left segment of ω′ is inherited from the corresponding segment of the original small molecule ϕ2 ; 8: The right segment ω′ is copied from the another original small molecule ψ1 ; 9: Generate a new molecule ω′ which consists of ϕ2 and ψ1 ; 10: end if ′ 11: return a new molecule ω ; Y. Xu et al. / J. Parallel Distrib. Comput. 73 (2013) 1306–1322 5.3. The outline and analysis of DMSCRO The entire algorithm of using DMSCRO to schedule a DAG job is outlined in Algorithm 7. In the algorithm, DMSCRO first initializes the process (Step 1–3 in the algorithm). Then the process enters a loop (Step 4–37). In each iteration, one of elementary chemical reaction operations is performed to generate new molecules. PE of newly generated molecules will be calculated. The iteration repeats until the stopping criteria are met. The stopping criteria may be set based on different parameters, such as the maximum amount of CPU time used, the maximum number of iterations performed, an objective function value less than a predefined threshold obtained, and the maximum number of iterations performed without further performance improvement. In the implementations of the experiments in this paper, the stopping criterion is set as when there is no makespan improvement after 10 000 consecutive iterations in the search loop. After the searching loop stops, the molecule with the lowest PE contains the best DAG scheduling solution (Step 38). Algorithm 7 The outline of DMSCRO Input: the DAG job Output: The DAG scheduling solution; 1: Initialize PopSize, KELossRate , MoleColl and InitialKE, α and β ; 2: Call Algorithm 1 to generate the initial population, Pop; 3: Call Algorithm 2 to calculate PE of each molecule in Pop; 4: while the stopping criteria is not met do 5: Generate b ∈ [0, 1]; 6: if b > MoleColl then 7: Select a molecule ω from Pop randomly; 8: if (NumHitω − MinHitω ) > α then 9: call Algorithm 4 to generate new molecules ω1′ and ω2′ ; 10: Call Algorithm 2 to calculate PEω′ and PEω′ ; 1 2 11: if Inequality Eq. (3) holds then 12: Remove ω from Pop; 13: Add ω1′ and ω2′ to Pop; 14: end if 15: else 16: Call Algorithm 3 to generate a new molecules ω′ ; 17: Call Algorithm 2 to calculate PEω′ ; 18: Remove ω from Pop; 19: Add ω′ to Pop; 20: end if 21: else 22: Select two molecules ω1 and ω2 from Pop randomly; 23: if (KEω1 < β) and (KEω2 < β) then 24: Call Algorithm 6 to generate a new molecule ω′ ; 25: Call Algorithm 2 to calculate PEω′ ; 26: if Inequality Eq. (11) holds then 27: Remove ω1 and ω2 from Pop; 28: Add ω′ to Pop; 29: end if 30: else 31: Call Algorithm 5 to generate two new molecules ω1′ and ω2′ ; 32: Call Algorithm 2 to calculate PEω′ and PEω′ ; 1 2 33: Remove ω1 and ω2 from Pop; 34: Add ω1′ and ω2′ to Pop; 35: end if 36: end if 37: end while 38: return the molecule with the lowest PE in Pop; 1315 The time complexity of DMSCRO is analyzed as follows. According to Algorithm 7, the time is mainly spent in running the searching loop (Step 4–37) in the proposed DMSCRO. In each iteration of the loop, the algorithm needs to (1) perform one of the four elementary operations and (2) evaluate the fitness function of solutions (i.e., makespan) for each new molecule generated. In the on-wall collision, the main steps are (1) finding the first predecessor of a randomly selected task in the encoding array, and (2) adjusting the order of the subtasks between the selected subtask and its predecessor. The time complexity of both steps is O(n). Therefore, the time complexity of on-wall collision is O(2 × n). In decomposition, the operations involved in the onwall collision are performed twice. Therefore, the time complexity of decomposition is O(4 × n). In the synthesis operation, the main steps are (1) copying the head portion of molecule 1 to a new module, (2) copying the subtasks that are in molecule 2, but are not in the head portion of molecule 1 to the tail portion of the new module. The time complexity of the first step is O(n). The time complexity of the second step is O(n2 ), because for each subtask in the subtask queueing ϕ of molecule 2, the step needs check whether the subtask is already in the head portion of molecule 1. So the time complexity of the synthesis operation is O(n + n2 ). The inter-molecule collision, similar steps are performed, but two new molecules will be generated. So the time complexity of inter-molecule collision is O(2 × (n + n2 )). As discussed above, the most expensive operation is inter-molecule collision. The time complexity of evaluating the fitness function of a molecule (i.e., calculating the makespan of a DAG schedule) is O(e × m), where e is the number of edges in the DAG and m is the number of heterogeneous computing nodes. Two molecules (by inter-molecule collision or decomposition) may be generated in an iteration. Therefore, in an iteration, the worst case time complexity is O(2 × (n + n2 ) + 2 × e × m). Therefore, the time complexity of DMSCRO is O(iters × 2 × (n + n2 + e × m)), where iters is the number of iterations performed by DMSCRO. The space complexity of the DMSCRO can be analyzed as follows. In DMSCRO, for each molecule we need an array of size (n + n) to store it. Moreover, for each molecule, we use an extra molecule space to store the molecule structure that produces the minimum makespan during the evolution of the molecule. Therefore, the space complexity caused by a molecule is O(4 × n). There are PopSize molecules in the initial population. Therefore, the space complexity of DMSCRO is O(PopSize × 4 × n). Note that although the number of molecules may change during the evolution, the setting of the decomposition and synthesis probability in DMSCRO dictates that the average number of molecules is still PopSize statistically during the evolution. It is very difficult to theoretically prove the optimality of the CRO (as well as DMSCRO) scheme. However, by analyzing the chemical reaction operations designed in DMSCRO and the operational environment of DMSCRO, it can be shown to some extent that DMSCRO enjoys the advantages of both GA and SA. The Inter-molecular collision designed in DMSCRO exchanges the partial structure of two different molecules, which acts like the crossover operation in GA, while the operation of on-wall collision designed in DMSCRO randomly changes the molecular structures, which has the similar effect of the mutation operation in GA. On the other hand, the energy conservation requirement in DMSCRO is able to guide the searching of the optimal solution in the similar way as the Metropolis Algorithm of SA guides the evolution of the solutions in SA. In addition to the similarity to GA and SA, DMSCRO has two additional operations: decomposition and synthesis. These two operations may change the number of solutions in a population, which give DMSCRO more opportunities to jump out of the local optimum and explore the wider areas in the solution space. This benefit enables DMSCRO to find good solutions 1316 Y. Xu et al. / J. Parallel Distrib. Comput. 73 (2013) 1306–1322 faster than GA, which is demonstrated by the experimental results in Section 6.4. It is also worth emphasizing that uni-molecular (on-wall) and inter-molecular collisions play different roles in the DMSCRO process. As discussed above, the inter-molecular collision in DMSCRO bears similarity with the crossover operation in Genetic algorithms. If only performing this type of collisions, the searching is likely to be locked in a local optimum. On the other hand, uni-molecular collision in the DMSCRO maintains the diversity of molecules by randomly mutating the molecular structures, which can help the searching jump out of the local optimum and reach wider areas in the solution space. Both types of collisions are important for the searching in DMSCRO. 6. Simulation and results To illustrate the power of DMSCRO-based DAG scheduling algorithm, we compare this algorithm with the previously proposed heuristics (HEFT_B and HEFT_T) [29] and also with a well known meta-heuristic algorithm, Genetic Algorithm (GA), presented in [12]. The makespan performance obtained by GA is used as the baseline performance. The GA used in the simulation experiments is therefore labeled as BGA in the figures. The reason why we select HEFT_B and HEFT_T as the representatives of heuristic scheduling is because HEFT_B and HEFT_T outperform other heuristic scheduling algorithms in heterogeneous parallel systems, as discussed in the fourth last paragraph in Section 2.1. The reasons why we compare DMSCRO with the GA presented in [12] are three folds. First, GA is to date the most popular meta-heuristic technique for scheduling. Second, DMSCRO absorbs the strengths of GA and SA (Simulated Annealing). Therefore, we want to compare DMSCRO with GA to see the advantages of DMSCRO over GA. We do not compare DMSCRO with SA because SA performs operations on a single solution to generate new solutions, instead of on a population of solutions as in DMSCRO and GA. Indeed, the underlying principles and philosophies between DMSCRO and SA differ a lot. Typically, a meta-heuristic algorithm operating on a population of solutions is able to find good solutions faster than that operating on a single solution. Finally, as we have discussed in the first paragraph of Section 2.2, the GA method presented in [12] has the closest workload and system model to that in our paper. In the experiments, we consider two extensive sets of graphs as the workload: real-world application and randomly generated application graphs. The performance has been evaluated in terms of makespan. Each makespan value plotted in the graphs is the result averaged over a number of independent runs, which we call average makespan. In real-world applications, the average makespan is obtained from 10 independent runs, while in random graphs the average makespan is obtained from the running of 100 different random DAG graphs. DMSCRO is programmed in C♯ (the outline of the program is shown in Algorithm 7). A DAG task in the program is represented by a class, whose members include an array of subtasks, the matrix of the speed at which each subtask runs on each computing node, and the matrix of communication data between every pair of subtasks. A subtask in the program is also represented by a class, whose members include an array of predecessors of the subtask, an array of successors of the subtask, the indegree of the subtask, the outdegree of the subtask and the computational data of the subtask. The simulations are performed on the same PC with an Intel Core 2 Duo-E6700 @ 2.66 GHz CPU and 2 GB RAM. As discussed in Section 4.2, this paper adopts the MHM heterogeneity model and the heterogeneity is represented by a two-dimensional matrix, S. In the experiments, the heterogeneity is set in the following way. We set a parameter h (0 < h < 1). The Fig. 14. Gaussian elimination for a matrix of size 7. value of each element S (Ti , Pk ) in S is randomly selected from the range of [1 − h%, 1 + h%]. In doing so, the speed of a computing node will be different for different subtasks, which complies with the assumption of the MHM model. The biggest possible ratio of the best computing node speed to the worst computing node +h% speed for each task is 11− , which is used to represent the level of h% heterogeneity. Unless otherwise stated, h is set as such a value that the level of heterogeneity is 2. Here are some suggested values for the parameters: PopSize = 10, KE LossRate = 0.2, MoleColl = 0.2, InitialKE = 1000, α = 500, β = 10, and buffer = 200. These values are deduced from the literature [18]. 6.1. Real world application graphs The first test set used task graphs of two real world problems, Gaussian elimination [36] and Molecular dynamics code [16], to evaluate the performance of DMSCRO. 6.1.1. Gaussian elimination Gaussian elimination is used to determine the solution of a system of linear equations. Gaussian elimination systematically applies elementary row operations on a set of linear equations in order to convert it to the upper triangular form. Once the coefficient matrix is in the upper triangular form, the back substitution is used to find a solution. The DAG for the Gaussian elimination algorithm with the matrix size of 7 is shown in Fig. 14. The total number of tasks in the graph is 27, and the largest number of tasks at the same level is 6. This graph has been used to evaluate DMSCRO in the simulation. Since the structure of the graph is fixed, only the number of heterogeneous computing nodes and the communication to computation ratio (CCR) values were varied. CCR values were 0.1, 0.2, 1, 2, and 5 in the experiments. Since in Gaussian elimination, the same operation is executed on every computing node and the same information is communicated between heterogeneous computing nodes, it is assume that all tasks have the same computation cost and all communication links have the same communication cost. Fig. 15 shows the makespan of DMSCRO, BGA, HEFT_B and HEFT_T under the increasing number of computing nodes. Fig. 15 shows a decrease in the average makespan as the number of Y. Xu et al. / J. Parallel Distrib. Comput. 73 (2013) 1306–1322 Fig. 15. Average makespan for Gaussian elimination. Fig. 16. Average makespan for Gaussian elimination, the number of computing nodes is 8. computing nodes increases, which is to be expected. It can also been seen that as the number of computing nodes increases, the advantage of DMSCRO and GA over HEFT_B and HEFT_T diminishes. This may be because when more computing nodes are contributed to run the same scale of tasks, less intelligent scheduling algorithms are needed in order to achieve good performance. The reason why DMSCRO and BGA typically outperform HEFT_B and HEFT_T is because DMSCRO and BGA search a wide area of the solution space for the optimal scheduling, while HEFT_B and HEFT_T narrow the search down to a very small portion of the solution space by the means of the heuristics. Therefore, DMSCRO and BGA are more likely to obtain better solutions than HEFT_B and HEFT_T. The experiments show that DMSCRO and BGA achieve very similar performance. The fundamental reason for this is because both BGA and DMSCRO belong to meta-heuristic methods. According to No-Free-Lunch Theorem in the area of meta-heuristics, all well-designed meta-heuristic methods that search for optimal solutions are the same in performance when averaged over all possible objective functions. In theory, as long as a well-designed meta-heuristic method runs for long enough, it will gradually ap- 1317 proach the optimal solution. The BGA we used in the experiments is taken from the literature, which has shown that it is well designed. Therefore the fact that DMSCRO presents the very similar performance than BGA indicates that DMSCRO developed in our work is also well designed. A close observation of the figure shows that DMSCRO slightly outperforms BGA in some cases. As discussed in the last paragraph of Section 5, all meta-heuristic methods that search for optimal solutions are the same in performance when averaged over all possible objective functions, and as long as it runs for long enough, it will gradually approach the optimal solution in theory. The reason why DMSCRO has better performance than BGA in some cases is only because when the stopping criteria set in the experiments (which is that the makespan stays unchanged for 10 000 consecutive iterations in the search loop) is satisfied, the performance obtained by DMSCRO is better than that obtained by BGA. This also shows that DMSCRO is more efficient in searching good solutions than BGA. The more detailed experimental results in this aspect will be presented in Section 6.4. Fig. 16 shows the performance of these four algorithms under increasing CCR. It can be observed that theaverage makespan increases rapidly with the increase of CCR. This may be because when CCR increases, the application becomes more communicationintensive, and consequently the heterogeneous computing nodes are in the idle state for longer. Fig. 16 shows that DMSCRO and BGA outperform HEFT_B and HEFT_T and that the advantage becomes more prominent when CCR becomes big. These results suggest that the heuristic algorithms, HEFT_B and HEFT_T, performs less effectively for communication-intensive applications, while DMSCRO and BGA can deliver more consistent performance in a wide range of scheduling scenarios. 6.1.2. Molecular dynamics code Fig. 17 represents the DAG of a molecular dynamics code as given in [16]. Again, since the graph has the fixed structure and a fixed number of computing nodes, the only parameters that could be varied were the number of heterogeneous computing nodes and the CCR values. The CCR values that were used in our experiments are 0.1, 0.2, 1, 2, and 5. Figs. 18 and 19 show the average makespan of DMSCRO and BGA over HEFT_B and HEFT_T under different number of heterogeneous computing nodes and different CCR values respectively. Fig. 18 shows the decrease in average makespan with the increasing number of heterogeneous computing nodes. Fig. 19 plots the average makespan with respect to different CCR values. The average makespan increases with increasing CCR. 6.2. Random generated application graphs In these experiments, we used randomly generated task graphs to evaluate the performance. In order to generate random graphs we implemented a random graph generator which allows the user to generate a variety of random graphs with different characteristics. The input parameters of the generator are CCR, the number of instructions in a subtask (representing the computational data), the number of successors of a subtask (representing the degree of parallelism) and the number of subtasks in a graph. We have generated a large set of random task graphs with different characteristics and scheduled these task graphs on a heterogeneous computing system. The following are the values of parameters used in the simulation experiments, unless otherwise stated. The number of subtasks generated in a DAG is randomly selected from 10 to 200. The number of successors that a subtask can have is a random number between 1 and 4. The number of computing nodes is 32. 1318 Y. Xu et al. / J. Parallel Distrib. Comput. 73 (2013) 1306–1322 Fig. 19. Average makespan for the molecular dynamics code; the number of computing nodes is 16. Fig. 17. A molecular dynamics code. Fig. 20. Average makespan of different subtask numbers, CCR = 10; the number of computing nodes is 32. Fig. 18. Average makespan for the molecular dynamics code. One of the initial scheduling solutions is set as those generated by HEFT_B, while other initial solutions are randomly generated. The computational costs of the DAG subtasks are generated as follows. The computational data of each subtask in the DAG (i.e., W (Ti )) is randomly selected from a range. Unless otherwise stated, the range is [1, 80]. The computational cost of subtask Ti on computing node Pk is then calculated using Eq. (13). We have evaluated the performance of the algorithms under different parameters, including numbers of subtasks, different numbers of heterogeneous computing nodes and different CCR values. DMSCRO is compared with those of other algorithms in terms of makespan. Each value plotted in the graphs is the results averaged over 100 different random DAG graphs. As shown in Fig. 20, DMSCRO always outperforms HEFT_B, HEFT_T and BGA as the number of subtasks in a DAG increases. The reasons for these are the same as those explained in Fig. 15. Fig. 21 shows the average makespan under the increasing values of CCR. It can be observed that the average makespan increases rapidly with the increase of CCR. This may be because when CCR increases, the application becomes more communication-intensive, and consequently the computing nodes are in the idle state for longer. Both Figs. 22 and 23 compare the average makespan of four algorithms under the increasing number of heterogeneous computing nodes. Nevertheless, Fig. 22 considers the workload with a low value of CCR (computation-intensive workload), while Fig. 23 considers a high value of CCR (communication-intensive workload). As can be seen from these two figures, DMSCRO outperforms the heuristic algorithms in all cases. Moreover, by comparing these two figures, it can be seen that when the CCR is low, the makespan decreases quickly (i.e., the performance improves quickly) as the number of heterogeneous computing nodes increases. When the Y. Xu et al. / J. Parallel Distrib. Comput. 73 (2013) 1306–1322 1319 Fig. 23. Average makespan of four algorithms under different computing node numbers and the high communication costs, subtasks = 100. Fig. 21. Average makespan of DMSCRO under different values of CCR, subtasks = 100. Fig. 22. Average makespan of four algorithms under different computing node numbers and the low communication costs, subtasks = 100. CCR is high, the makespan decreases at much lower rate. This is because when CCR is high, the makespan tends to be high as explained in Fig. 21, and therefore it is difficult to reduce the makespan even if the number of heterogeneous computing nodes is increased. 6.3. Impact of heterogeneity on makespan We have conducted the experiments to demonstrate the impact of heterogeneity on makespan. In the experiments, we change the value of h so that the level of heterogeneity increases from 1 to 4. Note that when the level of heterogeneity is 1, the computing system is homogeneous, and that the average computing node speed remain unchanged as we change the value of h and therefore the level of heterogeneity. The experimental results are plotted in Fig. 24, where the level +h% . As can be of heterogeneity on the x-axis is calculated using 11− h% seen from this figure, the makespan decreases in all cases as the level of heterogeneity increases. These results can be explained as follows. When the level of heterogeneity increases, some heterogeneous computing nodes become increasingly powerful Fig. 24. The impact of heterogeneity on makespan. and therefore the subtasks can run faster in those heterogeneous computing nodes. The DMSCRO algorithm is able to make use of this opportunity and intelligently allocate a suitable number of subtasks to those heterogeneous computing nodes. Consequently, the makespan decreases. A closer observation from Fig. 24 shows that when the average number of tasks in a DAG is small, the makespan almost tails off when the level of heterogeneity increases beyond a certain value. For example, in the case of the average number of tasks being 50, the level of heterogeneity almost remains unchanged when the level of heterogeneity is more than 2.8. This is because when the average number of tasks is small, the workload level is also small and then it is easier to reach the top performance. Although increasing the level of heterogeneity can improve the makespan performance, it will be increasingly difficult to be improved as it approaches the top performance. 6.4. Convergence trace of DMSCRO The experiments in the previous subsections show the final makespan achieved by DMSCRO and GA after the stopping criteria are satisfied. These results show that DMSCRO can obtain the 1320 Y. Xu et al. / J. Parallel Distrib. Comput. 73 (2013) 1306–1322 Fig. 25. The convergence trace for Gaussian elimination; averaged over 10 independent runs, ccr = 1, processors = 8. similar makespan performance as GA and that in some cases the final makespan obtained by DMSCRO is even better than that by GA when they stop searching. In this section, we conducted the experiments to show the change of makespan as DMSCRO and GA progress during the search. We then compared the convergence trace of these two algorithms. These experiments can help further reveal the differences between DMSCRO and GA and can also help explain why the final makespan performance of DMSCRO is sometimes better than that of GA. Figs. 25 and 26 show the convergence traces when processing Gaussian elimination and the molecular dynamics code, respectively. Figs. 27–31 plot the convergence traces for processing the sets of randomly generated DAGs with each set containing the DAGs of 10, 20, 50, 100 and 200 subtasks, respectively. It can be observed from these figures that the makespan performance decreases quickly as both DMSCRO and GA progress and that the decreasing trends tail off when the algorithms run for long enough. These figures also show that although the final makespans achieved by both algorithms are almost the same in most cases, their convergence traces are rather different. In all cases, the DMSCRO converges faster than GA. Quantitatively, our records show that DMSCRO converges faster than GA by 26.5% on average (by 58.9% in the best case). In these experiments, the algorithms stop the search when the performance stabilizes (i.e., the makespan remains unchanged) for a preset number of consecutive iterations in the search loop (in the experiments, it is 10 000 iterations). In reality, the stopping criteria could be that the algorithm stops when the total running time of the algorithm reaches a preset value (e.g., 60 s). In this case, the fact that DMSCRO converges faster than GA means that the makespan obtained by DMSCRO could be much better than that by GA when the algorithms stop. The reason for this can be explained by the analysis presented in the last paragraph of Section 5.3, i.e., DMSCRO enjoys the advantages of both GA and SA. Fig. 26. The convergence trace for the molecular dynamics code; averaged over 10 independent runs, ccr = 1, processor = 16. Fig. 27. The convergence trace for the randomly generated DAGs with each containing 10 subtasks. 7. Conclusions In this paper, we developed a DMSCRO for DAG scheduling on heterogeneous computing systems. The algorithm incorporates two molecular structures: one evolves to generate priority queueing of subtasks in a DAG and the other to generate taskto-computing-node mappings. Four elementary chemical reaction operations are designed in DMSCRO, and they take into account the precedence relations of the subtasks and guarantee that the newly generated priority queueing complies with those precedence relations. As a result, the DMSCRO algorithm can cover a much Fig. 28. The convergence trace for the randomly generated DAGs with each containing 20 subtasks. larger search space than heuristic scheduling approaches. The experiments show that DMSCRO outperforms HEFT_B and HEFT_T and can achieve a higher speedup of task executions. In future work, we plan to extend DMSCRO by investigating the following three intriguing issues. First, we will study the impact of important parameters on the initial solution generator. Second, Y. Xu et al. / J. Parallel Distrib. Comput. 73 (2013) 1306–1322 1321 and operating frequency. We will apply the DVS technique to our DMSCRO task scheduling algorithm to realize a bi-objective optimization to minimize makespan and energy. Acknowledgments Fig. 29. The convergence trace for the randomly generated DAGs with each containing 50 subtasks. This research was partially funded by the Key Program of National Natural Science Foundation of China (Grant No. 61133005), the National Natural Science Foundation of China (Grant Nos. 61070057, 61173045), Key Projects in the National Science & Technology Pillar Program, the Cultivation Fund of the Key Scientific and Technical Innovation Project (2012BAH09B02), and the Ph.D. Programs Foundation of Ministry of Education of China (20100161110019). The project was supported by the National Science Foundation for Distinguished Young Scholars of Hunan (12JJ1011), the Science and Technology Plan Projects in Hunan Province (Grant No. 2013GK3082) and the Research Project Grant of the Leverhulme Trust (Grant No. RPG-101). References Fig. 30. The convergence trace for the randomly generated DAGs with each containing 100 subtasks. Fig. 31. The convergence trace for the randomly generated DAGs with each containing 200 subtasks. we plan to use MPI to parallelize the running of DMSCRO, so as to further reduce the time needed to find good solutions. Finally, the quadratic relationship between voltage and energy has made dynamic voltage scaling (DVS) one of the most powerful techniques to reduce system power demands by lowering the supply voltage [1] A. Amini, T.Y. Wah, M. Saybani, S. Yazdi, A study of density-grid based clustering algorithms on data streams, in: 2011 Eighth International Conference on Fuzzy Systems and Knowledge Discovery, FSKD, Vol. 3, 2011, pp. 1652–1656. http://dx.doi.org/10.1109/FSKD.2011.6019867. [2] R. Bajaj, D. Agrawal, Improving scheduling of tasks in a heterogeneous environment, IEEE Transactions on Parallel and Distributed Systems 15 (2) (2004) 107–118. http://dx.doi.org/10.1109/TPDS.2004.1264795. [3] T. Chen, B. Zhang, X. Hao, Y. Dai, Task scheduling in grid based on particle swarm optimization, in: The Fifth International Symposium on Parallel and Distributed Computing, 2006, ISPDC’06, 2006, pp. 238–245. http://dx.doi.org/10.1109/ISPDC.2006.46. [4] H. Cheng, A high efficient task scheduling algorithm based on heterogeneous multi-core processor, in: 2010 2nd International Workshop on Database Technology and Applications, DBTA, 2010, pp. 1–4. http://dx.doi.org/10.1109/DBTA.2010.5659041. [5] P. Choudhury, R. Kumar, P. Chakrabarti, Hybrid scheduling of dynamic task graphs with selective duplication for multiprocessors under memory and time constraints, IEEE Transactions on Parallel and Distributed Systems 19 (7) (2008) 967–980. http://dx.doi.org/10.1109/TPDS.2007.70784. [6] L. He, D. Zou, Z. Zhang, C. Chen, H. Jin, S. Jarvis, Developing resource consolidation frameworks for moldable virtual machines in clouds, Future Generation Computer Systems (2012) http://dx.doi.org/10.1016/j.future.2012.05.015. [7] H. El-Rewin, T. Lewis, Scheduling parallel program tasks onto arbitrary target machines, Parallel and Distributed Computing 9 (1990) 138–153. [8] F. Ferrandi, P. Lanzi, C. Pilato, D. Sciuto, A. Tumeo, Ant colony heuristic for mapping and scheduling tasks and communications on heterogeneous embedded systems, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 29 (6) (2010) 911–924. http://dx.doi.org/10.1109/TCAD.2010.2048354. [9] M.R. Garey, D.S. Johnson, Computers and Intractability: A Guide to the Theory of NP-Completeness, W. H. Freeman, New York, 1979. [10] H.-W. Ge, L. Sun, Y.-C. Liang, F. Qian, An effective PSO and AIS-based hybrid intelligent algorithm for job-shop scheduling, IEEE Transactions on Systems, Man, and Cybernetics Part A: Systems and Humans 38 (2) (2008) 358–368. http://dx.doi.org/10.1109/TSMCA.2007.914753. [11] N.B. Ho, J.C. Tay, Solving multiple-objective flexible job shop problems by evolution and local search, IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews 38 (5) (2008) 674–685. http://dx.doi.org/10.1109/TSMCC.2008.923888. [12] E. Hou, N. Ansari, H. Ren, A genetic algorithm for multiprocessor scheduling, IEEE Transactions on Parallel and Distributed Systems 5 (2) (1994) 113–120. http://dx.doi.org/10.1109/71.265940. [13] J. Hwang, Y. Chow, F. Anger, C. Lee, Scheduling precedence graphs in systems with interprocessor communication times, SIAM Journal on Computing 18 (2) (1989) 244–257. http://dx.doi.org/10.1137/0218016. [14] M. Iverson, F. Ozguner, G. Follen, Parallelizing existing applications in a distributed heterogeneous environment, in: 1995 IEEE International Conference on Heterogeneous Computing Workshop, 1995, pp. 93–100. [15] M. Kashani, M. Jahanshahi, Using simulated annealing for task scheduling in distributed systems, in: International Conference on Computational Intelligence, Modelling and Simulation, 2009, CSSim’09, 2009, pp. 265–269. http://dx.doi.org/10.1109/CSSim.2009.36. [16] S. Kim, J. Browne, A general approach to mapping of parallel computation upon multiprocessor architectures, in: Proceedings of the International Conference on Parallel Processing, 1988, pp. 1–8. [17] Y.-K. Kwok, I. Ahmad, Static scheduling algorithms for allocating directed task graphs to multiprocessors, ACM Computing Surveys (CSUR) 31 (4) (1999) 406–471. http://dx.doi.org/10.1145/344588.344618. URL: http://doi.acm.org/10.1145/344588.344618. 1322 Y. Xu et al. / J. Parallel Distrib. Comput. 73 (2013) 1306–1322 [18] A. Lam, V. Li, Chemical-reaction-inspired metaheuristic for optimization, IEEE Transactions on Evolutionary Computation 14 (3) (2010) 381–399. http://dx.doi.org/10.1109/TEVC.2009.2033580. [19] H. Li, L. Wang, J. Liu, Task scheduling of computational grid based on particle swarm algorithm, in: 2010 Third International Joint Conference on Computational Science and Optimization, CSO, Vol. 2, 2010, pp. 332–336. http://dx.doi.org/10.1109/CSO.2010.34. [20] K. Li, Z. Zhang, Y. Xu, B. Gao, L. He, Chemical reaction optimization for heterogeneous computing environments, in: 2012 IEEE 10th International Symposium on Parallel and Distributed Processing with Applications, ISPA, 2012, pp. 17–23. http://dx.doi.org/10.1109/ISPA.2012.11. [21] F.-T. Lin, Fuzzy job-shop scheduling based on ranking level (lambda;, 1) interval-valued fuzzy numbers, IEEE Transactions on Fuzzy Systems 10 (4) (2002) 510–522. http://dx.doi.org/10.1109/TFUZZ.2002.800659. [22] B. Liu, L. Wang, Y.-H. Jin, An effective PSO-based memetic algorithm for flow shop scheduling, IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics 37 (1) (2007) 18–27. http://dx.doi.org/10.1109/TSMCB.2006.883272. [23] F. Pop, C. Dobre, V. Cristea, Genetic algorithm for dag scheduling in grid environments, in: IEEE 5th International Conference on Intelligent Computer Communication and Processing, 2009, ICCP 2009, August, pp. 299–305. http://dx.doi.org/10.1109/ICCP.2009.5284747. [24] R. Shanmugapriya, S. Padmavathi, S. Shalinie, Contention awareness in task scheduling using Tabu search, in: IEEE International Advance Computing Conference, 2009, IACC 2009, 2009, pp. 272–277. http://dx.doi.org/10.1109/IADCC.2009.4809020. [25] L. Shi, Y. Pan, An efficient search method for job-shop scheduling problems, IEEE Transactions on Automation Science and Engineering 2 (1) (2005) 73–77. http://dx.doi.org/10.1109/TASE.2004.829418. [26] G. Sih, E. Lee, A compile-time scheduling heuristic for interconnectionconstrained heterogeneous processor architectures, IEEE Transactions on Parallel and Distributed Systems 4 (2) (1993) 175–187. http://dx.doi.org/10.1109/71.207593. [27] S. Song, K. Hwang, Y.-K. Kwok, Risk-resilient heuristics and genetic algorithms for security-assured grid job scheduling, IEEE Transactions on Computers 55 (6) (2006) 703–719. http://dx.doi.org/10.1109/TC.2006.89. [28] D.P. Spooner, J. Cao, S.A. Jarvis, L. He, G.R. Nudd, Performance-aware workflow management for grid computing, The Computer Journal 48 (3) (2005) 347–357. http://dx.doi.org/10.1093/comjnl/bxh090. [29] H. Topcuoglu, S. Hariri, M.-Y. Wu, Performance-effective and lowcomplexity task scheduling for heterogeneous computing, IEEE Transactions on Parallel and Distributed Systems 13 (3) (2002) 260–274. http://dx.doi.org/10.1109/71.993206. [30] T. Tsuchiya, T. Osada, T. Kikuno, A new heuristic algorithm based on gas for multiprocessor scheduling with task duplication, in: 1997 3rd International Conference on Algorithms and Architectures for Parallel Processing, 1997, ICAPP 97, 1997, pp. 295–308. http://dx.doi.org/10.1109/ICAPP.1997.651499. [31] A. Tumeo, C. Pilato, F. Ferrandi, D. Sciuto, P. Lanzi, Ant colony optimization for mapping and scheduling in heterogeneous multiprocessor systems, in: International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation, 2008, SAMOS 2008, 2008, pp. 142–149. http://dx.doi.org/10.1109/ICSAMOS.2008.4664857. [32] B. Varghese, G.T Mckee, V. Alexandrov, Can agent intelligence be used to achieve fault tolerant parallel computing systems? Parallel Processing Letters 21 (04) (2011) 379–396. http://dx.doi.org/10.1142/S012962641100028X. [33] J. Wang, Q. Duan, Y. Jiang, X. Zhu, A new algorithm for grid independent task schedule: genetic simulated annealing, in: World Automation Congress, WAC, 2010, 2010, pp. 165–171. [34] D.H. Wolpert, W.G. Macready, No free lunch theorems for optimization, IEEE Transactions on Evolutionary Computation 1 (1) (1997) 67–82. http://dx.doi.org/10.1109/4235.585893. [35] Y.W. Wong, R. Goh, S.-H. Kuo, M. Low, A Tabu search for the heterogeneous dag scheduling problem, in: 2009 15th International Conference on Parallel and Distributed Systems, ICPADS, 2009, pp. 663–670. http://dx.doi.org/10.1109/ICPADS.2009.127. [36] M.-Y. Wu, D. Gajski, Hypertool: a programming aid for message-passing systems, IEEE Transactions on Parallel and Distributed Systems 1 (3) (1990) 330–343. http://dx.doi.org/10.1109/71.80160. [37] J. Xu, A. Lam, V. Li, Chemical reaction optimization for the grid scheduling problem, in: 2010 IEEE International Conference on Communications, ICC, 2010, pp. 1–5. http://dx.doi.org/10.1109/ICC.2010.5502406. [38] J. Xu, A. Lam, V. Li, Chemical reaction optimization for task scheduling in grid computing, IEEE Transactions on Parallel and Distributed Systems 22 (10) (2011) 1624–1631. http://dx.doi.org/10.1109/TPDS.2011.35. [39] L. He, S.A. Jarvis, D.P. Spooner, H. Jiang, D.N. Dillenberger, G.R. Nudd, Allocating non-real-time and soft real-time jobs in multiclusters, IEEE Transactions on Parallel and Distributed Systems 17 (2) (2006) 99–112. http://doi.ieeecomputersociety.org/10.1109/TPDS.2006.18. Yuming Xu received the master’s degree from Hunan University, China, in 2009. He is currently working toward the Ph.D. degree at Hunan University of China. His research interests include modeling and scheduling for distributed computing systems, Parallel algorithms, Grid and Cloud computing. Kenli Li received the Ph.D. in computer science from Huazhong University of Science and Technology, China, in 2003, and the M.Sc. in mathematics from Central South University, China, in 2000. He was a visiting scholar at University of Illinois at Champaign and Urbana from 2004 to 2005. Now He is a professor of Computer science and Technology at Hunan University, associate director of National Supercomputing Center in Changsha, a senior member of CCF. His major research includes parallel computing, Grid and Cloud computing, and DNA computer. He has published more than 70 papers in international conferences and journals, such as IEEE TC, JPDC, PC, ICPP, and CCGrid. Ligang He received the Bachelor’s and Master’s degrees from the Huazhong University of Science and Technology, Wuhan, China, and received the Ph.D. degree in Computer Science from the University of Warwick, UK. He was also a Post-doctoral researcher at the University of Cambridge, UK. In 2006, he joined the Department of Computer Science at the University of Warwick as an Assistant Professor, and then became an Associate Professor. His areas of interest are parallel and distributed computing, Grid computing and Cloud computing. He has published more than 50 papers in international conferences and journals, such as IEEE TPDS, IPDPS, Cluster, CCGrid, and MASCOTS. He also served as a member of the program committee for many international conferences and was the reviewer for a number of international journals, including IEEE TPDS, IEEE TC, IEEE TASE, etc. He is a member of the IEEE. Tung Khac Truong received B.S. in Mathematic from Hue College, Hue University, Vietnam, in 2001. He received M.S. in computer science from Hue University, Vietnam, in 2007. He is currently working toward the Ph.D. degree in computer science at school of Computer and Communication, Hunan University. His research interests are soft computing and parallel computing.