J. Parallel Distrib. Comput. using double molecular structure-based chemical

J. Parallel Distrib. Comput. 73 (2013) 1306–1322
Contents lists available at SciVerse ScienceDirect
J. Parallel Distrib. Comput.
journal homepage: www.elsevier.com/locate/jpdc
A DAG scheduling scheme on heterogeneous computing systems
using double molecular structure-based chemical
reaction optimization
Yuming Xu a , Kenli Li a,∗ , Ligang He b , Tung Khac Truong a,c
a
College of Information Science and Engineering, Hunan University, Changsha, 410082, China
b
Department of Computer Science, University of Warwick, United Kingdom
c
Faculty of Information Technology, Industry University of Hochiminh City, Hochiminh, VietNam
highlights
•
•
•
•
•
Applying CRO to solve DAG scheduling problems in heterogeneous computing systems.
Developing DMSCRO by adapting the conventional CRO framework.
Designing a new solution encoding method for DAG scheduling.
Designing new operations for performing elementary chemical reactions in this work.
Conducting experiments to verify the effectiveness and efficiency of the DMSCRO.
article
info
Article history:
Received 12 May 2012
Received in revised form
10 May 2013
Accepted 30 May 2013
Available online 6 June 2013
Keywords:
NP-hard problem
Chemical reaction optimization
Meta-heuristic approaches
Task scheduling
Makespan
abstract
A new meta-heuristic method, called Chemical Reaction Optimization (CRO), has been proposed very
recently. The method encodes solutions as molecules and mimics the interactions of molecules in
chemical reactions to search the optimal solutions. The CRO method has demonstrated its capability
in solving NP-hard optimization problems. In this paper, the CRO scheme is used to formulate the
scheduling of Directed Acyclic Graph (DAG) jobs in heterogeneous computing systems, and a Double
Molecular Structure-based Chemical Reaction Optimization (DMSCRO) method is developed. There are
two molecular structures in DMSCRO: one is used to encode the execution order of the tasks in a DAG
job, and the other to encode the task-to-computing-node mapping. The DMSCRO method also designs
four elementary chemical reaction operations and the fitness function suitable for the scenario of DAG
scheduling. In this paper, we have also conducted the simulation experiments to verify the effectiveness
and efficiency of DMSCRO over a large set of randomly generated graphs and the graphs for real-world
problems.
© 2013 Elsevier Inc. All rights reserved.
1. Introduction
A job consisting of a group of tasks with precedence constraints
is often modeled as a Directed Acyclic Graph (DAG). When
scheduling a DAG job, the main objective is to optimize its
makespan, which is defined as the duration between the time when
the first task in the DAG starts execution and the time when the
last task finishes execution. This problem has been well-studied for
many decades. Achieving this objective has been proved to be a NPcomplete problem [9], which means that the time needed to find
∗
Corresponding author.
E-mail addresses: xxl1205@163.com (Y. Xu), lkl@hnu.edu.cn (K. Li),
liganghe@dcs.warwick.ac.uk (L. He), tung147@gmail.com (T.K. Truong).
0743-7315/$ – see front matter © 2013 Elsevier Inc. All rights reserved.
http://dx.doi.org/10.1016/j.jpdc.2013.05.005
the optimal solution increases exponentially as the problem size
increases. Therefore, two schools of scheduling methods, heuristic
scheduling and meta-heuristic scheduling, have been proposed to
find the sub-optimal solution with lower time overhead.
Heuristic scheduling algorithms exploit the heuristics to
identify a good solution. An important class of heuristic scheduling
is list scheduling. List scheduling maintains an ordered list of
tasks in a DAG job according to some greedy heuristics. The tasks
are selected in the specified order for mapping to the computing
nodes which allow the earliest start times. Heuristic scheduling
algorithms can find solutions with low time complexity, since the
attempted solutions are narrowed down by greedy heuristics to
a very small portion of the entire solution space. However, the
quality of the solutions obtained by these algorithms is heavily
dependent on the effectiveness of the heuristics, and it is not likely
for the greedy heuristics to produce consistent results on a wide
Y. Xu et al. / J. Parallel Distrib. Comput. 73 (2013) 1306–1322
range of problems, especially when the complexity of the DAG
scheduling problem becomes high.
Meta-heuristic scheduling (or Guided-random-search-based)
techniques work by guiding the searching for solutions in a
solution space. Although meta-heuristic scheduling typically takes
longer time, they can achieve good performance consistently for
a wide range of scheduling scenarios. Well-known examples of
meta-heuristic scheduling techniques include Genetic Algorithms
(GA), Particle Swarm Optimization (PSO), Ant Colony Optimization
(ACO), Simulated Annealing (SA) and Tabu Search (TS), etc. Very
recently, a new meta-heuristic method, called Chemical Reaction
Optimization (CRO), has been proposed. The method encodes
solutions as molecules and mimics the interactions of molecules
in chemical reactions to search the optimal solutions. The CRO
method has demonstrated its capability in solving NP-hard
optimization problems.
In this paper, we integrate the CRO framework to schedule DAG
jobs on heterogeneous computing systems. Scheduling DAG jobs
on heterogeneous systems involves making decisions about the
execution order of tasks and task-to-computing-node mapping.
This paper adapts the conventional CRO framework and proposes
a Double Molecular Structure-based CRO (DMSCRO) scheme to
formulate the scheduling of DAG jobs. In DMSCRO, one molecular
structure is used to encode the execution order of the tasks in a
DAG job, while the other molecular structure to encode the taskto-computing-node mapping. DMSCRO also designs the necessary
elementary chemical reaction operations and the fitness function
suitable for the scenario of DAG scheduling.
According to No-Free-Lunch Theorem in the area of metaheuristics [34], all meta-heuristic methods that search for optimal
solutions are the same in performance when averaged over all possible objective functions. In theory, as long as an effective metaheuristic method runs for long enough, it will gradually approach
the optimal solution. We have conducted the experiments over a
large set of randomly generated graphs, and also the graphs abstracted from two well-known real applications: Gaussian elimination and molecular dynamics application. The experimental results
show that the proposed DMSCRO can achieve better performance
than the heuristic algorithms, but achieves similar performance as
GA in the literature in terms of makespan. We will show in Section 5 that DMSCRO combines the advantages of GA and Simulated Annealing (SA), and therefore may have better performance in
terms of searching efficiency. The experimental results presented
in Section 6.4 indeed demonstrate that DMSCRO is able to find good
solutions faster than GA.
The three major contributions of this work are summarized
below:
• Applying the Chemical Reaction Optimization framework to
solve DAG scheduling problems in heterogeneous computing
systems.
• Developing DMSCRO by adapting the conventional CRO framework and designing a new solution encoding method, new operations for performing elementary chemical reactions and a
new fitness function suitable for the scheduling scenarios considered in this work.
• Conducting simulation experiments to verify the effectiveness
and efficiency of the proposed DMSCRO. Our experimental
results show that (1) DMSCRO is able to achieve the similar
makespan as GA, but it finds good solutions faster than GA by
26.5% on average (by 58.9% in the best case), and (2) DMSCRO
consistently achieves smaller makespans than two heuristic
scheduling algorithms (HEFT_B and HEFT_T) which it has been
shown in the literature outperform other heuristic scheduling
algorithms. Compared with HEFT_B and HEFT_T, the makespan
reduction achieved by DMSCRO is 10% and 12.8% on average,
respectively.
1307
The remainder of this paper is organized as follows: In Section 2,
the related work about scheduling algorithms on heterogeneous
systems is presented. Section 3 presents the background knowledge of CRO. Section 4 describes the system and workload model.
In Section 5, the DMSCRO scheme is presented for DAG scheduling, aiming to minimize the makespan on heterogeneous computing systems. Section 6 compares the performance of the proposed
scheme with the existing heuristic algorithms. Finally, Section 7
concludes the paper.
2. Related work
In this section, we discuss the related work on heuristic
scheduling, meta-heuristic (or guided-random-search-based)
scheduling and job-shop scheduling.
2.1. Heuristic scheduling
Heuristic methods usually provide good solutions for restricted
instances of a task scheduling problem. They can find a near
optimal solution in less than polynomial time. A heuristic method
searches a path in the solution space and ignores other possible
paths [17,39].
There are three typical types of heuristic-based algorithms
for the DAG scheduling problem: list scheduling [17,29], cluster
scheduling [1,4], and duplication-based scheduling [30,2].
List scheduling is most popular among these three when it
comes to scheduling DAG jobs. A list scheduling algorithm is
typically divided into two phases. In the first phase, each task
is assigned a priority, and then added to a list of waiting tasks
in order of decreasing priority according to some criteria. In the
second phase, the task with the highest priority is selected and
assigned to the most suitable computing node. The main objective
of scheduling a DAG job is to minimize its makespan. Most list
scheduling algorithms use two attributes, b-level and t-level, to
assign priority for tasks. b-level of a task is the length of the longest
path from the exit task of the DAG job to the task, while t-level of a
task is the length of the longest path from the entry task to the task.
The major difference among different list scheduling algorithms
is the means by which the b-level and the t-level are used, and
the computing node is selected. Since this work investigates the
DAG scheduling in heterogeneous parallel systems, this subsection
mainly discusses the related work in heuristic DAG scheduling in
heterogeneous parallel systems.
Modified Critical Path (MCP) scheduling [36] uses the tasks’
latest start time to assign the scheduling priority. The latest start
times of the tasks on the critical path of the DAG are actually their
t-levels.
Earliest Time First (ETF) scheduling algorithm [13] uses the tasks’
earliest start times to assign the scheduling priority. Both ETF and
MCP allocate a task to a processor that minimizes the task’s start
time.
Dynamic-level scheduling (DLS) [26] calculates the difference
between the b-level of task Ti and Ti ’s earliest start time on
processor Pj . The difference is called the dynamic level of the (Ti , Pj )
pair. Each time, DLS selects the (task, processor) pair with the
highest dynamic level.
Mapping heuristic (MH) [7] uses the so called static b-level to
assign the scheduling priority. Static b-level is the b-level without
considering the communication costs between tasks. A task is then
allocated to a processor that gives the earliest start time.
Levelized-Min Time (LMT) [14] assigns the scheduling priority in
two phases. It first groups the tasks into different levels according
the DAG topology, and then in the same level a task with the biggest
computation cost has the highest priority. A task is allocated to a
processor that minimizes the sum of the task’s computation and
the total communication costs with the tasks in the previous level.
1308
Y. Xu et al. / J. Parallel Distrib. Comput. 73 (2013) 1306–1322
Ref. [29] also proposes two heuristic DAG scheduling algorithms
for heterogeneous systems. One algorithm uses the tasks’ b-level
to determine the priority (i.e., scheduling order) of the tasks,
which is called HEFT_B in this paper. The bigger b-level, the higher
priority. In HEFT_B, a task is allocated to a processor that gives
the earliest start time. The other algorithm uses the sum of tlevel and b-level to determine the tasks’ priority, which is called
HEFT_T in this paper. HEFT_T tries to allocate all tasks on the
critical path to one processor. HEFT_T allocates a task not on
the critical path to the processor that minimizes the task’s start
time. The extensive experiments have been conducted in [29] and
the results demonstrate that HEFT_B and HEFT_T outperform (in
terms of makespan) other representative heuristic algorithms in
heterogeneous systems, such as DLS, MH and LMT.
Job Shop Scheduling Problem (JSSP) [38,21,10,11,22,25] originates from industry, investigating how to allocate work operations
to machines so as to maximize the productivity (e.g., by minimizing the makespan of the work flow). JSSP can be abstracted as DAG
scheduling. Therefore, the heuristic methods developed for DAG
scheduling can be applied to JSSP in industry.
Clustering algorithm [1,4] is another type of heuristic algorithms. Clustering algorithms assume that there are an unlimited
number of computing nodes available to task executions. The clustering algorithms will use as many computing nodes as possible in
order to reduce the makespan of the schedule. Then if the number
of computing nodes used by a schedule is more than the number of
computing nodes available, a mapping process is required to merge
the tasks in a candidate schedule onto these available computing
nodes.
The duplication-based scheduling heuristic [30,2] attempts to
reduce communication delays by executing the key tasks on
more than one computing node. Duplication-based scheduling
essentially aims to further improve the performance of list
scheduling. The disadvantage of this method is that the same task
may have to run several times on different computing nodes and
therefore cause more resource consumptions.
2.2. Meta-heuristic scheduling
Meta-heuristic algorithms are also called guided-randomsearch-based algorithms. Contrary to heuristic-based algorithms,
the meta-heuristic algorithms incorporate a combinatorial process
when searching for solutions. The meta-heuristic algorithms
typically require sufficient sampling of candidate solutions in the
search space and have shown robust performance on a variety of
scheduling problems. Many meta-heuristic algorithms have been
proposed to successfully solve the task scheduling problem.
GA [12,27,33,6] is to date the most popular technique for metaheuristic scheduling. [12] proposed to apply the GA technique
to DAG scheduling. In [12], a scheduling solution is encoded as
multiple one-dimensional string. Each string represents an ordered
list of tasks scheduled to a processor. In the crossover operation, a
crossover point is randomly selected in each string of two parent
solutions, and the head portion of one parent merges with the
tail portion of the other parent. Mutation is applied by randomly
exchanging two tasks in two strings. Makespan is used as the
fitness function to judge the quality of the scheduling solution.
There are also other work applying GA for scheduling, such as
in [27,28,23]. In those work, however, GA is developed to tackle the
special features in the applications or in the computing systems.
For example, [23] applies GA to address the dynamic feature in
Grid environments; [27] tackles the security considerations in
Grids; [28] designs a GA method to schedule a DAG in which
each task is a parallel task. Our work is concerned about the
conventional heterogeneous computing systems, and a task in a
DAG is a serial task. Therefore, the workload and system model
in [12] is the closest one to the model in this paper.
Other meta-heuristic scheduling methods include PSO [3,19],
ACO [31,8], SA [15], and TS [24,35]. Some research works have also
been developed to use agent intelligence to tackle task processing
on parallel computing systems [32].
Chemical Reaction Optimization (CRO) was proposed very recently [18,37,5,32,20]. It mimics the interactions of molecules
in chemical reactions. In CRO, solutions are encoded, and then
sequences of operations, which loosely simulate the molecule
behaviors in reaction, are performed over these solutions. Gradually, good solutions will emerge. CRO has already shown its power
in solving problems like Quadratic Assignment Problem (QAP),
Resource-Constrained Project Scheduling Problem (RCPSP), Channel
Assignment Problem (CAP) [18,32], Task Scheduling in Grid Computing (TSGC) [38]. To the best of our knowledge, CRO has not yet been
applied to DAG scheduling in the literature.
3. Background of CRO
CRO mimics the process of a chemical reaction where molecules
undergo a sequence of reactions between each other or with
the environment in a closed container. A molecule has a unique
structure of atoms, which represents a solution of the optimization
problem. Potential energy (PE) and Kinetic energy (KE) are two
key properties attached to a molecule structure. The former
corresponds to the fitness value of the solution and the fitness of
a solution is judged by the PE energy of the molecule, while the
latter is used to control the acceptance of new solutions with worse
fitness. Suppose ω and f are a molecular structure (solution) and
the fitness function, respectively. Then PE ω = f (ω). During the
reactions, a molecule structure, ω, attempts to change to another
molecule structure ω′ if PE ω′ ≤ PE ω , or PE ω′ ≤ PE ω + KE ω . KE is a
non-negative number and it helps the molecule escape from local
optimums. A central energy buffer is also implemented in CRO. The
energy stored in the buffer can be regarded as the energy in the
environment (other than the molecules) in the closed container.
The energy may also flow between the molecules and the energy
buffer. CRO requires conservation of energy. The total amount of
energy in the system is determined by PE and the initial KE of all
molecules in the initial population, and the initial energy stored
in the buffer. Conservation of energy means that the total energy
remains constant during the process of CRO.
During the CRO process, the following four types of elementary
reactions are likely to happen.
• On-wall ineffective collision: this reaction is a uni-molecule
reaction, whose reactant involves only one molecule. When a
molecule ω collides onto the wall of the closed container, it
is allowed to change to another molecule ω′ , if inequality (1)
related to their energy values holds.
PE ω + KE ω ≥ PE ω′ .
(1)
After collision, the KE energy will be re-distributed. A certain
portion of KE of the new molecule will be withdrawn to the
central energy buffer (i.e., environment). The KE energy of the
new module can be calculated in Eq. (2), where a is a number
randomly selected from the range of [KE LossRate , 1]. KE LossRate is
the loss rate of the KE energy, which is a system parameter set
during the initialization stage of CRO:
KE ω′ = (PE ω − PE ω′ + KE ω ) × a.
(2)
• Decomposition: this reaction is also a uni-molecule reaction. A
molecule ω can decompose into two new molecules, ω1′ and ω2′ ,
if Inequality (3) holds, where buffer denotes the energy stored
in the central buffer, which represents the energy interactions
between molecules and the environment:
PE ω + KE ω + buffer ≥ PE ω′ + PE ω′ .
1
2
(3)
Y. Xu et al. / J. Parallel Distrib. Comput. 73 (2013) 1306–1322
Let Edec = (PE ω +KE ω )−(PE ω′ +PE ω′ ). Then after decomposing,
1
2
the KE energies of ω1 and ω2 are calculated by Eq. (4), where
δ1 , δ2 , δ3 , δ4 is a random number generated in [0, 1]:
′
′
KE ω′ ← (Edec + buffer ) × δ1 × δ2
(4)
KE ω′ ← (Edec + buffer − KE ω′ ) × δ3 × δ4 .
(5)
1
2
1
The energy in the buffer is updated by Eq. (6):
buffer ← Edec + buffer − (PE ω′ + PE ω′ ).
1
(6)
2
• Inter-molecular ineffective collision: this reaction is an intermolecule reaction, whose reactants involve two molecules.
When two molecules, ω1 and ω2 , collide into each other, they
can change to two new molecules, ω1′ and ω2′ , if Inequality (7)
holds:
PE ω1 + PE ω2 + KE ω1 + KE ω2 ≥ PE ω′ + PE ω′ .
1
2
(7)
Einter denotes the spare energy after the inter-molecule
collision, which can be calculated by Eq. (8). The KEs of the new
molecules will share the spare energy. Therefore, the KEs of ω1′
and ω2′ are calculated by Eqs. (9) and (10), respectively, where
δ1 is a random number generated from [0, 1]:
Einter = (PE ω1 + PE ω2 + KE ω1 + KE ω2 ) − (PE ω′ + PE ω′ )
(8)
KE ω′ ← Einter × δ1
(9)
1
1
KE ω′ ← Einter × (1 − δ1 ).
2
2
(10)
• Synthesis: this reaction is also an inter-molecule reaction.
When two molecules, ω1 and ω2 , collide into each other, they
can be combined to generate a new molecule, ω′ , if Inequality
(11) holds:
PE ω1 + PE ω2 + KE ω1 + KE ω2 ≥ PE ω′ .
Table 1
Definitions of notations.
Notation
Definition
T
Ti
P
Pk
Tentry
Texit
succ(Ti )
pred(Ti )
Wd (Ti )
S (Ti , Pk )
The set of n weighted tasks in the application
The ith task in the application
The set of m heterogeneous computing nodes
The kth computing node in the system
The starting subtask without any predecessors
The final subtask with no successors
The set of the immediate successors of the subtask Ti
The set of the immediate predecessors of the subtask Ti
The computational data of subtask Ti
The estimated execution speed of subtask Ti on computing
nodePk
The computational cost of task Ti on the computing node Pk
The average computational cost of subtask Ti
The amount of communication between subtask Ti and
subtask Tj
The communication startup cost of computing node Pk
The two-dimensional matrix of communication bandwidths
between computing nodes
The communication bandwidths between computing nodes
Pk to Pl
The communication cost from the subtask Ti (scheduled on
Pk ) to the subtask Tj (scheduled on Pl )
The average communication cost of the edge(Ti , Tj )
The earliest start time of the subtask Ti on the computing
node Pk
The earliest finish time of the subtask Ti on the computing
node Pk
The earliest time at which the computing node Pk is ready for
the task execution
The upward ranking of subtask Ti
The downward ranking of subtask Ti
The communication to computation ratio is the ratio of the
average communication cost to the average computation cost.
W (Ti , Pk )
W (Ti )
Cd (Ti , Tj )
Cs (Pk )
B
B(Pk , Pl )
C (Ti , Tj )
C (Ti , Tj )
EST (Ti , Pk )
EFT (Ti , Pk )
Tavail (Pk )
Rankb (Ti )
Rankt (Ti )
CCR
(11)
The KE energy of ω′ is calculated by Eq. (12):
KE ω′ = PE ω1 + KE ω1 + PE ω2 + KE ω2 − PE ω′ .
1309
4. MODELS
(12)
The typical execution flow of CRO is as follows. The CRO is
first initialized to set some system parameters, such as PopSize
(the size of the populations or molecules), KE LossRate , InitialKE (the
initial energy associated with molecules), buffer (initial energy in
the energy buffer), MoleColl (MoleColl is used later in the process
to determine whether to perform a uni-molecular or an intermolecular operation), etc. Then the process enters a loop. In each
iteration, the process first decides for each molecule whether to
perform uni-molecular operations or inter-molecular operations
following on a certain probability. It is decided in the following
way. A random number, b, is generated in the interval [0, 1]. If
b is bigger than the value of MoleColl that the process sets in the
initialization stage, a uni-molecular operation will be performed;
otherwise, an inter-molecular operation will take place. If it is
a uni-molecular operation, CRO uses a parameter α to guide
the further selection of on-wall collision or decomposition. α
represents the maximum number of collisions allowed with no
improved solution being found by a molecule. When a molecule
undergoes a collision, the parameter NumHit is updated to record
the total number of collisions which a molecule has carried out.
If a molecule has undergone a number of collisions larger than α ,
a decomposition will then be triggered. Similarly, if the process
decides to perform the inter-molecular operations, CRO uses a
parameter β to further decide whether an inter-molecule collision
or a synthesis operation should be performed. β specifies the least
amount of KE which a molecule should possess. For two molecules
ω1 and ω2 , synthesis will be triggered when both KE ω1 and KE ω2
are less than β . Otherwise, inter-molecular ineffective collision will
occur. The iteration repeats until the stopping criterion satisfies
(e.g., the best solution does not change for a certain number of
consecutive iterations). After the searching loop stops, the best
solution is then the molecule with the lowest PE.
This section discusses the system, application and task scheduling model assumed in this work. The definition of the notations can
be found in Table 1.
4.1. System model
In this paper, the target system consists of a set P of m heterogeneous computing nodes that are fully interconnected with highspeed network. Each subtask in a DAG task can only be executed
on one heterogeneous computing node. The communication time
between two dependent subtasks should be taken into account if
they are assigned to different heterogeneous computing nodes.
We also assume a static computing system model in which
the dependent relations and the execution times of subtasks are
known a priori and do not change over the course of the scheduling
and subtask execution. In addition, all heterogeneous computing
nodes are fully available to the computation on the time slots they
are assigned to.
4.2. Application model
In this work, an application is represented by a DAG, with the
graph vertexes representing subtasks and edges between vertexes
representing execution precedence between subtasks. If there is an
edge from Ti to Tj in the DAG graph, then Ti is called the predecessor
of Tj , and Tj is the successor of Ti . pred(Ti ) and succ(Ti ) denote
the set of predecessors and successors of task Ti , respectively.
There is an entry subtask and an exit subtask in a DAG. The entry
subtask Tentry is the starting subtask of the application without
any predecessors, while the exit subtask Texit is the final subtask
with no successors. A weight is associated with each vertex and
edge. The vertex weight, denoted as Wd (Ti ), represents the amount
1310
Y. Xu et al. / J. Parallel Distrib. Comput. 73 (2013) 1306–1322
(a) A simple DAG task model containing 8
subtasks.
(b) A fully connected
parallel system with 3
heterogeneous computing
nodes.
Fig. 1. A simple DAG task model containing 8 subtasks and a fully connected parallel system with 3 heterogeneous computing nodes.
of data to be processed in the subtask Ti , while the edge weight,
denoted as Cd (Ti , Tj ), represents the amount of communications
between subtask Ti and subtask Tj . The DAG topology of an
exemplar application model and system model is shown in Fig. 1(a)
and (b), respectively.
The heterogeneity model of a computing system can be divided
into two categories: Fix Heterogeneity Model (FHM) and Mixed
Heterogeneity Model (MHM). In a FHM computing system, a
computing node executes the tasks at the same speed, regardless
of the type of the tasks. On the contrary, in a MHM computing
system, how fast a computing node executes a task depends on
how well the heterogeneous computing node architecture matches
the task requirements and features. The work in this paper assumes
MHM computing systems. The execution speeds of the computing
nodes in the heterogeneous computing system is represented
by a two-dimensional matrix, S, in which an element S (Ti , Pk )
represents the speed at which computing node Pk executes subtask
Ti . The computation cost of subtask Ti running on computing
node Pk , denoted as W (Ti , Pk ), can be calculated by Eq. (13).
Assume the DAG has the topology as in Fig. 1(a) and there are 3
heterogeneous computing nodes on the computing system as in
Fig. 1(b). An example of the computing node heterogeneity and the
computation costs of the subtasks is shown in Table 2. Note that
there are two numbers in each computing node in Fig. 1(a). The
number at the top is the task id and the one at the bottom is the
average computation cost as calculated in Table 2:
W (Ti , Pk ) =
Wd (Ti )
S (Ti , Pk )
.
(13)
The average computation cost of subtask Ti , denoted as W (Ti ),
can be calculated by Eq. (14):
W (Ti ) =
m

W (Ti , Pk )
k=1
m
.
(14)
The communication bandwidths between heterogeneous computing nodes are represented by a two-dimensional matrix B. The
communication startup costs of the computing nodes are represented by an array, Cs , in which the element Cs (Pk ) is the startup
cost of computing node Pk . The communication cost C (Ti , Tj ) of
edge(Ti , Tj ), which is the time spent in transferring data from subtask Ti (scheduled on Pk ) to subtask Tj (scheduled on Pl ), can be
calculated by Eq. (15). When Ti and Tj are scheduled on the same
computing node, the communication cost is regarded as 0:
C (Ti , Tj ) = Cs (Pk ) +
Cd (Ti , Tj )
B(Pk , Pl )
.
(15)
C (Ti , Tj ) is the average communication cost of the edge(Ti , Tj )
which is defined as shown in Eq. (16):

C (Ti , Tj ) =
Ti on Pk ,Tj on Pl
C (Ti , Tj )
N
.
(16)
where N is the number of the variables transferred between
subtasks Ti and Tj between all pairs of processors Pk and Pl . In this
paper, a communication cost is only required when two subtasks
are assigned to different heterogeneous computing nodes. In other
words, the communication cost can be ignored when the subtasks
are assigned to the same computing node. It is assumed that the
inter-computing-node communications are performed at the same
speed (i.e., with the same bandwidths) on all links. And we assume
B(Pk , Pl ) = 1, Cs (Pk ) = 0 to simplify our DAG task scheduling
model.
The earliest start time of the subtask Ti on computing node Pk is
denoted as EST (Ti , Pk ), which can be calculated using Eq. (17):
EST (Ti , Pk )

0, Ti = Tentry


 max EFT (T , P ), P = P
j
m
k
m
= Tj ∈ pred(Ti )


max
,
P
)
+
C
(
T
,
T
(
EFT
(
T
j
m
j
i )),
T ∈ pred(T )
j
(17)
Pk ̸= Pm
i
EFT (Ti , Pk ) = EST (Ti , Pk ) + W (Ti , Pk ).
(18)
The earliest finish time of the task Ti on computing node Pk
is denoted as EFT (Ti , Pk ), which can be calculated using Eq. (18).
EST (Ti ) and EFT (Ti ) denote the earliest start time (EST) and the
earliest finish time (EFT) of a subtask Ti over all heterogeneous
computing nodes, which can be calculated in Eqs. (19) and (20)
respectively. Tavail (Pk ) denotes the earliest time at which the
computing node Pk is ready for the task execution:
EST (Ti ) = max (EFT (Ti , Pk ), Tavail (Pk ))
(19)
EFT (Ti ) = min EFT (Ti , Pk ).
(20)
1≤k≤m
1≤k≤m
In this study, the task scheduling problem is the process of
mapping a set T of n subtasks in a DAG to a set P of m computing
nodes, aiming at minimizing the makespan. The subtask with the
highest priority is selected for computing node allocation, and
a computing node which can ensure the earliest finish time is
selected to run the subtask. In many task scheduling algorithms,
a subtask which has a higher upward rank is more important
and is therefore preferred for computing node allocation to
Y. Xu et al. / J. Parallel Distrib. Comput. 73 (2013) 1306–1322
Table 2
Computing node heterogeneity and computation costs.
Ti
P0
Cost
T1
Ave. cost
P2
W (Ti )
1.00
1.20
1.33
1.18
1.00
0.75
1.30
1.09
0.85
0.80
1.00
0.81
1.37
1.00
0.93
0.80
1.22
1.09
0.86
1.30
0.79
1.79
1.00
1.20
11
10
9
11
15
12
10
11
13
15
12
16
11
9
14
15
9
11
14
10
19
5
13
10
11.00
12.00
11.67
12.33
15.00
8.67
12.33
12.00
13
11
P1
T4
11
P2
13
13
T3
T0
T6
T2
30
20
Table 3
Task priorities.
makespan
Ti
Rankb
Rankt
0
1
2
3
4
5
6
7
101.33
66.67
63.33
73.00
79.33
41.67
37.33
12.00
0.00
22.00
28.00
25.00
22.00
56.33
64.00
89.33
Fig. 2. A schedule for the simple DAG task graph in Fig. 1(a) on 3 computing
nodes in Fig. 1(b) using the HEFT_B task scheduling algorithm. The makespan of
the schedule is 69.
P0
P1
80
70
60
10
T1
P1
11
P2
T0
13
T4
T3
T5
T6
T7
80
70
60
50
.
17
40
W (Ti )
T2
P0
30
C (Ti , Tj )
Fig. 3. A schedule for the simple DAG task graph in Fig. 1(a) on 3 computing
nodes in Fig. 1(b) using the HEFT_T task scheduling algorithm. The makespan of
the schedule is 80.
20

T7
T5
makespan
10
CCR =
edge(Ti ,Tj )∈E
T3
T1
50
(22)
Where succ(Ti ) is the set of the immediate successors of the
subtask Ti , pred(Ti ) is the set of the predecessors of the subtask Ti .
Table 3 shows the upward-ranking and downward-ranking of
each subtask in the DAG in Fig. 1(a). Note that both computation
and communication costs are the costs averaged over all vertexes
and links. The communication to computation ratio (CCR) can
be used to indicate whether a task graph is communicationintensive or computation-intensive. For a given task graph, it is
computed by the average communication cost divided by the
average computation cost on a target computing system. The
computation can be formulated as in Eq. (23):

13
40
(W (Ti ) + C (Tj , Ti ) + Rankt (Tj )).
T0
30
max
Tj ∈ pred(Ti )
P2
13
(21)
The downward-ranking of subtask Ti , denoted as Rankt (Ti ), can
be calculated in Eq. (22):
Rankt (Ti ) =
T6
11
20
(C (Ti , Tj ) + Rankb (Tj )).
10
T4
10
max
Tj ∈ succ(Ti )
T2
17
other subtasks. Intuitively, the upward rank of a subtask reflects
the average remaining cost to finish all subtasks after that subtask
starts up. The upward-ranking of subtask Ti , denoted as Rankb (Ti ),
can be computed in Eq. (21):
Rankb (Ti ) = W (Ti ) +
T7
T5
80
P1
70
P0
60
P2
50
P1
40
P0
10
0
1
2
3
4
5
6
7
Speed
1311
makespan
(23)
Ti ∈T
Figs. 2–4 show three exemplar solutions of scheduling the DAG
in Fig. 1(a) to the computing system in Fig. 1(b), using the HEFT_B
algorithm, the HEFT_T algorithm and the DMSCRO task scheduling
algorithm respectively. The task scheduling priority queueing in
HEFT_B is generated by upward-ranking [29,35].
5. Design of DMSCRO
The concepts in DAG scheduling can be mapped to those in CRO.
DAG scheduling involves making decisions about (1) scheduling
order of tasks (i.e., the order of the tasks in the waiting queue of
the scheduler) and (2) resource allocation (i.e., which computing
node is used to run the task). The quality of a scheduling solution
is determined by makespan. The shorter makespan, the better
Fig. 4. A schedule for the simple DAG task graph in Fig. 1(a) on 3 computing nodes
in Fig. 1(b) using the DMSCRO task scheduling algorithm. The makespan of the
schedule is 66.
scheduling solution. In CRO, there are the concepts of molecule,
atoms, molecular structure and energy of a molecule. A molecular
structure represents the positions of atoms in a molecule. A
molecule has a unique molecule structure. In CRO, the modules
react through chemical reactions, aiming to transform to the
molecular structures with lower energy (i.e., with more stable
states). A scheduling solution (scheduling order and resource
allocation) in DAG scheduling corresponds to a molecule in
CRO. A scheduling solution is encoded as a one dimensional
integer array in this paper. The first half of the integer array
represents the scheduling order and the second half represents the
resource allocation. The integer array corresponds to a molecule.
An element in the array corresponds to an atom in a molecule.
1312
Y. Xu et al. / J. Parallel Distrib. Comput. 73 (2013) 1306–1322
Algorithm 1 IniMolecule
Fig. 5. Illustration of the big molecule which consists of two small molecules.
The order of the elements in the array represents the molecular
structure. This paper designs the operations on the encoded
scheduling solutions (integer arrays) to change the order of the
elements in the arrays. These designed operations correspond
to the chemical reactions that change the molecular structures.
Different orders of the elements in the arrays represent different
scheduling solutions, and we can calculate from an integer array
the corresponding makespan of the scheduling solution. The
makespan of a scheduling solution corresponds the energy of a
molecule.
In this section, we first present the encoding of scheduling
solutions and the fitness function used in the DMSCRO, and then
present the design of four elementary chemical reaction operations
in DMSCRO. Finally, we outline the execution flow of the DMSCRO
scheme and discuss a few important properties in DMSCRO.
5.1. Molecular structure and fitness function
In this paper, each big molecule ω, which consists of two small
molecules ϕ and ψ , represents a solution of the task scheduling
problem. The molecular structure is encoded with each integer
in the permutation representing the subtask Ti or the computing
node Pj .
The two molecular structures can be denoted as the following
integer queueing, as shown in Fig. 5.
Where ϕ = {T1 , T2 , . . . , Ti , . . . , Tn } represents the set of subtasks queueing. An atom Ti in the molecule ϕ represents the subtask i in the DAG which is linked according to their execution
priority order. Moreover, the priority order of the subtasks in the
molecule ϕ should be a valid topological order, where the start vertex should be placed at the beginning of the molecule while the exit
vertex should be placed at the end. In order that a queueing of integers should be feasible, all the subtasks in a DAG should be scheduled and the schedule should satisfy the precedence relations. We
take advantage of an upward-ranking heuristic mostly used by traditional list scheduling approaches to estimate the priority of each
subtask, as shown in Eq. (21). ψ = {P0 , P1 , . . . , Pk , . . . , Pm } represents the set of heterogeneous computing nodes. And the corresponding position (atom) in molecule ψ is the computing node Pj
which is assigned to execute the subtask Ti .
The initial solution generator is used to generate the initial
solutions for DMSCRO to manipulate. The part one ϕ of the first
big molecule ω is generated by upward-ranking, defined as Ranku .
The rest big molecules in the initial solutions are generated by
a random perturbation in the first big molecule ω. A detailed
description is given in Algorithm 1.
Potential energy (PE) is defined as the objective function (Fitness
function) value of the corresponding solution represented by ω.
In this paper, the overall schedule length of the entire subtasks,
namely makespan, is the largest finish time among all tasks, which
is equivalent to the actual finish time of the exit vertex (subtask)
Texit . For the task scheduling problem, the goal is to obtain task
assignments that ensure minimum makespan to ensure that the
precedence of the tasks is not violated. Hence, the fitness function
value is defined as
Fitness(ω) = PE ω = makespan = EFT (Texit ).
(24)
Algorithm 2 presents how to calculate the value of the fitness
function.
Input:
The first molecule ω;
Output:
The initial solutions;
1: MoleN ← 1;
2: while MoleN < PopSize do
3:
For each atom Ti in molecule ϕ to find the first successor
succ(i) from i to the end;
4:
For each atom Tj , j ∈ (i, succ (i)) to find the first predecessor
pred(j) from succ(i) to the begin in molecule ϕ ;
5:
if pred(j) < i then
6:
Interchanged position of atom Ti and atom Tj in molecule
ϕ;
7:
end if
8:
For each atom Pk in molecule ψ to randomly change;
9:
Generate a new molecule ω′ ;
10:
MoleN ← MoleN + 1;
11: end while
Algorithm 2 Calculating the Fitness value of a molecule
Input:
The molecule ω of task scheduling queue;
Output:
The makespan of the ω (task scheduling queue);
1: Add all subtasks (atoms) Ti of the molecule ω to the ScheduleQ
according to their priority;
2: while ScheduleQ ̸= ∅ do
3:
Select atom Ti from ScheduleQ when its indegree=0;
4:
Compute EFT (Ti , Pk ) using the scheduling policy;
5:
for all succ (Ti ) of the atom Ti do
6:
indegreesucc (Ti ) = indegreesucc (Ti ) − 1;
7:
end for
8:
Remove atom Ti from the ScheduleQ ;
9: end while
10: return Fitness(ω)=PEω =makespan= EFT (Texit ) ;
5.2. Elementary chemical reaction operations
This subsection presents four elementary chemical reaction operations designed in DMSCRO, including on-wall collision, decomposition, inter-molecular collision and synthesis.
5.2.1. On-wall ineffective collision
In this paper, the operation OnWall(ω) is used to generate a new
solution ω′ from a given big molecule ω. The following steps are
followed in the operation: (1) the operation randomly chooses an
atom (subtask) Ti in the small molecule ϕ , and then finds the first
predecessor of Ti , such as Tj , from the selection position to the begin
of molecule ϕ . (2) A random number k ∈ [j + 1, i − 1] is generated,
and the atom Ti is stored in a temporary variable Temp, and then
from the position i − 1, the operation shifts each atom by one place
to the right position until a position k. (3) The operation moves the
atom Temp to the position k. The rest of the atoms in ϕ ′ are the same
as those in ϕ . For the small molecule ψ , the operation adjusts the
atoms in the same positions as in ϕ and generates a molecule ψ ′ .
In the end, the operation generates a new molecule ω1′ consisting
of ϕ ′ and ϕ ′ . The operations are illustrated in Fig. 6 and Fig. 7, and
its detailed executions are presented in Algorithm 3.
5.2.2. Decomposition
In this paper, the operation Decomp(ω) is used to generate
two new solutions ω1′ and ω2′ from a given solution ω. This
operation first uses the steps in on-wall collision to generate ω1′ .
Y. Xu et al. / J. Parallel Distrib. Comput. 73 (2013) 1306–1322
Predecessor of subtask 2
Molecule
Successor of subtask 2
1313
New Molecule 1
Change molecular structure
Molecule
New Molecule
Fig. 6. Illustration of the molecular structure change for on-wall ineffective
collision.
P0
0
4
P1
3
6
5
New Molecule 2
Fig. 8. Illustration of the molecular structure change for decomposition.
P0
0
4
P1
3
6
5
P2
2
1
7
P0
0
4
P1
3
6
5
P2
1
2
7
P0
0
4
P1
3
6
5
P2
1
2
7
Change molecule structure
P2
1
2
7
Fig. 7. Illustration of the subtask-to-computing-node mapping for on-wall
ineffective collision.
Algorithm 3 OnWall(ω) function
Input:
Molecule ω;
Output:
New molecule ω′ ;
1: Choose randomly an atom Ti in the small molecule ϕ ,
2: Find the first predecessor Tj of Ti from i to the begin of molecule
ϕ;
3: Generate a random number k ∈ [j + 1, i − 1]
4: Temp ← Ti ;
5: for m = i − 1 downto k do
6:
Tm ← Tm−1 ;
7: end for
8: Tk ← Temp;
′
9: Generate a new small molecule ϕ
10: Generate δ ∈ [0, 1]
11: if δ < 0.5 then
12:
Temp ← Pi ;
13:
for m = i − 1 downto k do
14:
Pm ← Pm−1 ;
15:
end for
16:
Pk ← Temp;
17:
Generate a new small molecule ψ ′
18: end if
′
′
′
19: Generate a new molecule ω consisting of ϕ and ψ ;
′
20: return the new molecule ω ;
Then the operation generate the other new solution ω2′ in the
similar fashion. The only difference is that in Step 1, we use the
position of the first successor of Ti as the starting position instead of
randomly selecting a position. Note that because of decomposition
(and synthesis discussed later) the number of solutions in the CRO
process may be varied during the reactions, which is a feature
of CRO that is different from Genetic algorithms. This operations
are illustrated in Fig. 8 and Fig. 9, and its detailed executions are
presented in Algorithm 4.
5.2.3. Inter-molecular ineffective collision
In this paper, the operation Intermole(ω1 , ω2 ) is used to
generate two new solutions ω1′ and ω2′ from the given solutions
ω1 and ω2 . The new solutions are generated in the following steps.
P0
0
4
P1
3
6
5
P2
1
2
7
Decomposition
Fig. 9. Illustration of the task-to-computing-node mapping for decomposition.
Algorithm 4 Decomp(ω) function
Input:
Molecule ω;
Output:
Two new molecules ω1′ and ω2′ ;
1: Choose randomly an atom Ti in the small molecule ϕ ;
2: Find the first predecessor Tj of Ti from i to the begin;
3: Temp ← Ti ;
4: for k = i − 1 downto j + 1 do
5:
Tk ← Tk−1 ;
6: end for
7: Tj+1 ← Temp;
′
8: Generate a new small molecule ϕ
9: Change some atoms accordingly in the molecule ψ and
generate a new small molecule ψ ′
′
′
′
10: Generate a new molecule ω1 consisting of ϕ and ψ ;
11: Choose randomly an atom Ti in the small molecule ϕ ;
12: Find the first predecessor Tj of Ti from i to the begin;
13: Temp ← Ti ;
14: for k = i − 1 downto j + 1 do
15:
Tk ← Tk−1 ;
16: end for
17: Tj+1 ← Temp;
′
18: Generate a new small molecule ϕ
19: Change some atoms accordingly in the molecule ψ and
generate a new small molecule ψ ′
′
′
′
20: Generate a new molecule ω2 consisting of ϕ and ψ ;
′
′
21: return the new molecules ω1 and ω2 ;
(1) An integer i is randomly selected between 1 and n. (2) ϕ1 and ϕ2
are cut off at the position i to become the left and right segments.
(3) The left segments of ϕ1′ and ϕ2′ are inherited from the left
1314
Y. Xu et al. / J. Parallel Distrib. Comput. 73 (2013) 1306–1322
New Molecule 2
Molecule 1
Molecule 2
New Molecule 1
Fig. 12. Illustration of molecular structure change for synthesis.
Fig. 10. Illustration of molecular structure change for inter-molecular ineffective
collision.
Two Molecule
Two Molecule
P0
0
3
P1
1
4
P2
2
5
P0
0
Two New Molecule
P0
0
3
4
6
P1
1
6
5
7
P2
2
7
4
P0
0
4
3
6
1
2
5
7
P1
3
6
5
P1
P2
2
1
7
P2
P0
0
1
P1
4
2
5
P2
3
6
7
P0
0
3
P1
2
1
5
P2
4
6
7
New Molecule
Synthesis
P0
0
3
P1
4
1
5
P2
6
2
7
Fig. 13. Illustration of the task-to-computing-node mapping for synthesis.
Fig. 11. Illustration of the task-to-computing-node mapping for inter-molecular
ineffective collision.
segments of ϕ1 and ϕ2 , respectively. (4) Each subtask in the right
segments of ϕ1′ comes from the subtasks in ϕ2 that do not appear
in the left segment of ϕ1′ . The atoms at the corresponding positions
in ψ1 and ψ2 are also adjusted in the fashion same as above to
generate ψ1′ and ψ2′ . As a result, the operation generates ω1′ and
ω2′ from ω1 and ω2 . This operations are illustrated in Fig. 10 and
Fig. 11, and its detailed executions are outlined in Algorithm 5.
Algorithm 5 Intermole(ω1 , ω2 ) function
Input:
Two molecules ω1 and ω2 ;
Output:
Two new molecules ω1′ and ω2′ ;
1: Choose randomly a cut-off point i ∈ [1, n];
2: Cut the small molecule ϕ1 and ϕ2 into left and right segments;
3: Cut the mall molecule ψ1 and ψ2 into left and right segments
in the same way;
′
4: Inherit from the left segment of ϕ1 to the left segments of ϕ1 ;
5: Copy from the subtasks in ϕ2 that do not appear in the left
segment of ϕ1′ to the right segment of ϕ1′ .
′
6: Inherit from the left segments of ψ1 to the left segments of ψ1 ;
7: Copy from the subtasks in ψ2 that do not appear in the left
segment of ψ1′ to the right segment of ψ1′ .
′
′
′
8: Generate a new molecule ω1 which consists of ϕ1 and ψ1 ;
′
9: Inherit from the left segments of ϕ2 to the left segments of ϕ2 ;
10: Copy from the subtasks in ϕ1 that do not appear in the left
segment of ϕ2′ to the right segment of ϕ2′ .
′
11: Inherit from the left segments of ψ2 to the left segments of ψ2 ;
12: Copy from the subtasks in ψ1 that do not appear in the left
segment of ψ2′ to the right segment of ψ2′ .
′
′
′
13: Generate a new molecule ω2 which consists of ϕ2 and ψ2 ;
′
′
14: return the new molecules ω1 and ω2 ;
5.2.4. Synthesis
In this paper, the operation Synth(ω1 , ω2 ) is used to generate
a new solution ω′ from two existing solutions ω1 and ω2 . The
new solution is generated in the following steps. (1) An integer i
between 1 and n is randomly selected. (2) ϕ1 and ϕ2 are cut off
at the position i to become the left and right segments. (3) The
left segment of ϕ ′ is inherited from the corresponding segment
of ϕ1 , and each task in the right segment of ϕ ′ comes from ϕ2
that do not appear in the left segment of ω′ . Then the atoms at
the corresponding positions in ψ1 and ψ2 are also adjusted in
the fashion same as above to generate ψ ′ . This operations are
illustrated in Fig. 12 and Fig. 13, and its detailed executions are
outlined in Algorithm 6.
Algorithm 6 Synth(ω1 , ω2 ) function
Input:
Two molecules ω1 and ω2 ;
Output:
the new molecule ω′ ;
1: Generate ρ ∈ [0, 1]
2: if ρ < 0.5 then
3:
The left segment of ω′ is inherited from the corresponding
segment of the original small molecule ϕ1 ;
4:
The right segment ω′ is copied from the another original
small molecule ψ2 ;
5:
Generate a new molecule ω′ which consists of ϕ1 and ψ2 ;
6: else
7:
The left segment of ω′ is inherited from the corresponding
segment of the original small molecule ϕ2 ;
8:
The right segment ω′ is copied from the another original
small molecule ψ1 ;
9:
Generate a new molecule ω′ which consists of ϕ2 and ψ1 ;
10: end if
′
11: return a new molecule ω ;
Y. Xu et al. / J. Parallel Distrib. Comput. 73 (2013) 1306–1322
5.3. The outline and analysis of DMSCRO
The entire algorithm of using DMSCRO to schedule a DAG job is
outlined in Algorithm 7. In the algorithm, DMSCRO first initializes
the process (Step 1–3 in the algorithm). Then the process enters
a loop (Step 4–37). In each iteration, one of elementary chemical
reaction operations is performed to generate new molecules. PE
of newly generated molecules will be calculated. The iteration
repeats until the stopping criteria are met. The stopping criteria
may be set based on different parameters, such as the maximum
amount of CPU time used, the maximum number of iterations
performed, an objective function value less than a predefined
threshold obtained, and the maximum number of iterations
performed without further performance improvement. In the
implementations of the experiments in this paper, the stopping
criterion is set as when there is no makespan improvement
after 10 000 consecutive iterations in the search loop. After the
searching loop stops, the molecule with the lowest PE contains the
best DAG scheduling solution (Step 38).
Algorithm 7 The outline of DMSCRO
Input:
the DAG job
Output:
The DAG scheduling solution;
1: Initialize PopSize, KELossRate , MoleColl and InitialKE, α and β ;
2: Call Algorithm 1 to generate the initial population, Pop;
3: Call Algorithm 2 to calculate PE of each molecule in Pop;
4: while the stopping criteria is not met do
5:
Generate b ∈ [0, 1];
6:
if b > MoleColl then
7:
Select a molecule ω from Pop randomly;
8:
if (NumHitω − MinHitω ) > α then
9:
call Algorithm 4 to generate new molecules ω1′ and ω2′ ;
10:
Call Algorithm 2 to calculate PEω′ and PEω′ ;
1
2
11:
if Inequality Eq. (3) holds then
12:
Remove ω from Pop;
13:
Add ω1′ and ω2′ to Pop;
14:
end if
15:
else
16:
Call Algorithm 3 to generate a new molecules ω′ ;
17:
Call Algorithm 2 to calculate PEω′ ;
18:
Remove ω from Pop;
19:
Add ω′ to Pop;
20:
end if
21:
else
22:
Select two molecules ω1 and ω2 from Pop randomly;
23:
if (KEω1 < β) and (KEω2 < β) then
24:
Call Algorithm 6 to generate a new molecule ω′ ;
25:
Call Algorithm 2 to calculate PEω′ ;
26:
if Inequality Eq. (11) holds then
27:
Remove ω1 and ω2 from Pop;
28:
Add ω′ to Pop;
29:
end if
30:
else
31:
Call Algorithm 5 to generate two new molecules ω1′ and
ω2′ ;
32:
Call Algorithm 2 to calculate PEω′ and PEω′ ;
1
2
33:
Remove ω1 and ω2 from Pop;
34:
Add ω1′ and ω2′ to Pop;
35:
end if
36:
end if
37: end while
38: return the molecule with the lowest PE in Pop;
1315
The time complexity of DMSCRO is analyzed as follows.
According to Algorithm 7, the time is mainly spent in running
the searching loop (Step 4–37) in the proposed DMSCRO. In each
iteration of the loop, the algorithm needs to (1) perform one of the
four elementary operations and (2) evaluate the fitness function
of solutions (i.e., makespan) for each new molecule generated.
In the on-wall collision, the main steps are (1) finding the first
predecessor of a randomly selected task in the encoding array,
and (2) adjusting the order of the subtasks between the selected
subtask and its predecessor. The time complexity of both steps
is O(n). Therefore, the time complexity of on-wall collision is
O(2 × n). In decomposition, the operations involved in the onwall collision are performed twice. Therefore, the time complexity
of decomposition is O(4 × n). In the synthesis operation, the
main steps are (1) copying the head portion of molecule 1 to
a new module, (2) copying the subtasks that are in molecule 2,
but are not in the head portion of molecule 1 to the tail portion
of the new module. The time complexity of the first step is
O(n). The time complexity of the second step is O(n2 ), because
for each subtask in the subtask queueing ϕ of molecule 2, the
step needs check whether the subtask is already in the head
portion of molecule 1. So the time complexity of the synthesis
operation is O(n + n2 ). The inter-molecule collision, similar steps
are performed, but two new molecules will be generated. So the
time complexity of inter-molecule collision is O(2 × (n + n2 )). As
discussed above, the most expensive operation is inter-molecule
collision. The time complexity of evaluating the fitness function
of a molecule (i.e., calculating the makespan of a DAG schedule) is
O(e × m), where e is the number of edges in the DAG and m is the
number of heterogeneous computing nodes. Two molecules (by
inter-molecule collision or decomposition) may be generated in an
iteration. Therefore, in an iteration, the worst case time complexity
is O(2 × (n + n2 ) + 2 × e × m). Therefore, the time complexity
of DMSCRO is O(iters × 2 × (n + n2 + e × m)), where iters is the
number of iterations performed by DMSCRO.
The space complexity of the DMSCRO can be analyzed as
follows. In DMSCRO, for each molecule we need an array of
size (n + n) to store it. Moreover, for each molecule, we use
an extra molecule space to store the molecule structure that
produces the minimum makespan during the evolution of the
molecule. Therefore, the space complexity caused by a molecule
is O(4 × n). There are PopSize molecules in the initial population.
Therefore, the space complexity of DMSCRO is O(PopSize × 4 × n).
Note that although the number of molecules may change during
the evolution, the setting of the decomposition and synthesis
probability in DMSCRO dictates that the average number of
molecules is still PopSize statistically during the evolution.
It is very difficult to theoretically prove the optimality of
the CRO (as well as DMSCRO) scheme. However, by analyzing
the chemical reaction operations designed in DMSCRO and the
operational environment of DMSCRO, it can be shown to some
extent that DMSCRO enjoys the advantages of both GA and SA.
The Inter-molecular collision designed in DMSCRO exchanges
the partial structure of two different molecules, which acts like
the crossover operation in GA, while the operation of on-wall
collision designed in DMSCRO randomly changes the molecular
structures, which has the similar effect of the mutation operation
in GA. On the other hand, the energy conservation requirement
in DMSCRO is able to guide the searching of the optimal solution
in the similar way as the Metropolis Algorithm of SA guides the
evolution of the solutions in SA. In addition to the similarity to
GA and SA, DMSCRO has two additional operations: decomposition
and synthesis. These two operations may change the number of
solutions in a population, which give DMSCRO more opportunities
to jump out of the local optimum and explore the wider areas in the
solution space. This benefit enables DMSCRO to find good solutions
1316
Y. Xu et al. / J. Parallel Distrib. Comput. 73 (2013) 1306–1322
faster than GA, which is demonstrated by the experimental results
in Section 6.4.
It is also worth emphasizing that uni-molecular (on-wall) and
inter-molecular collisions play different roles in the DMSCRO
process. As discussed above, the inter-molecular collision in
DMSCRO bears similarity with the crossover operation in Genetic
algorithms. If only performing this type of collisions, the searching
is likely to be locked in a local optimum. On the other hand,
uni-molecular collision in the DMSCRO maintains the diversity of
molecules by randomly mutating the molecular structures, which
can help the searching jump out of the local optimum and reach
wider areas in the solution space. Both types of collisions are
important for the searching in DMSCRO.
6. Simulation and results
To illustrate the power of DMSCRO-based DAG scheduling
algorithm, we compare this algorithm with the previously
proposed heuristics (HEFT_B and HEFT_T) [29] and also with a
well known meta-heuristic algorithm, Genetic Algorithm (GA),
presented in [12]. The makespan performance obtained by GA is
used as the baseline performance. The GA used in the simulation
experiments is therefore labeled as BGA in the figures.
The reason why we select HEFT_B and HEFT_T as the
representatives of heuristic scheduling is because HEFT_B and
HEFT_T outperform other heuristic scheduling algorithms in
heterogeneous parallel systems, as discussed in the fourth last
paragraph in Section 2.1. The reasons why we compare DMSCRO
with the GA presented in [12] are three folds. First, GA is to
date the most popular meta-heuristic technique for scheduling.
Second, DMSCRO absorbs the strengths of GA and SA (Simulated
Annealing). Therefore, we want to compare DMSCRO with GA
to see the advantages of DMSCRO over GA. We do not compare
DMSCRO with SA because SA performs operations on a single
solution to generate new solutions, instead of on a population of
solutions as in DMSCRO and GA. Indeed, the underlying principles
and philosophies between DMSCRO and SA differ a lot. Typically,
a meta-heuristic algorithm operating on a population of solutions
is able to find good solutions faster than that operating on a
single solution. Finally, as we have discussed in the first paragraph
of Section 2.2, the GA method presented in [12] has the closest
workload and system model to that in our paper.
In the experiments, we consider two extensive sets of graphs
as the workload: real-world application and randomly generated
application graphs. The performance has been evaluated in terms
of makespan. Each makespan value plotted in the graphs is the
result averaged over a number of independent runs, which we
call average makespan. In real-world applications, the average
makespan is obtained from 10 independent runs, while in random
graphs the average makespan is obtained from the running of 100
different random DAG graphs.
DMSCRO is programmed in C♯ (the outline of the program is
shown in Algorithm 7). A DAG task in the program is represented
by a class, whose members include an array of subtasks, the matrix
of the speed at which each subtask runs on each computing node,
and the matrix of communication data between every pair of
subtasks. A subtask in the program is also represented by a class,
whose members include an array of predecessors of the subtask,
an array of successors of the subtask, the indegree of the subtask,
the outdegree of the subtask and the computational data of the
subtask. The simulations are performed on the same PC with an
Intel Core 2 Duo-E6700 @ 2.66 GHz CPU and 2 GB RAM.
As discussed in Section 4.2, this paper adopts the MHM
heterogeneity model and the heterogeneity is represented by a
two-dimensional matrix, S. In the experiments, the heterogeneity
is set in the following way. We set a parameter h (0 < h < 1). The
Fig. 14. Gaussian elimination for a matrix of size 7.
value of each element S (Ti , Pk ) in S is randomly selected from the
range of [1 − h%, 1 + h%]. In doing so, the speed of a computing
node will be different for different subtasks, which complies with
the assumption of the MHM model. The biggest possible ratio
of the best computing node speed to the worst computing node
+h%
speed for each task is 11−
, which is used to represent the level of
h%
heterogeneity. Unless otherwise stated, h is set as such a value that
the level of heterogeneity is 2.
Here are some suggested values for the parameters: PopSize =
10, KE LossRate = 0.2, MoleColl = 0.2, InitialKE = 1000, α = 500,
β = 10, and buffer = 200. These values are deduced from the
literature [18].
6.1. Real world application graphs
The first test set used task graphs of two real world problems,
Gaussian elimination [36] and Molecular dynamics code [16], to
evaluate the performance of DMSCRO.
6.1.1. Gaussian elimination
Gaussian elimination is used to determine the solution of a
system of linear equations. Gaussian elimination systematically
applies elementary row operations on a set of linear equations
in order to convert it to the upper triangular form. Once the
coefficient matrix is in the upper triangular form, the back
substitution is used to find a solution. The DAG for the Gaussian
elimination algorithm with the matrix size of 7 is shown in Fig. 14.
The total number of tasks in the graph is 27, and the largest
number of tasks at the same level is 6. This graph has been used
to evaluate DMSCRO in the simulation. Since the structure of the
graph is fixed, only the number of heterogeneous computing nodes
and the communication to computation ratio (CCR) values were
varied. CCR values were 0.1, 0.2, 1, 2, and 5 in the experiments.
Since in Gaussian elimination, the same operation is executed on
every computing node and the same information is communicated
between heterogeneous computing nodes, it is assume that all
tasks have the same computation cost and all communication links
have the same communication cost.
Fig. 15 shows the makespan of DMSCRO, BGA, HEFT_B and
HEFT_T under the increasing number of computing nodes. Fig. 15
shows a decrease in the average makespan as the number of
Y. Xu et al. / J. Parallel Distrib. Comput. 73 (2013) 1306–1322
Fig. 15. Average makespan for Gaussian elimination.
Fig. 16. Average makespan for Gaussian elimination, the number of computing
nodes is 8.
computing nodes increases, which is to be expected. It can also
been seen that as the number of computing nodes increases,
the advantage of DMSCRO and GA over HEFT_B and HEFT_T
diminishes. This may be because when more computing nodes
are contributed to run the same scale of tasks, less intelligent
scheduling algorithms are needed in order to achieve good
performance.
The reason why DMSCRO and BGA typically outperform HEFT_B
and HEFT_T is because DMSCRO and BGA search a wide area of
the solution space for the optimal scheduling, while HEFT_B and
HEFT_T narrow the search down to a very small portion of the
solution space by the means of the heuristics. Therefore, DMSCRO
and BGA are more likely to obtain better solutions than HEFT_B and
HEFT_T.
The experiments show that DMSCRO and BGA achieve very
similar performance. The fundamental reason for this is because
both BGA and DMSCRO belong to meta-heuristic methods. According to No-Free-Lunch Theorem in the area of meta-heuristics, all
well-designed meta-heuristic methods that search for optimal solutions are the same in performance when averaged over all possible objective functions. In theory, as long as a well-designed
meta-heuristic method runs for long enough, it will gradually ap-
1317
proach the optimal solution. The BGA we used in the experiments is
taken from the literature, which has shown that it is well designed.
Therefore the fact that DMSCRO presents the very similar performance than BGA indicates that DMSCRO developed in our work is
also well designed.
A close observation of the figure shows that DMSCRO slightly
outperforms BGA in some cases. As discussed in the last paragraph
of Section 5, all meta-heuristic methods that search for optimal
solutions are the same in performance when averaged over all
possible objective functions, and as long as it runs for long
enough, it will gradually approach the optimal solution in theory.
The reason why DMSCRO has better performance than BGA in
some cases is only because when the stopping criteria set in the
experiments (which is that the makespan stays unchanged for
10 000 consecutive iterations in the search loop) is satisfied, the
performance obtained by DMSCRO is better than that obtained by
BGA. This also shows that DMSCRO is more efficient in searching
good solutions than BGA. The more detailed experimental results
in this aspect will be presented in Section 6.4.
Fig. 16 shows the performance of these four algorithms under
increasing CCR. It can be observed that theaverage makespan increases rapidly with the increase of CCR. This may be because when
CCR increases, the application becomes more communicationintensive, and consequently the heterogeneous computing nodes
are in the idle state for longer. Fig. 16 shows that DMSCRO and BGA
outperform HEFT_B and HEFT_T and that the advantage becomes
more prominent when CCR becomes big. These results suggest that
the heuristic algorithms, HEFT_B and HEFT_T, performs less effectively for communication-intensive applications, while DMSCRO
and BGA can deliver more consistent performance in a wide range
of scheduling scenarios.
6.1.2. Molecular dynamics code
Fig. 17 represents the DAG of a molecular dynamics code as
given in [16]. Again, since the graph has the fixed structure and a
fixed number of computing nodes, the only parameters that could
be varied were the number of heterogeneous computing nodes and
the CCR values. The CCR values that were used in our experiments
are 0.1, 0.2, 1, 2, and 5.
Figs. 18 and 19 show the average makespan of DMSCRO and BGA
over HEFT_B and HEFT_T under different number of heterogeneous
computing nodes and different CCR values respectively. Fig. 18
shows the decrease in average makespan with the increasing
number of heterogeneous computing nodes. Fig. 19 plots the
average makespan with respect to different CCR values. The average
makespan increases with increasing CCR.
6.2. Random generated application graphs
In these experiments, we used randomly generated task graphs
to evaluate the performance. In order to generate random graphs
we implemented a random graph generator which allows the
user to generate a variety of random graphs with different
characteristics. The input parameters of the generator are CCR,
the number of instructions in a subtask (representing the
computational data), the number of successors of a subtask
(representing the degree of parallelism) and the number of
subtasks in a graph. We have generated a large set of random
task graphs with different characteristics and scheduled these task
graphs on a heterogeneous computing system.
The following are the values of parameters used in the
simulation experiments, unless otherwise stated. The number of
subtasks generated in a DAG is randomly selected from 10 to 200.
The number of successors that a subtask can have is a random
number between 1 and 4. The number of computing nodes is 32.
1318
Y. Xu et al. / J. Parallel Distrib. Comput. 73 (2013) 1306–1322
Fig. 19. Average makespan for the molecular dynamics code; the number of
computing nodes is 16.
Fig. 17. A molecular dynamics code.
Fig. 20. Average makespan of different subtask numbers, CCR = 10; the number
of computing nodes is 32.
Fig. 18. Average makespan for the molecular dynamics code.
One of the initial scheduling solutions is set as those generated by
HEFT_B, while other initial solutions are randomly generated.
The computational costs of the DAG subtasks are generated
as follows. The computational data of each subtask in the DAG
(i.e., W (Ti )) is randomly selected from a range. Unless otherwise
stated, the range is [1, 80]. The computational cost of subtask Ti on
computing node Pk is then calculated using Eq. (13).
We have evaluated the performance of the algorithms under
different parameters, including numbers of subtasks, different
numbers of heterogeneous computing nodes and different CCR
values. DMSCRO is compared with those of other algorithms in
terms of makespan. Each value plotted in the graphs is the results
averaged over 100 different random DAG graphs.
As shown in Fig. 20, DMSCRO always outperforms HEFT_B,
HEFT_T and BGA as the number of subtasks in a DAG increases. The
reasons for these are the same as those explained in Fig. 15.
Fig. 21 shows the average makespan under the increasing values of CCR. It can be observed that the average makespan increases
rapidly with the increase of CCR. This may be because when CCR increases, the application becomes more communication-intensive,
and consequently the computing nodes are in the idle state for
longer.
Both Figs. 22 and 23 compare the average makespan of four algorithms under the increasing number of heterogeneous computing nodes. Nevertheless, Fig. 22 considers the workload with a low
value of CCR (computation-intensive workload), while Fig. 23 considers a high value of CCR (communication-intensive workload).
As can be seen from these two figures, DMSCRO outperforms the
heuristic algorithms in all cases. Moreover, by comparing these two
figures, it can be seen that when the CCR is low, the makespan
decreases quickly (i.e., the performance improves quickly) as the
number of heterogeneous computing nodes increases. When the
Y. Xu et al. / J. Parallel Distrib. Comput. 73 (2013) 1306–1322
1319
Fig. 23. Average makespan of four algorithms under different computing node
numbers and the high communication costs, subtasks = 100.
Fig. 21. Average makespan of DMSCRO under different values of CCR, subtasks =
100.
Fig. 22. Average makespan of four algorithms under different computing node
numbers and the low communication costs, subtasks = 100.
CCR is high, the makespan decreases at much lower rate. This is because when CCR is high, the makespan tends to be high as explained
in Fig. 21, and therefore it is difficult to reduce the makespan even
if the number of heterogeneous computing nodes is increased.
6.3. Impact of heterogeneity on makespan
We have conducted the experiments to demonstrate the impact
of heterogeneity on makespan. In the experiments, we change the
value of h so that the level of heterogeneity increases from 1 to
4. Note that when the level of heterogeneity is 1, the computing
system is homogeneous, and that the average computing node
speed remain unchanged as we change the value of h and therefore
the level of heterogeneity.
The experimental results are plotted in Fig. 24, where the level
+h%
. As can be
of heterogeneity on the x-axis is calculated using 11−
h%
seen from this figure, the makespan decreases in all cases as the
level of heterogeneity increases. These results can be explained
as follows. When the level of heterogeneity increases, some
heterogeneous computing nodes become increasingly powerful
Fig. 24. The impact of heterogeneity on makespan.
and therefore the subtasks can run faster in those heterogeneous
computing nodes. The DMSCRO algorithm is able to make use of
this opportunity and intelligently allocate a suitable number of
subtasks to those heterogeneous computing nodes. Consequently,
the makespan decreases. A closer observation from Fig. 24 shows
that when the average number of tasks in a DAG is small, the
makespan almost tails off when the level of heterogeneity increases
beyond a certain value. For example, in the case of the average
number of tasks being 50, the level of heterogeneity almost
remains unchanged when the level of heterogeneity is more than
2.8. This is because when the average number of tasks is small, the
workload level is also small and then it is easier to reach the top
performance. Although increasing the level of heterogeneity can
improve the makespan performance, it will be increasingly difficult
to be improved as it approaches the top performance.
6.4. Convergence trace of DMSCRO
The experiments in the previous subsections show the final
makespan achieved by DMSCRO and GA after the stopping criteria
are satisfied. These results show that DMSCRO can obtain the
1320
Y. Xu et al. / J. Parallel Distrib. Comput. 73 (2013) 1306–1322
Fig. 25. The convergence trace for Gaussian elimination; averaged over 10
independent runs, ccr = 1, processors = 8.
similar makespan performance as GA and that in some cases the
final makespan obtained by DMSCRO is even better than that by
GA when they stop searching. In this section, we conducted the
experiments to show the change of makespan as DMSCRO and GA
progress during the search. We then compared the convergence
trace of these two algorithms. These experiments can help further
reveal the differences between DMSCRO and GA and can also
help explain why the final makespan performance of DMSCRO is
sometimes better than that of GA.
Figs. 25 and 26 show the convergence traces when processing Gaussian elimination and the molecular dynamics code,
respectively. Figs. 27–31 plot the convergence traces for processing the sets of randomly generated DAGs with each set containing the DAGs of 10, 20, 50, 100 and 200 subtasks, respectively.
It can be observed from these figures that the makespan performance decreases quickly as both DMSCRO and GA progress and
that the decreasing trends tail off when the algorithms run for long
enough. These figures also show that although the final makespans
achieved by both algorithms are almost the same in most cases,
their convergence traces are rather different. In all cases, the DMSCRO converges faster than GA. Quantitatively, our records show
that DMSCRO converges faster than GA by 26.5% on average (by
58.9% in the best case).
In these experiments, the algorithms stop the search when the
performance stabilizes (i.e., the makespan remains unchanged) for
a preset number of consecutive iterations in the search loop (in the
experiments, it is 10 000 iterations). In reality, the stopping criteria
could be that the algorithm stops when the total running time of
the algorithm reaches a preset value (e.g., 60 s). In this case, the fact
that DMSCRO converges faster than GA means that the makespan
obtained by DMSCRO could be much better than that by GA when
the algorithms stop. The reason for this can be explained by the
analysis presented in the last paragraph of Section 5.3, i.e., DMSCRO
enjoys the advantages of both GA and SA.
Fig. 26. The convergence trace for the molecular dynamics code; averaged over 10
independent runs, ccr = 1, processor = 16.
Fig. 27. The convergence trace for the randomly generated DAGs with each
containing 10 subtasks.
7. Conclusions
In this paper, we developed a DMSCRO for DAG scheduling
on heterogeneous computing systems. The algorithm incorporates
two molecular structures: one evolves to generate priority
queueing of subtasks in a DAG and the other to generate taskto-computing-node mappings. Four elementary chemical reaction
operations are designed in DMSCRO, and they take into account the
precedence relations of the subtasks and guarantee that the newly
generated priority queueing complies with those precedence
relations. As a result, the DMSCRO algorithm can cover a much
Fig. 28. The convergence trace for the randomly generated DAGs with each
containing 20 subtasks.
larger search space than heuristic scheduling approaches. The
experiments show that DMSCRO outperforms HEFT_B and HEFT_T
and can achieve a higher speedup of task executions.
In future work, we plan to extend DMSCRO by investigating the
following three intriguing issues. First, we will study the impact
of important parameters on the initial solution generator. Second,
Y. Xu et al. / J. Parallel Distrib. Comput. 73 (2013) 1306–1322
1321
and operating frequency. We will apply the DVS technique to
our DMSCRO task scheduling algorithm to realize a bi-objective
optimization to minimize makespan and energy.
Acknowledgments
Fig. 29. The convergence trace for the randomly generated DAGs with each
containing 50 subtasks.
This research was partially funded by the Key Program
of National Natural Science Foundation of China (Grant No.
61133005), the National Natural Science Foundation of China
(Grant Nos. 61070057, 61173045), Key Projects in the National
Science & Technology Pillar Program, the Cultivation Fund of the
Key Scientific and Technical Innovation Project (2012BAH09B02),
and the Ph.D. Programs Foundation of Ministry of Education
of China (20100161110019). The project was supported by the
National Science Foundation for Distinguished Young Scholars of
Hunan (12JJ1011), the Science and Technology Plan Projects in
Hunan Province (Grant No. 2013GK3082) and the Research Project
Grant of the Leverhulme Trust (Grant No. RPG-101).
References
Fig. 30. The convergence trace for the randomly generated DAGs with each
containing 100 subtasks.
Fig. 31. The convergence trace for the randomly generated DAGs with each
containing 200 subtasks.
we plan to use MPI to parallelize the running of DMSCRO, so as
to further reduce the time needed to find good solutions. Finally,
the quadratic relationship between voltage and energy has made
dynamic voltage scaling (DVS) one of the most powerful techniques
to reduce system power demands by lowering the supply voltage
[1] A. Amini, T.Y. Wah, M. Saybani, S. Yazdi, A study of density-grid based
clustering algorithms on data streams, in: 2011 Eighth International
Conference on Fuzzy Systems and Knowledge Discovery, FSKD, Vol. 3, 2011,
pp. 1652–1656. http://dx.doi.org/10.1109/FSKD.2011.6019867.
[2] R. Bajaj, D. Agrawal, Improving scheduling of tasks in a heterogeneous
environment, IEEE Transactions on Parallel and Distributed Systems 15 (2)
(2004) 107–118. http://dx.doi.org/10.1109/TPDS.2004.1264795.
[3] T. Chen, B. Zhang, X. Hao, Y. Dai, Task scheduling in grid based on
particle swarm optimization, in: The Fifth International Symposium on
Parallel and Distributed Computing, 2006, ISPDC’06, 2006, pp. 238–245.
http://dx.doi.org/10.1109/ISPDC.2006.46.
[4] H. Cheng, A high efficient task scheduling algorithm based on heterogeneous multi-core processor, in: 2010 2nd International Workshop on Database Technology and Applications, DBTA, 2010, pp. 1–4.
http://dx.doi.org/10.1109/DBTA.2010.5659041.
[5] P. Choudhury, R. Kumar, P. Chakrabarti, Hybrid scheduling of dynamic task
graphs with selective duplication for multiprocessors under memory and
time constraints, IEEE Transactions on Parallel and Distributed Systems 19 (7)
(2008) 967–980. http://dx.doi.org/10.1109/TPDS.2007.70784.
[6] L. He, D. Zou, Z. Zhang, C. Chen, H. Jin, S. Jarvis, Developing resource consolidation frameworks for moldable virtual machines in clouds, Future Generation
Computer Systems (2012) http://dx.doi.org/10.1016/j.future.2012.05.015.
[7] H. El-Rewin, T. Lewis, Scheduling parallel program tasks onto arbitrary target
machines, Parallel and Distributed Computing 9 (1990) 138–153.
[8] F. Ferrandi, P. Lanzi, C. Pilato, D. Sciuto, A. Tumeo, Ant colony heuristic for mapping and scheduling tasks and communications on heterogeneous embedded systems, IEEE Transactions on Computer-Aided
Design of Integrated Circuits and Systems 29 (6) (2010) 911–924.
http://dx.doi.org/10.1109/TCAD.2010.2048354.
[9] M.R. Garey, D.S. Johnson, Computers and Intractability: A Guide to the Theory
of NP-Completeness, W. H. Freeman, New York, 1979.
[10] H.-W. Ge, L. Sun, Y.-C. Liang, F. Qian, An effective PSO and AIS-based hybrid
intelligent algorithm for job-shop scheduling, IEEE Transactions on Systems,
Man, and Cybernetics Part A: Systems and Humans 38 (2) (2008) 358–368.
http://dx.doi.org/10.1109/TSMCA.2007.914753.
[11] N.B. Ho, J.C. Tay, Solving multiple-objective flexible job shop problems
by evolution and local search, IEEE Transactions on Systems, Man, and
Cybernetics, Part C: Applications and Reviews 38 (5) (2008) 674–685.
http://dx.doi.org/10.1109/TSMCC.2008.923888.
[12] E. Hou, N. Ansari, H. Ren, A genetic algorithm for multiprocessor scheduling,
IEEE Transactions on Parallel and Distributed Systems 5 (2) (1994) 113–120.
http://dx.doi.org/10.1109/71.265940.
[13] J. Hwang, Y. Chow, F. Anger, C. Lee, Scheduling precedence graphs in systems
with interprocessor communication times, SIAM Journal on Computing 18 (2)
(1989) 244–257. http://dx.doi.org/10.1137/0218016.
[14] M. Iverson, F. Ozguner, G. Follen, Parallelizing existing applications in
a distributed heterogeneous environment, in: 1995 IEEE International
Conference on Heterogeneous Computing Workshop, 1995, pp. 93–100.
[15] M. Kashani, M. Jahanshahi, Using simulated annealing for task scheduling
in distributed systems, in: International Conference on Computational
Intelligence, Modelling and Simulation, 2009, CSSim’09, 2009, pp. 265–269.
http://dx.doi.org/10.1109/CSSim.2009.36.
[16] S. Kim, J. Browne, A general approach to mapping of parallel computation upon
multiprocessor architectures, in: Proceedings of the International Conference
on Parallel Processing, 1988, pp. 1–8.
[17] Y.-K. Kwok, I. Ahmad, Static scheduling algorithms for allocating directed task graphs to multiprocessors, ACM Computing Surveys (CSUR)
31 (4) (1999) 406–471. http://dx.doi.org/10.1145/344588.344618. URL:
http://doi.acm.org/10.1145/344588.344618.
1322
Y. Xu et al. / J. Parallel Distrib. Comput. 73 (2013) 1306–1322
[18] A. Lam, V. Li, Chemical-reaction-inspired metaheuristic for optimization,
IEEE Transactions on Evolutionary Computation 14 (3) (2010) 381–399.
http://dx.doi.org/10.1109/TEVC.2009.2033580.
[19] H. Li, L. Wang, J. Liu, Task scheduling of computational grid based on
particle swarm algorithm, in: 2010 Third International Joint Conference on
Computational Science and Optimization, CSO, Vol. 2, 2010, pp. 332–336.
http://dx.doi.org/10.1109/CSO.2010.34.
[20] K. Li, Z. Zhang, Y. Xu, B. Gao, L. He, Chemical reaction optimization for
heterogeneous computing environments, in: 2012 IEEE 10th International
Symposium on Parallel and Distributed Processing with Applications, ISPA,
2012, pp. 17–23. http://dx.doi.org/10.1109/ISPA.2012.11.
[21] F.-T. Lin, Fuzzy job-shop scheduling based on ranking level (lambda;, 1)
interval-valued fuzzy numbers, IEEE Transactions on Fuzzy Systems 10 (4)
(2002) 510–522. http://dx.doi.org/10.1109/TFUZZ.2002.800659.
[22] B. Liu, L. Wang, Y.-H. Jin, An effective PSO-based memetic algorithm for flow
shop scheduling, IEEE Transactions on Systems, Man, and Cybernetics, Part B:
Cybernetics 37 (1) (2007) 18–27.
http://dx.doi.org/10.1109/TSMCB.2006.883272.
[23] F. Pop, C. Dobre, V. Cristea, Genetic algorithm for dag scheduling in grid
environments, in: IEEE 5th International Conference on Intelligent Computer
Communication and Processing, 2009, ICCP 2009, August, pp. 299–305.
http://dx.doi.org/10.1109/ICCP.2009.5284747.
[24] R. Shanmugapriya, S. Padmavathi, S. Shalinie, Contention awareness in task
scheduling using Tabu search, in: IEEE International Advance Computing
Conference, 2009, IACC 2009, 2009, pp. 272–277.
http://dx.doi.org/10.1109/IADCC.2009.4809020.
[25] L. Shi, Y. Pan, An efficient search method for job-shop scheduling problems,
IEEE Transactions on Automation Science and Engineering 2 (1) (2005) 73–77.
http://dx.doi.org/10.1109/TASE.2004.829418.
[26] G. Sih, E. Lee, A compile-time scheduling heuristic for interconnectionconstrained heterogeneous processor architectures, IEEE Transactions on
Parallel and Distributed Systems 4 (2) (1993) 175–187.
http://dx.doi.org/10.1109/71.207593.
[27] S. Song, K. Hwang, Y.-K. Kwok, Risk-resilient heuristics and genetic algorithms
for security-assured grid job scheduling, IEEE Transactions on Computers 55
(6) (2006) 703–719. http://dx.doi.org/10.1109/TC.2006.89.
[28] D.P. Spooner, J. Cao, S.A. Jarvis, L. He, G.R. Nudd, Performance-aware workflow
management for grid computing, The Computer Journal 48 (3) (2005)
347–357. http://dx.doi.org/10.1093/comjnl/bxh090.
[29] H. Topcuoglu, S. Hariri, M.-Y. Wu, Performance-effective and lowcomplexity task scheduling for heterogeneous computing, IEEE Transactions on Parallel and Distributed Systems 13 (3) (2002) 260–274.
http://dx.doi.org/10.1109/71.993206.
[30] T. Tsuchiya, T. Osada, T. Kikuno, A new heuristic algorithm based on gas for
multiprocessor scheduling with task duplication, in: 1997 3rd International
Conference on Algorithms and Architectures for Parallel Processing, 1997,
ICAPP 97, 1997, pp. 295–308. http://dx.doi.org/10.1109/ICAPP.1997.651499.
[31] A. Tumeo, C. Pilato, F. Ferrandi, D. Sciuto, P. Lanzi, Ant colony optimization for mapping and scheduling in heterogeneous multiprocessor systems,
in: International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation, 2008, SAMOS 2008, 2008, pp. 142–149.
http://dx.doi.org/10.1109/ICSAMOS.2008.4664857.
[32] B. Varghese, G.T Mckee, V. Alexandrov, Can agent intelligence be used to
achieve fault tolerant parallel computing systems? Parallel Processing Letters
21 (04) (2011) 379–396. http://dx.doi.org/10.1142/S012962641100028X.
[33] J. Wang, Q. Duan, Y. Jiang, X. Zhu, A new algorithm for grid independent task
schedule: genetic simulated annealing, in: World Automation Congress, WAC,
2010, 2010, pp. 165–171.
[34] D.H. Wolpert, W.G. Macready, No free lunch theorems for optimization, IEEE Transactions on Evolutionary Computation 1 (1) (1997) 67–82.
http://dx.doi.org/10.1109/4235.585893.
[35] Y.W. Wong, R. Goh, S.-H. Kuo, M. Low, A Tabu search for the heterogeneous dag scheduling problem, in: 2009 15th International Conference on Parallel and Distributed Systems, ICPADS, 2009, pp. 663–670.
http://dx.doi.org/10.1109/ICPADS.2009.127.
[36] M.-Y. Wu, D. Gajski, Hypertool: a programming aid for message-passing
systems, IEEE Transactions on Parallel and Distributed Systems 1 (3) (1990)
330–343. http://dx.doi.org/10.1109/71.80160.
[37] J. Xu, A. Lam, V. Li, Chemical reaction optimization for the grid scheduling
problem, in: 2010 IEEE International Conference on Communications, ICC,
2010, pp. 1–5. http://dx.doi.org/10.1109/ICC.2010.5502406.
[38] J. Xu, A. Lam, V. Li, Chemical reaction optimization for task scheduling in
grid computing, IEEE Transactions on Parallel and Distributed Systems 22 (10)
(2011) 1624–1631. http://dx.doi.org/10.1109/TPDS.2011.35.
[39] L. He, S.A. Jarvis, D.P. Spooner, H. Jiang, D.N. Dillenberger, G.R. Nudd,
Allocating non-real-time and soft real-time jobs in multiclusters, IEEE
Transactions on Parallel and Distributed Systems 17 (2) (2006) 99–112.
http://doi.ieeecomputersociety.org/10.1109/TPDS.2006.18.
Yuming Xu received the master’s degree from Hunan
University, China, in 2009. He is currently working toward
the Ph.D. degree at Hunan University of China. His research
interests include modeling and scheduling for distributed
computing systems, Parallel algorithms, Grid and Cloud
computing.
Kenli Li received the Ph.D. in computer science from
Huazhong University of Science and Technology, China, in
2003, and the M.Sc. in mathematics from Central South
University, China, in 2000. He was a visiting scholar
at University of Illinois at Champaign and Urbana from
2004 to 2005. Now He is a professor of Computer
science and Technology at Hunan University, associate
director of National Supercomputing Center in Changsha,
a senior member of CCF. His major research includes
parallel computing, Grid and Cloud computing, and DNA
computer. He has published more than 70 papers in
international conferences and journals, such as IEEE TC, JPDC, PC, ICPP, and CCGrid.
Ligang He received the Bachelor’s and Master’s degrees
from the Huazhong University of Science and Technology,
Wuhan, China, and received the Ph.D. degree in Computer
Science from the University of Warwick, UK. He was also a
Post-doctoral researcher at the University of Cambridge,
UK. In 2006, he joined the Department of Computer
Science at the University of Warwick as an Assistant
Professor, and then became an Associate Professor. His
areas of interest are parallel and distributed computing,
Grid computing and Cloud computing. He has published
more than 50 papers in international conferences and
journals, such as IEEE TPDS, IPDPS, Cluster, CCGrid, and MASCOTS. He also served as
a member of the program committee for many international conferences and was
the reviewer for a number of international journals, including IEEE TPDS, IEEE TC,
IEEE TASE, etc. He is a member of the IEEE.
Tung Khac Truong received B.S. in Mathematic from Hue
College, Hue University, Vietnam, in 2001. He received
M.S. in computer science from Hue University, Vietnam,
in 2007. He is currently working toward the Ph.D.
degree in computer science at school of Computer and
Communication, Hunan University. His research interests
are soft computing and parallel computing.