Task Partitioning Strategy with Minimum Time (TPSMT) in Distributed Parallel Computing Environment

advertisement
International Journal of Application or Innovation in Engineering & Management (IJAIEM)
Web Site: www.ijaiem.org Email: editor@ijaiem.org, editorijaiem@gmail.com
Volume 2, Issue 7, July 2013
ISSN 2319 - 4847
Task Partitioning Strategy with Minimum Time
(TPSMT) in Distributed Parallel Computing
Environment
1
Rafiqul Zaman Khan, 2Javed Ali
1
Associate Professor
Department of Computer Science, Aligarh Muslim University, Aligarh.
2
Research Scholar
Department of Computer Science, Aligarh Muslim University, Aligarh.
ABSTRACT
High performance in heterogeneous parallel computing may be achieved by efficient scheduling algorithm. Consideration of the
communication overhead and scalability of the problems are the important parameters for scheduling algorithms. Scheduling of
DAG based problems upon heterogeneous distributed computing environment is NP complete. These problems show results with
the limitation of precedence. In this paper we proposed Task Partitioning Strategy with Minimum Time (TPSMT) algorithm which
shows better performance over HEFT and CPFD in terms of communication cost and scalability. On the basis of the evolved
results it has been found that this shows better Normalized Schedule Length (NSL) over conventional algorithms.
Keywords: Normalized Schedule Length, Communication Cost, High Performance Parallel Computing etc.
1. INTRODUCTION
Efficient task partitioning is necessary to achieve the remarkable performance in multiprocessing systems. When number
of processors is large then scheduling is NP complete. Complexities of the algorithms depend upon the execution time [1]
of the algorithm. Objective function of any task partitioning strategy must consider computing factors. A lot of parallel
computing algorithms are proposed which consider very small factors concurrently. These task partitioning strategies are
categorized as preemptive verses non preemptive, adaptive versus non adaptive, dynamic versus static, distributed versus
centralized [2, 3]. In static task partitioning strategies runtime overhead is negligible because the nature of every task is
known prior to the execution. If all the performance factors are known in advance before the execution of the tasks then
static scheduling shows better results [4]. Practically to estimate all communication factors prior to the execution of tasks
is impossible. In round robin systems every processor of the computing system executes same number of tasks. Dynamic
scheduling balance the load by moving the pointer from compiler to processors[6]. Performance of the system is directly
affected by the tasks partitioning strategies [6]. Centralized and distributed scheduling shows different behavior in terms
of overheads and complexities of scheduling algorithms. Scheduling information is distributed among the storage devices
and related computing units in distributed scheduling fashion. Source initiated scheduling offers the tasks from highly
loaded processors to low loaded processors [7]. While in sink initiated scheduling, the tasks are transferred from low
loaded CPUs to high loaded processors [5]. In symmetric scheduling load may be transferred vice-versa. One or more
processing units are used to share the global information in centralized strategies. Contention to access the global
information is the major drawback of centralized scheduling algorithms. This contention tends toward the minimization
of existing resources. This effect may be reduced by the hierarchal scheduling systems. Some processors which are used in
centralized scheduling can't participate in the execution of tasks. They are used only as storing devices. This is another
serious drawback of centralized scheduling architectures.
Dynamic scheduling plays an important role in the management of resources due to run time interaction with under
laying architectures [8]. In large number of application areas such as real time computing, query processing, dynamic
scheduling shows out performance on static scheduling.
2. Performance Metrics:
2.1 Schedule Length Ratio(SLR): Ratio of the parallel time to the sum of weights of the critical path tasks on the fastest
processors.
2.2 Speedup: Ratio of the sequential execution time to the parallel execution time.
2.3 Efficiency: Ratio of the speedup value to the number of processors used to schedule the graph. Efficiency of
scheduling algorithm is computed as:
Efficiency=
Volume 2, Issue 7, July 2013
Page 522
International Journal of Application or Innovation in Engineering & Management (IJAIEM)
Web Site: www.ijaiem.org Email: editor@ijaiem.org, editorijaiem@gmail.com
Volume 2, Issue 7, July 2013
ISSN 2319 - 4847
2.4 NSL (Normalized Schedule Length): If completion time of an algorithm is considered as makespan of that
algorithm then NSL of scheduling algorithm is defined as follows:
NSL=
3. Thread in parallel processing algorithms
A program in execution is known as process. Light weight processes are called threads. Commonly used term thread is
taken from thread of control which is just a sequence of statements in a program. In shared memory architectures a single
process may have multiple threads of control. Dynamic threads are used in shared memory programs. In dynamic threads,
master thread controls the collection of worker threads. Master thread fork worker threads, these threads complete
assigned requests and after termination it again joins master thread. This facility makes the efficient use of system
resources because only at the time of running state resources are used by the threads. This phenomenon reduces idle time
of the participating processors. These facilities of threads maximize throughput and try to minimize the communication
overhead. Static threads are generated by the required setup which is known in advance. These static threads run until
work is completed.
3.1 Thread Benefits: Multiple thread programming have the following benefits:
 If some threads are damaged or blocked in multithreading applications then it allows a program to continue running. It
facilitates user interaction even if a process is loaded in another thread. So it increases responsiveness of the users.
 During the execution of the process it’s very costly to allocate memory in traditional parallel computing environments.
In multithreading, all threads of a process can share its allocated resources. So creation and context switching of the
threads of a process is economical. Because creation and allocation of a process is much more time consuming than
threads. So thread generates a very low overhead in the comparison of processes. Creating a process is thirty times slower
than creating a thread in Solaris 2 systems. Context switching is 5 times slower in process than thread. Thread code
sharing facility forces an application to have several different threads of activity within the same address space. In this
way threads may share resources effectively and efficiently.
 Each thread of all processes may run simultaneously upon different processors. Kernel level threads parallelization
increase the usage of the processors. Kernel level threads are supported directly by the operating systems without the user
thread interventions. While in traditional multiprogramming, kernel level processes can’t be generated by the operating
systems.
4. DAG
Any parallel program may be represented by DAG (Directed Acyclic Graph). In DAG, a parallel program is represented
by G= (V, E), where V is the set of nodes (V) and E is the set of directed edges (e). In the execution of parallel tasks
more than one computing units are required concurrently [13]. DAG scheduling are investigated on a set of homogeneous
and heterogeneous environments. In heterogeneous systems, it has different speed and execution capabilities. The
computing environments are connected by the networks of different topologies. These topologies (ring, star, mesh,
hypercube etc.) may be partially or fully connected. Minimization of the execution time is the main objective of
DAG[3,11f]. It allocates the tasks to the participating processors by preserving precedence constraints. Scheduling length
of a schedule (program) is the overall finished time of the program [16]. This scheduling length is termed as mean flow
time [5,16] or earliest finish time in many proposed algorithms.
A set of instructions which are executed sequentially on the same processors without preemption is known as node on
DAG. The computational cost of the node ( ) is denoted by w( ). Every edge ( , ) is the corresponding computation
message amongst the node by preserving the precedence constraints. Communication cost of the nodes is denoted by ( ).
Sink node is called the child node and source node is called the parent node of DAG. A node without child node is known
as exit node while a node without parent node is called entry node. Precedence constraints mean that the child node can’t
be executed without the execution of parent node. If two nodes are assigned on same processor then communication cost
between them is assumed to be zero.
The loop structure of the program can’t be explicitly modeled in DAG. Loop-separation techniques [6, 17] divide the loop
into many tasks. All iterations of the loop are started concurrently with the help of DAG. This model can be used for large
data flow problem to apply the scheduling techniques efficiently [17]. In Fast Fourier Transformation (FFT) or Gaussian
Elimination (numerical applications) loop bound are known at the time of compilation. So many iterations of the loop are
encapsulated in a task and denoted by a node in DAG. Message passing primitives and memory access operations are
means of calculation of node weight and edge weight [18]. Granularity of the tasks is also divided by the programmer.
Final granularity of the DAG is refined by the scheduling algorithms [11, 14].
The problems of conditional branches may be analyzed by DAG. In this scheduling, each branch is associated with some
non-zero probability having some precedence constraints[19] used two step methods to generate schedule. In first step, he
Volume 2, Issue 7, July 2013
Page 523
International Journal of Application or Innovation in Engineering & Management (IJAIEM)
Web Site: www.ijaiem.org Email: editor@ijaiem.org, editorijaiem@gmail.com
Volume 2, Issue 7, July 2013
ISSN 2319 - 4847
tries to minimize the amount of indeterminism. While in second step, reduced DAG generated by using pre-processor
methods. The problem of conditional branches is also investigated [15] by merging the schedule to produce a unified
schedule.
Utilization of the computing resources may be achieved by preemptive systems [12]. Non-preemptive homogeneous
systems use multistep algorithms, genetic algorithms and graph decomposition base algorithms. Duplication based and
heuristics algorithms perform better efficiency for non-preemptive systems. Scheduling cost is increased while the
numbers of processors are increased for the execution of large number of tasks. This limitation may generate serious
problem in the computation with very large number of processors.
Better scheduling includes zero inter-process communication, efficient connectivity and availability of the processors. But
practically efficient connectivity of parallel processors and contention free communication can’t be achieved by
distributed memory multicomputer, network of workstations and symmetric multiprocessors. Parallel processing
scheduling is an eminent and open research area. So up to now, clear and consistent assumption is not stated by
researchers.
5. Related Work
HEFT and the CPFD algorithms are the compared algorithms with the proposed algorithm.
5.1 Critical Path Fast Duplication (CPFD) Algorithm: Final schedule length of any algorithm is determined by the
critical path of implemented schedule. So, for the minimum schedule length critical path value of the schedule must be
minimum. Due to the precedence constraints limitations, the finish time of the CP nodes is determined by the execution
time of their parent nodes. Therefore scheduling cost of parent nodes of the critical path nodes must be determined prior
to the scheduling cost of CP nodes.
In CPFD, DAG is partitioned into three categories:
 Critical Path Node (
) of DAG.
 In Branch Node ( ) of DAG.
 Out Branch Node (
) of DAG
In Branch Nodes are those from where a path posses to the critical path nodes. Out Branch are neither belongs to in
branch nor to critical path. After partitioning the problem, priorities are assigned to the critical path nodes. According to
these priorities total execution time is calculated by CPFD.
The start time of
can be minimized by
. So
are better than the
.The critical path of the DAG is constructed
by the following procedure:
1. assign first node as a entry
.
goto next node. Suppose
be the next
repeat
2. if all predecessor of in queue then insert
in queue;
3. increment queue value;
else
4. let is predecessor of
having maximum value of b-level which is not in queue;
5. tie is broken by choosing predecessor with minimum t-level value;
6. if all the predecessor of in queue then
7. include all parents of
in queue having more communication costs;
8. repeat all steps until all predecessors of
are in queue;
9. endif
10. make
to next
;
11. append all
to the queue in decreasing value of b-level;
The sequence generated by the above steps preserve the constraints.
arranged in logical manner so that
predecessor
comes in front of successor
. All nodes of the critical path are arranged in sequence (
, ,
). This ordering is on the basis of static level[14]. According to the above steps ordering of the
node will be generated as follows: (
,
).
First node of the DAG is which is the first position node of critical path. Second node ( ) has only one parent node so
it’s the second position node of the sequence. After appended
) to the critical path node,
inserted to the sequence.
After that
is considered because both ( ) and ( ) have same b-level but
has smaller t-level. The only ( ) is the
last node of the critical path.
Volume 2, Issue 7, July 2013
Page 524
International Journal of Application or Innovation in Engineering & Management (IJAIEM)
Web Site: www.ijaiem.org Email: editor@ijaiem.org, editorijaiem@gmail.com
Volume 2, Issue 7, July 2013
ISSN 2319 - 4847
5.1.1 CPFD algorithm: Determine critical path and partitioned the DAG into three categories
key node be the entry node of
.
,
. Suppose
1. Let
be the processors set;
2. for every processor in
do
a. Determine key start time on P;
b. Let m unscheduled node on
;
c. Assign m on the processor which is idle. If assignment done than accept it as a new start time for inserted node;
d. If assignment can’t be done then return control to examine another processors;
3. Allot processor to the ready node which gives earliest start time.
4. Perform all required duplications;
5. repeat procedure from step 2 to step 5 for each
upon the set of processors;
6. until all
nodes are assigned.
The complexity of CPFD algorithm is o (v4).
S.N.
1
2
Table 1: Notations used in the paper
Description
Critical Path Node of DAG
In Branch Node of DAG.
Symbol
3
4
5
Out Branch Node of DAG
average computation of task (i)
computational cost between tasks ( ) and ( )
6
7
8
Group of processors
ith task
ith edge
6. Proposed Algorithm
The proposed algorithm follows two step approaches in designing and scheduling algorithms. In first step, priorities are
assigned to the nodes of DAG. While in second step, processors having earliest start time are selected to assign the tasks
of high priorities. Performance of the task partitioning strategy depends upon selected scheduling technique. If any task
has the shorter NSL at the time of current step then it must assign the higher priorities to other tasks.
6.1 Task Assignment phase: In heterogeneous computing environments, a task shows different execution time upon
different processors. This problem may be resolved by inserting some parameters to the computational costs of tasks. The
average computational capability of the system processors may be computed as follows:
AvgComp( )
(1)
Communication cost amongst the different processors in the heterogeneous system computed as follows:
AvgComm( )=
Earliest Completion Time of
Earliest Finish Time of
on
can be calculated as follows:
ECT( , )= min(w( ) + )
can be calculated as follows:
EFT( , )= ECT( , ) + w( )
(2)
on
(3)
(4)
Critical path parameter is used to assign the priorities to the tasks in homogeneous computing systems. A critical path is a
collection of nodes for those the sum of computational cost is optimal. This CP is used to determine the lower bound of
the task partitioning strategies. So there must be efficient scheduling strategy to assign the priorities to the critical path
nodes. When two tasks are assigned on the same processor then the communication cost between them is zero. HEFT
algorithms resolve this problem by assigning the rank to the tasks. CPFD is static task partitioning strategy so it uses
critical path method to assign the tasks to the processors.
In proposed algorithm a computational cost factor is inserted to compute the priority of the tasks. This factor considers
the average execution and computational time along with the communication and computational capabilities.
Volume 2, Issue 7, July 2013
Page 525
International Journal of Application or Innovation in Engineering & Management (IJAIEM)
Web Site: www.ijaiem.org Email: editor@ijaiem.org, editorijaiem@gmail.com
Volume 2, Issue 7, July 2013
ISSN 2319 - 4847
6.2 Pseudo Code for proposed Algorithm:
1. assign wgt( )←AvgComp( )
2. assign wgt( )←AvgComm( )
3. arrange tasks in descending order of ranks
4. while some tasks are not schedule do
begin
select-first( ) from priority queue
for every processor in
do
begin
comput- ECT( , );
assign task ( ) to selected processor which minimize EFT;
assign task ( ) to selected processor which minimize EFT;
end;
end;
5. repeat steps 1 to 4 step until ranks assigned to all tasks.
7. Simulation Results
Analysis of the proposed algorithm is implemented upon the DAG in Fig (1). Tasks associated with each others have
different size and communication costs.
Figure 1: DAG with Communication Cost
Figure 2: Average SLR against CCR values
Figure 3: Average Speedup against different Number of Processors
Volume 2, Issue 7, July 2013
Page 526
International Journal of Application or Innovation in Engineering & Management (IJAIEM)
Web Site: www.ijaiem.org Email: editor@ijaiem.org, editorijaiem@gmail.com
Volume 2, Issue 7, July 2013
ISSN 2319 - 4847
Figure 4: Average Efficiency against different Number of Processors
8. Computational Analysis
Average efficiency produced by TPSMT, HEFT and CPFD with number of processors {12, 24, 36, 48, 60, 72}
represented in Fig. (4). Plotted graph shows that average efficiency of the TPSMT is greater than the HEFT and CPFD by
53.09% and 63.20% respectively. While average efficiency of HEFT is 61.92 % greater than average efficiency of CPFD.
These results show that performance of TPSMT is better than the both compared algorithms. If efficiency of the algorithm
is large then it shows minimal execution time.
The speedup value obtained by the three algorithms against the increased number of processors. Figure (3) depicts the
average speedup obtained from TPSMT, HEFT and CPFD algorithms for a range of processors (20, 50, 80, 110, 140, 170,
200, 230, 250). From the graph it is clear that average speedup of all three scheduling algorithms increased if the number
of tasks increased. The proposed algorithm gained average speedup value than HEFT and CPFD by 17.36% and 56.79%
respectively. For the real world practical problems TPSMT algorithm outperforms the HEFT and CPFD algorithms
CCR value for the three algorithms plotted against the average NSL value in figure (2).The plotted graph expose that the
performance of the compared algorithms is consistence. We can draw the results that if CCR value increased from small
range to medium range then average value slightly increased. When CCR become large then average NSL value
increased significantly. Average reduction in the NSL value of the proposed algorithm with HEFT and CPFD is 8.08%
and 28.40% respectively. Average NSL value of HEFT is 25.30% less than NSL value of CPFD algorithm. At largest
value of CCR=10, average NSL value of TPSMT is 7.69% and 26.92% less than the HEFT and CPFD respectively. These
results indicate that if the communication cost is large then scheduling of the parallel applications is complex.
9. Conclusion and Future Work
A new list scheduling based strategy called Minimal Task Partitioning Algorithm (TPSMT) is proposed for
heterogeneous computing environments. This algorithm contributes average communication and computation cost factor
to assign the priorities to the critical path nodes. If we are scheduling the DAG problem upon the processors of different
capabilities then according to the proposed algorithm it may have more than one critical path.
The performance of the proposed algorithm TPSMT is compared with the best existing algorithms HEFT and CPFD. In
these algorithms one is best known dynamic algorithm and the other is static algorithm (CPFD). TPSMT significantly
outperform compared algorithms in terms of NSL and average speedup values. Average SLR valve of the proposed
algorithm reduced 10.14% and 39.20% in comparison to HEFT and CPFD respectively in the range of 2-10 CCR value.
This factor shows that TPSMT is also efficient for the high communication cost applications.
In future directions we can extend the proposed algorithm for the virtually connected heterogeneous computing
environments. Processor utilization factors may be inserted for TPSMT algorithm.
References:
[1.] Amoura, Bampis, E., Konig, J.-C. “Scheduling Algorithms for Parallel Gaussian Elimination with Communication
Costs”. IEEE Transactions on Parallel and Distributed Systems 9(7), pp.679–686, 1998.
[2.] Kwok, Y., Ahmad, I.“Static Scheduling Algorithms for Allocating Directed Task Graphs to Multiprocessors”. ACM
Computing Surveys 31(4), pp.406–471, 1999.
[3.] Kim, J.-K., et al. “Dynamically Mapping Tasks with Priorities and Multiple Deadlines in a Heterogeneous
Environment”. Journal of Parallel and Distributed Computing 67, pp.154–169, 2007.
[4.] Barbosa, J., Morais, C., Nobrega, R., Monteiro, A.P. “Static Scheduling of Dependent Parallel Tasks on
Heterogeneous Clusters”. In: Heteropar ,.IEEE Computer Society, Los Alamitos, pp. 1-8, 2005.
[5.] Bruno, J., Coffman, E. G., and Sethi, R.“Scheduling Independent Tasks to Reduce Mean Finishing Time”. Commun.
ACM 17, pp.380–390,1974.
[6.] Beck, M., Pingali, K., and Nicolau, A.“Static Scheduling for Dynamic Dataflow Machines”. J. Parallel Distrib.
Comput. 10, pp.275–291,1990.
Volume 2, Issue 7, July 2013
Page 527
International Journal of Application or Innovation in Engineering & Management (IJAIEM)
Web Site: www.ijaiem.org Email: editor@ijaiem.org, editorijaiem@gmail.com
Volume 2, Issue 7, July 2013
ISSN 2319 - 4847
[7.] Coffman, E. G. and Graham, R. L.“Optimal Scheduling for Two-Processor Systems”. Acta Inf. 1, pp.200–213, 1972.
[8.] David A. W. and Mark, D. H.. “Cost-Effective Parallel Computing”. IEEE Computer, pp. 67–71,1995
[9.] Gonzalez, M. J., Jr. “Deterministic processor scheduling”. ACM Comput. Surv. 9, pp.173–204,1977.
[10.] Haluk T., Salim H., Min Y. W.,“Performance-Effective and Low-Complexity Task Scheduling for Heterogeneous
Computing”.IEEE transactions on parallel and Distributed systems, Vol. 13, No. 3, pp. 262-279, 2002.
[11.] Kwok, Y.-K. and Ahmad, I.“Efficient Scheduling of Arbitrary Task Graphs to Multiprocessors Using A Parallel
Genetic Algorithm”. J. Parallel Distrib. Comput. 47, pp.55–78, 1997.
[12.] Rayward-Smith, V. J. 1987a.“The Complexity of Preemptive Scheduling with Interprocessor Communication
Delays. Inf. Process. Lett. 25, pp.120–128, 1987.
[13.] Wang, Q. and Cheng, K. H.“List Scheduling of Parallel Tasks”. Inf. Process. Lett. 37, pp.289–295,1991.
[14.] Yang, T. and Gerasoulis, A.. Pyrros. “Static Task Scheduling and Code Generation for Message Passing
Multiprocessors”. In Proceedings of the 1992 international conference on Supercomputing (ICS ’92, Washington,
DC , K. Kennedy and C. D. Polychronopoulos, Eds. ACM Press, New York, NY, pp.428–437, 1992.
[15.] H. El-Rewini, T.G. Lewiso.“Scheduling Parallel Programs onto Arbitrary Target Machines”. Journal of Parallel and
Distributed Computing 9 (2), pp.138–153, 1990.
[16.] Leung, J. Y.-T. and Young, G. H. “Minimizing schedule length subject to minimum flow time”. SIAM J. Comput.
18, pp.310–333, 1989.
[17.] Lee, B., Hurson, A. R., and Feng, T. Y. “A Vertically Layered Allocation Scheme for Data Flow Systems”. J.
Parallel Distrib. Comput. 11, pp.172–190, 1991.
[18.] Jiang, H., Bhuyan, L. N., and Ghosal, D. “Approximate Analysis of Multiprocessing Task Graphs”. In Proceedings
of the International Conference on Parallel Processing, pp.230–238,1990.
[19.] Towsley, D. “Allocating Programs Containing Branches and Loops within a Multiple Processor System”. IEEE
Trans. Softw. Eng. SE-12, pp.1018–1024,1986.
BIOGRAPHY AUTHORS:
Dr. Rafiqul Zaman Khan, is presently working as a Associate Professor in the Department of Computer
Science at Aligarh Muslim University, Aligarh, India. He received his B.Sc Degree from M.J.P Rohilkhand
University, Bareilly, M.Sc and M.C.A from A.M.U. and Ph.D (Computer Science) from Jamia Hamdard
University. He has 18 years of Teaching Experience of various reputed International and National
Universities viz King Fahad University of Petroleum & Minerals (KFUPM), K.S.A, Ittihad University,
U.A.E, Pune University, Jamia Hamdard University and AMU, Aligarh. He worked as a Head of the Department of
Computer Science at Poona College, University of Pune. He also worked as a Chairman of the Department of Computer
Science, AMU, Aligarh. His Research Interest includes Parallel & Distributed Computing, Gesture Recognition, Expert
Systems and Artificial Intelligence. Presently 04 students are doing PhD under his supervision. He has published about 35
research papers in International Journals/Conferences.
Names of some Journals of repute in which recently his articles have been published are International Journal of Computer Applications (ISSN: 09758887), U.S.A, Journal of Computer and Information Science (ISSN: 1913-8989), Canada, International Journal of Human Computer Interaction (ISSN:
2180-1347), Malaysia, and Malaysian Journal of Computer Science(ISSN: 0127-9084), Malaysia. He is the Member of Advisory Board of International
Journal of Emerging Technology and Advanced Engineering (IJETAE), Editorial Board of International Journal of Advances in Engineering &
Technology (IJAET), International Journal of Computer Science Engineering and Technology (IJCSET), International Journal in Foundations of Computer
Science & technology (IJFCST) and Journal of Information Technology, and Organizations (JITO).
Javed Ali is a research scholar in the Department of Computer Science, Aligarh Muslim University, Aligarh.
His research interest include parallel computing in distributed systems. He did Bsc(Hons) in mathematics
and MCA from Aligrah Muslim University ,Aligarh. He published seven international research papers in
reputed journals. He received state level scientist award by the government of India.
Volume 2, Issue 7, July 2013
Page 528
Download