International Journal of Application or Innovation in Engineering & Management (IJAIEM) Web Site: www.ijaiem.org Email: editor@ijaiem.org, editorijaiem@gmail.com Volume 2, Issue 7, July 2013 ISSN 2319 - 4847 Task Partitioning Strategy with Minimum Time (TPSMT) in Distributed Parallel Computing Environment 1 Rafiqul Zaman Khan, 2Javed Ali 1 Associate Professor Department of Computer Science, Aligarh Muslim University, Aligarh. 2 Research Scholar Department of Computer Science, Aligarh Muslim University, Aligarh. ABSTRACT High performance in heterogeneous parallel computing may be achieved by efficient scheduling algorithm. Consideration of the communication overhead and scalability of the problems are the important parameters for scheduling algorithms. Scheduling of DAG based problems upon heterogeneous distributed computing environment is NP complete. These problems show results with the limitation of precedence. In this paper we proposed Task Partitioning Strategy with Minimum Time (TPSMT) algorithm which shows better performance over HEFT and CPFD in terms of communication cost and scalability. On the basis of the evolved results it has been found that this shows better Normalized Schedule Length (NSL) over conventional algorithms. Keywords: Normalized Schedule Length, Communication Cost, High Performance Parallel Computing etc. 1. INTRODUCTION Efficient task partitioning is necessary to achieve the remarkable performance in multiprocessing systems. When number of processors is large then scheduling is NP complete. Complexities of the algorithms depend upon the execution time [1] of the algorithm. Objective function of any task partitioning strategy must consider computing factors. A lot of parallel computing algorithms are proposed which consider very small factors concurrently. These task partitioning strategies are categorized as preemptive verses non preemptive, adaptive versus non adaptive, dynamic versus static, distributed versus centralized [2, 3]. In static task partitioning strategies runtime overhead is negligible because the nature of every task is known prior to the execution. If all the performance factors are known in advance before the execution of the tasks then static scheduling shows better results [4]. Practically to estimate all communication factors prior to the execution of tasks is impossible. In round robin systems every processor of the computing system executes same number of tasks. Dynamic scheduling balance the load by moving the pointer from compiler to processors[6]. Performance of the system is directly affected by the tasks partitioning strategies [6]. Centralized and distributed scheduling shows different behavior in terms of overheads and complexities of scheduling algorithms. Scheduling information is distributed among the storage devices and related computing units in distributed scheduling fashion. Source initiated scheduling offers the tasks from highly loaded processors to low loaded processors [7]. While in sink initiated scheduling, the tasks are transferred from low loaded CPUs to high loaded processors [5]. In symmetric scheduling load may be transferred vice-versa. One or more processing units are used to share the global information in centralized strategies. Contention to access the global information is the major drawback of centralized scheduling algorithms. This contention tends toward the minimization of existing resources. This effect may be reduced by the hierarchal scheduling systems. Some processors which are used in centralized scheduling can't participate in the execution of tasks. They are used only as storing devices. This is another serious drawback of centralized scheduling architectures. Dynamic scheduling plays an important role in the management of resources due to run time interaction with under laying architectures [8]. In large number of application areas such as real time computing, query processing, dynamic scheduling shows out performance on static scheduling. 2. Performance Metrics: 2.1 Schedule Length Ratio(SLR): Ratio of the parallel time to the sum of weights of the critical path tasks on the fastest processors. 2.2 Speedup: Ratio of the sequential execution time to the parallel execution time. 2.3 Efficiency: Ratio of the speedup value to the number of processors used to schedule the graph. Efficiency of scheduling algorithm is computed as: Efficiency= Volume 2, Issue 7, July 2013 Page 522 International Journal of Application or Innovation in Engineering & Management (IJAIEM) Web Site: www.ijaiem.org Email: editor@ijaiem.org, editorijaiem@gmail.com Volume 2, Issue 7, July 2013 ISSN 2319 - 4847 2.4 NSL (Normalized Schedule Length): If completion time of an algorithm is considered as makespan of that algorithm then NSL of scheduling algorithm is defined as follows: NSL= 3. Thread in parallel processing algorithms A program in execution is known as process. Light weight processes are called threads. Commonly used term thread is taken from thread of control which is just a sequence of statements in a program. In shared memory architectures a single process may have multiple threads of control. Dynamic threads are used in shared memory programs. In dynamic threads, master thread controls the collection of worker threads. Master thread fork worker threads, these threads complete assigned requests and after termination it again joins master thread. This facility makes the efficient use of system resources because only at the time of running state resources are used by the threads. This phenomenon reduces idle time of the participating processors. These facilities of threads maximize throughput and try to minimize the communication overhead. Static threads are generated by the required setup which is known in advance. These static threads run until work is completed. 3.1 Thread Benefits: Multiple thread programming have the following benefits: If some threads are damaged or blocked in multithreading applications then it allows a program to continue running. It facilitates user interaction even if a process is loaded in another thread. So it increases responsiveness of the users. During the execution of the process it’s very costly to allocate memory in traditional parallel computing environments. In multithreading, all threads of a process can share its allocated resources. So creation and context switching of the threads of a process is economical. Because creation and allocation of a process is much more time consuming than threads. So thread generates a very low overhead in the comparison of processes. Creating a process is thirty times slower than creating a thread in Solaris 2 systems. Context switching is 5 times slower in process than thread. Thread code sharing facility forces an application to have several different threads of activity within the same address space. In this way threads may share resources effectively and efficiently. Each thread of all processes may run simultaneously upon different processors. Kernel level threads parallelization increase the usage of the processors. Kernel level threads are supported directly by the operating systems without the user thread interventions. While in traditional multiprogramming, kernel level processes can’t be generated by the operating systems. 4. DAG Any parallel program may be represented by DAG (Directed Acyclic Graph). In DAG, a parallel program is represented by G= (V, E), where V is the set of nodes (V) and E is the set of directed edges (e). In the execution of parallel tasks more than one computing units are required concurrently [13]. DAG scheduling are investigated on a set of homogeneous and heterogeneous environments. In heterogeneous systems, it has different speed and execution capabilities. The computing environments are connected by the networks of different topologies. These topologies (ring, star, mesh, hypercube etc.) may be partially or fully connected. Minimization of the execution time is the main objective of DAG[3,11f]. It allocates the tasks to the participating processors by preserving precedence constraints. Scheduling length of a schedule (program) is the overall finished time of the program [16]. This scheduling length is termed as mean flow time [5,16] or earliest finish time in many proposed algorithms. A set of instructions which are executed sequentially on the same processors without preemption is known as node on DAG. The computational cost of the node ( ) is denoted by w( ). Every edge ( , ) is the corresponding computation message amongst the node by preserving the precedence constraints. Communication cost of the nodes is denoted by ( ). Sink node is called the child node and source node is called the parent node of DAG. A node without child node is known as exit node while a node without parent node is called entry node. Precedence constraints mean that the child node can’t be executed without the execution of parent node. If two nodes are assigned on same processor then communication cost between them is assumed to be zero. The loop structure of the program can’t be explicitly modeled in DAG. Loop-separation techniques [6, 17] divide the loop into many tasks. All iterations of the loop are started concurrently with the help of DAG. This model can be used for large data flow problem to apply the scheduling techniques efficiently [17]. In Fast Fourier Transformation (FFT) or Gaussian Elimination (numerical applications) loop bound are known at the time of compilation. So many iterations of the loop are encapsulated in a task and denoted by a node in DAG. Message passing primitives and memory access operations are means of calculation of node weight and edge weight [18]. Granularity of the tasks is also divided by the programmer. Final granularity of the DAG is refined by the scheduling algorithms [11, 14]. The problems of conditional branches may be analyzed by DAG. In this scheduling, each branch is associated with some non-zero probability having some precedence constraints[19] used two step methods to generate schedule. In first step, he Volume 2, Issue 7, July 2013 Page 523 International Journal of Application or Innovation in Engineering & Management (IJAIEM) Web Site: www.ijaiem.org Email: editor@ijaiem.org, editorijaiem@gmail.com Volume 2, Issue 7, July 2013 ISSN 2319 - 4847 tries to minimize the amount of indeterminism. While in second step, reduced DAG generated by using pre-processor methods. The problem of conditional branches is also investigated [15] by merging the schedule to produce a unified schedule. Utilization of the computing resources may be achieved by preemptive systems [12]. Non-preemptive homogeneous systems use multistep algorithms, genetic algorithms and graph decomposition base algorithms. Duplication based and heuristics algorithms perform better efficiency for non-preemptive systems. Scheduling cost is increased while the numbers of processors are increased for the execution of large number of tasks. This limitation may generate serious problem in the computation with very large number of processors. Better scheduling includes zero inter-process communication, efficient connectivity and availability of the processors. But practically efficient connectivity of parallel processors and contention free communication can’t be achieved by distributed memory multicomputer, network of workstations and symmetric multiprocessors. Parallel processing scheduling is an eminent and open research area. So up to now, clear and consistent assumption is not stated by researchers. 5. Related Work HEFT and the CPFD algorithms are the compared algorithms with the proposed algorithm. 5.1 Critical Path Fast Duplication (CPFD) Algorithm: Final schedule length of any algorithm is determined by the critical path of implemented schedule. So, for the minimum schedule length critical path value of the schedule must be minimum. Due to the precedence constraints limitations, the finish time of the CP nodes is determined by the execution time of their parent nodes. Therefore scheduling cost of parent nodes of the critical path nodes must be determined prior to the scheduling cost of CP nodes. In CPFD, DAG is partitioned into three categories: Critical Path Node ( ) of DAG. In Branch Node ( ) of DAG. Out Branch Node ( ) of DAG In Branch Nodes are those from where a path posses to the critical path nodes. Out Branch are neither belongs to in branch nor to critical path. After partitioning the problem, priorities are assigned to the critical path nodes. According to these priorities total execution time is calculated by CPFD. The start time of can be minimized by . So are better than the .The critical path of the DAG is constructed by the following procedure: 1. assign first node as a entry . goto next node. Suppose be the next repeat 2. if all predecessor of in queue then insert in queue; 3. increment queue value; else 4. let is predecessor of having maximum value of b-level which is not in queue; 5. tie is broken by choosing predecessor with minimum t-level value; 6. if all the predecessor of in queue then 7. include all parents of in queue having more communication costs; 8. repeat all steps until all predecessors of are in queue; 9. endif 10. make to next ; 11. append all to the queue in decreasing value of b-level; The sequence generated by the above steps preserve the constraints. arranged in logical manner so that predecessor comes in front of successor . All nodes of the critical path are arranged in sequence ( , , ). This ordering is on the basis of static level[14]. According to the above steps ordering of the node will be generated as follows: ( , ). First node of the DAG is which is the first position node of critical path. Second node ( ) has only one parent node so it’s the second position node of the sequence. After appended ) to the critical path node, inserted to the sequence. After that is considered because both ( ) and ( ) have same b-level but has smaller t-level. The only ( ) is the last node of the critical path. Volume 2, Issue 7, July 2013 Page 524 International Journal of Application or Innovation in Engineering & Management (IJAIEM) Web Site: www.ijaiem.org Email: editor@ijaiem.org, editorijaiem@gmail.com Volume 2, Issue 7, July 2013 ISSN 2319 - 4847 5.1.1 CPFD algorithm: Determine critical path and partitioned the DAG into three categories key node be the entry node of . , . Suppose 1. Let be the processors set; 2. for every processor in do a. Determine key start time on P; b. Let m unscheduled node on ; c. Assign m on the processor which is idle. If assignment done than accept it as a new start time for inserted node; d. If assignment can’t be done then return control to examine another processors; 3. Allot processor to the ready node which gives earliest start time. 4. Perform all required duplications; 5. repeat procedure from step 2 to step 5 for each upon the set of processors; 6. until all nodes are assigned. The complexity of CPFD algorithm is o (v4). S.N. 1 2 Table 1: Notations used in the paper Description Critical Path Node of DAG In Branch Node of DAG. Symbol 3 4 5 Out Branch Node of DAG average computation of task (i) computational cost between tasks ( ) and ( ) 6 7 8 Group of processors ith task ith edge 6. Proposed Algorithm The proposed algorithm follows two step approaches in designing and scheduling algorithms. In first step, priorities are assigned to the nodes of DAG. While in second step, processors having earliest start time are selected to assign the tasks of high priorities. Performance of the task partitioning strategy depends upon selected scheduling technique. If any task has the shorter NSL at the time of current step then it must assign the higher priorities to other tasks. 6.1 Task Assignment phase: In heterogeneous computing environments, a task shows different execution time upon different processors. This problem may be resolved by inserting some parameters to the computational costs of tasks. The average computational capability of the system processors may be computed as follows: AvgComp( ) (1) Communication cost amongst the different processors in the heterogeneous system computed as follows: AvgComm( )= Earliest Completion Time of Earliest Finish Time of on can be calculated as follows: ECT( , )= min(w( ) + ) can be calculated as follows: EFT( , )= ECT( , ) + w( ) (2) on (3) (4) Critical path parameter is used to assign the priorities to the tasks in homogeneous computing systems. A critical path is a collection of nodes for those the sum of computational cost is optimal. This CP is used to determine the lower bound of the task partitioning strategies. So there must be efficient scheduling strategy to assign the priorities to the critical path nodes. When two tasks are assigned on the same processor then the communication cost between them is zero. HEFT algorithms resolve this problem by assigning the rank to the tasks. CPFD is static task partitioning strategy so it uses critical path method to assign the tasks to the processors. In proposed algorithm a computational cost factor is inserted to compute the priority of the tasks. This factor considers the average execution and computational time along with the communication and computational capabilities. Volume 2, Issue 7, July 2013 Page 525 International Journal of Application or Innovation in Engineering & Management (IJAIEM) Web Site: www.ijaiem.org Email: editor@ijaiem.org, editorijaiem@gmail.com Volume 2, Issue 7, July 2013 ISSN 2319 - 4847 6.2 Pseudo Code for proposed Algorithm: 1. assign wgt( )←AvgComp( ) 2. assign wgt( )←AvgComm( ) 3. arrange tasks in descending order of ranks 4. while some tasks are not schedule do begin select-first( ) from priority queue for every processor in do begin comput- ECT( , ); assign task ( ) to selected processor which minimize EFT; assign task ( ) to selected processor which minimize EFT; end; end; 5. repeat steps 1 to 4 step until ranks assigned to all tasks. 7. Simulation Results Analysis of the proposed algorithm is implemented upon the DAG in Fig (1). Tasks associated with each others have different size and communication costs. Figure 1: DAG with Communication Cost Figure 2: Average SLR against CCR values Figure 3: Average Speedup against different Number of Processors Volume 2, Issue 7, July 2013 Page 526 International Journal of Application or Innovation in Engineering & Management (IJAIEM) Web Site: www.ijaiem.org Email: editor@ijaiem.org, editorijaiem@gmail.com Volume 2, Issue 7, July 2013 ISSN 2319 - 4847 Figure 4: Average Efficiency against different Number of Processors 8. Computational Analysis Average efficiency produced by TPSMT, HEFT and CPFD with number of processors {12, 24, 36, 48, 60, 72} represented in Fig. (4). Plotted graph shows that average efficiency of the TPSMT is greater than the HEFT and CPFD by 53.09% and 63.20% respectively. While average efficiency of HEFT is 61.92 % greater than average efficiency of CPFD. These results show that performance of TPSMT is better than the both compared algorithms. If efficiency of the algorithm is large then it shows minimal execution time. The speedup value obtained by the three algorithms against the increased number of processors. Figure (3) depicts the average speedup obtained from TPSMT, HEFT and CPFD algorithms for a range of processors (20, 50, 80, 110, 140, 170, 200, 230, 250). From the graph it is clear that average speedup of all three scheduling algorithms increased if the number of tasks increased. The proposed algorithm gained average speedup value than HEFT and CPFD by 17.36% and 56.79% respectively. For the real world practical problems TPSMT algorithm outperforms the HEFT and CPFD algorithms CCR value for the three algorithms plotted against the average NSL value in figure (2).The plotted graph expose that the performance of the compared algorithms is consistence. We can draw the results that if CCR value increased from small range to medium range then average value slightly increased. When CCR become large then average NSL value increased significantly. Average reduction in the NSL value of the proposed algorithm with HEFT and CPFD is 8.08% and 28.40% respectively. Average NSL value of HEFT is 25.30% less than NSL value of CPFD algorithm. At largest value of CCR=10, average NSL value of TPSMT is 7.69% and 26.92% less than the HEFT and CPFD respectively. These results indicate that if the communication cost is large then scheduling of the parallel applications is complex. 9. Conclusion and Future Work A new list scheduling based strategy called Minimal Task Partitioning Algorithm (TPSMT) is proposed for heterogeneous computing environments. This algorithm contributes average communication and computation cost factor to assign the priorities to the critical path nodes. If we are scheduling the DAG problem upon the processors of different capabilities then according to the proposed algorithm it may have more than one critical path. The performance of the proposed algorithm TPSMT is compared with the best existing algorithms HEFT and CPFD. In these algorithms one is best known dynamic algorithm and the other is static algorithm (CPFD). TPSMT significantly outperform compared algorithms in terms of NSL and average speedup values. Average SLR valve of the proposed algorithm reduced 10.14% and 39.20% in comparison to HEFT and CPFD respectively in the range of 2-10 CCR value. This factor shows that TPSMT is also efficient for the high communication cost applications. In future directions we can extend the proposed algorithm for the virtually connected heterogeneous computing environments. Processor utilization factors may be inserted for TPSMT algorithm. References: [1.] Amoura, Bampis, E., Konig, J.-C. “Scheduling Algorithms for Parallel Gaussian Elimination with Communication Costs”. IEEE Transactions on Parallel and Distributed Systems 9(7), pp.679–686, 1998. [2.] Kwok, Y., Ahmad, I.“Static Scheduling Algorithms for Allocating Directed Task Graphs to Multiprocessors”. ACM Computing Surveys 31(4), pp.406–471, 1999. [3.] Kim, J.-K., et al. “Dynamically Mapping Tasks with Priorities and Multiple Deadlines in a Heterogeneous Environment”. Journal of Parallel and Distributed Computing 67, pp.154–169, 2007. [4.] Barbosa, J., Morais, C., Nobrega, R., Monteiro, A.P. “Static Scheduling of Dependent Parallel Tasks on Heterogeneous Clusters”. In: Heteropar ,.IEEE Computer Society, Los Alamitos, pp. 1-8, 2005. [5.] Bruno, J., Coffman, E. G., and Sethi, R.“Scheduling Independent Tasks to Reduce Mean Finishing Time”. Commun. ACM 17, pp.380–390,1974. [6.] Beck, M., Pingali, K., and Nicolau, A.“Static Scheduling for Dynamic Dataflow Machines”. J. Parallel Distrib. Comput. 10, pp.275–291,1990. Volume 2, Issue 7, July 2013 Page 527 International Journal of Application or Innovation in Engineering & Management (IJAIEM) Web Site: www.ijaiem.org Email: editor@ijaiem.org, editorijaiem@gmail.com Volume 2, Issue 7, July 2013 ISSN 2319 - 4847 [7.] Coffman, E. G. and Graham, R. L.“Optimal Scheduling for Two-Processor Systems”. Acta Inf. 1, pp.200–213, 1972. [8.] David A. W. and Mark, D. H.. “Cost-Effective Parallel Computing”. IEEE Computer, pp. 67–71,1995 [9.] Gonzalez, M. J., Jr. “Deterministic processor scheduling”. ACM Comput. Surv. 9, pp.173–204,1977. [10.] Haluk T., Salim H., Min Y. W.,“Performance-Effective and Low-Complexity Task Scheduling for Heterogeneous Computing”.IEEE transactions on parallel and Distributed systems, Vol. 13, No. 3, pp. 262-279, 2002. [11.] Kwok, Y.-K. and Ahmad, I.“Efficient Scheduling of Arbitrary Task Graphs to Multiprocessors Using A Parallel Genetic Algorithm”. J. Parallel Distrib. Comput. 47, pp.55–78, 1997. [12.] Rayward-Smith, V. J. 1987a.“The Complexity of Preemptive Scheduling with Interprocessor Communication Delays. Inf. Process. Lett. 25, pp.120–128, 1987. [13.] Wang, Q. and Cheng, K. H.“List Scheduling of Parallel Tasks”. Inf. Process. Lett. 37, pp.289–295,1991. [14.] Yang, T. and Gerasoulis, A.. Pyrros. “Static Task Scheduling and Code Generation for Message Passing Multiprocessors”. In Proceedings of the 1992 international conference on Supercomputing (ICS ’92, Washington, DC , K. Kennedy and C. D. Polychronopoulos, Eds. ACM Press, New York, NY, pp.428–437, 1992. [15.] H. El-Rewini, T.G. Lewiso.“Scheduling Parallel Programs onto Arbitrary Target Machines”. Journal of Parallel and Distributed Computing 9 (2), pp.138–153, 1990. [16.] Leung, J. Y.-T. and Young, G. H. “Minimizing schedule length subject to minimum flow time”. SIAM J. Comput. 18, pp.310–333, 1989. [17.] Lee, B., Hurson, A. R., and Feng, T. Y. “A Vertically Layered Allocation Scheme for Data Flow Systems”. J. Parallel Distrib. Comput. 11, pp.172–190, 1991. [18.] Jiang, H., Bhuyan, L. N., and Ghosal, D. “Approximate Analysis of Multiprocessing Task Graphs”. In Proceedings of the International Conference on Parallel Processing, pp.230–238,1990. [19.] Towsley, D. “Allocating Programs Containing Branches and Loops within a Multiple Processor System”. IEEE Trans. Softw. Eng. SE-12, pp.1018–1024,1986. BIOGRAPHY AUTHORS: Dr. Rafiqul Zaman Khan, is presently working as a Associate Professor in the Department of Computer Science at Aligarh Muslim University, Aligarh, India. He received his B.Sc Degree from M.J.P Rohilkhand University, Bareilly, M.Sc and M.C.A from A.M.U. and Ph.D (Computer Science) from Jamia Hamdard University. He has 18 years of Teaching Experience of various reputed International and National Universities viz King Fahad University of Petroleum & Minerals (KFUPM), K.S.A, Ittihad University, U.A.E, Pune University, Jamia Hamdard University and AMU, Aligarh. He worked as a Head of the Department of Computer Science at Poona College, University of Pune. He also worked as a Chairman of the Department of Computer Science, AMU, Aligarh. His Research Interest includes Parallel & Distributed Computing, Gesture Recognition, Expert Systems and Artificial Intelligence. Presently 04 students are doing PhD under his supervision. He has published about 35 research papers in International Journals/Conferences. Names of some Journals of repute in which recently his articles have been published are International Journal of Computer Applications (ISSN: 09758887), U.S.A, Journal of Computer and Information Science (ISSN: 1913-8989), Canada, International Journal of Human Computer Interaction (ISSN: 2180-1347), Malaysia, and Malaysian Journal of Computer Science(ISSN: 0127-9084), Malaysia. He is the Member of Advisory Board of International Journal of Emerging Technology and Advanced Engineering (IJETAE), Editorial Board of International Journal of Advances in Engineering & Technology (IJAET), International Journal of Computer Science Engineering and Technology (IJCSET), International Journal in Foundations of Computer Science & technology (IJFCST) and Journal of Information Technology, and Organizations (JITO). Javed Ali is a research scholar in the Department of Computer Science, Aligarh Muslim University, Aligarh. His research interest include parallel computing in distributed systems. He did Bsc(Hons) in mathematics and MCA from Aligrah Muslim University ,Aligarh. He published seven international research papers in reputed journals. He received state level scientist award by the government of India. Volume 2, Issue 7, July 2013 Page 528