PATH Algorithm for Adaptive Load Balancing on a Grid A.K. Aggarwal, Robert Kent and Jun Wei School of Computer Science University of Windsor Windsor, Ontario CANADA, N9B 3P4 Abstract: The resources on a grid form a dynamic set. The compute resources of a grid resource-service provider may be distributed over a wide geographical area. If the resource-service provider is to use his resources effectively, in addition to the characteristics of the compute nodes and the applications, the characteristics of the communication network must also be known. Network bandwidth is an important resource of a network. As high performance applications move to grids of clusters, connected through Internet, the measurement of network bandwidth becomes a critical and challenging issue. We have developed a new method: PATH (PAcket-Train with Header) algorithm for measurement of bandwidth. It is a sender-based algorithm that has been designed for ease of use in a grid environment. This paper presents the use of a load balancing algorithm in which PATH has been embedded. Our tests prove that such an adaptive load balancing algorithm, in which PATH has been embedded, can help improve the use of resources in a large measure. Keywords: Bottleneck link bandwidth, single packet, packet pair, packet train, PATH, SMSP, SDSP, grid and resource allocation, load balancing. be sent to a “consumer” client. In a grid, new machines may become available for use and some available 1. Introduction machines may have to be surrendered. A grid A load balancing algorithm for a heterogeneous scheduler, therefore, may have to measure the distributed computing environment attempts to bandwidth of the paths to new machines ‘on the fly’. improve the throughput and the response time of a PATH is a sender-based method, specifically designed parallel or distributed application by ensuring optimum for providing relatively accurate data for such utilization of available resources. On a grid, the load applications. PATH has been developed as a part of a balancing algorithm should respond quickly to the project for developing tool-kits for use in a multichanging workload and environmental conditions cluster grid, which is to support clients, located any without causing much overhead. Hence it should where on the Internet. customize its strategy as per the prevailing conditions. In the context of network performance characteristics, If a grid, consisting of many “producer” clusters is to the term ‘bandwidth’ specifies the amount of data in provide service to any “consumer” client on the bits that a network can transfer per second. For a multiInternet, the client may not have any software other hop path, the capacity of one of the links of the path is than the standard TCP/IP stack on its operating system. the amount of data per second that the link can carry For providing service to such a client, when the when there is no competing traffic. If there are H hops volume of data, required to be sent to the client, is in the path and Ci is the capacity of link i, then the large, the data may be conditioned in a format, bottleneck link bandwidth [12] is appropriate for the bandwidth of the path. [19] (1) C min Ci If a high performance application is being solved on a i 0...H geographically dispersed group of machines, connected There are many methods for measuring the through Internet, a load balancing algorithm may bandwidth; however, every available method is not require information about the bandwidth of the paths to equally appropriate for use in every type of grid these machines. This paper presents the use of an environment algorithm, titled PATH, which can be embedded in a The Network Weather Service [20] and the Load-balancing algorithm, for such an application, for netlogger [21] require the loading of middleware on measuring the bottleneck bandwidth between any two every machine of the grid. Hence these systems may machines. The information obtained through PATH not be usable directly on the type of grid applications may also be used by a web server for conditioning the being worked out in the project. data appropriately when large amounts of data has to 1 2. Background Grid Application Toolkits are required to provide specialized services for distributed computing on the Internet so as to make it possible for application developers to use the grid easily. If the tool-kits are to help in meeting the challenges of grid computing, it is necessary to design and construct software which can sense the existing network environment and compute a usage model based on reasonable statistical considerations. Such sensing software can be effectively used only when it is integrated with a dynamic distributed environment. Several research workers have developed distributed computing environments [1 to 5], which provide different types of facilities to the developer of distributed applications. WebDedip [6, 7], developed under a research project at Gujarat University, Ahmedabad, India, is the environment that was developed with ease of use as the core theme to support HPC users like physicists, mathematicians, civil engineers, etc. Initially, the performance issues were not addressed. Later, a hybrid application-centric load balancer [8], consisting of 6 algorithms, a heuristic and a normalization process, was developed for optimizing the performance. WebDedip has proved to be highly useful and reliable for developing applications in a distributed environment; it has features like fault-tolerance, userfriendliness and near-optimal use of resources. A user has only to provide information about the dependencies and resources required by the processes in the application. Thereafter, WebDedip is able to manage completely the distributed system for all the processes in the application. It has built-in mechanisms for guaranteed data transfer and process control and recovery, resolving problems and reporting them to the user and the System Manager. WebDedip provides a web-based user-interface through a standard browser. However, WebDedip requires a dedicated network. Since the grid is being developed, by common usage, as an IP-based network, the application tool has to be made more network-aware, so that it may be able to adapt process scheduling to the continuously varying environment. This is made possible by providing network parameters through a software module, and its associated protocol, which continuously monitors the status of the network. WebDedip is now being redesigned to work jointly with such a module so that an effective tool for use in a grid environment can be created. Recruitment of free nodes for distributed computing is also required in a grid system. Other nodes may be available in a time-scheduled mode for recruitment. As information regarding the mode of the nodes and the state of the network is made available, the scheduler may be able to allocate the resources optimally. A network-aware WebDedip has the potential to be useful for higher-level applications envisaged under Globus or other similar grid projects, as common standards for interfacing such systems are developed by the grid community. The use of a Sender-based algorithm, which is accurate and which is not resourcehungry, would be a suitable algorithm to be embedded in the load balancing algorithm of an environment like WebDedip. This would enable an effective use of grid resources. PATH has been specifically developed for such an application. In this paper, we show, through simulation, that the task of optimum utilization of compute nodes on a grid cannot be achieved without the use of such an algorithm. 3. The Existing Algorithms By general consensus, packet dispersion techniques are the preferred methods for measurement of bandwidth. A number of research workers [9 to 14]. have worked on developing new algorithms based on packet dispersion. A packet dispersion algorithm injects carefully spaced datagrams into the network and the differences between the dispersions of probe packets are observed. The packet dispersion techniques estimate bandwidth of the link or the end-to-end path by measuring the dispersion. Such methods are generally classified into Single Packet methods, Packet-pair methods and methods, which require a train of packets to be sent for estimating the bandwidth. For use on a grid, the criteria of practicability and performance have to be considered. Practicability: Most of the packet-pair algorithms [9 to 11] for estimating the link bandwidth need to install software on the receiver side. This may add another layer to the middleware required for a grid environment. Single-packet and Sender-based Packet Pair algorithms are easy to implement. Based on the TCP/IP protocol, these do not need installation of any special software on the receiver side to measure the parameters of network. However the accuracy of measurement of the existing sender-based methods is much poorer than that of receiver-based algorithms. Performance: To get better estimates of bandwidth, methods based on single-packet algorithms send out a large number of probe packets, which may consume a large part of the available bandwidth of the network. It affects the network performance adversely. PAcket Train with Header (PATH) has been developed for use with grid applications. It is based on the previous work of BProbe [12], packet-tailgating [10], 2 Pathrate [12] and Cartouche [11] algorithms. PATH is sender-based and it consumes relatively a smaller amount of bandwidth for the same high accuracy in measurement that is available by use of some of the receiver-based algorithms. 4. The Path Algorithm 4.1 Algorithm Description There are two steps to implement the PATH algorithm: First Step: We use simple single-packet algorithm (SMSP) to check the network structure and to get the bottleneck link Lk. Compared with the standard singlepacket algorithm (SDSP) [13], SMSP algorithm does not have to measure the bandwidth of each link of the whole network. It only checks the position of the minimal bandwidth, which is the bandwidth of the bottleneck link. Although SMSP can give an estimate of the bottleneck link bandwidth, this value may not be accurate. It is only used as a reference value for the second step of the PATH algorithm. Let the links from the source to the destination be numbered as 1 to n. Let Lk (lying between router Rk-1 and Rk) be the bottleneck link. The main purpose of SMSP algorithm is to estimate the value of k. This greatly reduces the volume of traffic, as compared to the one that SDSP algorithm generates. Second Step: Use Packet Train with header probe to measure the bandwidth of the link Lk. The source sends out a header packet H and a packet train T1, T2,… Tn. Both the header and the packet train are UDP packets. All the packets Ti of the packet-train are of the same size. Sh, the size of header packet H is much larger than St, the size of Ti. Each packet Ti contains only 8 bytes, used for identifying the packet. We denote the time-to-live (TTL) of a packet by tj if the packet expires after reaching router Rj. The TTL of all the packet-train packets Ti is tj. So the Ti packets will stop at router Rj. Rj would respond through ICMP time-exceeded packets to the source. ICMP packets include the 8 bytes original data from the UDP packets. This identifies the packet. We can check these identifications to take into account the loss of packets and out-of-order receipt of packets. The estimated interval ΔtI is computed through the use of a probability density function [15]. The bottleneck bandwidth β can be calculated from: St t I (2) St is the size of packet in the packet train. In both the steps, a number of packets are sent until the result converges. This takes care of the problem of loss of some of the packets. The algorithm described in this paper has the limitation that it can be used only if the router at the beginning of the bottleneck link responds through an ICMP time-exceeded message. The routers on the Internet paths are designed to respond with ICMP error messages. Moreover the ICMP rate limiting will not cause an error because the final result is obtained on the basis of a large number of readings. The results will be distorted only if the particular router is leading to a single link path to a web-server, which is under a continuous denial-of-service attack. Under such an abnormal condition, however, no method of measurement will be able to provide correct results. 4.2 Cross Traffic PATH algorithm works better than the standard senderbased algorithms under cross traffic. We consider three cases to discuss the effect of cross-traffic on the accuracy of measurements by PATH as compared to that by other available algorithms. Cross traffic can exist in the links between the source computer and router Rk-1, or between router Rk and the destination computer. Case 1 of cross traffic in the links between router Rk and the destination computer: Because all the PATH packets will stop at the router Rk, so the cross traffic in the link between router Rk and destination computer will not affect the measurement of PATH algorithm. But, the other sender-based algorithms need to send packets to the destination computer. The cross traffic in the link between router Rk and destination computer will distort the spacing among the probes. This may adversely affect the accuracy of the estimates of bandwidth for such other algorithms. Case 2 of cross traffic in the links between source computer and router Rk-1: In this case, the large header of PATH probe train can make packets T1, T2,… Tn stay together before they reach router Rk-1. If Sh is large enough, H will temporarily block the link Lj. If the packet train packets, T1, T2, … Tn, reach the router Rj-1 during the transmission of the header packet H in the link Lj, they will queue in the router Rj-1 and wait until the header packet is received by Rj. H will, therefore, be immediately followed by the packet train. The link Lj may cause the appropriate packet dispersion. But the packets will wait in the queue of Rj and the intervals among the packet-train packets will become zero. So if the header is large enough, it can keep the packet train together until they reach the bottleneck link. Case 3 of cross traffic in the link Lj: If the cross traffic is not to go to link Lj+1, it will not be in the same queue of router Rj. So the intervals between the packet-train packets will become zero again. Hence this kind of cross traffic will not affect the result of PATH 3 algorithm. But it will affect the spacing among the probes in the other sender-based algorithms. So under cross traffic, PATH may generate more accurate estimates of bandwidth than the estimates by other sender-based methods. 4.3 Test Results for PATH PATH and other bandwidth measurement algorithms have been tested exhaustively using NS-2 [16] and BAMA simulator [18]. The test results have been reported in [17]. In [17], PATH has been compared with both Senderbased algorithms and with Receiver-based algorithms. It is found that without cross-traffic, a receiver-based algorithm may be as accurate as PATH in estimating the bottleneck bandwidth. But in the presence of cross traffic, PATH has a higher accuracy. Thus the tests have shown that even though PATH is a sender-based algorithm, it has an accuracy level similar to that of one of the best receiver-based algorithms. It has also been found that, during the First Step of Section 4.1, the overhead introduced by PATH is of the order of 10% to 14% of the overhead introduced by SDSP algorithm. Load Balancing with PATH: PATH is embedded in a load balancing algorithm for use by a resourceprovider on a grid. If this algorithm is applied to a widely dispersed set of compute nodes, the algorithm should be able to take into account the communication delays, in addition to considering the characteristics of the resources and the applications. If measurable communication delays are present in the system, and if a load balancing algorithm were not to take these delays into account, while allocating the processes to various compute-nodes, the resources may not be used effectively. For testing the usefulness of PATH to a load balancing algorithm, we use a Simulator with a variable number of compute nodes. The tests were conducted with 2 to 45 compute-nodes. The applications chosen were of three types as specified in [8]. The paper uses the three applications for a hybrid load balancing strategy. Two of these applications have inter-dependencies with a depth of 5. These applications require a large amount of data to be transferred from one process to the next process. Hence these applications are suitable for studying the effect of bandwidth limitation on load balancing systems. The third application consists of a single task. The tests have been conducted for the medium load and heavy load conditions, as defined in [8]. We have assumed that the loads arrive at random during a fixed period of time. (Figure1) The network between Colby College, Waterville, Maine and San Diego Supercomputing Center was named by Downey [13] as the SDSC dataset. We have used the SDSC dataset [13, 14] for testing. The simulated network is distributed over 15 sites. Each site is assumed to have a maximum of 3 compute nodes. The links in the network have a bandwidth varying from 1.544 Mbps to 622 Mbps. Since the processes in applications of type 1 and 2 in [8] require the transfer of large amounts of data, varying from 5 MB to 150 MB, using it on the SDSC dataset provides large delays on some of the links. Thus this type of test brings out the effect of network characteristics on load balancing very effectively. 5. Test Results The tests were conducted for obtaining the Session Completion Time and the Percentage CPU Utilization for both high and medium loads. The case where the network delays are not considered, while allocating the nodes, is called the non-optimized case. In the non-optimized case, the communication delays are present. But in this case the algorithm allocates the first available node to every task that is ready to be processed. However the delays in transferring the results from the parent processes, for an inter-dependent set of tasks of a job, have to be taken into account. When the network delays are taken into account for allocating the nodes by the load balancing algorithm, it is called the optimized case. In this case, the node is selected such that the overall processing time for the job is reduced. Figure 1 shows the arrival profile of the applications. Figures 2 and 3 show the response for the case, where the first site has the 1st, 16th and 31st compute nodes. The second site on the SDSC network has node numbers 2, 17 and 32. The 15th site has the 15th, 30th and the 45th nodes. The internal delays at a site are again considered to be negligible. 6. Discussion on Test Results The results clearly show that the session completion time reduces considerably when the network bandwidth is considered, while allocating applications in a network of geographically distributed work stations. When the delays are not taken into account for allocating the nodes, but the delays are present, the use of a load balancing algorithm does not produce consistent results. In fact our tests show that a load balancing algorithm, which does not consider communication delays, under the conditions of the tests, may not be able to be effective. The graph is erratic and the session completion time varies widely. 4 The tests have been done for the SDSC dataset [13] and for applications types, as defined in [8]. Such a system may lead to high delays in transferring data from parent processes to the child process. Hence the results have validity for compute-nodes located at large distances on the Internet and for cases, where a large amount of data is required to be communicated among processes of a job. However the results may have no validity for clusters connected through a high speed dedicated network of an intragrid. For medium load, when the number of nodes becomes high, a nearly stable value for Normalized session completion time is obtained for both the optimized and non-optimized case. It is found that the session completion time improves from a normalized value of about 6.5 to a value of about 1.8. An equally good improvement is observed for any case of more than 6 nodes. However the values for the non-optimized case vary widely, since the nodes are allocated ‘blindly’ without taking into consideration the network delays. When the load becomes high, the non-optimized case does not stabilize and the session completion time fluctuates widely. However the graph shows that the improvement in the normalized session completion time varies from 51.4% to 75.3% with respect to the non-optimized values, when the number of nodes is more than 6. For less than 6 nodes, for both the medium and the high load case, the nodes remain rather busy. However the graphs show that even in these cases, there is a significant improvement in the session completion time. CPU utilization improves considerably when the network delays are taken into consideration for allocation of nodes to tasks. Attempts at load balancing, without taking into consideration the delays again do not show consistency. As expected, the utilization falls when the number of compute nodes increases. The results show that the use of bandwidth measurement algorithms like PATH as a part of the load balancing algorithms is necessary if load balancing on the compute resources on the grid is to be attempted. 7. Conclusions and Future Work The paper embeds a new bottleneck link-bandwidth estimation algorithm: PATH in a load balancing algorithm. Our tests show that load balancing on a grid is useful only when communication delays are taken into account. Most sender-based algorithms do not work well when the number of nodes is greater than 10. The performance of the PATH algorithm should be better because only the path between the sender and the bottleneck link will affect the estimate. The experiments that we used to verify the PATH algorithm are based on the simulators (BAMA and Network Simulator). The next stage of testing may be on a test-bed. We are working on integrating PATH with the application centric load balancing algorithm and WebDedip. The integrated system may be tested on a large test-bed. By collaborating with some large ISP, production tests could be the final method of tuning the system. We are also working on improving the Simulator used, which has been used for obtaining the results in this paper to make it more general so that we may be able to use other load balancing algorithms and various priority schemes in the simulation. References: [1]. G. E. Fagg, Keith Moore, Jack J. Dongarra, Al Geist,’Scalabale Networked information Processing Environment (SNIPE)’, Proc. of Supercomputing 97, San Jose, CA, November 1997. [2]. A.S. Grimshaw, W.A. Wulf and the Legion Team, ‘The Legion Vision of the World Wide Virtual Computer’, Communications of ACM, 40(1) (1997). [3]. M. Beck, J. Dongarra, G. Fagg, G. Al Geist, P. Gray, J. Kihl, M. Mingliardi, K. Moore, T. Moore, P. Papadopoulous, S. Scott, V. Sunderam, ‘HARNESS: a Next Generation Distributed Virtual Machine’, Future Generation Systems, Vol. 15, (1999), 571-582. [4]. J. Kohl, G. Geist, XPVM 1.0 Users’ Guide, Technical Report ORNL/TM-12981, Computer Science and Mathematical Division, Oak Ridge National Laboratory, Oak Ridge, TN, April 1995. [5]. J. Devaney, R. Lipman, M. Lo, W. Mitchell, M. Edwards, C. Clark, The Parallel Applications Development Environment (PADE), User’s Manual, PADE-Major Release 1.4,November 21, 1995. [6]. A.K.Aggarwal, H.S.Bhatt,” A Generalized Environment for Distributed Image Processing”, Proceedings of 15th Annual HPCS, Kluwer 2002, pp 209-223. [7]. Bhatt H.S., Aggarwal A.K., “Web enabled clientserver model for development environment of distributed image processing”, Proceedings of international conference on meta computing, GRID2000, pp 135- 145 [8]. Bhatt H. S., Singh B.K., Aggarwal A.K., “Application centric load balancing based on hybrid model”, International conference for advance computing and communication ADCOM-2001, organized by IEEE and ACS India, Dec. 2001, Page(s): 231-238 5 [9]. Vern Paxson, “Measurements and Analysis of End-to-End Internet Dynamics” PhD thesis, Computer Science Division, University of California, Berkeley, April 1997 [10]. K. Lai, and M. Baker, “Nettimer: A Tool for Measuring Bottleneck Link Bandwidth” In Proceedings of USENIX Symposium on Internet Technologies and Systems, March 2001 [11]. K. Harfoush, A. Bestavros, and J. Byers, “Measuring Bottleneck Bandwidth of Targeted Path Segments” Technical Report BUCS-TR-2001-016, Boston University, Computer Science Department, July 2001 [12]. C. Dovrolis, P. Ramanathan, and D. Moore, What do packet dispersion techniques measure? In Proceedings of IEEE INFOCOM, 2001 [13]. A. B. Downey, “Using Pathchar to Estimate Internet Link Characteristics” ACM SIGCOMM '99 Pages: 241-250. [14]. V. Jacobson, “Pathchar - a tool to infer characteristics of Internet paths” 97 MSRI talk. [15]. D.Scott, “On optimal and data-based histograms”, Biometrika, 66:605-610, 1979. [16]. Network Simulator (NS), version 2 (website) http://www.isi.edu/nsnam/ns/. [17]. Akshai Aggarwal and Jun Wei, “PATH Algorithm for Bottleneck Bandwidth Measurement”, WSEAS Transactions on Information Science and Applications, July 2004, pages 346-354. [18]. A.K.Aggarwal, Jun Wei, “BAMA (BAndwidth Measurement Algorithms) Simulator”, UoWSCS Technical Report #04-003, (website) http://www.uwindsor.ca/HPGCGroup/ [19]. J.Bolliger and T. Gross. A Framework-Based Approach to the Development of Network-Aware Application, IEEE Transactions on Software Engineering (Special Issue on Mobility and NetworkAware Computing), May 1998, 24(5), pp. 376-390. [20]. (Website)http://www.fp.globus.org/retreat00/pre sentations/nws_wolski/sld001.htm [21]. (Website)http://www.fp.globus.org/retreat00/pre sentations/w_tierney.smaller.pdf Figure 1: Application Arrival Profile 6 Figure 2: Normalized Session Completion time for nodes 1, 2, 3 in one area Figure 3: Percentage CPU utilization for nodes 1, 2, 3 in one area 7