PATH Algorithm for ADAPTIVE LOAD BALANCING ON A

advertisement
PATH Algorithm for Adaptive Load Balancing on a Grid
A.K. Aggarwal, Robert Kent and Jun Wei
School of Computer Science
University of Windsor
Windsor, Ontario
CANADA, N9B 3P4
Abstract: The resources on a grid form a dynamic set. The compute resources of a grid resource-service provider may
be distributed over a wide geographical area. If the resource-service provider is to use his resources effectively, in
addition to the characteristics of the compute nodes and the applications, the characteristics of the communication
network must also be known. Network bandwidth is an important resource of a network. As high performance
applications move to grids of clusters, connected through Internet, the measurement of network bandwidth becomes a
critical and challenging issue. We have developed a new method: PATH (PAcket-Train with Header) algorithm for
measurement of bandwidth. It is a sender-based algorithm that has been designed for ease of use in a grid environment.
This paper presents the use of a load balancing algorithm in which PATH has been embedded. Our tests prove that such
an adaptive load balancing algorithm, in which PATH has been embedded, can help improve the use of resources in a
large measure.
Keywords: Bottleneck link bandwidth, single packet, packet pair, packet train, PATH, SMSP, SDSP, grid and resource
allocation, load balancing.
be sent to a “consumer” client. In a grid, new machines
may become available for use and some available
1. Introduction
machines may have to be surrendered. A grid
A load balancing algorithm for a heterogeneous
scheduler, therefore, may have to measure the
distributed computing environment attempts to
bandwidth of the paths to new machines ‘on the fly’.
improve the throughput and the response time of a
PATH is a sender-based method, specifically designed
parallel or distributed application by ensuring optimum
for providing relatively accurate data for such
utilization of available resources. On a grid, the load
applications. PATH has been developed as a part of a
balancing algorithm should respond quickly to the
project for developing tool-kits for use in a multichanging workload and environmental conditions
cluster grid, which is to support clients, located any
without causing much overhead. Hence it should
where on the Internet.
customize its strategy as per the prevailing conditions.
In the context of network performance characteristics,
If a grid, consisting of many “producer” clusters is to
the term ‘bandwidth’ specifies the amount of data in
provide service to any “consumer” client on the
bits that a network can transfer per second. For a multiInternet, the client may not have any software other
hop path, the capacity of one of the links of the path is
than the standard TCP/IP stack on its operating system.
the amount of data per second that the link can carry
For providing service to such a client, when the
when there is no competing traffic. If there are H hops
volume of data, required to be sent to the client, is
in the path and Ci is the capacity of link i, then the
large, the data may be conditioned in a format,
bottleneck link bandwidth [12] is
appropriate for the bandwidth of the path. [19]
(1)
C  min Ci
If a high performance application is being solved on a
i  0...H
geographically dispersed group of machines, connected
There are many methods for measuring the
through Internet, a load balancing algorithm may
bandwidth;
however, every available method is not
require information about the bandwidth of the paths to
equally
appropriate
for use in every type of grid
these machines. This paper presents the use of an
environment
algorithm, titled PATH, which can be embedded in a
The Network Weather Service [20] and the
Load-balancing algorithm, for such an application, for
netlogger [21] require the loading of middleware on
measuring the bottleneck bandwidth between any two
every machine of the grid. Hence these systems may
machines. The information obtained through PATH
not be usable directly on the type of grid applications
may also be used by a web server for conditioning the
being worked out in the project.
data appropriately when large amounts of data has to
1
2.
Background
Grid Application Toolkits are required to provide
specialized services for distributed computing on the
Internet so as to make it possible for application
developers to use the grid easily.
If the tool-kits are to help in meeting the
challenges of grid computing, it is necessary to design
and construct software which can sense the existing
network environment and compute a usage model
based on reasonable statistical considerations. Such
sensing software can be effectively used only when it
is integrated with a dynamic distributed environment.
Several research workers have developed
distributed computing environments [1 to 5], which
provide different types of facilities to the developer of
distributed applications. WebDedip [6, 7], developed
under a research project at Gujarat University,
Ahmedabad, India, is the environment that was
developed with ease of use as the core theme to
support HPC users like physicists, mathematicians,
civil engineers, etc. Initially, the performance issues
were not addressed. Later, a hybrid application-centric
load balancer [8], consisting of 6 algorithms, a
heuristic and a normalization process, was developed
for optimizing the performance.
WebDedip has proved to be highly useful and
reliable for developing applications in a distributed
environment; it has features like fault-tolerance, userfriendliness and near-optimal use of resources. A user
has only to provide information about the dependencies
and resources required by the processes in the
application. Thereafter, WebDedip is able to manage
completely the distributed system for all the processes
in the application. It has built-in mechanisms for
guaranteed data transfer and process control and
recovery, resolving problems and reporting them to the
user and the System Manager. WebDedip provides a
web-based user-interface through a standard browser.
However, WebDedip requires a dedicated network.
Since the grid is being developed, by common
usage, as an IP-based network, the application tool has
to be made more network-aware, so that it may be able
to adapt process scheduling to the continuously varying
environment. This is made possible by providing
network parameters through a software module, and its
associated protocol, which continuously monitors the
status of the network. WebDedip is now being
redesigned to work jointly with such a module so that
an effective tool for use in a grid environment can be
created. Recruitment of free nodes for distributed
computing is also required in a grid system. Other
nodes may be available in a time-scheduled mode for
recruitment. As information regarding the mode of the
nodes and the state of the network is made available,
the scheduler may be able to allocate the resources
optimally.
A network-aware WebDedip has the potential to be
useful for higher-level applications envisaged under
Globus or other similar grid projects, as common
standards for interfacing such systems are developed
by the grid community. The use of a Sender-based
algorithm, which is accurate and which is not resourcehungry, would be a suitable algorithm to be embedded
in the load balancing algorithm of an environment like
WebDedip. This would enable an effective use of grid
resources. PATH has been specifically developed for
such an application. In this paper, we show, through
simulation, that the task of optimum utilization of
compute nodes on a grid cannot be achieved without
the use of such an algorithm.
3.
The Existing Algorithms
By general consensus, packet dispersion techniques are
the preferred methods for measurement of bandwidth.
A number of research workers [9 to 14]. have worked
on developing new algorithms based on packet
dispersion. A packet dispersion algorithm injects
carefully spaced datagrams into the network and the
differences between the dispersions of probe packets
are observed. The packet dispersion techniques
estimate bandwidth of the link or the end-to-end path
by measuring the dispersion. Such methods are
generally classified into Single Packet methods,
Packet-pair methods and methods, which require a
train of packets to be sent for estimating the
bandwidth. For use on a grid, the criteria of
practicability and performance have to be considered.
Practicability: Most of the packet-pair algorithms [9
to 11] for estimating the link bandwidth need to install
software on the receiver side. This may add another
layer to the middleware required for a grid
environment.
Single-packet and Sender-based Packet Pair algorithms
are easy to implement. Based on the TCP/IP protocol,
these do not need installation of any special software
on the receiver side to measure the parameters of
network. However the accuracy of measurement of the
existing sender-based methods is much poorer than that
of receiver-based algorithms.
Performance: To get better estimates of bandwidth,
methods based on single-packet algorithms send out a
large number of probe packets, which may consume a
large part of the available bandwidth of the network. It
affects the network performance adversely.
PAcket Train with Header (PATH) has been developed
for use with grid applications. It is based on the
previous work of BProbe [12], packet-tailgating [10],
2
Pathrate [12] and Cartouche [11] algorithms. PATH is
sender-based and it consumes relatively a smaller
amount of bandwidth for the same high accuracy in
measurement that is available by use of some of the
receiver-based algorithms.
4.
The Path Algorithm
4.1 Algorithm Description
There are two steps to implement the PATH algorithm:
First Step: We use simple single-packet algorithm
(SMSP) to check the network structure and to get the
bottleneck link Lk. Compared with the standard singlepacket algorithm (SDSP) [13], SMSP algorithm does
not have to measure the bandwidth of each link of the
whole network. It only checks the position of the
minimal bandwidth, which is the bandwidth of the
bottleneck link. Although SMSP can give an estimate
of the bottleneck link bandwidth, this value may not be
accurate. It is only used as a reference value for the
second step of the PATH algorithm.
Let the links from the source to the destination be
numbered as 1 to n. Let Lk (lying between router Rk-1
and Rk) be the bottleneck link. The main purpose of
SMSP algorithm is to estimate the value of k. This
greatly reduces the volume of traffic, as compared to
the one that SDSP algorithm generates.
Second Step: Use Packet Train with header probe to
measure the bandwidth of the link Lk.
The source sends out a header packet H and a packet
train T1, T2,… Tn. Both the header and the packet train
are UDP packets. All the packets Ti of the packet-train
are of the same size. Sh, the size of header packet H is
much larger than St, the size of Ti. Each packet Ti
contains only 8 bytes, used for identifying the packet.
We denote the time-to-live (TTL) of a packet by tj if
the packet expires after reaching router Rj. The TTL of
all the packet-train packets Ti is tj. So the Ti packets
will stop at router Rj. Rj would respond through ICMP
time-exceeded packets to the source.
ICMP packets include the 8 bytes original data from
the UDP packets. This identifies the packet. We can
check these identifications to take into account the loss
of packets and out-of-order receipt of packets.
The estimated interval ΔtI is computed through the use
of a probability density function [15]. The bottleneck
bandwidth β can be calculated from:

St
t I
(2)
St is the size of packet in the packet train.
In both the steps, a number of packets are sent until the
result converges. This takes care of the problem of loss
of some of the packets. The algorithm described in this
paper has the limitation that it can be used only if the
router at the beginning of the bottleneck link responds
through an ICMP time-exceeded message. The routers
on the Internet paths are designed to respond with
ICMP error messages. Moreover the ICMP rate
limiting will not cause an error because the final result
is obtained on the basis of a large number of readings.
The results will be distorted only if the particular router
is leading to a single link path to a web-server, which
is under a continuous denial-of-service attack. Under
such an abnormal condition, however, no method of
measurement will be able to provide correct results.
4.2 Cross Traffic
PATH algorithm works better than the standard senderbased algorithms under cross traffic. We consider three
cases to discuss the effect of cross-traffic on the
accuracy of measurements by PATH as compared to
that by other available algorithms. Cross traffic can
exist in the links between the source computer and
router Rk-1, or between router Rk and the destination
computer.
Case 1 of cross traffic in the links between router Rk
and the destination computer: Because all the PATH
packets will stop at the router Rk, so the cross traffic in
the link between router Rk and destination computer
will not affect the measurement of PATH algorithm.
But, the other sender-based algorithms need to send
packets to the destination computer. The cross traffic in
the link between router Rk and destination computer
will distort the spacing among the probes. This may
adversely affect the accuracy of the estimates of
bandwidth for such other algorithms.
Case 2 of cross traffic in the links between source
computer and router Rk-1: In this case, the large header
of PATH probe train can make packets T1, T2,… Tn
stay together before they reach router Rk-1. If Sh is large
enough, H will temporarily block the link Lj. If the
packet train packets, T1, T2, … Tn, reach the router Rj-1
during the transmission of the header packet H in the
link Lj, they will queue in the router Rj-1 and wait until
the header packet is received by Rj. H will, therefore,
be immediately followed by the packet train. The link
Lj may cause the appropriate packet dispersion. But the
packets will wait in the queue of Rj and the intervals
among the packet-train packets will become zero. So if
the header is large enough, it can keep the packet train
together until they reach the bottleneck link.
Case 3 of cross traffic in the link Lj: If the cross traffic
is not to go to link Lj+1, it will not be in the same queue
of router Rj. So the intervals between the packet-train
packets will become zero again. Hence this kind of
cross traffic will not affect the result of PATH
3
algorithm. But it will affect the spacing among the
probes in the other sender-based algorithms.
So under cross traffic, PATH may generate more
accurate estimates of bandwidth than the estimates by
other sender-based methods.
4.3 Test Results for PATH
PATH and other bandwidth measurement algorithms
have been tested exhaustively using NS-2 [16] and
BAMA simulator [18]. The test results have been
reported in [17].
In [17], PATH has been compared with both Senderbased algorithms and with Receiver-based algorithms.
It is found that without cross-traffic, a receiver-based
algorithm may be as accurate as PATH in estimating
the bottleneck bandwidth. But in the presence of cross
traffic, PATH has a higher accuracy. Thus the tests
have shown that even though PATH is a sender-based
algorithm, it has an accuracy level similar to that of
one of the best receiver-based algorithms. It has also
been found that, during the First Step of Section 4.1,
the overhead introduced by PATH is of the order of
10% to 14% of the overhead introduced by SDSP
algorithm.
Load Balancing with PATH: PATH is embedded in a
load balancing algorithm for use by a resourceprovider on a grid. If this algorithm is applied to a
widely dispersed set of compute nodes, the algorithm
should be able to take into account the communication
delays, in addition to considering the characteristics of
the resources and the applications. If measurable
communication delays are present in the system, and if
a load balancing algorithm were not to take these
delays into account, while allocating the processes to
various compute-nodes, the resources may not be used
effectively.
For testing the usefulness of PATH to a load balancing
algorithm, we use a Simulator with a variable number
of compute nodes. The tests were conducted with 2 to
45 compute-nodes. The applications chosen were of
three types as specified in [8]. The paper uses the three
applications for a hybrid load balancing strategy. Two
of these applications have inter-dependencies with a
depth of 5. These applications require a large amount
of data to be transferred from one process to the next
process. Hence these applications are suitable for
studying the effect of bandwidth limitation on load
balancing systems. The third application consists of a
single task. The tests have been conducted for the
medium load and heavy load conditions, as defined in
[8]. We have assumed that the loads arrive at random
during a fixed period of time. (Figure1)
The network between Colby College, Waterville,
Maine and San Diego Supercomputing Center was
named by Downey [13] as the SDSC dataset. We have
used the SDSC dataset [13, 14] for testing. The
simulated network is distributed over 15 sites. Each
site is assumed to have a maximum of 3 compute
nodes. The links in the network have a bandwidth
varying from 1.544 Mbps to 622 Mbps. Since the
processes in applications of type 1 and 2 in [8] require
the transfer of large amounts of data, varying from 5
MB to 150 MB, using it on the SDSC dataset provides
large delays on some of the links. Thus this type of test
brings out the effect of network characteristics on load
balancing very effectively.
5.
Test Results
The tests were conducted for obtaining the Session
Completion Time and the Percentage CPU Utilization
for both high and medium loads.
The case where the network delays are not considered,
while allocating the nodes, is called the non-optimized
case. In the non-optimized case, the communication
delays are present. But in this case the algorithm
allocates the first available node to every task that is
ready to be processed. However the delays in
transferring the results from the parent processes, for
an inter-dependent set of tasks of a job, have to be
taken into account.
When the network delays are taken into account for
allocating the nodes by the load balancing algorithm, it
is called the optimized case. In this case, the node is
selected such that the overall processing time for the
job is reduced.
Figure 1 shows the arrival profile of the applications.
Figures 2 and 3 show the response for the case, where
the first site has the 1st, 16th and 31st compute nodes.
The second site on the SDSC network has node
numbers 2, 17 and 32. The 15th site has the 15th, 30th
and the 45th nodes. The internal delays at a site are
again considered to be negligible.
6.
Discussion on Test Results
The results clearly show that the session completion
time reduces considerably when the network
bandwidth is considered, while allocating applications
in a network of geographically distributed work
stations.
When the delays are not taken into account for
allocating the nodes, but the delays are present, the use
of a load balancing algorithm does not produce
consistent results. In fact our tests show that a load
balancing algorithm, which does not consider
communication delays, under the conditions of the
tests, may not be able to be effective. The graph is
erratic and the session completion time varies widely.
4
The tests have been done for the SDSC dataset [13]
and for applications types, as defined in [8]. Such a
system may lead to high delays in transferring data
from parent processes to the child process. Hence the
results have validity for compute-nodes located at large
distances on the Internet and for cases, where a large
amount of data is required to be communicated among
processes of a job. However the results may have no
validity for clusters connected through a high speed
dedicated network of an intragrid. For medium load,
when the number of nodes becomes high, a nearly
stable value for Normalized session completion time is
obtained for both the optimized and non-optimized
case. It is found that the session completion time
improves from a normalized value of about 6.5 to a
value of about 1.8. An equally good improvement is
observed for any case of more than 6 nodes. However
the values for the non-optimized case vary widely,
since the nodes are allocated ‘blindly’ without taking
into consideration the network delays. When the load
becomes high, the non-optimized case does not
stabilize and the session completion time fluctuates
widely. However the graph shows that the
improvement in the normalized session completion
time varies from 51.4% to 75.3% with respect to the
non-optimized values, when the number of nodes is
more than 6. For less than 6 nodes, for both the
medium and the high load case, the nodes remain
rather busy. However the graphs show that even in
these cases, there is a significant improvement in the
session completion time.
CPU utilization improves considerably when the
network delays are taken into consideration for
allocation of nodes to tasks. Attempts at load
balancing, without taking into consideration the delays
again do not show consistency. As expected, the
utilization falls when the number of compute nodes
increases.
The results show that the use of bandwidth
measurement algorithms like PATH as a part of the
load balancing algorithms is necessary if load
balancing on the compute resources on the grid is to be
attempted.
7.
Conclusions and Future Work
The paper embeds a new bottleneck link-bandwidth
estimation algorithm: PATH in a load balancing
algorithm. Our tests show that load balancing on a grid
is useful only when communication delays are taken
into account.
Most sender-based algorithms do not work well when
the number of nodes is greater than 10. The
performance of the PATH algorithm should be better
because only the path between the sender and the
bottleneck link will affect the estimate. The
experiments that we used to verify the PATH
algorithm are based on the simulators (BAMA and
Network Simulator). The next stage of testing may be
on a test-bed. We are working on integrating PATH
with the application centric load balancing algorithm
and WebDedip. The integrated system may be tested
on a large test-bed. By collaborating with some large
ISP, production tests could be the final method of
tuning the system.
We are also working on improving the Simulator used,
which has been used for obtaining the results in this
paper to make it more general so that we may be able
to use other load balancing algorithms and various
priority schemes in the simulation.
References:
[1]. G. E. Fagg, Keith Moore, Jack J. Dongarra, Al
Geist,’Scalabale Networked information Processing
Environment (SNIPE)’, Proc. of Supercomputing 97,
San Jose, CA, November 1997.
[2]. A.S. Grimshaw, W.A. Wulf and the Legion Team,
‘The Legion Vision of the World Wide Virtual
Computer’, Communications of ACM, 40(1) (1997).
[3]. M. Beck, J. Dongarra, G. Fagg, G. Al Geist, P.
Gray, J. Kihl, M. Mingliardi, K. Moore, T. Moore, P.
Papadopoulous, S. Scott, V. Sunderam, ‘HARNESS: a
Next Generation Distributed Virtual Machine’, Future
Generation Systems, Vol. 15, (1999), 571-582.
[4]. J. Kohl, G. Geist, XPVM 1.0 Users’ Guide,
Technical Report ORNL/TM-12981, Computer
Science and Mathematical Division, Oak Ridge
National Laboratory, Oak Ridge, TN, April 1995.
[5]. J. Devaney, R. Lipman, M. Lo, W. Mitchell, M.
Edwards, C. Clark, The Parallel Applications
Development Environment (PADE), User’s Manual,
PADE-Major Release 1.4,November 21, 1995.
[6]. A.K.Aggarwal,
H.S.Bhatt,” A Generalized
Environment for Distributed Image Processing”,
Proceedings of 15th Annual HPCS, Kluwer 2002, pp
209-223.
[7]. Bhatt H.S., Aggarwal A.K., “Web enabled clientserver model for development environment of
distributed image processing”, Proceedings of
international conference on meta computing, GRID2000, pp 135- 145
[8]. Bhatt H. S., Singh B.K., Aggarwal A.K.,
“Application centric load balancing based on hybrid
model”, International conference for advance
computing and communication ADCOM-2001,
organized by IEEE and ACS India, Dec. 2001, Page(s):
231-238
5
[9]. Vern Paxson, “Measurements and Analysis of
End-to-End Internet Dynamics” PhD thesis, Computer
Science Division, University of California, Berkeley,
April 1997
[10]. K. Lai, and M. Baker, “Nettimer: A Tool for
Measuring
Bottleneck
Link
Bandwidth”
In
Proceedings of USENIX Symposium on Internet
Technologies and Systems, March 2001
[11]. K. Harfoush, A. Bestavros, and J. Byers,
“Measuring Bottleneck Bandwidth of Targeted Path
Segments” Technical Report BUCS-TR-2001-016,
Boston University, Computer Science Department,
July 2001
[12]. C. Dovrolis, P. Ramanathan, and D. Moore,
What do packet dispersion techniques measure? In
Proceedings of IEEE INFOCOM, 2001
[13]. A. B. Downey, “Using Pathchar to Estimate
Internet Link Characteristics” ACM SIGCOMM '99
Pages: 241-250.
[14]. V. Jacobson, “Pathchar - a tool to infer
characteristics of Internet paths” 97 MSRI talk.
[15]. D.Scott,
“On
optimal
and
data-based
histograms”, Biometrika, 66:605-610, 1979.
[16]. Network Simulator (NS), version 2 (website)
http://www.isi.edu/nsnam/ns/.
[17]. Akshai Aggarwal and Jun Wei, “PATH
Algorithm for Bottleneck Bandwidth Measurement”,
WSEAS Transactions on Information Science and
Applications, July 2004, pages 346-354.
[18]. A.K.Aggarwal, Jun Wei, “BAMA (BAndwidth
Measurement
Algorithms) Simulator”, UoWSCS
Technical
Report
#04-003,
(website)
http://www.uwindsor.ca/HPGCGroup/
[19]. J.Bolliger and T. Gross. A Framework-Based
Approach to the Development of Network-Aware
Application, IEEE Transactions on Software
Engineering (Special Issue on Mobility and NetworkAware Computing), May 1998, 24(5), pp. 376-390.
[20]. (Website)http://www.fp.globus.org/retreat00/pre
sentations/nws_wolski/sld001.htm
[21]. (Website)http://www.fp.globus.org/retreat00/pre
sentations/w_tierney.smaller.pdf
Figure 1: Application Arrival Profile
6
Figure 2: Normalized Session Completion time for nodes 1, 2, 3 in one area
Figure 3: Percentage CPU utilization for nodes 1, 2, 3 in one area
7
Download