Dynamic Load Balancing Using Space-Filling Curves

advertisement
Dynamic Load Balancing Using
Space-Filling Curves
Levi Valgaerts
Technische Universität München
Institut für Informatik
Boltzmannstr. 3, 85748 Garching bei München
E-mail: levi.valgaerts@pandora.be
Abstract. Space-filling curves (SFCs) provide a continuous mapping from a
one-dimensional to a d-dimensional space and have been used to linearize spatially distributed data for partitioning, memory management and image processing. This paper gives a short introduction to the concept of SFCs and highlights some of the applications found in computer science. The main focus will
be on dynamic distribution and load balancing techniques for adaptive grid hierarchies based on SFCs. These methods may be used whenever it is necessary
to parallelize the solution of partial differential equations on a distributed system.
1.
Introduction
The numerical solution of partial differential equations (PDEs), is obtained by computing an approximate solution in a finite number of points. A standard way of discretizing the computational domain is by introducing a grid. The PDE is then solved at each
discrete grid point by finite difference, finite volume or finite element techniques. In
many cases adaptive grid refinement is desirable for concentrating additional resolution and computational effort to demanding regions in the domain. Such dynamically
adaptive techniques can yield very good ratios for cost/accuracy, compared to static
uniform methods. The distributed implementation of these techniques however, leads
to challenges in data-distribution and re-distribution. The overall efficiency of the
adaptive techniques is mainly determined by their ability to partition the underlying
data structure very fast at run-time and in such a way that communication between
processors is minimized. The main focus of this text will be on dynamic load balancing techniques for adaptive grid hierarchies based on Space-filling curves.
Space-filling curves (SFCs) have long resided in the realm of pure mathematics,
ever since Peano constructed the first such curve in 1890. Only with the recent developments in computer science a growing interest in the applications of SFCs can be
observed. SFCs are curves that pass through every point of a closed d-dimensional region. They have favorable properties that are exploited in applications where it is important to impose an ordering upon multi-dimensional data, such that spatial adjacency is preserved as much as possible in the one-dimensional index sequence.
1
I will begin with a short introduction to the concept of SFCs, followed by a discussion of some of the applications. The study of SFCs heavily relies on the knowledge
of set theory and topology. I will use a more intuitive approach which makes the nature of SFCs much faster comprehensible to the reader. To illustrate this approach I
use the Hilbert curve as one of the most prominent examples. In chapter 4 I will arrive
to the main topic of this paper, namely load balancing techniques using SFCs.
2.
Space-Filling Curves
A two-dimensional SFC is a continuous, surjective mapping from the closed unit interval I = [0,1] onto the closed unit-square  = I2. Surjective means that the SFC
passes through every point of . Hilbert was the first to propose a geometric generation principle for the construction of a SFC ([8]). The procedure is an exercise in recursive thinking and can be summed up in a few lines:



We assume that I can be mapped continuously onto the unit-square . If we
partition I into four congruent subintervals then it should be possible to partition  into four congruent sub-squares, such that each subinterval will be
mapped continuously onto one of the sub-squares. We can repeat this reasoning by again partitioning each subinterval into four congruent subintervals
and doing the same for the respective sub-squares.
When repeating this procedure ad infinitum we have to make sure that the
sub-squares are arranged in such a way that adjacent sub-squares correspond
to adjacent subintervals. Like this we preserve the overall continuity of the
mapping.
If an interval corresponds to a square, then its subintervals must correspond
to the sub-squares of that square. This inclusion relationship assures that a
mapping of the nth iteration preserves the mapping of the (n-1)th iteration.
Now every t  I can be regarded as the limit of a unique sequence of nested closed
intervals. With this sequence corresponds a unique sequence of nested closed squares
that shrink into a point of , the image fh(t) of t. fh(I) is called the Hilbert SFC. If we
connect the midpoints of the sub-squares in the nth iteration of the geometric generation procedure in the right order by polygonal lines, we can make the convergence to
the Hilbert Curve visible. This is done in Figure1 for the first three iterations. The sequence of piecewise linear functions that converge to the Hilbert SFC are called the
nth order approximating polygons or the discrete Hilbert curves.
We can derive an arithmetic description of the Hilbert curve which allows us to
calculate the coordinates of the image point of any t  I using a form of parameter
representation ([8, 1]). For this the quaternary representation of t is used and a traversal orientation for  is defined. Recursively subjecting  to similarity transforms
yields an analytical representation of the Hilbert curve.
2
Fig. 1. The first three steps (a, b, c) in the generation process of the Hilbert space-filling curve.
There are other SFCs that can be generated by the same recursive principle as the
Hilbert curve, such that the mapping in every iteration preserves the mapping of the
previous iteration. If we order the sub-squares as in Figure 2 we obtain the Lebesque
SFC, also called the Morton- or z-curve. Other popular SFCs that can be encountered
in various applications are the Peano curve ([8, 1, 6]) and the Sierpinski curve ([8, 2]).
The concept of SFCs can further be generalized for higher dimensional spaces where I
is mapped onto a closed region in a d-dimensional Euclidian space (d ≥2).
Fig. 2. The two first steps in the generation of the Lebesque or z-space-filling curve.
3.
Applications of Space-Filling Curves
In the previous chapter we have seen that the Hilbert curve can be regarded as the
limit of a uniformly converging sequence of piecewise linear functions. For n  1 and
d  2, the d-dimensional nth order approximation of the Hilbert curve maps an integer
set {0, 1, 2,…2nd-1} onto a d-dimensional integer space {0, 1, 2,…2n-1}d. This means
that the discrete Hilbert curve can be used to define a linear order by which the objects in a multidimensional space can be traversed. Additionally this mapping preserves locality. Entities, which are neighbors on the interval (that have similar indices), are also neighbors in the d-dimensional space. The reverse is not generally true;
neighbors in space can be separated by the inverse mapping.
3
There are several applications that benefit from a locality preserving mapping. In
traditional databases, a multi-attribute data space must be mapped to a onedimensional disc space. A typical range query can be regarded as a linear traversal
through the d-dimensional subspace representing the query region. The linear traversal specifies the order in which all the data objects in the query are fetched from
disk. It is highly desirable that objects close together in the multi-dimensional space
are also close together in the one-dimensional disc space, since it is more efficient to
fetch a set of consecutive disc blocks in order to reduce additional seek time. Thus the
number of disc accesses for a range query can be reduced by a good clustering of the
data points in space. In [7] analytical results of the clustering properties of the Hilbert
curve are given with respect to arbitrarily shaped range queries. A query region is represented as a rectilinear polyhedron in a d-dimensional space with finite granularity.
This means that every data object corresponds to a grid cell and the rectilinear query
region is bounded by a set of polygonal surfaces, each of which is perpendicular to
one of the coordinate axes (e.g. a closed piecewise linear curve for d=2). A cluster is
then defined as a group of grid points inside the query that are consecutively connected by a mapping. A cluster thus corresponds to a continuous run in the disc space and
the number of clusters corresponds to the number of nonconsecutive disc accesses.
The higher the average number of clusters introduced by a mapping in an arbitrary
query region, the more additional seek time is needed. For the Hilbert curve the number of clusters within a d-dimensional polyhedron is equal to the number of entries of
the curve in the polyhedron. In [7] it is analytically proven that the average number of
clusters is proportional to the total surface area of the polyhedron. These results, together with results from experimental simulations show that the Hilbert curve outperforms other SFCs like the z-curve in preserving locality.
In image processing an image is transformed into a bit string, which subsequently
can be compressed. It is often desirable that the spatial coherence (similar pixel values) is to be preserved in the one-dimensional pixel sequence. Scanning techniques
using SFCs are an attractive alternative to the standard scan-line method because they
exploit this two-dimensional locality. In [4] an algorithm is introduced to compute a
group of mappings that try to exploit the inherent coherence in the image as much as
possible for a particular image. This yields even better results than the classical Hilbert scan which only works statistically well.
4
4.
Dynamic Load Balancing Using Space-Filling Curves
4.1. The Load Balancing Problem
Finite difference, finite element and finite volume methods for the solution of partial
differential equations are based on meshes or grids. These schemes typically result in
a system of linear equations which can be solved by various numerical methods. It is
important to mention that the discrete operators are in many cases local, which means
that for the evaluation in one grid point, only direct neighbors are needed. A natural
way of porting algorithms to a parallel computer is the data distribution approach. The
grid is decomposed into several partitions and these are mapped onto the processors
that operate in parallel on the domain portions. The main goals of a good partitioning
scheme are load balancing and little communication between the processors.
With an adaptive approach, the initial mesh and the numerical method are enhanced during the course of solution in order to optimize the computational cost for a
given level of accuracy. This might be the case if the solution contains shocks or discontinuities or in time-dependent problems. Enhancements typically involve mesh refinement or coarsening, moving a mesh of fixed topology and increasing or decreasing the order of the method. In this paper we only consider adaptive grid refinement,
which allows changing the grid resolution in certain regions in order to gain more accuracy. Parallelism greatly complicates adaptive computation because domain decomposition, data management and inter-processor communication must be dynamic.
In the case of adaptive grid refinement the partitioning scheme has to be additionally
very fast and it has to introduce but small changes to the original partition because data migration is often a significant performance factor.
In [11] and [5] a formal description of the load balancing problem using SFCs is
given. An inverse SFC f -1 can be used for the mapping of a d-dimensional domain 
to the unit interval I. Thus geometric entities such as nodes or grid elements can be
mapped to a one-dimensional sequence. We can now solve the resulting onedimensional partition problem. We divide the interval I into disjoint subintervals Ij of
equal workload. On  this gives perfect load balance and the separators or partition
surfaces f( Ij ) /  can be shown to be quasi-optimal. In the performance model the
transferred data between processors is proportional to the separators, so we have to
minimize the surface to volume ratio of the partitions for high parallel efficiency. An
estimate for the locality of a SFC is given by the constant C in the definition of
Hölder continuity. This constant determines to what extend points close together on I
will be close together after being mapped in space. In [11] this estimate is used to
prove that partitions formed by most SFCs give satisfactory surface to volume ratios,
with the ratio for the sphere being the lowest. For uniform grids the separator sizes
obtained by SFC partitioning are optimal up to a multiplicative constant.
5
[9] presents a performance characterization of dynamic load balancing techniques
for distributed adaptive grid hierarchies. The authors introduce three metrics to evaluate six load balancing schemes. These metrics measure the load balance and the time
spent to achieve it, the communication overhead between computational nodes and
the data re-distribution after each refinement step. Experimental evaluation during a
simulation test show that the load balancing scheme based on SFCs provides good
performance. Only the iterative tree balancing scheme, a graph-partitioner, gave better
results with respect to the above presented metrics.
4.2.
Load Balancing Algorithms Using Space-Filling Curves
There exist many different implementations of grid partitioning algorithms based on
SFCs, often depending on the type of grid ([10, 2, 3]). In [10] a structured adaptive
grid hierarchy is cut into rectangular blocks. The resulting block graph is then partitioned using a SFC partitioning (SP). Various ways of carrying out the initial block
partition are then investigated in order to attain a block graph that is suitable for SP.
The heuristics used in the assessment consist of two assumptions that are generally
made when dealing with SP. First we assume that a graph node (domain block) communicates more with neighboring nodes than with distant nodes. Secondly we assume
that the SFC gives neighboring nodes nearby indices. As a conclusion, the block
graphs are found to be optimal for SP when the blocks are of equal size, or of equal
work load irrespective of the number of refinement levels within a block and irrespective of the way in which the domain is cut.
In this respect the SP is considered as a part of a 2-step partitioning approach. Algorithms presented in more recent papers do not discern the first block partitioning
step, but apply the indexing directly to the grid elements ([2] for the pamatos library
using triangle bisection) or to rectangular quadrants with equal workload ([3] for the
Zoltan library using octrees, §4.3) during mesh generation. These methods can be
used for unstructured irregular grids.
To create an enumeration of the grid cells according to a SFC ordering, there are
roughly speaking two possible approaches. A standard continuous SFC can be superimposed onto the grid. We then have to map a node or grid cell center onto an iterate
of the SFC. For Lebesque/Morton-type of SFC this can be done easily since it is the
inverse bit-interleaving process applied to a finite binary representation of the node
coordinates. The list of nodes, sorted by their position on the SFC, is then cut into
pieces of equal number of nodes and mapped to different processors. Although similar
procedures can be found for the Hilbert curve ([1]), another approach yields a much
faster and more elegant algorithm. We first create an enumeration of the grid dictated
by a discrete SFC. When we perform a grid refinement by substituting a coarse grid
element Ej by several smaller elements Ej,k, the enumeration is changed such that it
cycles through these new elements first, before going on to Ej+1. For the enumeration
we thus use several iterates of the SFC depending on the local level of refinement and
the enumeration is produced in parallel with the grid construction.
6
4.3.
Dynamic Octree Load Balancing Using Space-Filling Curves
As an illustration of the use of SFCs in load balancing algorithms for adaptive distributed grid hierarchies, I present here in more detail the dynamic octree load balancing
method as implemented in the Zoltan dynamic load balancing library ([3]). In this
method the grid is associated with an octree (quadtree in two dimensions), a recursively defined hierarchical data structure. The technique exploits similarities between
the octree and the SFC construction.
In octree-based mesh generation, the entire computational domain is embedded in a
cubic (square) universe which is represented by the root octant (quadrant). Octant refinement is the replacement of a terminal octant by an interior octant with eight (four
for quadtrees) new terminal octants as children, allowing for a greater resolution in
parts of the domain. In two dimensions this refinement corresponds to the bisection of
a quadrant into four sub-quadrants. Octrees can also be constructed for unstructured
meshes generated by other procedures by associating a grid element with the octant
containing its center. Figure 3 shows a 40-element triangular mesh where each leaf
quadrant contains at most 5 grid elements.
A depth-first traversal (DFT) of the octree determines all sub tree costs. This can
be the number of elements in the sub tree. The optimal partition size (OPS) is the total
cost divided by the number of partitions. A second DFT adds octants to the current
partition as long as their inclusion does not exceed OPS. Figure 3 shows the result of
this process for 3 partitions. This second DFT defines a one-dimensional ordering of
the leaf octants, which is divided into segments corresponding to the partitions.
For scalability, an octree used for dynamic repartitioning must be distributed across
the cooperating processes and the construction must happen in parallel. The octree
method specifies how each process calculates an own range of leaf octants and how
each process afterwards calculates the cost of its local sub tree. After the local traversals are complete, data is migrated to the assigned destination process, in order to
meet the OPS.
Fig. 3. Three-way quadtree partitioning of a 40-element triangular grid.
7
Partitions are formed from contiguous segments of the linearization of the leaf octants of the octree. Now SFCs can be used to dictate the leaf traversal of the distributed octree. For this purpose a string-rewriting rule is applied. The traversal ordering
and the partitioning are determined by reading the string from beginning to end.
In order to achieve a Hilbert or Morton ordering of the leaf octants we start from a
simple initial template for the first refinement level. This template can have four different orientations for the Hilbert curve. Figure 4 (left) shows one of these orientations. Upon refinement to the next level a unique translation rule determines the ordering and the orientation of the offspring. The indices of the terminal octants are
stored in a string that is rewritten upon each refinement. Rewriting means that each
string entry is replaced by eight (four) new ones. The final curve is determined by recursive application of the rewriting rule. In the case of adaptive refinement only certain string entries are replaced. Figure 5 gives the result of the octant traversal for the
grid from Figure 3 for Hilbert and Morton ordering. The index string representing the
Hilbert ordering in the left of Figure 5 is {00 02 03 01 1 3 23 21 202 201 200 203
22}. Notice that the amount of numbers that make up an index indicates the local refinement level of the terminal octant.
The transformation rules for refining a parent with a certain orientation are encoded in ordering and orientation tables. These tables concur with the grammar rules as
presented in [1].
Fig. 4. Generation of the first two levels of the two dimensional Hilbert ordering using ordering
and orientation tables. The number in the lower left corner of each quadrant is the index.
The results of a comparative performance test using Hilbert, Morton and Gray code
traversal of the octree are presented in [3] for a variety of mesh sizes with different
geometric complexity. Hilbert ordering produced a superior surface index, which is
the ratio between the surface and the volume of a partition and a measure for the inter-process communication volume. Inter-process adjacency is the percentage of other
processes with which each process must communicate. Hilbert ordering produced the
lowest values but some mesh structures did not favor any ordering, which emphasizes
the influence of the specific problem on the load balancing performance. The intraprocess connectivity measures the number of disjoint regions assigned to a given process, and is closely related to the surface index. Morton and Gray code orderings
8
achieve less with this respect because they tend to give disconnected partitions (see
Figure 5). Finally the total solution time is regarded as the most important measure.
Because of the improved surface index and inter-process connectivity, Hilbert ordering gave the lowest average execution time.
Fig. 5. Hilbert (left) and Morton (right) ordering of an adaptively refined grid.
5.
Conclusion and Remarks
In this paper I presented a short survey on the concept of space-filling curves and their
use in applications where it is important to define a linear order for multidimensional
data. I introduced the Hilbert curve as one of the most widely used representatives and
compared its performance to that of other SFCs like the z-curve in applications like
database management and load balancing.
When it comes to the data parallelization of a PDE solver, load balancing methods
using SFCs present a less complex and faster alternative to other partitioning methods
like graph partitioning. The partitioning algorithm based on SFCs simply chooses an
interval of numbers which translates into a geometric domain. The obtained partitions
can be proven to be not only perfectly balanced but to give additionally almost optimal separators. SFCs are especially well suited for adaptively refined grid hierarchies
because of their self-similar and recursive nature. As an example of a load balancing
technique I presented the octree method with Hilbert ordering.
Besides the Hilbert curve, other SFCs can be considered. Peano ([6]) and Sierpinski ([2]) curves are known to have competitive properties. As a further extension and
optimization of the considered partitioning methods, the SFC traversal of the grid elements can be used as the processing order of the elements during the iterative solution of the PDE ([6]). The position of a grid element on the SFC can also be used as a
key in a key based addressing scheme ([5]). In both cases the locality of the SFC indexing can optimize the use of cache hierarchies in modern computers.
9
References
[1]
M. Bader, Raumfüllende Kurven, Begleitendes Skriptum zum entsprechenden Kapitel der
Vorlesung „Algorithmen des Wissenschaftlichen Rechnens“,Technischen Universität
München, 2004.
[2]
J. Behrens and J. Zimmermann, Parallelizing an unstructured grid generator with a spacefilling curve approach, Euro-Par 2000 Parallel Processing { 6th International Euro-Par
Conference Munich, Germany, August/Sptember 2000 Proceedings (Berlin) (A. Bode, T.
Ludwig, W. Karl, and R. Wismüller, eds.), Lecture Notes in Computer Science, vol. 1900,
Springer Verlag, 2000, pp. 815{823.
[3]
P. C. Campbell, K. D. Devine, J. E. Flaherty, L. G. Gervasio, J. D. Teresco, Dynamic Octree Load Balancing Using Space-Filling Curves, Williams College Department of Computer Science Technical Report CS-03-01, 2003.
[4]
R. Dafner, D. Cohen-Or, and Y. Matias. Context based space filling curves.
Computer Graphics Forum, 19:209--218, 2000
[5]
M. Griebel and G. Zumbusch, Hash based adaptive parallel multilevel methods with
space-filling curves, in Horst Rollnik and Dietrich Wolf, editors, NIC Symposium 2001,
volume 9 of NIC Series, ISBN 3-00-009055-X, pages 479-492, Germany, 2002. Forschungszentrum Jülich.
[6]
F. Günther, M. Mehl, M. Pögl, C. Zenger, A cache-aware algorithm for PDEs on hierarchical data structures based on space-filling curves, SIAM Journal of Scientific Computing, submitted.
[7]
B. Moon, H.V. Jagadish, C. Faloutsos and J. H. Saltz, Analysis of the clustering properties of the Hilbert space-filling curve, IEEE Transactions on Knowledge and Data Engineering, 13(1):124-141, January/February, 2001
[8]
H. Sagan, Space-Filling Curves, Springer-Verlag, New York, 1994.
[9]
M. Shee, S. Bhavsar and M. Parashar, Characterizing the Performance of Dynamic Distribution and Load-Balancing Techniques for Adaptive Grid Hierarchies (1999), Proceedings of the IASTED International Conference Parallel and Distributed Computing and
Systems, November 3-6, 1999, Cambridge Massachusetts, USA
[10] J. Steensland, Dynamic Structured Grid Hierarchy Partitioners Using Space-Filling
Curves, Technical report, IT-series 2001-002, Uppsala University. 2001.
[11] G. Zumbusch, Adaptive Parallel Multilevel Methods for Partial Differential Equations,
Habilitation, Universität Bonn, 2001.
10
Download