Dynamic Load Balancing Using Space-Filling Curves Levi Valgaerts Technische Universität München Institut für Informatik Boltzmannstr. 3, 85748 Garching bei München E-mail: levi.valgaerts@pandora.be Abstract. Space-filling curves (SFCs) provide a continuous mapping from a one-dimensional to a d-dimensional space and have been used to linearize spatially distributed data for partitioning, memory management and image processing. This paper gives a short introduction to the concept of SFCs and highlights some of the applications found in computer science. The main focus will be on dynamic distribution and load balancing techniques for adaptive grid hierarchies based on SFCs. These methods may be used whenever it is necessary to parallelize the solution of partial differential equations on a distributed system. 1. Introduction The numerical solution of partial differential equations (PDEs), is obtained by computing an approximate solution in a finite number of points. A standard way of discretizing the computational domain is by introducing a grid. The PDE is then solved at each discrete grid point by finite difference, finite volume or finite element techniques. In many cases adaptive grid refinement is desirable for concentrating additional resolution and computational effort to demanding regions in the domain. Such dynamically adaptive techniques can yield very good ratios for cost/accuracy, compared to static uniform methods. The distributed implementation of these techniques however, leads to challenges in data-distribution and re-distribution. The overall efficiency of the adaptive techniques is mainly determined by their ability to partition the underlying data structure very fast at run-time and in such a way that communication between processors is minimized. The main focus of this text will be on dynamic load balancing techniques for adaptive grid hierarchies based on Space-filling curves. Space-filling curves (SFCs) have long resided in the realm of pure mathematics, ever since Peano constructed the first such curve in 1890. Only with the recent developments in computer science a growing interest in the applications of SFCs can be observed. SFCs are curves that pass through every point of a closed d-dimensional region. They have favorable properties that are exploited in applications where it is important to impose an ordering upon multi-dimensional data, such that spatial adjacency is preserved as much as possible in the one-dimensional index sequence. 1 I will begin with a short introduction to the concept of SFCs, followed by a discussion of some of the applications. The study of SFCs heavily relies on the knowledge of set theory and topology. I will use a more intuitive approach which makes the nature of SFCs much faster comprehensible to the reader. To illustrate this approach I use the Hilbert curve as one of the most prominent examples. In chapter 4 I will arrive to the main topic of this paper, namely load balancing techniques using SFCs. 2. Space-Filling Curves A two-dimensional SFC is a continuous, surjective mapping from the closed unit interval I = [0,1] onto the closed unit-square = I2. Surjective means that the SFC passes through every point of . Hilbert was the first to propose a geometric generation principle for the construction of a SFC ([8]). The procedure is an exercise in recursive thinking and can be summed up in a few lines: We assume that I can be mapped continuously onto the unit-square . If we partition I into four congruent subintervals then it should be possible to partition into four congruent sub-squares, such that each subinterval will be mapped continuously onto one of the sub-squares. We can repeat this reasoning by again partitioning each subinterval into four congruent subintervals and doing the same for the respective sub-squares. When repeating this procedure ad infinitum we have to make sure that the sub-squares are arranged in such a way that adjacent sub-squares correspond to adjacent subintervals. Like this we preserve the overall continuity of the mapping. If an interval corresponds to a square, then its subintervals must correspond to the sub-squares of that square. This inclusion relationship assures that a mapping of the nth iteration preserves the mapping of the (n-1)th iteration. Now every t I can be regarded as the limit of a unique sequence of nested closed intervals. With this sequence corresponds a unique sequence of nested closed squares that shrink into a point of , the image fh(t) of t. fh(I) is called the Hilbert SFC. If we connect the midpoints of the sub-squares in the nth iteration of the geometric generation procedure in the right order by polygonal lines, we can make the convergence to the Hilbert Curve visible. This is done in Figure1 for the first three iterations. The sequence of piecewise linear functions that converge to the Hilbert SFC are called the nth order approximating polygons or the discrete Hilbert curves. We can derive an arithmetic description of the Hilbert curve which allows us to calculate the coordinates of the image point of any t I using a form of parameter representation ([8, 1]). For this the quaternary representation of t is used and a traversal orientation for is defined. Recursively subjecting to similarity transforms yields an analytical representation of the Hilbert curve. 2 Fig. 1. The first three steps (a, b, c) in the generation process of the Hilbert space-filling curve. There are other SFCs that can be generated by the same recursive principle as the Hilbert curve, such that the mapping in every iteration preserves the mapping of the previous iteration. If we order the sub-squares as in Figure 2 we obtain the Lebesque SFC, also called the Morton- or z-curve. Other popular SFCs that can be encountered in various applications are the Peano curve ([8, 1, 6]) and the Sierpinski curve ([8, 2]). The concept of SFCs can further be generalized for higher dimensional spaces where I is mapped onto a closed region in a d-dimensional Euclidian space (d ≥2). Fig. 2. The two first steps in the generation of the Lebesque or z-space-filling curve. 3. Applications of Space-Filling Curves In the previous chapter we have seen that the Hilbert curve can be regarded as the limit of a uniformly converging sequence of piecewise linear functions. For n 1 and d 2, the d-dimensional nth order approximation of the Hilbert curve maps an integer set {0, 1, 2,…2nd-1} onto a d-dimensional integer space {0, 1, 2,…2n-1}d. This means that the discrete Hilbert curve can be used to define a linear order by which the objects in a multidimensional space can be traversed. Additionally this mapping preserves locality. Entities, which are neighbors on the interval (that have similar indices), are also neighbors in the d-dimensional space. The reverse is not generally true; neighbors in space can be separated by the inverse mapping. 3 There are several applications that benefit from a locality preserving mapping. In traditional databases, a multi-attribute data space must be mapped to a onedimensional disc space. A typical range query can be regarded as a linear traversal through the d-dimensional subspace representing the query region. The linear traversal specifies the order in which all the data objects in the query are fetched from disk. It is highly desirable that objects close together in the multi-dimensional space are also close together in the one-dimensional disc space, since it is more efficient to fetch a set of consecutive disc blocks in order to reduce additional seek time. Thus the number of disc accesses for a range query can be reduced by a good clustering of the data points in space. In [7] analytical results of the clustering properties of the Hilbert curve are given with respect to arbitrarily shaped range queries. A query region is represented as a rectilinear polyhedron in a d-dimensional space with finite granularity. This means that every data object corresponds to a grid cell and the rectilinear query region is bounded by a set of polygonal surfaces, each of which is perpendicular to one of the coordinate axes (e.g. a closed piecewise linear curve for d=2). A cluster is then defined as a group of grid points inside the query that are consecutively connected by a mapping. A cluster thus corresponds to a continuous run in the disc space and the number of clusters corresponds to the number of nonconsecutive disc accesses. The higher the average number of clusters introduced by a mapping in an arbitrary query region, the more additional seek time is needed. For the Hilbert curve the number of clusters within a d-dimensional polyhedron is equal to the number of entries of the curve in the polyhedron. In [7] it is analytically proven that the average number of clusters is proportional to the total surface area of the polyhedron. These results, together with results from experimental simulations show that the Hilbert curve outperforms other SFCs like the z-curve in preserving locality. In image processing an image is transformed into a bit string, which subsequently can be compressed. It is often desirable that the spatial coherence (similar pixel values) is to be preserved in the one-dimensional pixel sequence. Scanning techniques using SFCs are an attractive alternative to the standard scan-line method because they exploit this two-dimensional locality. In [4] an algorithm is introduced to compute a group of mappings that try to exploit the inherent coherence in the image as much as possible for a particular image. This yields even better results than the classical Hilbert scan which only works statistically well. 4 4. Dynamic Load Balancing Using Space-Filling Curves 4.1. The Load Balancing Problem Finite difference, finite element and finite volume methods for the solution of partial differential equations are based on meshes or grids. These schemes typically result in a system of linear equations which can be solved by various numerical methods. It is important to mention that the discrete operators are in many cases local, which means that for the evaluation in one grid point, only direct neighbors are needed. A natural way of porting algorithms to a parallel computer is the data distribution approach. The grid is decomposed into several partitions and these are mapped onto the processors that operate in parallel on the domain portions. The main goals of a good partitioning scheme are load balancing and little communication between the processors. With an adaptive approach, the initial mesh and the numerical method are enhanced during the course of solution in order to optimize the computational cost for a given level of accuracy. This might be the case if the solution contains shocks or discontinuities or in time-dependent problems. Enhancements typically involve mesh refinement or coarsening, moving a mesh of fixed topology and increasing or decreasing the order of the method. In this paper we only consider adaptive grid refinement, which allows changing the grid resolution in certain regions in order to gain more accuracy. Parallelism greatly complicates adaptive computation because domain decomposition, data management and inter-processor communication must be dynamic. In the case of adaptive grid refinement the partitioning scheme has to be additionally very fast and it has to introduce but small changes to the original partition because data migration is often a significant performance factor. In [11] and [5] a formal description of the load balancing problem using SFCs is given. An inverse SFC f -1 can be used for the mapping of a d-dimensional domain to the unit interval I. Thus geometric entities such as nodes or grid elements can be mapped to a one-dimensional sequence. We can now solve the resulting onedimensional partition problem. We divide the interval I into disjoint subintervals Ij of equal workload. On this gives perfect load balance and the separators or partition surfaces f( Ij ) / can be shown to be quasi-optimal. In the performance model the transferred data between processors is proportional to the separators, so we have to minimize the surface to volume ratio of the partitions for high parallel efficiency. An estimate for the locality of a SFC is given by the constant C in the definition of Hölder continuity. This constant determines to what extend points close together on I will be close together after being mapped in space. In [11] this estimate is used to prove that partitions formed by most SFCs give satisfactory surface to volume ratios, with the ratio for the sphere being the lowest. For uniform grids the separator sizes obtained by SFC partitioning are optimal up to a multiplicative constant. 5 [9] presents a performance characterization of dynamic load balancing techniques for distributed adaptive grid hierarchies. The authors introduce three metrics to evaluate six load balancing schemes. These metrics measure the load balance and the time spent to achieve it, the communication overhead between computational nodes and the data re-distribution after each refinement step. Experimental evaluation during a simulation test show that the load balancing scheme based on SFCs provides good performance. Only the iterative tree balancing scheme, a graph-partitioner, gave better results with respect to the above presented metrics. 4.2. Load Balancing Algorithms Using Space-Filling Curves There exist many different implementations of grid partitioning algorithms based on SFCs, often depending on the type of grid ([10, 2, 3]). In [10] a structured adaptive grid hierarchy is cut into rectangular blocks. The resulting block graph is then partitioned using a SFC partitioning (SP). Various ways of carrying out the initial block partition are then investigated in order to attain a block graph that is suitable for SP. The heuristics used in the assessment consist of two assumptions that are generally made when dealing with SP. First we assume that a graph node (domain block) communicates more with neighboring nodes than with distant nodes. Secondly we assume that the SFC gives neighboring nodes nearby indices. As a conclusion, the block graphs are found to be optimal for SP when the blocks are of equal size, or of equal work load irrespective of the number of refinement levels within a block and irrespective of the way in which the domain is cut. In this respect the SP is considered as a part of a 2-step partitioning approach. Algorithms presented in more recent papers do not discern the first block partitioning step, but apply the indexing directly to the grid elements ([2] for the pamatos library using triangle bisection) or to rectangular quadrants with equal workload ([3] for the Zoltan library using octrees, §4.3) during mesh generation. These methods can be used for unstructured irregular grids. To create an enumeration of the grid cells according to a SFC ordering, there are roughly speaking two possible approaches. A standard continuous SFC can be superimposed onto the grid. We then have to map a node or grid cell center onto an iterate of the SFC. For Lebesque/Morton-type of SFC this can be done easily since it is the inverse bit-interleaving process applied to a finite binary representation of the node coordinates. The list of nodes, sorted by their position on the SFC, is then cut into pieces of equal number of nodes and mapped to different processors. Although similar procedures can be found for the Hilbert curve ([1]), another approach yields a much faster and more elegant algorithm. We first create an enumeration of the grid dictated by a discrete SFC. When we perform a grid refinement by substituting a coarse grid element Ej by several smaller elements Ej,k, the enumeration is changed such that it cycles through these new elements first, before going on to Ej+1. For the enumeration we thus use several iterates of the SFC depending on the local level of refinement and the enumeration is produced in parallel with the grid construction. 6 4.3. Dynamic Octree Load Balancing Using Space-Filling Curves As an illustration of the use of SFCs in load balancing algorithms for adaptive distributed grid hierarchies, I present here in more detail the dynamic octree load balancing method as implemented in the Zoltan dynamic load balancing library ([3]). In this method the grid is associated with an octree (quadtree in two dimensions), a recursively defined hierarchical data structure. The technique exploits similarities between the octree and the SFC construction. In octree-based mesh generation, the entire computational domain is embedded in a cubic (square) universe which is represented by the root octant (quadrant). Octant refinement is the replacement of a terminal octant by an interior octant with eight (four for quadtrees) new terminal octants as children, allowing for a greater resolution in parts of the domain. In two dimensions this refinement corresponds to the bisection of a quadrant into four sub-quadrants. Octrees can also be constructed for unstructured meshes generated by other procedures by associating a grid element with the octant containing its center. Figure 3 shows a 40-element triangular mesh where each leaf quadrant contains at most 5 grid elements. A depth-first traversal (DFT) of the octree determines all sub tree costs. This can be the number of elements in the sub tree. The optimal partition size (OPS) is the total cost divided by the number of partitions. A second DFT adds octants to the current partition as long as their inclusion does not exceed OPS. Figure 3 shows the result of this process for 3 partitions. This second DFT defines a one-dimensional ordering of the leaf octants, which is divided into segments corresponding to the partitions. For scalability, an octree used for dynamic repartitioning must be distributed across the cooperating processes and the construction must happen in parallel. The octree method specifies how each process calculates an own range of leaf octants and how each process afterwards calculates the cost of its local sub tree. After the local traversals are complete, data is migrated to the assigned destination process, in order to meet the OPS. Fig. 3. Three-way quadtree partitioning of a 40-element triangular grid. 7 Partitions are formed from contiguous segments of the linearization of the leaf octants of the octree. Now SFCs can be used to dictate the leaf traversal of the distributed octree. For this purpose a string-rewriting rule is applied. The traversal ordering and the partitioning are determined by reading the string from beginning to end. In order to achieve a Hilbert or Morton ordering of the leaf octants we start from a simple initial template for the first refinement level. This template can have four different orientations for the Hilbert curve. Figure 4 (left) shows one of these orientations. Upon refinement to the next level a unique translation rule determines the ordering and the orientation of the offspring. The indices of the terminal octants are stored in a string that is rewritten upon each refinement. Rewriting means that each string entry is replaced by eight (four) new ones. The final curve is determined by recursive application of the rewriting rule. In the case of adaptive refinement only certain string entries are replaced. Figure 5 gives the result of the octant traversal for the grid from Figure 3 for Hilbert and Morton ordering. The index string representing the Hilbert ordering in the left of Figure 5 is {00 02 03 01 1 3 23 21 202 201 200 203 22}. Notice that the amount of numbers that make up an index indicates the local refinement level of the terminal octant. The transformation rules for refining a parent with a certain orientation are encoded in ordering and orientation tables. These tables concur with the grammar rules as presented in [1]. Fig. 4. Generation of the first two levels of the two dimensional Hilbert ordering using ordering and orientation tables. The number in the lower left corner of each quadrant is the index. The results of a comparative performance test using Hilbert, Morton and Gray code traversal of the octree are presented in [3] for a variety of mesh sizes with different geometric complexity. Hilbert ordering produced a superior surface index, which is the ratio between the surface and the volume of a partition and a measure for the inter-process communication volume. Inter-process adjacency is the percentage of other processes with which each process must communicate. Hilbert ordering produced the lowest values but some mesh structures did not favor any ordering, which emphasizes the influence of the specific problem on the load balancing performance. The intraprocess connectivity measures the number of disjoint regions assigned to a given process, and is closely related to the surface index. Morton and Gray code orderings 8 achieve less with this respect because they tend to give disconnected partitions (see Figure 5). Finally the total solution time is regarded as the most important measure. Because of the improved surface index and inter-process connectivity, Hilbert ordering gave the lowest average execution time. Fig. 5. Hilbert (left) and Morton (right) ordering of an adaptively refined grid. 5. Conclusion and Remarks In this paper I presented a short survey on the concept of space-filling curves and their use in applications where it is important to define a linear order for multidimensional data. I introduced the Hilbert curve as one of the most widely used representatives and compared its performance to that of other SFCs like the z-curve in applications like database management and load balancing. When it comes to the data parallelization of a PDE solver, load balancing methods using SFCs present a less complex and faster alternative to other partitioning methods like graph partitioning. The partitioning algorithm based on SFCs simply chooses an interval of numbers which translates into a geometric domain. The obtained partitions can be proven to be not only perfectly balanced but to give additionally almost optimal separators. SFCs are especially well suited for adaptively refined grid hierarchies because of their self-similar and recursive nature. As an example of a load balancing technique I presented the octree method with Hilbert ordering. Besides the Hilbert curve, other SFCs can be considered. Peano ([6]) and Sierpinski ([2]) curves are known to have competitive properties. As a further extension and optimization of the considered partitioning methods, the SFC traversal of the grid elements can be used as the processing order of the elements during the iterative solution of the PDE ([6]). The position of a grid element on the SFC can also be used as a key in a key based addressing scheme ([5]). In both cases the locality of the SFC indexing can optimize the use of cache hierarchies in modern computers. 9 References [1] M. Bader, Raumfüllende Kurven, Begleitendes Skriptum zum entsprechenden Kapitel der Vorlesung „Algorithmen des Wissenschaftlichen Rechnens“,Technischen Universität München, 2004. [2] J. Behrens and J. Zimmermann, Parallelizing an unstructured grid generator with a spacefilling curve approach, Euro-Par 2000 Parallel Processing { 6th International Euro-Par Conference Munich, Germany, August/Sptember 2000 Proceedings (Berlin) (A. Bode, T. Ludwig, W. Karl, and R. Wismüller, eds.), Lecture Notes in Computer Science, vol. 1900, Springer Verlag, 2000, pp. 815{823. [3] P. C. Campbell, K. D. Devine, J. E. Flaherty, L. G. Gervasio, J. D. Teresco, Dynamic Octree Load Balancing Using Space-Filling Curves, Williams College Department of Computer Science Technical Report CS-03-01, 2003. [4] R. Dafner, D. Cohen-Or, and Y. Matias. Context based space filling curves. Computer Graphics Forum, 19:209--218, 2000 [5] M. Griebel and G. Zumbusch, Hash based adaptive parallel multilevel methods with space-filling curves, in Horst Rollnik and Dietrich Wolf, editors, NIC Symposium 2001, volume 9 of NIC Series, ISBN 3-00-009055-X, pages 479-492, Germany, 2002. Forschungszentrum Jülich. [6] F. Günther, M. Mehl, M. Pögl, C. Zenger, A cache-aware algorithm for PDEs on hierarchical data structures based on space-filling curves, SIAM Journal of Scientific Computing, submitted. [7] B. Moon, H.V. Jagadish, C. Faloutsos and J. H. Saltz, Analysis of the clustering properties of the Hilbert space-filling curve, IEEE Transactions on Knowledge and Data Engineering, 13(1):124-141, January/February, 2001 [8] H. Sagan, Space-Filling Curves, Springer-Verlag, New York, 1994. [9] M. Shee, S. Bhavsar and M. Parashar, Characterizing the Performance of Dynamic Distribution and Load-Balancing Techniques for Adaptive Grid Hierarchies (1999), Proceedings of the IASTED International Conference Parallel and Distributed Computing and Systems, November 3-6, 1999, Cambridge Massachusetts, USA [10] J. Steensland, Dynamic Structured Grid Hierarchy Partitioners Using Space-Filling Curves, Technical report, IT-series 2001-002, Uppsala University. 2001. [11] G. Zumbusch, Adaptive Parallel Multilevel Methods for Partial Differential Equations, Habilitation, Universität Bonn, 2001. 10