Domain decomposition in parallel computing – Spring 2004 COT 5410

advertisement
COT 5410 – Spring 2004
Domain decomposition in parallel
computing
Ashok Srinivasan
www.cs.fsu.edu/~asriniva
Florida State University
Outline
• Background
• Geometric partitioning
• Graph partitioning
– Static
– Dynamic
• Important points
Background
• Tasks in a parallel computation need access to
certain data
• Same datum may be needed by multiple tasks
– Example: In matrix-vector multiplication, b2 is needed for the
computation of all ci2, 1 < i < n
– If a process does not “own” a datum needed by its task, then
it has to get it from a process that has it
• This communication is expensive
– Aims of domain decomposition
• * Distribute the data in such a manner that the communication
required is minimized
• * Ensure that the computational loads on processes are
balanced
Domain decomposition example
• Finite difference computation
– New value of a node depends on old values of its
neighbors
• We want to divide the nodes amongst the processes so that
– Communication is minimized
• Measure of partition quality
– Computational load is evenly balanced
Geometric partitioning
• Partition a set of points
– Uses only coordinate information
• Balances the load
– The heuristic tries to ensure that communication
costs are low
• Algorithms are typically fast, but partition not
of high quality
• Examples
– Orthogonal recursive bisection
– Inertial
– Space filling curves
Orthogonal recursive bisection
• Recursively bisect orthogonal to the longest
dimension
– Assume communication is proportional to the surface area of
the domain, and aligned with coordinate axes
– Recursive bisection
• Divide into two pieces, keeping load balanced
• Apply recursively, until desired number of partitions obtained
Inertial
• ORB may not be
effective if cuts along the
x, y, or z directions are
not good ones
• Inertial
– Recursively bisect
orthogonal to the inertial
axis
Space filling curves
• * Space filling curves
– A continuous curve that fills the space
– Order the points based on their relative
position on the curve
– Choose a curve that preserves proximity
• Points that are close in space should be
close in the ordering too
• Example
– Hilbert curve
Hilbert curve
H1
H2
•
Hi
Hi+1
Hilbert curve = lim Hn
n
Sources
– http://www.dcs.napier.ac.uk/~andrew/hilbert.html
– http://www.fractalus.com/kerry/tutorials/hilbert/hilbert-tutorial.html
Domain decomposition with a space
filling curve
• Order points based on their position on the
curve
• Divide into P parts
– P is the number of processes
• Space filling curves can be used in adaptive
computations too
• They can be extended to higher dimensions
too
Graph partitioning
•
* Model as graph partitioning
– Graph G = (V, E)
– Each task is represented by a vertex
• A weight can be used to represent the computational effort
– An edge exists between tasks if one needs data owned by
the other
• Weights can be associated with edges too
– Goal
• Partition vertices into P parts such that each partition has equal
vertex weights
• Minimize the weights of edges cut
• Problem is NP hard
– Edge cut metric
• Judge the quality of the partitioning by the number of edges cut
Static graph partitioning
• Combinatorial
– Levelized nested dissection
– Kernighan-Lin/Feduccia-Matheyses
• Spectral partitioning
• Multi-level methods
Combinatorial partitioning
• Use only connectivity information
• Examples
– Levelized nested dissection
– Kernighan-Lin/Feduccia-Matheyses
Levelized nested dissection (LND)
• Idea is similar to the geometric methods
– But cannot use coordinate information
– Instead of projecting vertices along the longest
axis, order them based on distance from a vertex
that may be one extreme of the longest dimension
of a graph
• Pseudo-peripheral vertex
– Perform a breadth-first search, starting from an arbitrary
vertex
– The vertex that is encountered last might be a good
approximation to a peripheral vertex
LND example
Finding a pseudoperipheral vertex
2
3
3
2
1
3
1
2
Initial vertex
1
4
Pseudoperipheral
vertex
3
LND example – Partitioning
6
5
3
4
5
2
5
4
2
3
1
Partition
Initial vertex
Recursively bisect the subgraphs
Kernighan-Lin/Fiduccia-Matheyses
• Refines an existing partition
• Kernighan-Lin
– Consider pairs of vertices from different partitions
– Choose a pair whose swapping will result in the best
improvement in partition quality
• The best improvement may actually be a worsening
– Perform several passes
• Choose best partition among those encountered
• Fiduccia-Matheyses
– Similar but more efficient
• Boundary Kernighan-Lin
– Consider only boundary vertices to swap
• ... and many other variants
Kernighan-Lin example
Swap these
Existing partition
Better partition
Edge cut = 4
Edge cut = 3
Spectral method
• Based on the observation that a Fiedler
vector of a graph contains connectivity
information
• Laplacian of a graph: L
– lii = di (degree of vertex i)
– lij = -1 if edge {i,j} exists, otherwise 0
• Smallest eigenvalue of L is 0 with eigenvector all 1
• All other eigenvalues are positive for a connected graph
• Fiedler vector
– Eigenvector corresponding to the second smallest
eigenvalue
Fiedler vector
• Consider a partitioning of V into A and B
– Let yi = 1 if vi e A, and yi = -1 if vi e B
– For load balance, Si yi = 0
– Also Seij e E (yi-yj)2 = 4 x number of edges
across partitions
– Also, yTLy = Si di yi2 – 2 Seij e E yiyj
= Seij e E (yi-yj)2
Optimization problem
•
* The optimal partition is obtain by solving
– Minimize yTLy
– Constraints:
• yi e {-1,1}
• Si yi = 0
– This is NP hard
• Relaxed problem
– Minimize yTLy
– Constraints:
• Si yi = 0
• Add a constraint on a norm of y, example, ||y||2 = n0.5
– Note
• (1, 1, ..., 1)T is an eigenvector with eigenvalue 0
• For a connected graph, all other eigenvalues are positive and
orthogonal to this eigenvector, which implies Si yi = 0
• The objective function is minimized by a Fiedler vector
Spectral algorithm
• Find a Fiedler vector of the Laplacian of the graph
– Note that the Fiedler value (the second smallest eigenvalue)
yields a lower bound on the communication cost, when the
load is balanced
• From the Fiedler vector, bisect the graph
– Let all vertices with components in the Fiedler vector greater
than the median be in one component, and the rest in the
other
• Recursively apply this to each partition
• Note: Finding the Fiedler vector of a large graph can
be time consuming
Multilevel methods
• Idea
– It takes time to partition a large graph
– So partition a small graph instead!
•
* Three phases
– Graph coarsening
• Combine vertices to create a smaller graph
– Example: Find a suitable matching
• Apply this recursively until a suitably small graph is obtained
– Partitioning
• Use spectral or another partitioning algorithm to partition the
small graph
– Multilevel refinement
• Uncoarsen the graph to get a partitioning of the original graph
• At each level, perform some graph refinement
Multilevel example
(without refinement)
9
10
5
7
3
11
12
2
4
8
1
6
16
15
13
14
Multilevel example
(without refinement)
9
10
5
7
3
11
12
2
4
8
1
6
16
15
13
14
Multilevel example
(without refinement)
9
10
5
7
3
11
12
1
1
2
2
4
8
1
6
16
15
13
14
2
1
1
2
1
1
1
Multilevel example
(without refinement)
9
10
5
7
3
11
12
1
1
2
2
4
8
1
6
16
15
13
14
2
1
1
1
2
1
1
Multilevel example
(without refinement)
9
10
5
7
3
11
12
1
1
2
2
4
8
16
1
6
1
2
1
2
1
1
14
2
1
1
15
13
2
1
Dynamic partitioning
• We have an initial partitioning
– Now, the graph changes
– * Determine a good partition, fast
– * Also minimize the number of vertices
that need to be moved
• Examples
– PLUM
– Jostle
– Diffusion
PLUM
• Partition based on the initial mesh
– Vertex and edge weights alone changed
• Map partitions to processors
– Use more partitions than processors
• Ensures finer granularity
– Compute a similarity matrix based on data already on a
process
• Measures savings on data redistribution cost for each (process,
partition) pair
• Choose assignment of partitions to processors
– Example: Maximum weight matching
» Duplicate each processor: # of partitions/P times
– Alternative: Greedy approximation algorithm
» Assign in order of maximum similarity value
• http://citeseer.nj.nec.com/oliker98plum.html
JOSTLE
• Use Hu and Blake’s scheme for load balancing
– Solve Lx = b using Conjugate Gradient
• L = Laplacian of processor graph, bi = Weight on process
Pi – Average weight
– Move max(xi-xj, 0) weight between Pi and Pj
• Leads to balanced load
– Equivalent to Pi sending xi load to each neighbor j, and each
neighbor Pj sending xj to Pi
– Net loss in load for Pi = di xi - Sneighborj xj = L(i)x = bi
» where L(i) is row i of L, and di is degree of i
– New load for Pi = weight on Pi - bi = average weight
• Leads to minimum L2 norm of load moved
– Using max(xi-xj, 0)
• Select vertices to move, based on relative gain
– http://citeseer.nj.nec.com/walshaw97parallel.html
Diffusion
• Involves only communication with neighbors
• A simple scheme
– Processor Pi repeatedly sends a wi weight to each
neighbor
• wi = weight on Pi
• wk = (I – a L) wk-1 , wk = weight vector at iteration k
– Simple criteria exist for choosing a to ensure convergence
» Example: a = 0.5/(maxi di),
• More sophisticated schemes exist
Important points
• Goals of domain decomposition
– Balance the load
– Minimize communication
• Space filling curves
• Graph partitioning model
– Spectral method
• Relax NP hard integer optimization to floating point, and then
discretize to get approximate integer solution
– Multilevel methods
• Three phases
• Dynamic partitioning – additional requirements
– Use old solution to find new one fast
– Minimize number of vertices moved
Download