ParallelGraphs

advertisement
Parallel Graph Algorithms
Sathish Vadhiyar
Graph Traversal
 Graph search plays an important role in
analyzing large data sets
 Relationship between data objects
represented in the form of graphs
 Breadth first search used in finding
shortest path or sets of paths
Level-synchronized algorithm
 Proceeds level-by-level starting with the
source vertex
 Level of a vertex – its graph distance
from the source
 How to decompose the graph (vertices,
edges and adjacency matrix) among
processors?
Distributed BFS with 1D
Partitioning
 Each vertex and edges emanating from it
are owned by one processor
 1-D partitioning of the adjacency matrix
 Edges emanating from vertex v is its
edge list = list of vertex indices in row v
of adjacency matrix A
1-D Partitioning
 At each level, each processor owns a set F –
set of frontier vertices owned by the
processor
 Edge lists of vertices in F are merged to
form a set of neighboring vertices, N
 Some vertices of N owned by the same
processor, while others owned by other
processors
 Messages are sent to those processors to
add these vertices to their frontier set for
the next level
Lvs(v) – level of v, i.e,
graph distance from
source vs
2D Partitioning
 P=RXC processor mesh
 Adjacency matric divided into R.C block rows and C
block columns
 A(i,j)(*) denotes a block owned by (i,j) processor;
each processor owns C blocks
2D Partitioning
 Processor (i,j) owns vertices belonging to
block row (j-1).R+i
 Thus a process stores some edges
incident on its vertices, and some edges
that are not
2D Paritioning
 Assume that the edge list for a given vertex is the
column of the adjacency matrix
 Each block in the 2D partitioning contains partial
edge lists
 Each processor has a frontier set of vertices, F,
owned by the processor
2D Paritioning
Expand Operation
 Consider v in F
 The owner of v sends messages to
other processors in frontier column
telling that v is in the frontier; since
any of these processors may have
partial edge list of v
2D Partitioning
Fold Operation
 Partial edge lists on each processor
merged to form N – potential vertices in
the next frontier
 Vertices in N sent to their owners to
form new frontier set F on those
processors
 These owner processors are in the same
processor row
 This communication step referred as
fold operation
Analysis
 Advantage of 2D over 1D – processorcolumn and processor-row
communications involve only R and C
processors
BFS on GPUs
BFS on GPUs
 One GPU thread for a vertex
 In each iteration, each vertex looks at
its entry in the frontier array
 If true, it forms the neighbors and
frontiers
 Severe load imbalance among the treads
 Scope for improvement
Parallel Depth First Search
 Easy to parallelize
 Left subtree can be searched in parallel
with the right subtree
 Statically assign a node to a processor –
the whole subtree rooted at that node
can be searched independently.
 Can lead to load imbalance; Load
imbalance increases with the number of
processors
Dynamic Load Balancing (DLB)
 Difficult to estimate the size of the
search space beforehand
 Need to balance the search space among
processors dynamically
 In DLB, when a processor runs out of
work, it gets work from another
processor
Maintaining Search Space
 Each processor searches the space
depth-first
 Unexplored states saved as stack; each
processor maintains its own local stack
 Initially, the entire search space
assigned to one processor
Work Splitting
 When a processor receives work request, it splits
its search space
 Half-split: Stack space divided into two equal
pieces – may result in load imbalance
 Giving stack space near the bottom of the stack
can lead to giving bigger trees
 Stack space near the top of the stack tend to have
small trees
 To avoid sending very small amounts of work –
nodes beyond a specified stack depth are not given
away – cutoff depth
Strategies
 1. Send nodes near the bottom of the
stack
 2. Send nodes near the cutoff depth
 3. Send half the nodes between the
bottom of the stack and the cutoff
depth
 Example: Figures 11.5(a) and 11.9
Load Balancing Strategies
 Asynchronous round-robin: Each
processor has a target processor to get
work from; the value of the target is
incremented with modulo
 Global round-robin: One single target
processor variable is maintained for all
processors
 Random polling: randomly select a donor
Termination Detection
 Dijikstra’s Token Termination Detection
Algorithm
 Based on passing of a token in a logical ring;
P0 initiates a token when idle; A processor
holds a token until it has completed its work,
and then passes to the next processor; when
P0 receives again, then all processors have
completed
 However, a processor may get more work
after becoming idle
Algorithm Continued….
 Taken care of by using white and black
tokens
 Initially, the token is white; a processor j
becomes black if it sends work to i<j
 If j completes work, it changes token to
black and sends it to next processor; after
sending, changes to white.
 When P0 receives a black token, reinitiates
the ring
Tree Based Termination
Detection
 Uses weights
 Initially processor 0 has weight 1
 When a processor transfers work to another
processor, the weights are halved in both the
processors
 When a processor finishes, weights are returned
 Termination is when processor 0 gets back 1
 Goes with the DFS algorithm; No separate
communication steps
 Figure 11.10
 Minimal Spanning Tree, Single-Source
and All-pairs Shortest Paths
Minimal Spanning Tree – Prim’s
Algorithm
 Spanning tree of a graph, G (V,E) – tree
containing all vertices of G
 MST – spanning tree with minimum sum
of weights
 Vertices are added to a set Vt that
holds vertices of MST; Initially contains
an arbitrary vertex,r, as root vertex
Minimal Spanning Tree – Prim’s
Algorithm
 An array d such that d[v in (V-Vt)] holds
weight of the edge with least weight
between v and any vertex in Vt; Initially
d[v] = w[r,v]
 Find the vertex in d with minimum weight
and add to Vt
 Update d
 Time complexity – O(n2)
Parallelization
 Vertex V and d array partitioned across P
processors
 Each processor finds local minimum in d
 Then global minimum across all d performed
by reduction on a processor
 The processor finds the next vertex u, and
broadcasts to all processors
Parallelization
 All processors update d; The owning
processor of u marks u as belonging to Vt
 Process responsible for v must know w[u,v]
to update v; 1-D block mapping of adjacency
matrix
 Complexity – O(n2/P) + (OnlogP) for
communication
Single Source Shortest Path –
Dijikistra’s Algorithm
 Finds shortest path from the source
vertex to all vertices
 Follows a similar structure as Prim’s
 Instead of d array, an array l that
maintains the shortest lengths are
maintained
 Follow similar parallelization scheme
Single Source Shortest Path on
GPUs
SSSP on GPUs
 A single kernel is not enough since Ca
cannot be updated while it is accessed.
 Hence costs updated in a temporary
array Ua
All-Pairs Shortest Paths
 To find shortest paths between all pairs
of vertices
 Dijikstra’s algorithm for single-source
shortest path can be used for all
vertices
 Two approaches
All-Pairs Shortest Paths
 Source-partitioned formulation: Partition the
vertices across processors
 Works well if p<=n; No communication
 Can at best use only n processors
 Time complexity?
 Source-parallel formulation: Parallelize SSSP for a
vertex across a subset of processors
 Do for all vertices with different subsets of
processors
 Hierarchical formulation
 Exploits more parallelism
 Time complexity?
All-Pairs Shortest Paths
Floyd’s Algorithm
 Consider a subset S = {v1,v2,…,vk} of
vertices for some k <= n
 Consider finding shortest path between
vi and vj
 Consider all paths from vi to vj whose
intermediate vertices belong to the set
S; Let pi,j(k) be the minimum-weight path
among them with weight di,j(k)
All-Pairs Shortest Paths
Floyd’s Algorithm
 If vk is not in the shortest path, then
pi,j(k) = pi,j(k-1)
 If vk is in the shortest path, then the
path is broken into two parts – from vi to
vk, and from vk to vj
 So di,j(k) = min{di,j(k-1) , di,k(k-1) + dk,j(k-1) }
 The length of the shortest path from vi
to vj is given by di,j(n).
 In general, solution is a matrix D(n)
Parallel Formulation
2-D Block Mapping
 Processors laid in a 2D mesh
 During kth iteration, each process Pi,j
needs certain segments of the kth row
and kth column of the D(k-1) matrix
 For dl,r(k): following are needed
 dl,k(k-1) (from a process along the same
process row)
 dk,r(k-1) (from a process along the same
process column)
 Figure 10.8
Parallel Formulation
2D Block Mapping
 During kth iteration, each of the root(p)
processes containing part of the kth row
sends it to root(p)-1 in same column;
 Similarly for the same row
 Figure 10.8
 Time complexity?
APSP on GPUs
 Space complexity of Floyd’s algorithm is O(V2) –
Impossible to go beyond a few vertices on GPUs
 Uses V2 threads
 A single O(V) operation looping over O(V2) threads can exhibit slowdown due to high context switching
overhead between threads
 Use Dijikistra’s – run SSSP algorithm from every
vertex in graph
 Will require only the final output size to be O(V2)
 Intermediate outputs on GPU can be O(V) and can
be copied to CPU memory
APSP on GPUs
Sources/References
 Paper: A Scalable Distributed Parallel
Breadth-First Search Algorithm on
BlueGene/L. Yoo et al. SC 2005.
 Paper:Accelerating large graph
algorithms on the GPU usingCUDA.
Harish and Narayanan. HiPC 2007.
Speedup Anomalies in DFS
 The overall work (space searched) in
parallel DFS can be smaller or larger
than in sequential DFS
 Can cause superlinear or sublinear
speedups
 Figures 11.18, 11.19
Parallel Formulation
Pipelining
 In the 2D formulation, the kth iteration
in all processes start only after k-1(th)
iteration completes in all the processes
 A process can start working on the kth
iteration as soon as it has computed (k1)th iteration and has relevant parts of
the D(k-1) matrix
 Example: Figure 10.9
 Time complexity
Download