Parallel Graph Algorithms - Computer Science Division

advertisement
Parallel Graph Algorithms
Aydın Buluç
ABuluc@lbl.gov
http://gauss.cs.ucsb.edu/~aydin/
Lawrence Berkeley National Laboratory
CS267, Spring 2013
April 9, 2013
Slide acknowledgments: K. Madduri, J. Gilbert, D. Delling, S. Beamer
Graph Preliminaries
Define: Graph G = (V,E)
-a set of vertices and a set of edges
between vertices
Edge
Vertex
n=|V| (number of vertices)
m=|E| (number of edges)
D=diameter (max #hops between any pair of vertices)
• Edges can be directed or undirected, weighted or not.
• They can even have attributes (i.e. semantic graphs)
• Sequences of edges <u1,u2>, <u2,u3>, … ,<un-1,un> is a
path from u1 to un. Its length is the sum of its weights.
Lecture Outline
• Applications
• Designing parallel graph algorithms
• Case studies:
A.
B.
C.
D.
E.
Graph traversals: Breadth-first search
Shortest Paths: Delta-stepping, PHAST, Floyd-Warshall
Ranking: Betweenness centrality
Maximal Independent Sets: Luby’s algorithm
Strongly Connected Components
• Wrap-up: challenges on current systems
Lecture Outline
• Applications
• Designing parallel graph algorithms
• Case studies:
A.
B.
C.
D.
E.
Graph traversals: Breadth-first search
Shortest Paths: Delta-stepping, PHAST, Floyd-Warshall
Ranking: Betweenness centrality
Maximal Independent Sets: Luby’s algorithm
Strongly Connected Components
• Wrap-up: challenges on current systems
Routing in transportation networks
Road networks, Point-to-point shortest paths: 15 seconds (naïve)  10 microseconds
H. Bast et al., “Fast Routing in Road Networks with Transit Nodes”, Science 27, 2007.
Internet and the WWW
• The world-wide web can be represented as a directed graph
– Web search and crawl: traversal
– Link analysis, ranking: Page rank and HITS
– Document classification and clustering
• Internet topologies (router networks) are naturally modeled
as graphs
Scientific Computing
• Reorderings for sparse solvers
– Fill reducing orderings
 Partitioning, eigenvectors
– Heavy diagonal to reduce pivoting (matching)
• Data structures for efficient exploitation
of sparsity
• Derivative computations for optimization
Image source: Yifan Hu, “A gallery of large graphs”
– graph colorings, spanning trees
• Preconditioning
– Incomplete Factorizations
– Partitioning for domain decomposition
– Graph techniques in algebraic multigrid
 Independent sets, matchings, etc.
– Support Theory
 Spanning trees & graph embedding techniques
B. Hendrickson, “Graphs and HPC: Lessons for Future Architectures”,
http://www.er.doe.gov/ascr/ascac/Meetings/Oct08/Hendrickson%20ASCAC.pdf
3
1
6
8
4
9
7
10
5
G+(A)
[chordal]
2
Large-scale data analysis
• Graph abstractions are very useful to analyze complex data
sets.
• Sources of data: petascale simulations, experimental devices,
the Internet, sensor networks
• Challenges: data size, heterogeneity, uncertainty, data quality
Astrophysics: massive datasets,
temporal variations
Bioinformatics: data quality,
heterogeneity
Image sources: (1) http://physics.nmt.edu/images/astro/hst_starfield.jpg (2,3) www.visualComplexity.com
Social Informatics: new analytics
challenges, data uncertainty
Graph-theoretic problems in social networks
– Targeted advertising:
clustering and centrality
– Studying the spread of
information
Image Source: Nexus (Facebook application)
Network Analysis for Neurosciences
Graph-theoretical models are used to predict the course of
degenerative illnesses like Alzheimer’s.
Vertices: ROI (regions of interest)
Edges: structural/functional connectivity
Some disease indicators:
- Deviation from small-world property
- Emergence of “epicenters” with disease-associated patterns
Research in Parallel Graph Algorithms
Application
Areas
Methods/
Problems
Social Network
Analysis
Find central entities
Community detection
Network dynamics
WWW
Computational
Biology
Marketing
Social Search
Gene regulation
Metabolic pathways
Genomics
Graph
Algorithms
Traversal
Data size
Shortest Paths
Connectivity
Problem
Complexity
Max Flow
…
Scientific
Computing
Graph partitioning
Matching
Coloring
…
Engineering
VLSI CAD
Route planning
…
Architectures
GPUs, FPGAs
x86 multicore
servers
Massively
multithreaded
architectures
(Cray XMT, Sun
Niagara)
Multicore
clusters
(NERSC Hopper)
Clouds
(Amazon EC2)
Characterizing Graph-theoretic computations
Input data
Problem: Find ***
• paths
• clusters
• partitions
• matchings
• patterns
• orderings
Graph kernel
• traversal
• shortest path
algorithms
• flow algorithms
• spanning tree
algorithms
• topological
sort
…..
Factors that influence
choice of algorithm
• graph sparsity (m/n ratio)
• static/dynamic nature
• weighted/unweighted, weight
distribution
• vertex degree distribution
• directed/undirected
• simple/multi/hyper graph
• problem size
• granularity of computation at
nodes/edges
• domain-specific characteristics
Graph problems are often recast as sparse
linear algebra (e.g., partitioning) or linear
programming (e.g., matching) computations
Lecture Outline
• Applications
• Designing parallel graph algorithms
• Case studies:
A.
B.
C.
D.
E.
Graph traversals: Breadth-first search
Shortest Paths: Delta-stepping, PHAST, Floyd-Warshall
Ranking: Betweenness centrality
Maximal Independent Sets: Luby’s algorithm
Strongly Connected Components
• Wrap-up: challenges on current systems
The PRAM model
• Many PRAM graph algorithms in 1980s.
• Idealized parallel shared memory system model
• Unbounded number of synchronous processors;
no synchronization, communication cost; no
parallel overhead
• EREW (Exclusive Read Exclusive Write), CREW
(Concurrent Read Exclusive Write)
• Measuring performance: space and time
complexity; total number of operations (work)
PRAM Pros and Cons
• Pros
– Simple and clean semantics.
– The majority of theoretical parallel algorithms are designed
using the PRAM model.
– Independent of the communication network topology.
• Cons
–
–
–
–
–
Not realistic, too powerful communication model.
Communication costs are ignored.
Synchronized processors.
No local memory.
Big-O notation is often misleading.
Building blocks of classical PRAM graph algorithms
• Prefix sums
• Symmetry breaking
• Pointer jumping
• List ranking
• Euler tours
• Vertex collapse
• Tree contraction
[some covered in the “Tricks with Trees” lecture]
Work / Span Model
tp = execution time on p processors
Work / Span Model
tp = execution time on p processors
t1 = work
Work / Span Model
tp = execution time on p processors
t1 = work
t∞ = span *
*Also called critical-path length
or computational depth.
Work / Span Model
tp = execution time on p processors
t1 = work
t∞ = span *
WORK LAW
∙tp ≥t1/p
SPAN LAW
∙tp ≥ t∞
*Also called critical-path length
or computational depth.
Data structures: graph representation
Static case
• Dense graphs (m = Θ(n2)): adjacency matrix commonly used.
• Sparse graphs: adjacency lists, compressed sparse matrices
Dynamic
• representation depends on common-case query
• Edge insertions or deletions? Vertex insertions or deletions?
Edge weight updates?
• Graph update rate
• Queries: connectivity, paths, flow, etc.
• Optimizing for locality a key design consideration.
Graph representations
Compressed sparse rows (CSR) = cache-efficient adjacency lists
7
1
12
4
2
2
Index into
adjacency
array
3 26
14
1
26
1 12
19
3
6
4
2 14
4
3
Adjacencies 1
3
2
2
4
(column ids in CSR)
26
19
14
7
(numerical values in CSR)
12
4
2 19
1
Weights
3
3
(row pointers in CSR)
7
Distributed graph representations
• Each processor stores the entire graph
(“full replication”)
• Each processor stores n/p vertices and all adjacencies
out of these vertices (“1D partitioning”)
• How to create these “p” vertex partitions?
– Graph partitioning algorithms: recursively optimize for
conductance (edge cut/size of smaller partition)
– Randomly shuffling the vertex identifiers ensures that edge
count/processor are roughly the same
2D checkerboard distribution
• Consider a logical 2D processor grid (pr * pc = p) and
the matrix representation of the graph
• Assign each processor a sub-matrix (i.e, the edges
within the sub-matrix) 9 vertices, 9 processors, 3x3 processor grid
5
x
8
1
7
2
3
4
x
6
Flatten
Sparse matrices
x
x
x
x
x
x
x
x
x
Per-processor local graph
representation
x
x
x
0
x
x
x
x
x
x
x
x
x
x
High-performance graph algorithms
• Implementations are typically array-based for
performance (e.g. CSR representation).
• Concurrency = minimize synchronization (span)
• Where is the data? Find the distribution that
minimizes inter-processor communication.
• Memory access patterns
– Regularize them (spatial locality)
– Minimize DRAM traffic (temporal locality)
• Work-efficiency
– Is (Parallel time) * (# of processors) = (Serial work)?
Lecture Outline
• Applications
• Designing parallel graph algorithms
• Case studies:
A.
B.
C.
D.
E.
Graph traversals: Breadth-first search
Shortest Paths: Delta-stepping, PHAST, Floyd-Warshall
Ranking: Betweenness centrality
Maximal Independent Sets: Luby’s algorithm
Strongly Connected Components
• Wrap-up: challenges on current systems
Graph traversal: Depth-first search (DFS)
9
5
8
8
0
source
vertex
7
1
2
7
1
2
3
3
4
4
6
preorder
vertex number
6
5
9
procedure DFS(vertex v)
v.visited = true
previsit (v)
for all v s.t. (v, w) Î E
if(!w.visited) DFS(w)
postvisit (v)
Parallelizing DFS is a bad idea: span(DFS) = O(n)
J.H. Reif, Depth-first search is inherently sequential. Inform. Process. Lett. 20 (1985) 229-234.
Graph traversal : Breadth-first search (BFS)
1
Input:
Output:
5
8
1
0
source
vertex
7
2
1
2
3
3
3
4
6
distance from
source vertex
4
4
9
1
2
Memory requirements (# of machine words):
• Sparse graph representation: m+n
• Stack of visited vertices: n
• Distance array: n
Parallel BFS Strategies
1. Expand current frontier (level-synchronous approach, suited for low diameter graphs)
5
0
7
8
3
1
4
6
9
source
vertex
• O(D) parallel steps
• Adjacencies of all vertices
in current frontier are
visited in parallel
2
2. Stitch multiple concurrent traversals (Ullman-Yannakakis approach,
suited for high-diameter graphs)
source
vertex
0
5
8
7
3
2
1
4
6
9
• path-limited searches
from “super vertices”
• APSP between “super
vertices”
INTRODUCTION
60
Percentage of total
50
40
30
20
2
2
! Breadth-First Search (BFS)
a key building
1 is
1 0
0
0
1
1
Performance observations ofblock
the level-synchronous
algorithm
for many graph algorithms
! Common real-world graph properties
2
1
1
• Scale-free - exponential degree distribution
Youtube social network
Synthetic
network
Neighbor
Types
•
Small-world
low
effective
diameter
# of vertices in frontier array
! BFS implementation challenges
Execution time
• Often have little locality (spatial or temporal)
• Low arithmetic intensity
• Usually bottlenecked by memory system
• Scale-free graphs are harder to load-balance
• Small-world graphs are harder to partition
10
0
BREADTH-FIRST SEARCH
1
2
3
4
5
6
7
8
9 10 11 12 13 14 15
Phase #
When the frontier is
at its peak, almost all
edge examinations
“fail” to claim a child
Insight
2
Sample Search
with Classic Algorithm
! Many edges are
examined unnecessarily
! Classic algorithm is:
Top-Down
! What if went
Bottom-Up?
1
2
2
1
0
2
1
0
0 1 1 0 1
S
u
ffi
c
ie
n
tC
h
e
c
k
E
x
tr
a
C
h
e
c
k
Neighbor Types
2
1
2 11
Scott Beamer, Aydın Buluç, David Patterson, Krste A
Bottom-up BFS algorithm
L
E
L
C
O
M
P
U
T
I
N
G
L
A
Classical (top-down) algorithm is optimal in worst case, but
pessimistic
O P T I M I Zfor
I Nlow-diameter
G D I R E C graphs
T I O N (previous
S H Aslide).
RED MEM ANA
Top-Down
1
Edge Check Types
Bottom-Up
0
1
1
for all v in frontier
attempt to parent all
neighbors(v)
1
0
1
1
Direction-optimizing approach:
- Switch from top-down to
bottom-up search
- When the majority of the
vertices are discovered.
[Read paper for exact heuristic]
for all v in unvisited
find any parent
(neighbor(v) in frontier)
Hybrid of Top-Down & Bottom-Up
Scott Beamer, Krste Asanović, and David Patterson, "Direction-Optimizing Breadth-First Search",
! Combine
both to get best of both
Int. Conf. on High Performance Computing, Networking, Storage and Analysis (SC), 2012
! Top-down for small frontier, bottom-up for large
Time Breakdown
#
nodes
C
= # unexplored edges
C
=
2
1
4
5
7
1
3
6
Breadth-first search in
the language of linear
algebra
from
1
to
7
AT
7
2
1
4
5
7
1
6
3
Replace scalar operations
Multiply -> select
Add -> minimum
from
1
7
1
1
1
parents:

to
1
1
7
AT
X
ATX
2
1
Select vertex with
minimum label as parent
4
5
7
from
1
6
3
7
1
2
1
to
4
parents:
4
4

4
2
2
1
2
2
7
2
2
4
AT
X
ATX
2
1
4
5
7
1
6
3
from
7
1
3
1
to
5
4
parents:
1
2
3
2

3
5
3
7
7
AT
X
ATX
2
1
4
5
7
1
3
6
from
7
1

to
6
7
AT
X
ATX
1D Parallel BFS algorithm
1
x
4
3
AT
2
7
5
6
xfrontier
ALGORITHM:
1. Find owners of the current frontier’s adjacency [computation]
2. Exchange adjacencies via all-to-all. [communication]
3. Update distances/parents for unvisited vertices. [computation]
2D Parallel BFS algorithm
1
x
4
3
AT
2
7
5
6
xfrontier
ALGORITHM:
1. Gather vertices in processor column [communication]
2. Find owners of the current frontier’s adjacency [computation]
3. Exchange adjacencies in processor row [communication]
4. Update distances/parents for unvisited vertices. [computation]
BFS Strong Scaling
2D Flat MPI
2D Hybrid
16
GTEPS
12
Comm. time (seconds)
20
1D Flat MPI
1D Hybrid
12
8
4
5040
10008
20000
Number of cores
40000
1D Flat MPI
1D Hybrid
2D Flat MPI
2D Hybrid
10
8
6
4
2
5040
10008
20000
40000
Number of cores
• NERSC Hopper (Cray XE6, Gemini interconnect AMD Magny-Cours)
• Hybrid: In-node 6-way OpenMP multithreading
• Kronecker (Graph500): 4 billion vertices and 64 billion edges.
A. Buluç, K. Madduri. Parallel breadth-first search on distributed memory systems. Proc. Supercomputing, 2011.
4:
while f = ∅ do
T RAcores
NSPOSEV
ECTOR
(f i j )
115.6K
on
Hopper
f i ← A L L GATHERV (f i j , P (:, j ))
t i ← A i j ⊗f i
t is candida
t i j ← A L LTOA L LV (t i , P (i , :))
t i j ← t i Scaling
j ⊙πi j
Strong
w/ Twitter
πi j ← πi j + t i j
f i j ← ti j
5:
! Direction
Achieved optimizing
243 GTEPS with
BFS
with 2D
decomposition
Results on Titan
Weak Scaling w/ Kronecker
6:
7:
8:
9:
10:
11:
The pseudocode for parallel BFS algorithm with
tioning is given in Algorithm 3 for completeness. B
t are implemented as sparse vectors. For distribute
the syntax vi j denotes the local n/ p sized piece of
owned by the P (i , j )th processor. The syntax vi d
hypothetical n/ pr sized piece of the vector collectiv
by all the processors along the i th processor row P
algorithm has four major steps:
• Expand:
For all locally owned edges, g
Time
Breakdown
Sensitivity
to
Degree
• ORNL Titan (Cray XK6, Gemini interconnect
Interlagos)
originatingAMD
vertices
(line 6).
• Kronecker (Graph500): 16 billion •vertices
and 256y:billion
L ocal discover
Find edges.
(partial) owners of t
frontier’s adjacency (line 7).
• Fold:
Exchange
discovered
(line
Scott Beamer, Aydın Buluç, Krste Asanović, and David Patterson,
"Distributed
Memory
Breadth-Firstadjacencies
Search
Revisited: Enabling Bottom-Up Search”, Workshop on Multithreaded
Applications
(MTAAP), at
• L ocal Architectures
update: and
Update
distances/parents
fo
the International Parallel & Distributed Processing Symposium (IPDPS), 2013
vertices (lines 9, 10, and 11).
Lecture Outline
• Applications
• Designing parallel graph algorithms
• Case studies:
A.
B.
C.
D.
E.
Graph traversals: Breadth-first search
Shortest Paths: Delta-stepping, PHAST, Floyd-Warshall
Ranking: Betweenness centrality
Maximal Independent Sets: Luby’s algorithm
Strongly Connected Components
• Wrap-up: challenges on current systems
Parallel Single-source Shortest Paths
(SSSP) algorithms
• Famous serial algorithms:
– Bellman-Ford : label correcting - works on any graph
– Dijkstra : label setting – requires nonnegative edge weights
• No known PRAM algorithm that runs in sub-linear time and
O(m+n log n) work
• Ullman-Yannakakis randomized approach
• Meyer and Sanders, ∆ - stepping algorithm
• Distributed memory implementations based on graph
partitioning
• Heuristics for load balancing and termination detection
K. Madduri, D.A. Bader, J.W. Berry, and J.R. Crobak, “An Experimental Study of A Parallel
Shortest Path Algorithm for Solving Large-Scale Graph Instances,” Workshop on Algorithm
Engineering and Experiments (ALENEX), New Orleans, LA, January 6, 2007.
∆ - stepping algorithm
• Label-correcting algorithm: Can relax edges from
unsettled vertices also
• “approximate bucket implementation of Dijkstra”
• For random edge weighs [0,1], runs in O(n + m + D× L)
where L = max distance from source to any node
• Vertices are ordered using buckets of width ∆
• Each bucket may be processed in parallel
∆ < min w(e) : Degenerates into Dijkstra
∆ > max w(e) : Degenerates into Bellman-Ford
U. Meyer and P.Sanders, ∆ - stepping: a parallelizable shortest path algorithm.
Journal of Algorithms 49 (2003)
∆ - stepping algorithm: illustration
∆ = 0.1
(say)
0.05
3
0.56
0.07
0.01
0
0.02
0.13
0.15
2
0.23
4
0.18
5
1
d array
0
1
2
3
4
5
6
∞ ∞ ∞ ∞ ∞ ∞ ∞
Buckets
6
One parallel phase
while (bucket is non-empty)
i) Inspect light (w < ∆) edges
ii) Construct a set of “requests”
(R)
iii) Clear the current bucket
iv) Remember deleted vertices
(S)
v) Relax request pairs in R
Relax heavy request pairs (from S)
Go on to the next bucket
∆ - stepping algorithm: illustration
0.05
3
0.56
0.07
0.01
0
0.02
0.13
0.15
2
0.23
4
0.18
5
1
d array
0
1
2
3
0
∞ ∞ ∞ ∞ ∞ ∞
Buckets
0 0
4
5
6
6
One parallel phase
while (bucket is non-empty)
i) Inspect light (w < ∆) edges
ii) Construct a set of “requests”
(R)
iii) Clear the current bucket
iv) Remember deleted vertices
(S)
v) Relax request pairs in R
Relax heavy request pairs (from S)
Go on to the next bucket
Initialization:
Insert s into bucket, d(s) = 0
∆ - stepping algorithm: illustration
0.05
3
0.56
0.07
0.01
0
0.02
0.13
0.15
2
0.23
4
0.18
5
1
d array
0
1
2
3
0
∞ ∞ ∞ ∞ ∞ ∞
Buckets
0 0
4
5
6
One parallel phase
while (bucket is non-empty)
i) Inspect light (w < ∆) edges
ii) Construct a set of “requests”
(R)
iii) Clear the current bucket
iv) Remember deleted vertices
(S)
v) Relax request pairs in R
Relax heavy request pairs (from S)
Go on to the next bucket
6
R
2
.01
S
∆ - stepping algorithm: illustration
0.05
3
0.56
0.07
0.01
0
0.02
0.13
0.15
2
0.23
4
0.18
5
1
d array
0
1
2
3
0
∞ ∞ ∞ ∞ ∞ ∞
Buckets
4
5
6
6
One parallel phase
while (bucket is non-empty)
i) Inspect light (w < ∆) edges
ii) Construct a set of “requests”
(R)
iii) Clear the current bucket
iv) Remember deleted vertices
(S)
v) Relax request pairs in R
Relax heavy request pairs (from S)
Go on to the next bucket
R
2
.01
0
S
0
∆ - stepping algorithm: illustration
0.05
3
0.56
0.07
0.01
0
0.02
0.13
0.15
2
0.23
4
0.18
5
1
d array
0
1
2
3
0
∞
.01
∞ ∞ ∞ ∞
Buckets
0 2
4
5
6
6
One parallel phase
while (bucket is non-empty)
i) Inspect light (w < ∆) edges
ii) Construct a set of “requests”
(R)
iii) Clear the current bucket
iv) Remember deleted vertices
(S)
v) Relax request pairs in R
Relax heavy request pairs (from S)
Go on to the next bucket
R
S
0
∆ - stepping algorithm: illustration
0.05
3
0.56
0.07
0.01
0
0.02
0.13
0.15
2
0.23
4
0.18
5
1
d array
0
1
2
3
0
∞
.01
∞ ∞ ∞ ∞
Buckets
0 2
4
5
6
6
One parallel phase
while (bucket is non-empty)
i) Inspect light (w < ∆) edges
ii) Construct a set of “requests”
(R)
iii) Clear the current bucket
iv) Remember deleted vertices
(S)
v) Relax request pairs in R
Relax heavy request pairs (from S)
Go on to the next bucket
R
1 3
.03 .06
S
0
∆ - stepping algorithm: illustration
0.05
3
0.56
0.07
0.01
0
0.02
0.13
0.15
2
0.23
4
0.18
5
1
d array
0
1
2
3
0
∞
.01
∞ ∞ ∞ ∞
Buckets
4
5
6
6
One parallel phase
while (bucket is non-empty)
i) Inspect light (w < ∆) edges
ii) Construct a set of “requests”
(R)
iii) Clear the current bucket
iv) Remember deleted vertices
(S)
v) Relax request pairs in R
Relax heavy request pairs (from S)
Go on to the next bucket
R
1 3
.03 .06
0
S
0 2
∆ - stepping algorithm: illustration
0.05
3
0.56
0.07
0.01
0
0.02
0.13
0.15
2
0.23
4
0.18
5
1
d array
0
1
2
3
0
.03 .01 .06
Buckets
0 1 3
4
5
6
∞ ∞ ∞
6
One parallel phase
while (bucket is non-empty)
i) Inspect light (w < ∆) edges
ii) Construct a set of “requests”
(R)
iii) Clear the current bucket
iv) Remember deleted vertices
(S)
v) Relax request pairs in R
Relax heavy request pairs (from S)
Go on to the next bucket
R
S 0 2
∆ - stepping algorithm: illustration
0.05
3
0.56
0.07
0.01
2
0
0.02
0.13
0.15
0.23
4
0.18
5
1
d array
0
1
2
3
4
5
6
0 .03 .01 .06 .16 .29 .62
Buckets
1 4
2 5
6 6
6
One parallel phase
while (bucket is non-empty)
i) Inspect light (w < ∆) edges
ii) Construct a set of “requests”
(R)
iii) Clear the current bucket
iv) Remember deleted vertices
(S)
v) Relax request pairs in R
Relax heavy request pairs (from S)
Go on to the next bucket
R
S
0 2 1 3
No. of phases (machine-independent performance count)
1000000
high
diameter
No. of phases
100000
10000
1000
low
diameter
100
10
Rnd-rnd
Rnd-logU Scale-free LGrid-rnd LGrid-logU
SqGrid
USAd NE USAt NE
Graph Family
Too many phases in high diameter graphs:
Level-synchronous breadth-first search has the same problem.
Average shortest path weight for various graph families
~ 220 vertices, 222 edges, directed graph, edge weights normalized to [0,1]
Complexity:
O(n + m + D× L)
100000
L: maximum distance
(shortest path weight)
Average shortest path weight
10000
1000
100
10
1
0.1
0.01
Rnd-rnd
Rnd-logU Scale-free LGrid-rnd LGrid-logU
Graph Family
SqGrid
USAd NE USAt NE
PHAST – hardware accelerated shortest path trees
Preprocessing: Build a contraction hierarchy
• order nodes by importance (highway dimension)
• process in order
• add shortcuts to preserve distances
• assign levels (ca. 150 in road networks)
• 75% increase in number of edges (for road networks)
D. Delling, A. V. Goldberg, A. Nowatzyk, and R. F. Werneck. PHAST: HardwareAccelerated Shortest Path Trees. In IPDPS. IEEE Computer Society, 2011
PHAST – hardware accelerated shortest path trees
One-to-all search from source s:
• Run forward search from s
• Only follow edges to more important nodes
• Set distance labels d of reached nodes
PHAST – hardware accelerated shortest path trees
From top-down:
• process all nodes u in reverse level order:
• check incoming arcs (v,u) with lev(v) > lev(u)
• Set d(u)=min{d(u),d(v)+w(v,u)}
PHAST – hardware accelerated shortest path trees
From top-down:
• linear sweep without priority queue
• reorder nodes, arcs, distance labels by level
• accesses to d array becomes contiguous (no jumps)
• parallelize with SIMD (SSE instructions, and/or GPU)
PHAST – performance comparison
Inputs: Europe/USA Road Networks
GPU implementation
Edge weights:
estimated travel times
Edge weights:
physical distances
PHAST – hardware accelerated shortest path trees
• Specifically designed for road networks
• Fill-in can be much higher for other graphs
(Hint: think about sparse Gaussian Elimination)
• Road networks are (almost)
planar.
• Planar graphs have O( n)
separators.
• Good separators lead to
orderings with minimal fill.
Lipton, Richard J.; Tarjan, Robert E. (1979), "A separator theorem for planar graphs", SIAM Journal on
Applied Mathematics 36 (2): 177–189
Alan George. “Nested Dissection of a Regular Finite Element Mesh”. SIAM Journal on Numerical
Analysis, Vol. 10, No. 2 (Apr., 1973), 345-363
All-pairs shortest-paths problem
• Input: Directed graph with “costs” on edges
• Find least-cost paths between all reachable vertex pairs
• Classical algorithm: Floyd-Warshall
1
2
3
4
5
1
for k=1:n // the induction sequence
for i = 1:n
for j = 1:n
if(w(i→k)+w(k→j) < w(i→j))
w(i→j):= w(i→k) + w(k→j)
2
3
4
5
k = 1 case
• It turns out a previously overlooked recursive version is
more parallelizable than the triple nested loop
5
2
1
9
-3 4
V
1
-2
1
3
3
6
3
-1
7
V
2
C
A
5
4
4
0
¥
3
¥
¥
-3
B
C
D
D
B
4
5
+ is “min”,
A = A*;
é
ê
ê
ê
ê
ê
ê
êë
A
5 9 ¥ ¥ 4 ù
ú
0 1 -2 ¥ ¥ ú
¥ 0 4 ¥ 3 ú
¥ 5 0 4 ¥ ú
ú
¥ -1 ¥ 0 ¥ ú
¥ ¥ ¥ 7 0 úû
× is “add”
% recursive call
B = AB; C = CA;
D = D + CB;
D = D*;
% recursive call
B = BD; C = DC;
A = A + BC;
5
2
1
9
-3 4
3
3
6
é
ê
ê
ê
ê
ê
ê
êë
0
¥
3
¥
¥
-3
3
-1
7
-2
1 8
5
4
4
é ¥ ù
ê
ú
3
ë
û
é 5 9 ù
ë
û
C
B
The cost of 3-12 path
4
5
5 9 ¥ ¥ 4 ù
ú
0 1 -2 ¥ ¥ ú
8 0 4 ¥ 3 ú
¥ 5 0 4 ¥ ú
ú
¥ -1 ¥ 0 ¥ ú
¥ ¥ ¥ 7 0 úû
Distances
=
é ¥ ¥ ù
ê
ú
ë 8 12 û
a(3, 2) = a(3,1) + a(1, 2) ¾then
¾® P(3, 2) = P(1, 2)
é
ê
ê
ê
P =ê
ê
ê
êë
1
2
3
4
5
6
1
2
1
4
5
6
1
2
3
4
5
6
1
2
3
4
5
6
Parents
1
2
3
4
5
6
1
2
3
4
5
6
ù
ú
ú
ú
ú
ú
ú
úû
5
2
1
6
-3 4
6
é
ê
ê
ê
ê
ê
ê
êë
0
¥
3
¥
¥
-3
3
-1
7
-2
1 8
3
3
D = D*: no change
5
4
4
é 5 9 ù é 0 1 ù
ë
û ê 8 0 ú
ë
û
B
é
ù
=ë 5 6 û
D
Path:
1-2-3
4
5
5 6 ¥ ¥ 4 ù
ú
0 1 -2 ¥ ¥ ú
8 0 4 ¥ 3 ú
¥ 5 0 4 ¥ ú
ú
¥ -1 ¥ 0 ¥ ú
¥ ¥ ¥ 7 0 úû
Distances
a(1,3) = a(1, 2)+ a(2,3) ¾then
¾® P(1,3) = P(2,3)
é
ê
ê
ê
P =ê
ê
ê
êë
1
2
3
4
5
6
1
2
1
4
5
6
2
2
3
4
5
6
1
2
3
4
5
6
Parents
1
2
3
4
5
6
1
2
3
4
5
6
ù
ú
ú
ú
ú
ú
ú
úû
All-pairs shortest-paths problem
Floyd-Warshall
ported to GPU
480x
Naïve
recursive
implementation
The right
primitive
(Matrix multiply)
A. Buluç, J. R. Gilbert, and C. Budak. Solving path problems on the GPU. Parallel
Computing, 36(5-6):241 - 253, 2010.
Communication-avoiding APSP in distributed memory
Bandwidth: Wbc-2.5D (n, p) = O(n
Latency: Sbc-2.5D (p) = O
(
cp)
2
)
cp log (p)
2
c: number of
replicas
Optimal for any
memory size !
Communication-avoiding APSP in distributed memory
1200
c=16
c=4
c=1
1000
n=8192
600
400
n=4096
0
1
4
16
64
256
1024
200
1
4
16
64
256
1024
GFlops
800
Number of compute nodes
E. Solomonik, A. Buluç, and J. Demmel. Minimizing communication in all-pairs
shortest paths. In Proceedings of the IPDPS. 2013.
Lecture Outline
• Applications
• Designing parallel graph algorithms
• Case studies:
A.
B.
C.
D.
E.
Graph traversals: Breadth-first search
Shortest Paths: Delta-stepping, PHAST, Floyd-Warshall
Ranking: Betweenness centrality
Maximal Independent Sets: Luby’s algorithm
Strongly Connected Components
• Wrap-up: challenges on current systems
Betweenness Centrality (BC)
• Centrality: Quantitative measure to capture the
importance of a vertex/edge in a graph
– degree, closeness, eigenvalue, betweenness
• Betweenness Centrality
BC (v) 
 st (v)

s  v  t  st
(  st : No. of shortest paths between s and t)
• Applied to several real-world networks
–
–
–
–
Social interactions
WWW
Epidemiology
Systems biology
Algorithms for Computing Betweenness
• All-pairs shortest path approach: compute the length and
number of shortest paths between all s-t pairs (O(n3) time),
sum up the fractional dependency values (O(n2) space).
• Brandes’ algorithm (2003): Augment a single-source shortest
path computation to count paths; uses the Bellman criterion;
O(mn) work and O(m+n) space on unweighted graph.
d (v) =
s (v)
å s (w) (1+ d (w))
wÎadj(v)
Dependency of
source on v.
d (v) = ås st (v)
tÎV
Number of shortest
paths from source to w
Parallel BC algorithms for unweighted graphs
• High-level idea: Level-synchronous parallel breadthfirst search augmented to compute centrality scores
• Two steps (both implemented as BFS)
– traversal and path counting
– dependency accumulation
Shared-memory parallel algorithms for betweenness centrality
Exact algorithm: O(mn) work, O(m+n) space, O(nD+nm/p) time.
D.A. Bader and K. Madduri. Parallel algorithms for evaluating centrality indices in real-world networks, ICPP’08
Improved with lower synchronization overhead and fewer noncontiguous memory references.
K. Madduri, D. Ediger, K. Jiang, D.A. Bader, and D. Chavarria-Miranda. A faster parallel algorithm and efficient
multithreaded implementations for evaluating betweenness centrality on massive datasets. MTAAP 2009
Parallel BC Algorithm Illustration
0
5
8
7
3
2
1
4
6
9
Parallel BC Algorithm Illustration
1. Traversal step: visit adjacent vertices, update distance
and path counts.
0
5
8
7
3
source
vertex
2
1
4
6
9
Parallel BC Algorithm Illustration
1. Traversal step: visit adjacent vertices, update distance
and path counts.
S
5
0
7
source
vertex
2
8
3
6
P
0
1
4
D
9
2
7
5
0
1
0
1
0
1
0
0
Parallel BC Algorithm Illustration
1. Traversal step: visit adjacent vertices, update distance
and path counts.
S
5
0
7
source
vertex
2
8
3
6
P
0
1
4
D
9
8
3
2
7
5
0
1
2
0
2
1
0
1
2
0
5
Level-synchronous approach: The adjacencies of all vertices
in the current frontier can be visited in parallel
7
0
7
Parallel BC Algorithm Illustration
1. Traversal step: at the end, we have all reachable vertices,
their corresponding predecessor multisets, and D values.
5
0
7
source
vertex
2
8
3
1
4
6
9
S
D
2
1
6
4
8
3
2
7
5
0
0
P
1
2
1
1
2
Level-synchronous approach: The adjacencies of all vertices
in the current frontier can be visited in parallel
6
0
2
3
0
8
0
5
6
7
8
0
7
Parallel BC Algorithm Illustration
2. Accumulation step: Pop vertices from stack,
update dependence scores.
S
5
0
7
source
vertex
2
8
3
1
4
6
9
2
1
6
4
8
3
2
7
5
0
 (v ) 
 (v )
1   (w) 


(
w
)
vP ( w )
Delta P
6
0
2
3
0
8
0
5
6
7
8
0
7
Parallel BC Algorithm Illustration
2. Accumulation step: Can also be done in a
level-synchronous manner.
S
5
0
7
source
vertex
2
8
3
1
4
6
9
2
1
6
4
8
3
2
7
5
0
d (v) =
s (v)
(1+ d (w))
s
(w)
wÎadj(v)
å
Delta P
6
0
2
3
0
8
0
5
6
7
8
0
7
Distributed memory BC algorithm
Work-efficient parallel breadth-first search via parallel
sparse matrix-matrix multiplication over semirings

4
3
AT
B
7
6
.
(ATX) *¬B
Encapsulates three level of parallelism:
1. columns(B): multiple BFS searches in parallel
2. columns(AT)+rows(B): parallel over frontier vertices in each
BFS
3. rows(AT): parallel over incident edges of each frontier vertex
A. Buluç, J.R. Gilbert. The Combinatorial BLAS: Design, implementation, and applications. The International
Journal of High Performance Computing Applications, 2011.
5
Lecture Outline
• Applications
• Designing parallel graph algorithms
• Case studies:
A.
B.
C.
D.
E.
Graph traversals: Breadth-first search
Shortest Paths: Delta-stepping, PHAST, Floyd-Warshall
Ranking: Betweenness centrality
Maximal Independent Sets: Luby’s algorithm
Strongly Connected Components
• Wrap-up: challenges on current systems
Maximal Independent Set
• Graph with vertices V = {1,2,…,n}
• A set S of vertices is independent if no
two vertices in S are neighbors.
• An independent set S is maximal if it is
impossible to add another vertex and
stay independent
5
• An independent set S is maximum
if no other independent set has more
vertices
• Finding a maximum independent set is
intractably difficult (NP-hard)
• Finding a maximal independent set is
easy, at least on one processor.
1
2
3
4
7
8
The set of red vertices
S = {4, 5} is independent
and is maximal
but not maximum
6
Sequential Maximal Independent Set Algorithm
1
1. S = empty set;
2
2. for vertex v = 1 to n {
3.
if (v has no neighbor in S) {
4.
5.
add v to S
}
5
3
4
7
8
6. }
S={}
6
Sequential Maximal Independent Set Algorithm
1
1. S = empty set;
2
2. for vertex v = 1 to n {
3.
if (v has no neighbor in S) {
4.
5.
add v to S
}
5
3
4
7
8
6. }
S={1}
6
Sequential Maximal Independent Set Algorithm
1
1. S = empty set;
2
2. for vertex v = 1 to n {
3.
if (v has no neighbor in S) {
4.
5.
add v to S
}
5
3
4
7
8
6. }
S = { 1, 5 }
6
Sequential Maximal Independent Set Algorithm
1
1. S = empty set;
2
2. for vertex v = 1 to n {
3.
if (v has no neighbor in S) {
4.
5.
add v to S
}
5
3
4
7
8
6. }
S = { 1, 5, 6 }
work ~ O(n), but span ~O(n)
6
Parallel, Randomized MIS Algorithm
1. S = empty set; C = V;
1
2
2. while C is not empty {
3.
label each v in C with a random r(v);
4.
for all v in C in parallel {
5
if r(v) < min( r(neighbors of v) ) {
5.
6.
move v from C to S;
7.
remove neighbors of v from C;
8.
9.
10. }
}
}
3
4
7
8
S={}
C = { 1, 2, 3, 4, 5, 6, 7, 8 }
M. Luby. "A Simple Parallel Algorithm for the Maximal Independent Set
Problem". SIAM Journal on Computing 15 (4): 1036–1053, 1986
6
Parallel, Randomized MIS Algorithm
1. S = empty set; C = V;
1
2
2. while C is not empty {
3.
label each v in C with a random r(v);
4.
for all v in C in parallel {
5
if r(v) < min( r(neighbors of v) ) {
5.
6.
move v from C to S;
7.
remove neighbors of v from C;
8.
9.
10. }
}
}
3
4
7
8
S={}
C = { 1, 2, 3, 4, 5, 6, 7, 8 }
6
Parallel, Randomized MIS Algorithm
2.6
1
1. S = empty set; C = V;
2. while C is not empty {
3.
4.
for all v in C in parallel {
5
if r(v) < min( r(neighbors of v) ) {
6.
move v from C to S;
7.
remove neighbors of v from C;
8.
10. }
2
5.9
3.1
label each v in C with a random r(v);
5.
9.
4.1
}
}
1.2
3
4
7
8
9.7
5.8
6
9.3
S={}
C = { 1, 2, 3, 4, 5, 6, 7, 8 }
Parallel, Randomized MIS Algorithm
2.6
1
1. S = empty set; C = V;
2. while C is not empty {
3.
4.
for all v in C in parallel {
5
if r(v) < min( r(neighbors of v) ) {
6.
move v from C to S;
7.
remove neighbors of v from C;
8.
10. }
2
5.9
3.1
label each v in C with a random r(v);
5.
9.
4.1
}
}
1.2
3
4
7
8
9.7
9.3
S = { 1, 5 }
C = { 6, 8 }
5.8
6
Parallel, Randomized MIS Algorithm
1. S = empty set; C = V;
1
2
2. while C is not empty {
3.
4.
label each v in C with a random r(v);
for all v in C in parallel {
5
if r(v) < min( r(neighbors of v) ) {
5.
6.
move v from C to S;
7.
remove neighbors of v from C;
8.
9.
10. }
}
}
3
4
7
8
1.8
S = { 1, 5 }
C = { 6, 8 }
2.7
6
Parallel, Randomized MIS Algorithm
1. S = empty set; C = V;
1
2
2. while C is not empty {
3.
4.
label each v in C with a random r(v);
for all v in C in parallel {
5
if r(v) < min( r(neighbors of v) ) {
5.
6.
move v from C to S;
7.
remove neighbors of v from C;
8.
9.
10. }
}
}
3
4
7
8
1.8
S = { 1, 5, 8 }
C={}
2.7
6
Parallel, Randomized MIS Algorithm
1. S = empty set; C = V;
1
2
2. while C is not empty {
3.
label each v in C with a random r(v);
4.
for all v in C in parallel {
5
if r(v) < min( r(neighbors of v) ) {
5.
6.
move v from C to S;
7.
remove neighbors of v from C;
8.
9.
10. }
}
}
3
4
7
8
Theorem: This algorithm
“very probably” finishes
within O(log n) rounds.
work ~ O(n log n), but span ~O(log n)
6
Lecture Outline
• Applications
• Designing parallel graph algorithms
• Case studies:
A.
B.
C.
D.
E.
Graph traversals: Breadth-first search
Shortest Paths: Delta-stepping, PHAST, Floyd-Warshall
Ranking: Betweenness centrality
Maximal Independent Sets: Luby’s algorithm
Strongly Connected Components
• Wrap-up: challenges on current systems
Strongly connected components (SCC)
1
1
2
4
7
5
3
6
1
2
2
4
7
4
7
5
3
6
3
6
• Symmetric permutation to block triangular form
• Find P in linear time by depth-first search
Tarjan, R. E. (1972), "Depth-first search and linear graph algorithms",
SIAM Journal on Computing 1 (2): 146–160
5
Strongly connected components of directed graph
• Sequential: use depth-first search (Tarjan);
work=O(m+n) for m=|E|, n=|V|.
• DFS seems to be inherently sequential.
• Parallel: divide-and-conquer and BFS (Fleischer
et al.); worst-case span O(n) but good in practice
on many graphs.
L. Fleischer, B. Hendrickson, and A. Pınar. On identifying strongly connected components in
parallel. Parallel and Distributed Processing, pages 505–511, 2000.
Fleischer/Hendrickson/Pinar algorithm
- Partition the given graph into
three disjoint subgraphs
- Each can be processed
independently/recursively
Lemma: FW(v)∩ BW(v) is a
unique SCC for any v. For every
other SCC s, either
(a) s ⊂ FW(v)\BW(v),
(b) s ⊂ BW(v)\FW(v),
(c) s ⊂ V \ (FW(v)∪BW(v)).
FW(v): vertices reachable from vertex v.
BW(v): vertices from which v is reachable.
Lecture Outline
• Applications
• Designing parallel graph algorithms
• Case studies:
A.
B.
C.
D.
E.
Graph traversals: Breadth-first search
Shortest Paths: Delta-stepping, PHAST, Floyd-Warshall
Ranking: Betweenness centrality
Maximal Independent Sets: Luby’s algorithm
Strongly Connected Components
• Wrap-up: challenges on current systems
The locality challenge
“Large memory footprint, low spatial and temporal locality
impede performance”
Performance rate
(Million Traversed Edges/s)
Serial Performance of “approximate betweenness centrality” on a 2.67
GHz Intel Xeon 5560 (12 GB RAM, 8MB L3 cache)
Input: Synthetic R-MAT graphs (# of edges m = 8n)
No Last-level Cache (LLC) misses
60
50
40
30
O(m) LLC misses
20
10
0
10
12
14
16
18
20
22
24
Problem Size (Log2 # of vertices)
26
~ 5X drop in
performance
The parallel scaling challenge
“Classical parallel graph algorithms perform poorly on current
parallel systems”
• Graph topology assumptions in classical algorithms do not
match real-world datasets
• Parallelization strategies at loggerheads with techniques for
enhancing memory locality
• Classical “work-efficient” graph algorithms may not fully
exploit new architectural features
– Increasing complexity of memory hierarchy, processor heterogeneity,
wide SIMD.
• Tuning implementation to minimize parallel overhead is nontrivial
– Shared memory: minimizing overhead of locks, barriers.
– Distributed memory: bounding message buffer sizes, bundling
messages, overlapping communication w/ computation.
Designing parallel algorithms for large sparse graph analysis
System requirements: High (on-chip memory, DRAM, network, IO) bandwidth.
Solution: Efficiently utilize available memory bandwidth.
Algorithmic innovation
to avoid corner cases.
Improve
locality
“RandomAccess”-like
Problem
Complexity
Locality
Data reduction/
Compression
“Stream”-like
Faster
methods
Work
performed
O(n)
O(nlog n)
O(n2)
104
106
108
1012
Problem size (n: # of vertices/edges)
Peta+
Conclusions
• Best algorithm depends on the input.
• Communication costs (and hence data
distribution) is crucial for distributed memory.
• Locality is crucial for in-node performance.
• Graph traversal (BFS) is fundamental building
block for more complex algorithms.
• Best parallel algorithm can be significantly
different than the best serial algorithm.
Download