Can PRAM Graph Algorithms Provide Practical Speedups on Many-Core Machines?

advertisement
Speaker
George Caragea, James Edwards and Uzi Vishkin
University of Maryland
1


It has proven to be quite difficult to obtain
significant performance improvements using
current parallel computing platforms.
National Research Council report: While
heroic programmers can exploit today vast
amounts of parallelism, whole new computing
“stacks” are required to allow expert and
typical programmers to do that easily.
2

The Parallel Random Access Machine (PRAM) is the
simplest model of a parallel computer.
◦ Work-Depth is a conceptually simpler model that is
equivalent to the PRAM.
◦ At each point in time, specify all operations that can be
performed in parallel.
◦ Any processor can access any memory address in constant
time.

Advantages
◦ Ease of algorithm design
◦ Provability of correctness
◦ Ease of truly PRAM-like programming

So, what’s the problem?
3

Many doubt the direct practical relevance of PRAM
algorithms.
◦ Example: lack of any poly-logarithmic PRAM graph algorithms in
the new NSF/IEEE-TCPP curriculum


Past work provided very limited evidence to alleviate these
doubts.
Graph algorithms in particular tend to be difficult to
implement efficiently, as shown in two papers from
Georgia Tech:
◦ Biconnectivity (IPDPS ‘05, 12-processor Sun machine): Speedups
of up to 4x with a modified version of the Tarjan-Vishkin
biconnectivity algorithm
 No speedup without major changes to the algorithm
◦ Maximum flow (IPDPS ‘10, hybrid GPU-CPU implementation):
Speedups of up to 2.5x
4

Cause: PRAM algorithms are not a good match
for current hardware:
◦ Fine-grained parallelism = overheads
 Requires managing many threads
 Synchronization and communication are expensive
 Clustering reduces granularity, but at the cost of load
balancing
◦ Irregular memory accesses = poor locality
 Cache is not used efficiently
 Performance becomes sensitive to memory latency

Unlike models such as BSP and LogP, PRAM does
not explicitly take these factors into account.
5

The Explicit Multi-Threading (XMT) architecture was
developed at the University of Maryland with the
following goals in mind:
◦ Good performance on parallel algorithms of any granularity
◦ Support for regular or irregular memory access
◦ Efficient execution of code derived from PRAM algorithms


A 64-processor FPGA hardware prototype and a
software toolchain (compiler and simulator) are freely
available for download.
Note: Unless otherwise specified, speedup results for
XMT were obtained using the simulator and are given
in terms of cycle counts.
6

Main feature of XMT: Using similar hardware
resources (e.g. silicon area, power
consumption) as existing CPUs and GPUs,
provide a platform that to a programmer
looks as close to a PRAM as possible.
◦ Instead of ~8 “heavy” processor cores, provide
~1,024 “light” cores for parallel code and one
“heavy” core for serial code.
◦ Devote on-chip bandwidth to a high-speed
interconnection network rather than maintaining
coherence between private caches.
7
◦ For the PRAM algorithms presented, the number of
HW threads is more important than the processing
power per thread because they happen to perform
more work than an equivalent serial algorithm. This
cost is overridden by sufficient parallelism in
hardware.
◦ Balance between the tight synchrony of the PRAM
and hardware constraints (such as locality) is
obtained through support for fine-grained
multithreaded code, where a thread can advance at
it own speed between (a form of) synchronization
barriers.
8

Consider the following two systems:
1.
2.
XMT running a PRAM algorithm with few or no modifications
A multi-core CPU or GPU running a heavily modified version of
the same PRAM algorithm or another algorithm solving the same
problem

It is perhaps surprising that (1) can outperform (2) while
being easier to implement.

This idea was demonstrated with the following four PRAM
graph algorithms:
◦
◦
◦
◦
BFS
Connectivity
Biconnectivity
Maximum flow
9


None of 40+ students in a fall 2010 joint
UIUC/UMD course got any speedups using
OpenMP programming on simple irregular
problems such as breadth-first search (BFS)
using an 8-processor SMP, but they got 8x25x speedups on the XMT FPGA prototype.
On BFS, we show potential speedups of 5.4x
over an optimized GPU implementation, 73x
when the input graph provides low degree of
parallelism during execution.
10


Using the Shiloach-Vishkin (SV) PRAM algorithm
for connectivity, we show potential speedups of
39x-100x over a best serial implementation and
2.2x-4x over an optimized GPU implementation
that greatly modified the original algorithm.
In fact, for XMT the SV PRAM connectivity
algorithm did not need to wait for a research
paper. It was given as one of 6 programming
assignments in standard PRAM algorithm classes,
and was even done by a couple of 10th graders
at Blair High School, Maryland.
11
Dataset
1kv-500ke-complete
20kv-5me-random
1mv-3me-planar
USA-road-d.LKS
web-Google-con
Description
Complete graph
Random graph
Maximal planar graph
Great Lakes road graph
Google web graph
Nodes
1,000
20,000
1,000,002
2,758,119
855,802
Edges
499,500
5,000,000
3,000,000
3,397,404
4,291,352

Complete graph: Every vertex is connected to every other vertex

Random graph: Edges are added at random between unique
pairs of vertices

Great lakes road graph: From the 9th DIMACS Implementation
Challenge

Google web graph: Undirected version of the largest connected
component of the Google web graph of web pages and
hyperlinks between them, from the Stanford network analysis
platform
12

Maximal planar graph
◦ Built layer by layer
◦ The first layer has three vertices and three edges.
◦ Each additional layer has three vertices and nine
edges.
13
49.09
65.06
13.45
19.04
38.99
57.35
20
20.45
27.11
40
6.60
13.13
60
10.98
15.41
80
XMT 2048
64.54
67.56
100
99.85
XMT 1024
120
(higher is better)
Speedup relative to serial
140
16.58
23.82
GTX 480 (Fermi)
89.75
109.53
135.79
GTX 280 (Tesla)
0
1kv-500ke-
20kv-5me-
1mv-3me-
USA-road-
Web-
complete
random
planar
d.LKS
Google-con
Dataset
14


On biconnectivity, we show potential speedups of
9x-33x using a direct implementation of the
Tarjan-Vishkin (TV) biconnectivity algorithm, a
logarithmic-time PRAM algorithm.
When compared with two other algorithms, one
based on BFS and the other on DFS, TV was the
only algorithm that provided strong speedups on
all evaluated input graphs.
◦ The other algorithms use less work but lose out to TV on
balance

Furthermore, TV provided the best speedup on
sparse graphs.
15
pDFS
30
TV
25
TV-BFS
20
15
10
5
2048
1024
64
2048
1024
64
2048
1024
64
2048
1024
64
2048
1024
0
64
(higher is better)
Speedup relative to serial DFS
35
1kv-500ke-
20kv-5me-
1mv-3me-
USA-road-
web-Google-
complete
random
planar
d.LKS
con
top: # TCUs
bottom: Dataset
16


Biconnectivity provides a good example of
how programming differs between XMT and
other platforms.
For both XMT and SMPs, a significant
challenge was to improve the work efficiency
of subroutines used within the biconnectivity
algorithm.
17

On XMT, we left the core algorithm as is without reducing
its available parallelism
◦ When computing graph connectivity (first on the input graph, then
on an auxiliary graph), compact the adjacency list every few
iterations
◦ When computing the preorder numbering of the spanning tree of
the input graph, accelerate the iterations by choosing faster but
more work demanding list ranking algorithms for different
iterations (“accelerating cascades”, [CV86])
◦ Transition as many computations as possible from the original
input graph to the spanning tree.

In contrast, speedups on SMPs could not be achieved
without reducing the parallelism of TV (e.g. by performing
a DFS traversal of the input graph), effectively replacing
many of its components
18



On maximum flow, we show potential speedups of up
to 108x compared to a modern CPU architecture
running a best serial implementation.
The XMT solution is a PRAM lock-free
implementation, based on balancing the GoldbergTarjan Push-Relabel algorithm with the first PRAM
max-flow algorithm (SV).
Performance is highly dependent on the structure of
the graph, determined by:
◦ The amount of parallelism available during execution
◦ The number of parallel steps (kernel invocations)
◦ The amount of memory queuing due to conflicts
19
Dataset
ADG
Description
Acyclic Dense Graph
Washington Random
RLG
Level Graph
RMF-WIDE GenRMF Wide Graph
RMF-LONG GenRMF Long Graph
RANDOM
Radom Graph

Nodes
1,200
Edges
719,400
131,074
8,192
8,192
65,536
391,168
23,040
22,464
96,759
Acyclic Dense Graphs (ADG)
◦ From 1st DIMACS Challenge [JM93]
◦ Complete direct acyclic graphs
◦ Node degrees range between N-1 to 1

Washington Random Level Graphs (RLG)
◦ From 1st DIMACS Challenge [JM93]
◦ Rectangular grids. Each vertex in a row has three edges to randomly chosen vertices
in next row
◦ Source and sink external to grid, connected to first and last row

RMF Graphs
◦ From 1st DIMACS Challenge [JM93] and [GG88]
◦ a square grids of vertices (frames), with b x b vertices per frame.
N=axbxb
◦ Each vertex connected to neighbors in frame, and one random
vertex in next frame
◦ Source in first frame, sink in last frame
◦ RMF long: many “small” frames; RMF wide: fewer “large” frames.

RANDOM
◦
◦
◦
◦
Random unstructured graphs
Edges are placed uniformly at random between pairs of nodes
Average degree is 6
Short diameter, high degree of parallelism
(higher is better)
Speedup
5
1.56
0.88
0.18
8.10
4.91
15
1.76
1.09
0.18
1.70
0.31
2.83
0.02
10
7.95
108.33
16.19
20
PR.1024
PR.64
cuda_mf
0
22


These experimental algorithm results show
not only that theory-based algorithms can
provide good speedups in practice, but also
that they are sometimes the only ones that
can do so.
Perhaps most surprising to theorists would
be that the nominal number of processors is
not as important for a fair comparison among
same-generation many-core platforms as
silicon area.
23




[CB05] G. Cong and D.A. Bader. An Experimental Study of Parallel
Biconnected Components Algorithms on Symmetric
Multiprocessors (SMPs). In Proc. 19th IEEE International Parallel
and Distributed Processing Symposium., page 45b, April 2005.
[CKTV10] G. C. Caragea, F. Keceli, A. Tzannes, and U. Vishkin.
General-purpose vs. GPU: Comparison of many-cores on
irregular workloads. In HotPar ’10: Proceedings of the 2nd
Workshop on Hot Topics in Parallelism. USENIX, June 2010.
[CV86] R. Cole, U. Vishkin. Deterministic coin tossing and
accelerating cascades: micro and macro techniques for designing
parallel algorithms. In Proc. STOC 1986.
[CV11] G. Caragea, U. Vishkin. Better Speedups for Parallel MaxFlow. Brief Announcement, SPAA 2011.
24




[EV11] J. Edwards and U. Vishkin. An Evaluation of
Biconnectivity Algorithms on Many-Core Processors.
2011. Under review.
[FM10] S.H. Fuller and L.I. Millett (Eds.). The Future of
Computing Performance: Game Over or Next level.
Computer Science and Telecommunications Board,
National Academies Press, December 2010.
[GG88] D. Goldfarb and M. Grigoriadis. A
Computational Comparison of the Dinic and
Network Simplex Methods for Maximum Flow.
Annals of Operations Research, 13:81-123, 1988.
[GT88] A.Goldberg, R.Tarjan, A new approach to the
maximum-flow problem Journal of ACM, 1988.
25



[HH10] Z. He and Bo Hong, Dynamically Tuned PushRelabel Algorithm for the Maximum Flow Problem on CPUGPU-Hybrid Platforms. In Proc. 24th IEEE International
Parallel and Distributed Processing Symposium (IPDPS'10),
2010.
[JM93] D.S. Jonson and C.C. McGeoch, editors. Network
Flows and Matching: First DIMACS Implementation
Challenge. AMS, Providence, RI, 1993.
[KTCBV11] F. Keceli, A. Tzannes, G. Caragea, R. Barua and
U. Vishkin. Toolchain for programming, simulating and
studying the XMT many-core architecture. Proc. 16th Int.
Workshop on High-Level Parallel Programming Models and
Supportive Environments (HIPS), in conjunction with IPDPS,
Anchorage, Alaska, May 20, 2011, to appear.
26

[SV82a] Y. Shiloach and U. Vishkin. An O(log n) parallel
connectivity algorithm. J. Algorithms, 3(1):57–67, 1982.

[SV82b] Y. Shiloach and U. Vishkin. An O(n2 log n) parallel maxflow algorithm. J. Algorithms, 3:128–146, 1982.



[TCPP10] NSF/IEEE-TCPP curriculum initiative on parallel and
distributed computing - core topics for undergraduates.
http://www.cs.gsu.edu/˜tcpp/curriculum/index.php, December
2010.
[TV85] R. E. Tarjan and U. Vishkin. An Efficient Parallel
Biconnectivity Algorithm. SIAM J. Computing, 14(4):862–874,
1985.
[WV08] X. Wen and U. Vishkin. FPGA-Based Prototype of a PRAMon-Chip Processor. In Proceedings of the 5th Conference on
Computing Frontiers, CF ’08, pages 55–66, New York, NY, USA,
2008. ACM.
27
Download