James Edwards and Uzi Vishkin University of Maryland 1

advertisement
James Edwards and Uzi Vishkin
University of Maryland
1

Motivation
◦ Begin with a theory of parallel algorithms (PRAM)
◦ Develop an architecture (XMT) based on theory
◦ Validate theory using architecture
◦ Validate architecture using theory

In order to validate XMT, we need to move beyond simple
benchmark kernels
◦ This is in line with the history of benchmarking of performance (e.g. SPEC)

Triconnectivity is the most complex algorithm that has been
tested on XMT.
◦ Only one serial implementation is publically available, and no prior parallel
implementation
◦ Prior work of similar complexity on XMT includes biconnectivity [EV12PMAM/PPoPP] and maximum flow [CV11-SPAA].
2
advanced planarity testing
advanced triconnectivity
planarity testing
triconnectivity
st-numbering
k-edge/vertex
connectivity
minimum
spanning forest
centroid
decomposition
tree
contraction
Euler
tours
ear decomposition search
lowest common
ancestors
biconnectivity
strong
orientation
graph
connectivity
tree Euler tour
list ranking
2-ruling set
prefix-sums
deterministic coin tossing
1
2
3
4
5
6
Input graph G
1
2
3
4
3
4
3
4
5
6
Triconnected components of G
4

High-level structure
◦ Key insight for serial and parallel algorithms:
separation pairs lie on cycles in the input graph
◦ Serial [HT73]: use depth-first search.
◦ Parallel [RV88, MR92]: use an ear decomposition.
1
E1
2
3
E2
4
5
Ear decomposition
of G
E3
6
5

Low-level structure
◦ The bulk of the algorithm lies in general
subroutines such as graph connectivity.
◦ Implementation of the triconnectivity algorithm was
greatly assisted by reuse of a library developed
during earlier work on biconnectivity (PMAM ‘12).
 Using this library, a majority of students successfully
completed a programming assignment on
biconnectivity in 2-3 weeks in a grad course on
parallel algorithms.
6

The Explicit Multi-Threading (XMT) architecture
was developed at the University of Maryland with
the following goals in mind:
◦ Good performance on parallel algorithms of any
granularity
◦ Support for regular or irregular memory access
◦ Efficient execution of code derived from PRAM
algorithms

A 64-processor FPGA hardware prototype and a
software toolchain (compiler and simulator) exist;
the latter is freely available for download.
7
Data set
Random-10K
Random-20K
Planar3-1000K
Ladder-20K
Ladder-100K
Ladder-1000K



Vertices (n) Edges (m) Sep. pairs (s)
10K
3000K
0
20K
5000K
0
1000K
3000K
0
20K
30K
10K
100K
150K
50K
1000K
1500K
500K
Random graph: Edges are added at random between
unique pairs of vertices
Planar3 graph: Vertices are added in layers of three; each
vertex in a layer is connected to the other vertices in the
layer and two vertices of the preceding layer
Ladder: Similar to Planar3, but with two vertices per layer
8
Normalized Runtime
0.2
0
1
0.99x
1.2
11.4x
0.8
1.3x
64 TCUs (FPGA)
16.9x
9.6x
1.8x
1.7x
0.6
21.6x
13.5x
129.2x
12.0x
108.8x
Serial (Core i7)
1024 TCUs (sim.)
0.4
9
T(n, m, s) = (2.38n + 0.238m + 4.75s) log2 n
-0.126%
Predicted
Simulated
-0.0161%
2
1.5
0
+29.4%
-25.2%
0.5
+3.11%
1
-10.5%
Runtime (billions of cycles)
2.5
10

The speedups presented here (up to 129x) in
conjunction with prior results for biconnectivity
(up to 33x) and max-flow (up to 108x)
demonstrates that the advantage of XMT is not
limited to small kernels.
◦ Biconnectivity was an exceptional challenge due to the
compactness of the serial algorithm.


This work completes the capstone of the proofof-concept of PRAM algorithms on XMT.
With this work, we now have the foundation in
place to advance to work on applications.
11




[CV11-SPAA] G. Caragea, U. Vishkin. Better Speedups
for Parallel Max-Flow. Brief Announcement, SPAA
2011.
[EV12-PMAM] J. Edwards and U. Vishkin. Better
Speedups Using Simpler Parallel Programming for
Graph Connectivity and Biconnectivity. PMAM, 2012.
[EV12-SPAA] J. Edwards and U. Vishkin. Brief
Announcement: Speedups for Parallel Graph
Triconnectivity. SPAA, 2012.
[HT73] J. E. Hopcroft and R. E. Tarjan. Dividing a
graph into triconnected components. SIAM J.
Computing, 2(3):135–158, 1973.
12


[MR92] G. L. Miller and V. Ramachandran. A new
graph triconnectivity algorithm and its
parallelization. Combinatorica, 12(1):53–76,
1992.
[KTCBV11] F. Keceli, A. Tzannes, G. Caragea, R.
Barua and U. Vishkin. Toolchain for
programming, simulating and studying the XMT
many-core architecture. Proc. 16th Int. Workshop
on High-Level Parallel Programming Models and
Supportive Environments (HIPS), in conjunction
with IPDPS, Anchorage, Alaska, May 20, 2011.
13



[RV88] V. Ramachandran and U. Vishkin. Efficient
parallel triconnectivity in logarithmic time. In
Proc. AWOC, pages 33–42, 1988.
[TV85] R. E. Tarjan and U. Vishkin. An Efficient
Parallel Biconnectivity Algorithm. SIAM J.
Computing, 14(4):862–874, 1985.
[WV08] X. Wen and U. Vishkin. FPGA-Based
Prototype of a PRAM-on-Chip Processor. In
Proceedings of the 5th Conference on Computing
Frontiers, CF ’08, pages 55–66, New York, NY,
USA, 2008. ACM.
14
15

PRAM algorithms are not a good match for
current hardware:
◦ Fine-grained parallelism = overheads
 Requires managing many threads
 Synchronization and communication are expensive
 Clustering reduces granularity, but at the cost of load
balancing
◦ Irregular memory accesses = poor locality
 Cache is not used efficiently
 Performance becomes sensitive to memory latency
16

Main feature of XMT: Using similar hardware
resources (e.g. silicon area, power
consumption) as existing CPUs and GPUs,
provide a platform that to a programmer
looks as close to a PRAM as possible.
◦ Instead of ~8 “heavy” processor cores, provide
~1,024 “light” cores for parallel code and one
“heavy” core for serial code.
◦ Devote on-chip bandwidth to a high-speed
interconnection network rather than maintaining
coherence between private caches.
17
◦ For the PRAM algorithm presented, the number of
HW threads is more important than the processing
power per thread because they happen to perform
more work than an equivalent serial algorithm. This
cost is overridden by sufficient parallelism in
hardware.
◦ Balance between the tight synchrony of the PRAM
and hardware constraints (such as locality) is
obtained through support for fine-grained
multithreaded code, where a thread can advance at
it own speed between (a form of) synchronization
barriers.
18

Maximal planar graph
◦ Built layer by layer
◦ The first layer has three vertices and three edges.
◦ Each additional layer has three vertices and nine
edges.
19
Download