An Interactive Environment for Combinatorial Supercomputing John R. Gilbert University of California, Santa Barbara Viral Shah (UCSB) Steve Reinhardt (Silicon Graphics) with thanks to Alan Edelman (MIT & ISC) and Jeremy Kepner (MIT-LL) 1 Support: DOE Office of Science, DARPA, SGI, ISC Parallel Computing Today Columbia, NASA Ames Research Center Lab Beowulf cluster 2 Combinatorial Scientific Computing Emerging large scale, high-performance applications: • Web search and information retrieval • Knowledge discovery • Machine learning • Computational biology • Bioinformatics • Sparse matrix methods • Geometric modeling • ... How will combinatorial methods be used by nonexperts? 3 Analogy: Matrix Division in Matlab x = A \ b; • Works for either full or sparse A • Is A square? no => use QR to solve least squares problem • Is A triangular or permuted triangular? yes => sparse triangular solve • Is A symmetric with positive diagonal elements? yes => attempt Cholesky after symmetric minimum degree • Otherwise => use LU on A(:, colamd(A)) 4 Interactive Data Exploration 5 RMAT Approximate Power-Law Graph 6 Strongly Connected Components 7 Matlab*P A = rand(4000*p, 4000*p); x = randn(4000*p, 1); y = zeros(size(x)); while norm(x-y) / norm(x) > 1e-11 y = x; x = A*x; x = x / norm(x); end; 8 Matlab*P History • Matlab*P 1.0 (1998): Edelman, Husbands, Isbell (MIT) • Matlab*P 2.0 (2002- ): MIT / UCSB / LBNL • Star-P (2004- ): Interactive Supercomputing / SGI 9 Data-Parallel Operations < M A T L A B > Copyright 1984-2001 The MathWorks, Inc. Version 6.1.0.1989a Release 12.1 >> A = randn(500*p, 500*p) A = ddense object: 500-by-500 >> E = eig(A); >> E(1) ans = -4.6711 +22.1882i e = pp2matlab(E); >> ppwhos Name 10 Size Bytes Class A 500px500p 688 ddense object E 500px1 652 ddense object e 500x1 8000 double array (complex) Task-Parallel Operations >> quad('4./(1+x.^2)', 0, 1); ans = 3.14159270703219 >> a = (0:3*p) / 4 a = ddense object: 1-by-4 >> pp2matlab(a) ans = 0 0.25000000000000 0.50000000000000 0.75000000000000 >> b = a + .25; >> c = ppeval('quad','4./(1+x.^2)', a, b); c = ddense object: 1-by-4 >> sum(c) ans = 3.14159265358979 11 Star-P Architecture Star-P client manager package manager processor #1 dense/sparse sort processor #2 ScaLAPACK processor #3 FFTW Ordinary Matlab variables processor #0 FPGA interface MPI user code UPC user code ... MATLAB® processor #n-1 server manager matrix manager 12 Distributed matrices Distributed Sparse Array Structure P0 31 41 59 26 53 1 3 2 3 1 31 1 41 53 59 26 2 3 P1 Each processor stores: P2 Pn 13 • # of local nonzeros (# local edges) • range of local rows (local vertices) • nonzeros in a compressed row data structure (local edges) The sparse( ) Constructor • A = sparse (I, J, V, nr, nc); 14 • Input: ddense vectors I, J, V, dimensions nr, nc • Output: A(I(k), J(k)) = V(k) • Sum values with duplicate indices • Sorts triples < i, j, v > by < i, j > • Inverse: [I, J, V] = find(A); Sparse Array and Matrix Operations • dsparse layout, same semantics as ddense • Matrix arithmetic: +, max, sum, etc. • matrix * matrix and matrix * vector • Matrix indexing and concatenation A (1:3, [4 5 2]) = [ B(:, J) C ] ; 15 • Linear solvers: x = A \ b; using SuperLU (MPI) • Eigensolvers: [V, D] = eigs(A); using PARPACK (MPI) Large-Scale Graph Algorithms 16 • Graph theory, algorithms, and data structures are ubiquitous in sparse matrix computation. • Time to turn the relationship around! • Represent a graph as a sparse adjacency matrix. • A sparse matrix language is a good start on primitives for computing with graphs. • Leverage the mature techniques and tools of highperformance numerical computation. Sparse Adjacency Matrix and Graph 1 2 4 7 3 AT x 5 6 ATx • Adjacency matrix: sparse array w/ nonzeros for graph edges • Storage-efficient implementation from sparse data structures 17 Breadth-First Search: Sparse mat * vec 1 2 4 7 3 AT 18 x 5 6 ATx • Multiply by adjacency matrix step to neighbor vertices • Work-efficient implementation from sparse data structures Breadth-First Search: Sparse mat * vec 1 2 4 7 3 AT 19 x 5 6 ATx • Multiply by adjacency matrix step to neighbor vertices • Work-efficient implementation from sparse data structures Breadth-First Search: Sparse mat * vec 1 2 4 7 3 AT 20 x 5 6 ATx (AT)2x • Multiply by adjacency matrix step to neighbor vertices • Work-efficient implementation from sparse data structures Connected Components of a Graph • Sequential Matlab uses depth-first search (dmperm), which doesn’t parallelize well • Pointer-jumping algorithms (Shiloach/Vishkin & descendants) – repeat • Link every (super)vertex to a neighbor • Shrink each tree to a supervertex by pointer jumping – until no further change • Other graph kernels in progress: – Shortest-path search (after Husbands, LBNL) – Bipartite matching (after Riedy, UCB) – Strongly connected components (after Pinar, LBNL) 21 Maximal Independent Set 1 2 4 7 degree = sum(G, 2); prob = 1 ./ (2 * deg); select = rand (n, 1) < prob; if ~isempty (select & (G * select)) % keep higher degree vertices 3 6 end IndepSet = [IndepSet select]; neighbor = neighbor | (G * select); remain = neighbor == 0; G = G(remain, remain); 22 Starting guess: Select some vertices randomly 5 Maximal Independent Set 1 2 4 7 degree = sum(G, 2); prob = 1 ./ (2 * deg); select = rand (n, 1) < prob; if ~isempty (select & (G * select)) % keep higher degree vertices 3 5 6 end IndepSet = [IndepSet select]; neighbor = neighbor | (G * select); 23 If neighbors are selected, keep only a higher-degree one. remain = neighbor == 0; Add selected vertices to G = G(remain, remain); the independent set. Maximal Independent Set 1 2 4 7 degree = sum(G, 2); prob = 1 ./ (2 * deg); select = rand (n, 1) < prob; if ~isempty (select & (G * select); % keep higher degree vertices 3 6 end IndepSet = [IndepSet select]; neighbor = neighbor | (G * select); remain = neighbor == 0; G = G(remain, remain); 24 Discard neighbors of the independent set. Iterate on the rest of the graph. 5 SSCA#2: “Graph Analysis” Benchmark (spec version 1) Fine-grained, irregular data access Searching and clustering 25 • Many tight clusters, loosely interconnected • Input data is edge triples < i, j, label(i,j) > • Vertices and edges permuted randomly SSCA#2: Scalable Data Generator Scale 26 • Given “scale” = log2(#vertices) • Creates edge triples < i, j, label(i,j) > • Randomly permutes triples and vertex numbers #Vertices #Cliques #Edges Directed #Edges Undirected 10 1,024 186 13,212 3,670 15 32,768 2,020 1,238,815 344,116 20 1,048,576 20,643 126,188,649 35,052,403 25 33,554,432 207,082 12,951,350,000 3,597,598,000 30 1,073,741,824 2,096,264 1,317,613,000,000 366,003,600,000 Statistics for SSCA2 spec v1.1 Concise SSCA#2 in Star-P Kernel 1: Construct graph data structures • 27 Graphs are dsparse matrices, created by sparse( ) Kernels 2 and 3 Kernel 2: Search by edge labels • About 12 lines of executable Matlab or Star-P Kernel 3: Extract subgraphs 28 • Returns subgraphs consisting of vertices and edges within fixed distance of given starting vertices • Sparse matrix-matrix product for multiple breadth-first search • About 25 lines of executable Matlab or Star-P Kernel 4: Clustering by BFS • Grow local clusters from many seeds in parallel • Breadth-first search by sparse matrix * matrix • Cluster vertices connected by many short paths % Grow each seed to vertices % reached by at least k % paths of length 1 or 2 C = sparse(seeds, 1:ns, 1, n, ns); C = A * C; C = C + A * C; C = C >= k; 29 Kernel 4: Clustering by Peer Pressure 12 3 Steps in a peer pressure algorithm: 5 2 8 1. Vote for a cluster leader 2. Collect neighbor votes 6 4 3. Vote for a new leader 1 10 (based on neighbor votes) 11 13 7 9 • Clustering qualities depend on details of each step. • Want relatively few potential leaders, e.g. a maximal indep set. Other choices possible – for SSCA2 graph, simpler rules work too. • Neighbor votes can be combined using various weightings. • Each version of kernel4 is about 25 lines of code. 30 Kernel 4: Clustering by Peer Pressure 12 [ignore, leader] = max(G); S = G * sparse(1:n,leader,1,n,n); 5 11 12 12 13 5 13 13 [ignore, leader] = max(S); 13 13 13 31 13 • Each vertex votes for highest numbered neighbor as its leader • Number of leaders is approximately number of clusters (small relative to the number of nodes) Kernel 4: Clustering by Peer Pressure 12 [ignore, leader] = max(G); S = sparse(leader,1:n,1,n,n) * G; 5 5 12 12 12 5 12 13 [ignore, leader] = max(S); 13 13 13 32 13 • Matrix multiplication gathers neighbor votes • S(i,j) is # of votes for i from j’s neighbors • In SSCA2 (spec1.0), most of graph structure is recovered right away; iteration needed for harder graphs Scaling Up Recent results on SGI Altix (up to 128 processors): • Have run SSCA2 (spec v1.0) on graphs with 227 = 134 million vertices and about one billion (109) edges • Have manipulated graphs with 400 million vertices and 4 billion edges • Timings scale well – for large graphs, • 2x problem size 2x time • 2x problem size & 2x processors same time • Tracking new SSCA2 draft spec v2.0, in progress Using this benchmark to tune Star-P sparse array infrastructure 33 Toolbox for Graph Analysis and Pattern Discovery Layer 1: Graph Theoretic Tools 34 • Graph operations • Global structure of graphs • Graph partitioning and clustering • Graph generators • Visualization and graphics • Scan and combining operations • Utilities Sparse Matrix times Sparse Matrix 35 • Shows up often as a primitive. • Graphs are mostly not mesh-like, i.e. geometric locality and good separators. • On a 2D processor grid, the parallel sparse algorithm looks much like the parallel dense algorithm. • Redistribute to round-robin cyclic or random distribution for load balance. Load Balance Without Redistribution 36 Load Balance With Redistribution 37 Compressed Sparse Matrix Storage value: 31 41 59 26 53 31 • 38 0 53 0 59 0 41 26 0 Full storage: • 2-dimensional array. • (nrows*ncols) memory. row: 1 3 2 3 colstart: 1 3 5 6 1 • Sparse storage: • Compressed storage by columns (CSC). • Three 1-dimensional arrays. • (2*nzs + ncols + 1) memory. • Similarly, CSR. Single Processor MatMul: C = A * B C(:, :) = 0; for i = 1:n for j = 1:n for k = 1:n C(i, j) = C(i, j) + A(i, k) * B(k, j); • The n3 scalar updates can be done in any order. • Six possible algorithms: ijk, ikj, jik, jki, kij, kji (lots more if you consider blocking for cache) 39 • Goal: Sparse algorithm with time = O(nonzero flops) • For sparse, even time = O(n2) is too slow! Possible Organizations of Sparse MatMul Barriers to O(flops) work Outer product: for k = 1:n C = C + A(:, k) * B(k, :) Column by column: for j = 1:n for k where B(k, j) 0 C(:, j) = C(:, j) + A(:, k) * B(k, j) 40 too slow Inner product: for i = 1:n for j = 1:n C(i, j) = A(i, :) * B(:, j) - Inserting updates into C is - n2 loop iterations cost too much if C is sparse - Loop k only over nonzeros in column j of B - Use sparse accumulator (SPA) for column updates Sparse Accumulator (SPA) • Abstract data type for a single matrix column • Operations on SPA: • 41 – initialize spa O(n) time, O(n) storage – spa = spa + scalar * (CSC vector) O(nnz(vector)) time – (CSC vector) = spa O(nnz(spa)) time – – spa = 0 … possibly other ops O(nnz(spa)) time Standard implementation of SPA (many variations): – n-element array “value” – n-element flag array “is-nonzero” – linked structure (or other) to gather nonzeros from spa CSC Sparse Matrix Multiplication with SPA for j = 1:n C(:, j) = A * B(:, j) = x C A scatter/ accumulate gather SPA 42 B All matrix columns and vectors are stored compressed except the SPA. Remarks • Tools for combinatorial methods built on parallel sparse matrix infrastructure • Easy-to-use interactive programming environment – Rapid prototyping tool for algorithm development – Interactive exploration and visualization of data – Alan Edelman: “parallel computing is fun again ” • Sparse matrix * sparse matrix is a key primitive • Matrices over semirings like (min,+) as well as (+,*) 44