Graphs and Sparse Matrices in an Interactive Supercomputing Environment John R. Gilbert and Viral Shah University of California at Santa Barbara with thanks to David Cheng, Ron Choy, Alan Edelman, Parry Husbands Page Rank Matrix • Web crawl of 170,000 pages from mit.edu • Matlab*P spy plot of the matrix of the graph Parallel Computing Today Earth Simulator machine Departmental Beowulf cluster But! How do you program it? C with MPI #include <stdio.h> #include <stdlib.h> #include <math.h> #include "mpi.h" #define A(i,j) ( 1.0/((1.0*(i)+(j))*(1.0*(i)+(j)+1)/2 + (1.0*(i)+1)) ) void errorExit(void); double normalize(double* x, int mat_size); int main(int argc, char **argv) { int num_procs; int rank; int mat_size = 64000; int num_components; double *x = NULL; double *y_local = NULL; double norm_old = 1; double norm = 0; int i,j; int count; if (MPI_SUCCESS != MPI_Init(&argc, &argv)) exit(1); if (MPI_SUCCESS != MPI_Comm_size(MPI_COMM_WORLD,&num_procs)) errorExit(); C with MPI (2) if (0 == mat_size % num_procs) num_components = mat_size/num_procs; else num_components = (mat_size/num_procs + 1); mat_size = num_components * num_procs; if (0 == rank) printf("Matrix Size = %d\n", mat_size); if (0 == rank) printf("Num Components = %d\n", num_components); if (0 == rank) printf("Num Processes = %d\n", num_procs); x = (double*) malloc(mat_size * sizeof(double)); y_local = (double*) malloc(num_components * sizeof(double)); if ( (NULL == x) || (NULL == y_local) ) { free(x); free(y_local); errorExit(); } if (0 == rank) { for (i=0; i<mat_size; i++) { x[i] = rand(); } norm = normalize(x,mat_size); } C with MPI (3) if (MPI_SUCCESS != MPI_Bcast(x, mat_size, MPI_DOUBLE, 0, MPI_COMM_WORLD)) errorExit(); count = 0; while (fabs(norm-norm_old) > TOL) { count++; norm_old = norm; for (i=0; i<num_components; i++) { y_local[i] = 0; } for (i=0; i<num_components && (i+num_components*rank)<mat_size; i++) { for (j=mat_size-1; j>=0; j--) { y_local[i] += A(i+rank*num_components,j) * x[j]; } } if (MPI_SUCCESS != MPI_Allgather(y_local, num_components, MPI_DOUBLE, x, num_components, MPI_DOUBLE, MPI_COMM_WORLD)) errorExit(); C with MPI (4) norm = normalize(x, mat_size); } if (0 == rank) { printf("result = %16.15e\n", norm); } free(x); free(y_local); MPI_Finalize(); exit(0); } void errorExit(void) { int rank; MPI_Comm_rank(MPI_COMM_WORLD,&rank); printf("%d died\n",rank); MPI_Finalize(); exit(1); } C with MPI (5) double normalize(double* x, int mat_size) { int i; double norm = 0; for (i=mat_size-1; i>=0; i--) { norm += x[i] * x[i]; } norm = sqrt(norm); for (i=0; i<mat_size; i++) { x[i] /= norm; } return norm; } Matlab*P A = rand(4000*p, 4000*p); x = randn(4000*p, 1); y = zeros(size(x)); while norm(x-y) / norm(x) > 1e-11 y = x; x = A*x; x = x / norm(x); end; Matlab*P • Version 1.0: Parry Husbands (now at NERSC) with Alan Edelman, Charles Isbell • Now: Ron Choy, Alan Edelman, S, G, . . . • Uses Matlab as an interactive front end to parallel software libraries (Scalapack, FFTW, . . . ) • Data stays on backend until retrieved explicitly • Two views of the same data: – Data-parallel SIMD – Task-parallel SPMD Data-Parallel Operations < M A T L A B > Copyright 1984-2001 The MathWorks, Inc. Version 6.1.0.1989a Release 12.1 >> A = randn(500*p, 500*p) A = ddense object: 500-by-500 >> E = eig(A); >> E(1) ans = -4.6711 +22.1882i e = E(:); >> whose Name A E e Size 500x500 500x1 500x1 Bytes 688 652 8000 Class ddense object ddense object double array (complex) Task-Parallel Operations >> quad('4./(1+x.^2)', 0, 1); ans = 3.14159270703219 >> a = (0:3*p) / 4 a = ddense object: 1-by-4 >> a(:) ans = 0 0.25000000000000 0.50000000000000 0.75000000000000 >> b = a+.25; >> c = mm('quad','4./(1+x.^2)', a, b); c = ddense object: 1-by-4 >> sum(c(:)) ans = 3.14159265358979 % Should be “feval”! FFT2 in four lines >> A = randn(4096, 4096*p) A = ddense object: 4096-by-4096 >> tic; >> >> >> >> B C D F = = = = mm('fft', A); B.'; mm('fft’, C); D.'; >> toc elapsed_time = 73.50 >>a = A(:,:); >> tic; g = fft2(a); toc elapsed_time = 202.95 MultiMatlab (MM) mode • A Matlab (or Octave) process runs on each node • Each node runs a sequential Matlab program • Same distributed data model as in “SIMD” mode • New: Global address space semantics … Matrix multiplication in MM mode >> % A, B, C are ddense matrices >> C = mm (‘dgemm’, A, dglobal(B)) function C = dgemm (A, B) [m n] = size(B); bs = n / getnproc(B); for i = 1:bs:ncb; C(:, i:i+bs) = A * B(:, i:i+bs); end GAS-MM infrastructure • Built on GASNet (UCB/LBL) • Network-independent, high-performance primitives for one-way communication • Matlab dglobal array indexing is overloaded to call GASNet Combinatorial Scientific Computing • Sparse matrix methods (direct, iterative, preconditioning); adaptive multilevel modeling; geometric meshing; searching and information retrieval; machine learning; computational biology; . . . • How will combinatorial methods be used by people who don’t understand them in detail? Matrix division in Matlab x = A \ b; • Works for either full or sparse A • Is A square? no => use QR to solve least squares problem • Is A triangular or permuted triangular? yes => sparse triangular solve • Is A symmetric with positive diagonal elements? yes => attempt Cholesky after symmetric minimum degree • Otherwise => use LU on A(:, colamd(A)) Matlab sparse matrix design principles • All operations should give the same results for sparse and full matrices (almost all) • Sparse matrices are never created automatically, but once created they propagate • Performance is important -- but usability, simplicity, completeness, and robustness are more important • Storage for a sparse matrix should be O(nonzeros) • Time for a sparse operation should be O(flops) (as nearly as possible) Matlab sparse matrix design principles • All operations should give the same results for sparse and full matrices (almost all) • Sparse matrices are never created automatically, but once created they propagate • Performance is important -- but usability, simplicity, completeness, and robustness are more important • Storage for a sparse matrix should be O(nonzeros) • Time for a sparse operation should be O(flops) (as nearly as possible) Matlab*P dsparse matrices: same principles, but some different tradeoffs Sparse matrix operations • • • • dsparse layout, same semantics as ddense For now, only row distribution Matrix operators: +, -, max, etc. Matrix indexing and concatenation A (1:3, [4 5 2]) = [ B(:, 7) C ] ; • A \ b by direct methods • Conjugate gradients Sparse data structure 31 0 53 0 59 0 31 53 59 41 26 1 3 2 1 2 41 26 0 • Full: • 2-dimensional array of real or complex numbers • (nrows*ncols) memory • Sparse: • compressed row storage • about (1.5*nzs + .5*nrows) memory Distributed sparse data structure P0 31 41 59 26 53 1 3 2 3 1 P1 P2 Pn Each processor stores: • # of local nonzeros • range of local rows • nonzeros in CSR form Sparse matrix times dense vector • y=A*x • The first call to matvec caches a communication schedule for matrix A. Later calls to multiply any vector by A use the cached schedule. • Communication and computation overlap. • Can use a tuned sequential matvec kernel on each processor. Sparse linear systems • x=A\b • Matrix division uses MPI-based direct solvers: – SuperLU_dist: nonsymmetric static pivoting – MUMPS: nonsymmetric multifrontal – PSPASES: Cholesky ppsetoption(’SparseDirectSolver’,’SUPERLU’) • Iterative solvers implemented in Matlab*P • Some preconditioners; ongoing work Application: Fluid dynamics • Modeling density-driven instabilities in miscible fluids (Goyal, Meiburg) • Groundwater modeling, oil recovery, etc. • Mixed finite difference & spectral method • Large sparse generalized eigenvalue problem function lambda = peigs (A, B, sigma, iter, tol) [m n] = size (A); C = A - sigma * B; y = rand (m, 1); for k = 1:iter q = y ./ norm (y); v = B * q; y = C \ v; theta = dot (q, y); res = norm (y - theta*q); if res <= tol break; end; end; lambda = 1 / theta; Combinatorial algorithms in Matlab*P • Sparse matrices are a good start on primitives for combinatorial scientific computing. – Random-access indexing: A(i,j) – Neighbor sequencing: find (A(i,:)) – Sparse table construction: sparse (I, J, V) • What else do we need? Sorting in Matlab*P • [V, perm] = sort (V) • Common primitive for many sparse matrix and array algorithms: sparse(), indexing, transpose • Matlab*P uses a parallel sample sort Sample sort • (Perform a random permutation) • Select p-1 “splitters” to form p buckets • Route each element to the correct bucket • Sort each bucket locally • “Starch” the result to match the distribution of the input vector Sample sort example Initial data (after randomizing) 3 6 8 1 5 4 7 2 9 8 7 9 6 7 8 9 6 7 8 9 Choose splitters (2 and 6) 1 2 3 6 5 4 Sort local data 1 2 3 4 5 Starch 1 2 3 4 5 How sparse( ) works • A = sparse (I, J, V) • Input: ddense vectors I, J, V (optionally, also dimensions and distribution info) • Sort triples (i, j, v) by (i, j) • Starch the vectors for desired row distribution • Locally convert to compressed row indices • Sum values with duplicate indices Graph / mesh partitioning • Reduce communication in matvec and other parallel computations • Reordering for sparse GE • PARMETIS • Parts of G/Teng Matlab meshpart toolbox Vertex separator in graph of matrix Matrix reordered by nested dissection 0 20 40 60 80 100 120 0 50 nz = 844 100 Geometric mesh partitioning • Algorithm of Miller, Teng, Thurston, Vavasis • Partitions irregular finite element meshes into equal-size pieces with few connecting edges • Guaranteed quality partitions for well-shaped meshes, often very good results in practice • Existing implementation in sequential Matlab • Code runs in Matlab*P with very minor changes Outline of algorithm 1. Project points stereographically from Rd to Rd+1 2. Find “centerpoint” (generalized median) 3. Conformal map: Rotate and dilate 4. Find great circle 5. Unmap and project down 6. Convert circle to separator Geometric mesh partitioning Matching and depth-first search in Matlab • dmperm: Dulmage-Mendelsohn decomposition • Square, full rank A: – – – – [p, q, r] = dmperm(A); A(p,q) is block upper triangular with nonzero diagonal also, strongly connected components of a directed graph also, connected components of an undirected graph • Arbitrary A: – – – – [p, q, r, s] = dmperm(A); maximum-size matching in a bipartite graph minimum-size vertex cover in a bipartite graph decomposition into strong Hall blocks Connected components • Sequential Matlab uses depth-first search (dmperm), which doesn’t parallelize well • Shiloach-Vishkin algorithm: – repeat • Link every (super)vertex to a random neighbor • Shrink each tree to a supervertex by pointer jumping – until no further change • Originally a processor-efficient PRAM algorithm • Matlab*P code looks much like the PRAM code Pointer jumping while ~all( C(myrows) == C(C(myrows)) ) C(myrows) = C(C(myrows)); end C(myrows) = min (C(myrows), C(C(myrows))); Example of execution Final components After first iteration Page Rank • Importance ranking of web pages • Stationary distribution of a Markov chain • Power method: matvec and vector arithmetic • Matlab*P page ranking demo (from SC’03) on a web crawl of mit.edu (170,000 pages) Remarks • Easy-to-use interactive programming environment • Interface to existing parallel packages • Combinatorial methods toolbox being built on parallel sparse matrix infrastructure – Much to be done: spanning trees, searches, etc. • A few issues for ongoing work – Dynamic resource management – Fault management – Programming in the large