Computation on meshes, sparse matrices, and graphs Some slides are from David Culler, Jim Demmel, Bob Lucas, Horst Simon, Kathy Yelick, et al., UCB CS267 Parallelizing Stencil Computations • Parallelism is simple • Grid is a regular data structure • Even decomposition across processors gives load balance • Spatial locality limits communication cost • Communicate only boundary values from neighboring patches • Communication volume • v = total # of boundary cells between patches 2 Two-dimensional block decomposition • • • • n mesh cells, p processors Each processor has a patch of n/p cells Block row (or block col) layout: v = 2 * p * sqrt(n) 2-dimensional block layout: v = 4 * sqrt(p) * sqrt(n) 3 Ghost Nodes in Stencil Computations Comm cost = α * (#messages) + β * (total size of messages) Green = my interior nodes Blue = my boundary nodes Yellow = neighbors’ boundary nodes = my “ghost nodes” • • • • • Keep a ghost copy of neighbors’ boundary nodes Communicate every second iteration, not every iteration Reduces #messages, not total size of messages Costs extra memory and computation Can also use more than one layer of ghost nodes 4 Parallelism in Regular meshes • Computing a Stencil on a regular mesh • need to communicate mesh points near boundary to neighboring processors. • Often done with ghost regions • Surface-to-volume ratio keeps communication down, but • Still may be problematic in practice Implemented using “ghost” regions. Adds memory overhead 5 Irregular mesh: NASA Airfoil in 2D 6 Composite Mesh from a Mechanical Structure 7 Adaptive Mesh Refinement (AMR) • Adaptive mesh around an explosion • Refinement done by calculating errors 8 Adaptive Mesh Shock waves in a gas dynamics using AMR (Adaptive Mesh Refinement) See: http://www.llnl.gov/CASC/SAMRAI/ 9 Irregular mesh: Tapered Tube (Multigrid) 10 Challenges of Irregular Meshes for PDE’s • How to generate them in the first place • For example, Triangle, a 2D mesh generator (Shewchuk) • 3D mesh generation is harder! For example, QMD (Vavasis) • How to partition them into patches • For example, ParMetis, a parallel graph partitioner (Karypis) • How to design iterative solvers • For example, PETSc, Aztec, Hypre (all from national labs) • … Prometheus, a multigrid solver for finite elements on irregular meshes • How to design direct solvers • For example, SuperLU, parallel sparse Gaussian elimination • These are challenges to do sequentially, more so in parallel 11 Converting the Mesh to a Matrix 12 Sparse matrix data structure (stored by rows) 31 0 53 0 59 0 41 26 0 • Full: • 2-dimensional array of real or complex numbers • (nrows*ncols) memory 31 53 59 41 26 1 3 2 1 2 • Sparse: • compressed row storage • about (2*nzs + nrows) memory Distributed row sparse matrix data structure P0 31 41 59 26 53 1 3 2 3 1 P1 P2 Pn Each processor stores: • # of local nonzeros • range of local rows • nonzeros in CSR form Conjugate gradient iteration to solve A*x=b x0 = 0, r0 = b, d0 = r0 for k = 1, 2, 3, . . . αk = (rTk-1rk-1) / (dTk-1Adk-1) xk = xk-1 + αk dk-1 rk = rk-1 – αk Adk-1 βk = (rTk rk) / (rTk-1rk-1) dk = rk + βk dk-1 step length approx solution residual improvement search direction • One matrix-vector multiplication per iteration • Two vector dot products per iteration • Four n-vectors of working storage Parallel Vector Operations • Suppose x, y, and z are column vectors • Data is partitioned among all processors • Vector add: z = x + y • Embarrassingly parallel • Scaled vector add (DAXPY): z = α*x + y • Broadcast the scalar α, then independent * and + • Vector dot product (DDOT): β = xTy = Sj x[j] * y[j] • Independent *, then + reduction Broadcast and reduction • Broadcast of 1 value to p processors in log p time α Broadcast • Reduction of p values to 1 in log p time • Takes advantage of associativity in +, *, min, max, etc. 1 3 1 0 4 -6 3 2 Add-reduction Parallel Dense Matrix-Vector Product (Review) • y = A*x, where A is a dense matrix P0 P1 P2 P3 x P0 • Layout: • 1D by rows • Algorithm: Foreach processor j Broadcast X(j) Compute A(p)*x(j) y P1 P2 P3 • A(i) is the n by n/p block row that processor Pi owns • Algorithm uses the formula Y(i) = A(i)*X = Sj A(i)*X(j) Parallel sparse matrix-vector product • Lay out matrix and vectors by rows • y(i) = sum(A(i,j)*x(j)) • Only compute terms with A(i,j) ≠ 0 P0 P1 P2 P3 P0 • Algorithm Each processor i: Broadcast x(i) Compute y(i) = A(i,:)*x x y P1 P2 P3 • Optimizations • Only send each proc the parts of x it needs, to reduce comm • Reorder matrix for better locality by graph partitioning • Worry about balancing number of nonzeros / processor, if rows have very different nonzero counts Other memory layouts for matrix-vector product • Column layout of the matrix eliminates the broadcast • But adds a reduction to update the destination – same total comm • Blocked layout uses a broadcast and reduction, both on only sqrt(p) processors – less total comm • Blocked layout has advantages in multicore / Cilk++ too P0 P1 P2 P3 P0 P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 P12 P13 P14 P15 Graphs and Sparse Matrices • Sparse matrix is a representation of a (sparse) graph 1 2 3 4 1 1 2 1 5 6 6 1 1 3 4 5 1 1 1 3 1 4 1 1 1 2 1 1 1 1 6 5 • Matrix entries are edge weights • Number of nonzeros per row is the vertex degree • Edges represent data dependencies in matrix-vector multiplication Graph partitioning • • • • • Assigns subgraphs to processors Determines parallelism and locality. Tries to make subgraphs all same size (load balance) Tries to minimize edge crossings (communication). Exact minimization is NP-complete. edge crossings = 6 edge crossings = 10 22 Link analysis of the web 1 1 2 2 3 4 1 2 3 4 7 5 4 5 6 3 6 7 • Web page = vertex • Link = directed edge • Link matrix: Aij = 1 if page i links to page j 5 6 7 Web graph: PageRank (Google) [Brin, Page] An important page is one that many important pages point to. • Markov process: follow a random link most of the time; otherwise, go to any page at random. • Importance = stationary distribution of Markov process. • Transition matrix is p*A + (1-p)*ones(size(A)), scaled so each column sums to 1. • Importance of page i is the i-th entry in the principal eigenvector of the transition matrix. • But the matrix is 1,000,000,000,000 by 1,000,000,000,000. A Page Rank Matrix • Importance ranking of web pages •Stationary distribution of a Markov chain •Power method: matvec and vector arithmetic •Matlab*P page ranking demo (from SC’03) on a web crawl of mit.edu (170,000 pages) Social Network Analysis in Matlab: 1993 Co-author graph from 1993 Householder symposium Social Network Analysis in Matlab: 1993 Sparse Adjacency Matrix Which author has the most collaborators? >>[count,author] = max(sum(A)) count = 32 author = 1 >>name(author,:) ans = Golub Social Network Analysis in Matlab: 1993 Have Gene Golub and Cleve Moler ever been coauthors? >> A(Golub,Moler) ans = 0 No. But how many coauthors do they have in common? >> AA = A^2; >> AA(Golub,Moler) ans = 2 And who are those common coauthors? >> name( find ( A(:,Golub) .* A(:,Moler) ), :) ans = Wilkinson VanLoan