Graphs and Sparse Matrices in an Interactive Supercomputing Environment

advertisement
Graphs and Sparse Matrices
in an
Interactive Supercomputing
Environment
John R. Gilbert and Viral Shah
University of California at Santa Barbara
with thanks to
David Cheng, Ron Choy, Alan Edelman, Parry Husbands
Page Rank Matrix
• Web crawl of 170,000 pages from mit.edu
• Matlab*P spy plot of the matrix of the graph
Parallel Computing Today
Earth Simulator machine
Departmental Beowulf cluster
But!
How do you program it?
C with MPI
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include "mpi.h"
#define A(i,j) ( 1.0/((1.0*(i)+(j))*(1.0*(i)+(j)+1)/2 + (1.0*(i)+1)) )
void errorExit(void);
double normalize(double* x, int mat_size);
int main(int argc, char **argv)
{
int num_procs;
int rank;
int mat_size = 64000;
int num_components;
double *x = NULL;
double *y_local = NULL;
double norm_old = 1;
double norm = 0;
int i,j;
int count;
if (MPI_SUCCESS != MPI_Init(&argc, &argv)) exit(1);
if (MPI_SUCCESS != MPI_Comm_size(MPI_COMM_WORLD,&num_procs))
errorExit();
C with MPI (2)
if (0 == mat_size % num_procs) num_components = mat_size/num_procs;
else num_components = (mat_size/num_procs + 1);
mat_size = num_components * num_procs;
if (0 == rank) printf("Matrix Size = %d\n", mat_size);
if (0 == rank) printf("Num Components = %d\n", num_components);
if (0 == rank) printf("Num Processes = %d\n", num_procs);
x = (double*) malloc(mat_size * sizeof(double));
y_local = (double*) malloc(num_components * sizeof(double));
if ( (NULL == x) || (NULL == y_local) )
{
free(x);
free(y_local);
errorExit();
}
if (0 == rank)
{
for (i=0; i<mat_size; i++)
{
x[i] = rand();
}
norm = normalize(x,mat_size);
}
C with MPI (3)
if (MPI_SUCCESS !=
MPI_Bcast(x, mat_size, MPI_DOUBLE, 0, MPI_COMM_WORLD))
errorExit();
count = 0;
while (fabs(norm-norm_old) > TOL) {
count++;
norm_old = norm;
for (i=0; i<num_components; i++)
{
y_local[i] = 0;
}
for (i=0; i<num_components && (i+num_components*rank)<mat_size;
i++)
{
for (j=mat_size-1; j>=0; j--)
{
y_local[i] += A(i+rank*num_components,j) * x[j];
}
}
if (MPI_SUCCESS != MPI_Allgather(y_local, num_components,
MPI_DOUBLE, x, num_components, MPI_DOUBLE, MPI_COMM_WORLD))
errorExit();
C with MPI (4)
norm = normalize(x, mat_size);
}
if (0 == rank)
{
printf("result = %16.15e\n", norm);
}
free(x);
free(y_local);
MPI_Finalize();
exit(0);
}
void errorExit(void)
{
int rank;
MPI_Comm_rank(MPI_COMM_WORLD,&rank);
printf("%d died\n",rank);
MPI_Finalize();
exit(1);
}
C with MPI (5)
double normalize(double* x, int mat_size)
{
int i;
double norm = 0;
for (i=mat_size-1; i>=0; i--)
{
norm += x[i] * x[i];
}
norm = sqrt(norm);
for (i=0; i<mat_size; i++)
{
x[i] /= norm;
}
return norm;
}
Matlab*P
A = rand(4000*p, 4000*p);
x = randn(4000*p, 1);
y = zeros(size(x));
while norm(x-y) / norm(x) > 1e-11
y = x;
x = A*x;
x = x / norm(x);
end;
Matlab*P
• Version 1.0: Parry Husbands (now at NERSC) with
Alan Edelman, Charles Isbell
• Now: Ron Choy, Alan Edelman, S, G, . . .
• Uses Matlab as an interactive front end to parallel
software libraries (Scalapack, FFTW, . . . )
• Data stays on backend until retrieved explicitly
• Two views of the same data:
– Data-parallel SIMD
– Task-parallel SPMD
Data-Parallel Operations
< M A T L A B >
Copyright 1984-2001 The MathWorks, Inc.
Version 6.1.0.1989a Release 12.1
>> A = randn(500*p, 500*p)
A = ddense object: 500-by-500
>> E = eig(A);
>> E(1)
ans = -4.6711 +22.1882i
e = E(:);
>> whose
Name
A
E
e
Size
500x500
500x1
500x1
Bytes
688
652
8000
Class
ddense object
ddense object
double array (complex)
Task-Parallel Operations
>> quad('4./(1+x.^2)', 0, 1);
ans = 3.14159270703219
>> a = (0:3*p) / 4
a = ddense object: 1-by-4
>> a(:)
ans =
0
0.25000000000000
0.50000000000000
0.75000000000000
>> b = a+.25;
>> c = mm('quad','4./(1+x.^2)', a, b);
c = ddense object: 1-by-4
>> sum(c(:))
ans = 3.14159265358979
% Should be “feval”!
FFT2 in four lines
>> A = randn(4096, 4096*p)
A = ddense object: 4096-by-4096
>> tic;
>>
>>
>>
>>
B
C
D
F
=
=
=
=
mm('fft', A);
B.';
mm('fft’, C);
D.';
>> toc
elapsed_time = 73.50
>>a = A(:,:);
>> tic; g = fft2(a); toc
elapsed_time = 202.95
MultiMatlab (MM) mode
• A Matlab (or Octave) process runs on each node
• Each node runs a sequential Matlab program
• Same distributed data model as in “SIMD” mode
• New: Global address space semantics …
Matrix multiplication in MM mode
>> % A, B, C are ddense matrices
>> C = mm (‘dgemm’, A, dglobal(B))
function C = dgemm (A, B)
[m n] = size(B);
bs = n / getnproc(B);
for i = 1:bs:ncb;
C(:, i:i+bs) = A * B(:, i:i+bs);
end
GAS-MM infrastructure
• Built on GASNet
(UCB/LBL)
• Network-independent,
high-performance
primitives for one-way
communication
• Matlab dglobal array
indexing is overloaded
to call GASNet
Combinatorial Scientific Computing
• Sparse matrix methods (direct, iterative,
preconditioning); adaptive multilevel
modeling; geometric meshing; searching and
information retrieval; machine learning;
computational biology; . . .
• How will combinatorial methods be used by
people who don’t understand them in detail?
Matrix division in Matlab
x = A \ b;
• Works for either full or sparse A
• Is A square?
no => use QR to solve least squares problem
• Is A triangular or permuted triangular?
yes => sparse triangular solve
• Is A symmetric with positive diagonal elements?
yes => attempt Cholesky after symmetric minimum degree
• Otherwise
=> use LU on A(:, colamd(A))
Matlab sparse matrix design principles
• All operations should give the same results for
sparse and full matrices (almost all)
• Sparse matrices are never created automatically,
but once created they propagate
• Performance is important -- but usability, simplicity,
completeness, and robustness are more important
• Storage for a sparse matrix should be O(nonzeros)
• Time for a sparse operation should be O(flops)
(as nearly as possible)
Matlab sparse matrix design principles
• All operations should give the same results for
sparse and full matrices (almost all)
• Sparse matrices are never created automatically,
but once created they propagate
• Performance is important -- but usability, simplicity,
completeness, and robustness are more important
• Storage for a sparse matrix should be O(nonzeros)
• Time for a sparse operation should be O(flops)
(as nearly as possible)
Matlab*P dsparse matrices: same principles,
but some different tradeoffs
Sparse matrix operations
•
•
•
•
dsparse layout, same semantics as ddense
For now, only row distribution
Matrix operators: +, -, max, etc.
Matrix indexing and concatenation
A (1:3, [4 5 2]) = [ B(:, 7) C ] ;
• A \ b by direct methods
• Conjugate gradients
Sparse data structure
31 0 53
0 59 0
31 53 59 41 26
1
3
2
1
2
41 26 0
• Full:
• 2-dimensional array of
real or complex numbers
• (nrows*ncols) memory
• Sparse:
• compressed row storage
• about (1.5*nzs + .5*nrows)
memory
Distributed sparse data structure
P0
31
41
59
26
53
1
3
2
3
1
P1
P2
Pn
Each processor stores:
• # of local nonzeros
• range of local rows
• nonzeros in CSR form
Sparse matrix times dense vector
• y=A*x
• The first call to matvec caches a
communication schedule for matrix A.
Later calls to multiply any vector by A use
the cached schedule.
• Communication and computation overlap.
• Can use a tuned sequential matvec kernel
on each processor.
Sparse linear systems
• x=A\b
• Matrix division uses MPI-based direct solvers:
– SuperLU_dist: nonsymmetric static pivoting
– MUMPS: nonsymmetric multifrontal
– PSPASES: Cholesky
ppsetoption(’SparseDirectSolver’,’SUPERLU’)
• Iterative solvers implemented in Matlab*P
• Some preconditioners; ongoing work
Application: Fluid dynamics
• Modeling density-driven
instabilities in miscible
fluids (Goyal, Meiburg)
• Groundwater modeling,
oil recovery, etc.
• Mixed finite difference &
spectral method
• Large sparse generalized
eigenvalue problem
function lambda = peigs (A, B,
sigma, iter, tol)
[m n] = size (A);
C = A - sigma * B;
y = rand (m, 1);
for k = 1:iter
q = y ./ norm (y);
v = B * q;
y = C \ v;
theta = dot (q, y);
res = norm (y - theta*q);
if res <= tol
break;
end;
end;
lambda = 1 / theta;
Combinatorial algorithms in Matlab*P
• Sparse matrices are a good start on primitives
for combinatorial scientific computing.
– Random-access indexing: A(i,j)
– Neighbor sequencing:
find (A(i,:))
– Sparse table construction: sparse (I, J, V)
• What else do we need?
Sorting in Matlab*P
• [V, perm] = sort (V)
• Common primitive for many sparse matrix and
array algorithms: sparse(), indexing, transpose
• Matlab*P uses a parallel sample sort
Sample sort
• (Perform a random permutation)
• Select p-1 “splitters” to form p buckets
• Route each element to the correct bucket
• Sort each bucket locally
• “Starch” the result to match the distribution
of the input vector
Sample sort example
Initial data (after randomizing)
3
6
8
1
5
4
7
2
9
8
7
9
6
7
8
9
6
7
8
9
Choose splitters (2 and 6)
1
2
3
6
5
4
Sort local data
1
2
3
4
5
Starch
1
2
3
4
5
How sparse( ) works
• A = sparse (I, J, V)
• Input: ddense vectors I, J, V (optionally, also
dimensions and distribution info)
• Sort triples (i, j, v) by (i, j)
• Starch the vectors for desired row distribution
• Locally convert to compressed row indices
• Sum values with duplicate indices
Graph / mesh partitioning
• Reduce communication in
matvec and other parallel
computations
• Reordering for sparse GE
• PARMETIS
• Parts of G/Teng Matlab
meshpart toolbox
Vertex separator in graph of matrix
Matrix reordered by nested dissection
0
20
40
60
80
100
120
0
50
nz = 844
100
Geometric mesh partitioning
• Algorithm of Miller, Teng, Thurston, Vavasis
• Partitions irregular finite element meshes into equal-size
pieces with few connecting edges
• Guaranteed quality partitions for well-shaped meshes,
often very good results in practice
• Existing implementation in sequential Matlab
• Code runs in Matlab*P with very minor changes
Outline of algorithm
1. Project points stereographically from Rd to Rd+1
2. Find “centerpoint” (generalized median)
3. Conformal map: Rotate and dilate
4. Find great circle
5. Unmap and project down
6. Convert circle to separator
Geometric mesh partitioning
Matching and depth-first search in Matlab
• dmperm: Dulmage-Mendelsohn decomposition
• Square, full rank A:
–
–
–
–
[p, q, r] = dmperm(A);
A(p,q) is block upper triangular with nonzero diagonal
also, strongly connected components of a directed graph
also, connected components of an undirected graph
• Arbitrary A:
–
–
–
–
[p, q, r, s] = dmperm(A);
maximum-size matching in a bipartite graph
minimum-size vertex cover in a bipartite graph
decomposition into strong Hall blocks
Connected components
• Sequential Matlab uses depth-first search (dmperm),
which doesn’t parallelize well
• Shiloach-Vishkin algorithm:
– repeat
• Link every (super)vertex to a random neighbor
• Shrink each tree to a supervertex by pointer jumping
– until no further change
• Originally a processor-efficient PRAM algorithm
• Matlab*P code looks much like the PRAM code
Pointer jumping
while ~all( C(myrows) == C(C(myrows)) )
C(myrows) = C(C(myrows));
end
C(myrows) = min (C(myrows), C(C(myrows)));
Example of execution
Final components
After first iteration
Page Rank
• Importance ranking
of web pages
• Stationary distribution
of a Markov chain
• Power method: matvec
and vector arithmetic
• Matlab*P page ranking
demo (from SC’03) on
a web crawl of mit.edu
(170,000 pages)
Remarks
• Easy-to-use interactive programming environment
• Interface to existing parallel packages
• Combinatorial methods toolbox being built on
parallel sparse matrix infrastructure
– Much to be done: spanning trees, searches, etc.
• A few issues for ongoing work
– Dynamic resource management
– Fault management
– Programming in the large
Download