Challenges in Combinatorial Scientific Computing John R. Gilbert University of California, Santa Barbara

advertisement
Challenges in
Combinatorial Scientific Computing
John R. Gilbert
University of California, Santa Barbara
Georgia Tech CSE Colloquium
March 12, 2010
1
Support: NSF, DARPA, SGI
Combinatorial Scientific Computing
“I observed that most of the
coefficients in our matrices were
zero; i.e., the nonzeros were ‘sparse’
in the matrix, and that typically the
triangular matrices associated with
the forward and back solution
provided by Gaussian elimination
would remain sparse if pivot
elements were chosen with care”
- Harry Markowitz, describing the 1950s
work on portfolio theory that won
the 1990 Nobel Prize for Economics
2
Graphs and Sparse Matrices: Cholesky factorization
Fill: new nonzeros in factor
3
1
7
3
1
7
Symmetric Gaussian elimination:
6
8
4
9
G(A)
3
4
10
5
2
6
8
9
10
5
G+(A)
[chordal]
2
for j = 1 to n
add edges between j’s
higher-numbered neighbors
Large graphs are everywhere…
• Internet structure
• Social interactions
WWW snapshot, courtesy Y. Hyun
4
• Scientific datasets: biological,
chemical, cosmological, ecological, …
Yeast protein interaction network, courtesy H. Jeong
An analogy?
Continuous
physical modeling
As the “middleware”
of scientific computing,
linear algebra has supplied
or enabled:
• Mathematical tools
Linear algebra
• “Impedance match” to
computer operations
• High-level primitives
• High-quality software libraries
Computers
• Ways to extract performance
from computer architecture
• Interactive environments
5
An analogy?
Continuous
physical modeling
Discrete
structure analysis
Linear algebra
Graph theory
Computers
6
Computers
An analogy? Well, we’re not there yet ….
 Mathematical tools
? “Impedance match” to
computer operations
Discrete
structure analysis
? High-level primitives
? High-quality software libs
Graph theory
? Ways to extract performance
from computer architecture
? Interactive environments
7
Computers
All-Pairs Shortest Paths on a GPU
[Buluc et al.]
Based on R-Kleene algorithm
Well suited for GPU architecture:
9
•
In-place computation =>
low memory bandwidth
•
Few, large MatMul calls =>
low GPU dispatch overhead
C
A
A
B
C
D
D
B
+ is “min”,
A = A*;
× is “add”
% recursive call
•
Recursion stack on host CPU, B = AB; C = CA;
not on multicore GPU
D = D + CB;
•
Careful tuning of GPU code
•
Fast matrix-multiply kernel
D = D*;
% recursive call
B = BD; C = DC;
A = A + BC;
APSP:
Experiments
and observations
The Case
for Primitives
Lifting Floyd-Warshall
to GPU
480x
Unorthodox RKleene algorithm
The right primitive!
Runtime vs. Matrix Dimension, log-log
10
APSP:
Experiments
and observations
The Case
for Primitives
Lifting Floyd-Warshall
to GPU
480x
Unorthodox RKleene algorithm
The right primitive!
Runtime vs. Matrix Dimension, log-log
• High performance is achievable but not simple
• Carefully chosen and optimized primitives are key
• Matching the architecture and the algorithm is key
11
Landscape connectivity modeling
[Shah et al.]
13
•
Habitat quality, gene flow,
corridor identification,
conservation planning
•
Pumas in southern California:
12 million nodes, < 1 hour
•
Targeting larger problems:
Yellowstone-to-Yukon corridor
Figures courtesy of Brad McRae, NCEAS
Coloring for parallel nonsymmetric
preconditioning [Aggarwal et al.]
263 million DOF
•
•
•
14
Level set method for multiphase
interface problems in 3D.
Nonsymmetric-structure,
second-order-accurate octree discretization.
BiCGSTAB preconditioned by parallel triangular solves.
NMF traffic analysis results
[Karpinski et al.]
15
Model reduction and graph decomposition
Spectral graph decomposition technique combined with dynamical
systems analysis leads to deconstruction of a possibly unknown network
into inputs, outputs, forward and feedback loops and allows identification
of a minimal functional unit (MFU) of a system.
Output, execution
Approach:
1. Decompose networks
2. Propagate uncertainty through
components
3. Iteratively aggregate component
uncertainty
Trim the network,
preserve dynamics!
H-V decomposition
(node 4 and
several connections pruned,
with no loss of performance)
Feedback loops
Forward, production unit
Input, initiator
Additional functional
requirements
Level of output
For MFU
Minimal functional units:
sensitive edges (leading to lack of production)
easily identifiable
16
Level of output
with feedback loops
Mezic group, UCSB
Allows identification of roles of
different feedback loops
Horizontal - vertical decomposition
[Mezic et al.]
1
level 4
9
8
2
3
4
5
6
7
8
9
1
2
6
level 3
3
7
4
5
level 2
2
3
4
5
6
7
8
level 1
1
9
•
Strongly connected components, ordered by levels of DAG
•
Linear-time sequentially; no work/span efficient parallel algorithms known
•
… but we don’t want an exact algorithm anyway; edges are probabilities,
and the application is a kind of statistical model reduction …
17
The Primitives Challenge
•
By analogy to
numerical
scientific
computing. . .
Basic Linear Algebra Subroutines (BLAS):
Speed (MFlops) vs. Matrix Size (n)
C = A*B
y = A*x
•
19
What should the
combinatorial
BLAS look like?
μ = xT y
Primitives should…
20
•
Supply a common notation to express computations
•
Have broad scope but fit into a concise framework
•
Allow programming at the appropriate level of
abstraction and granularity
•
Scale seamlessly from desktop to supercomputer
•
Hide architecture-specific details from users
Frameworks for graph primitives
Many possibilities; none completely satisfactory;
little work on common frameworks or interoperability.
21
•
Visitor-based, distributed-memory: PBGL
•
Visitor-based, multithreaded: MTGL
•
Heterogeneous, tuned kernels: SNAP
•
Scan-based vectorized: NESL
•
Map-reduce: lots of visibility
•
Sparse array-based: Matlab *PKDT, CBLAS
Sparse array-based primitives
Identification of Primitives
Sparse matrix-matrix
multiplication (SpGEMM)
x
Element-wise operations
Sparse matrix-dense
vector multiplication
x
Sparse matrix indexing
.*
Matrices on various semirings:
23
(x, +) , (and, or) , (+, min) , …
Multiple-source breadth-first search
1
2
4
7
3
AT
24
X
6
5
Multiple-source breadth-first search

1
2
4
7
3
AT
25
X
ATX
6
5
Multiple-source breadth-first search

1
2
4
7
3
AT
26
X
6
ATX
•
Sparse array representation => space efficient
•
Sparse matrix-matrix multiplication => work efficient
•
Load balance depends on SpGEMM implementation
5
SpGEMM: Sparse Matrix x Sparse Matrix
Why focus on SpGEMM?
•
Graph clustering (Markov, peer pressure)
•
Subgraph / submatrix indexing
•
Shortest path calculations
•
Betweenness centrality
•
Graph contraction
•
Cycle detection
•
Multigrid interpolation & restriction
•
Colored intersection searching
•
Applying constraints in
finite element computations
•
27
Context-free parsing ...
1
1
1
x
x
1
1
Distributed-memory sparse matrix-matrix
multiplication
j
k

2D block layout

Outer product formulation

Sequential “hypersparse” kernel
k
*
i
=
Cij
Cij += Aik * Bkj
•
Scales well to hundreds of processors
•
Betweenness centrality benchmark:
over 200 MTEPS
•
Experiments: TACC Lonestar cluster
Time vs Number of cores -- 1M-vertex RMAT
28
The
Combinatorial
BLAS:
Example
of
use
Applications and Algorithms
Betweenness Centrality (BC)
What fraction of shortest paths
pass through this node?
Software stack for an application of
the Combinatorial BLAS
Brandes’ algorithm
30
BC Performance in distributed memory
RMAT powerlaw graph,
2Scale vertices,
avg degree 8
TEPS score
Millions
BC performance
250
200
150
Scale 17
Scale 18
100
Scale 19
50
Scale 20
Number of Cores
• TEPS = Traversed Edges Per Second
• One page of code using CBLAS
31
484
441
400
361
324
289
256
225
196
169
144
121
100
81
64
49
36
25
0
The Architecture & Algorithms Challenge
Two Nvidia
8800 GPUs
> 1 TFLOPS
Oak Ridge / Cray Jaguar
> 1.75 PFLOPS
 Parallelism is no longer optional…
 … in every part of a computation.
33
Intel 80core chip
> 1 TFLOPS
High-performance architecture

Most high-performance
computer designs allocate
resources to optimize
Gaussian elimination on
large, dense matrices.

Originally, because linear
algebra is the middleware
of scientific computing.

Nowadays, largely for
bragging rights.
34
P A
=
L
x
U
Strongly connected components
1
2
4
7
5
1
3
6
1
2
4
7
2
4
7
5
5
3
PAPT
35
6
3
6
G(A)
•
Symmetric permutation to block triangular form
•
Diagonal blocks are strong Hall (irreducible / strongly connected)
•
Sequential: linear time by depth-first search [Tarjan]
•
Parallel: divide & conquer, work and span depend on input
[Fleischer, Hendrickson, Pinar]
The memory wall blues

Most of memory is hundreds or thousands of cycles away
from the processor that wants it.

You can buy more bandwidth, but you can’t buy less latency.
(Speed of light, for one thing.)
36
The memory wall blues

Most of memory is hundreds or thousands of cycles away
from the processor that wants it.

You can buy more bandwidth, but you can’t buy less latency.
(Speed of light, for one thing.)

You can hide latency with either locality or parallelism.
37
The memory wall blues

Most of memory is hundreds or thousands of cycles away
from the processor that wants it.

You can buy more bandwidth, but you can’t buy less latency.
(Speed of light, for one thing.)

You can hide latency with either locality or parallelism.

Most interesting graph problems have lousy locality.

Thus the algorithms need even more parallelism!
38
Architectural impact on algorithms
Matrix multiplication: C = A * B
C = 0;
for i = 1 : n
for j = 1 : n
for k = 1 : n
C(i,j) = C(i,j) + A(i,k) * B(k,j);
O(n3) operations
39
Architectural impact on algorithms
12000 would take
1095 years
Naïve 3-loop matrix multiply [Alpern et al., 1992]:
6
T = N4.7
log cycles/flop
5
4
3
Size 2000 took 5 days
2
1
0
-1 0
1
2
3
4
5
log Problem Size
Naïve algorithm is O(N5) time under UMH model.
BLAS-3 DGEMM and recursive blocked algorithms are O(N3).
40
Diagram from Larry Carter
The architecture & algorithms challenge
41

A big opportunity exists for computer architecture to
influence combinatorial algorithms.

(Maybe even vice versa.)
A novel architectural approach: Cray MTA / XMT
42
•
Hide latency by massive
multithreading
•
Per-tick context switching
•
Uniform (sort of) memory
access time
•
But the economic case is
still not completely clear.
Some Challenges
43
Features of (many) large graph applications
•
“Feasible” means O(n), or even less.
•
You can’t scan all the data.
•
44
–
you want to poke at it in various ways
–
maybe interactively.
Multiple simultaneous queries to the same graph
–
maybe to differently filtered subgraphs
–
throughput and response time are both important.
•
Benchmark data sets are a big challenge!
•
Sometimes, think of data not as a finite object but as a
statistical sample of an infinite process.
Productivity
Raw performance isn’t always the only criterion.
Other factors include:
45
•
Seamless scaling from desktop to HPC
•
Interactive response for data exploration and viz
•
Rapid prototyping
•
Just plain programmability
The Education Challenge

How do you teach this stuff?

Where do you go to take courses in
 Graph algorithms …
 … on massive data sets …
 … in the presence of uncertainty …
 … analyzed on parallel computers …
 … applied to a domain science?
46
Final thoughts
47
•
Combinatorial algorithms are pervasive in scientific
computing and will become more so.
•
Linear algebra and combinatorics can support each
other in computation as well as in theory.
•
A big opportunity exists for computer architecture to
influence combinatorial algorithms.
•
This is a great time to be doing research in
combinatorial scientific computing!
Extra Slides
48
A few research questions in high-performance
combinatorial computing

49
Primitives for computation on graphs and other discrete structures
–
What is the right set of primitives?
–
What is the API?
–
How do you get performance on a wide range of platforms?

Tools and libraries: How will results be used by nonspecialists?

Building useful reference data sets, data generators, and benchmarks

How can computer architecture influence combinatorial algorithms?

Computing global properties of networks
–
Not just density, diameter, degree distribution, clustering coeff
–
Connectivity, robustness, spectral properties
–
Sensitivity analysis, stochastic and dynamic settings
SpGEMM Details
50
Two Versions of Sparse GEMM
A1 A2 A3 A4 A5 A6 A7 A8
x
B1 B2 B3 B4 B5 B6 B7 B8
=
C1 C2 C3 C4 C5 C6 C7 C8
1D block-column
distribution
Ci = Ci + A Bi
j
k
k
i
Cij += Aik Bkj
51
x
=
Cij
2D block
distribution
Modeled bounds on speedup, sparse 1-D & 2-D
2-D algorithm
1-D algorithm
N
P
N
• 1-D algorithms do not scale beyond 40x
• Break-even point is around 50 processors
52
P
Submatrices are hypersparse (i.e. nnz << n)
nnz’ =
p
c
0
p
blocks
Average of c nonzeros per column
Total Storage:
p
blocks
(n  nnz )  (n p  nnz )
• An algorithm whose complexity depends on
matrix dimension n is asymptotically too wasteful.
53
Complexity measure trends with increasing p
Standard algorithm is O(nnz+ flops+n)
n
n' (dimension) 
p
flops
nnz
nnz ' (data size) 
p
nnz
n
54
flops
flops' (work) 
p p
Sequential Kernel [IPDPS 2008]
•
Strictly O(nnz) data structure
•
Complexity independent of matrix dimension
•
Outer-product formulation with multi-way merging
•
Scalable in terms of the amount of work it performs.
•X
55
Parallel Interactive Tools:
Star-P
&
Knowledge Discovery
Toolbox
56
Star-P
A = rand(4000*p, 4000*p);
x = randn(4000*p, 1);
y = zeros(size(x));
while norm(x-y) / norm(x) > 1e-11
y = x;
x = A*x;
x = x / norm(x);
end;
57
Star-P Architecture
Star-P
client manager
package manager
processor #1
dense/sparse
sort
processor #2
ScaLAPACK
processor #3
FFTW
Ordinary Matlab variables
processor #0
FPGA interface
MPI user code
UPC user code
...
MATLAB®
processor #n-1
server manager
matrix manager
58
Distributed matrices
Distributed Sparse Array Structure
P0
31
41
59
26
53
1
3
2
3
1
1
31
41
59
2
53
26
3
P1
P2
Pn
59
Each processor stores
local vertices & edges
in a compressed row structure.
Has been scaled to >108
vertices, >109 edges in
interactive session.
Star-P History
• 1998: Matlab*P 1.0: Edelman, Husbands, Isbell (MIT)
• 2002 – 2006: Matlab*P 2.0: MIT / UCSB / LBNL
• 2004 – 2009: Interactive Supercomputing (orig. with SGI)
• 2008: Many large and small Star-P installations, including
– San Diego Supercomputing Center
– Pittsburgh Supercomputing Center
• 2009: ISC bought by Microsoft
60
KDT: A toolbox for graph analysis and pattern discovery
[G, Reinhardt, Shah]
Layer 1: Graph Theoretic Tools
61
•
Graph operations
•
Global structure of graphs
•
Graph partitioning and clustering
•
Graph generators
•
Visualization and graphics
•
Scan and combining operations
•
Utilities
Sample application stack
Computational ecology, CFD, data exploration
Applications
CG, BiCGStab, etc. + combinatorial preconditioners (AMG, Vaidya)
Preconditioned Iterative Methods
Graph querying & manipulation, connectivity, spanning trees,
geometric partitioning, nested dissection, NNMF, . . .
Graph Analysis & Pattern Toolbox
Arithmetic, matrix multiplication, indexing, solvers (\, eigs)
Distributed Sparse Matrices
62
Landscape connectivity modeling
[Shah et al.]
63
•
Habitat quality, gene flow,
corridor identification,
conservation planning
•
Pumas in southern California:
12 million nodes, < 1 hour
•
Targeting larger problems:
Yellowstone-to-Yukon corridor
Figures courtesy of Brad McRae, NCEAS
Download