CSR-K + Locality-Aware Scheduling = Faster Sparse Computations

advertisement
CSR-K + Locality-Aware Scheduling =
Faster Sparse Computations on Multicores ?
Padma Raghavan
Computer Science and Engineering
Institute for Cyberscience
The Pennsylvania State University
Collaborators:
Humayun Kabir and Joshua Booth
Yves Robert and Anne Benoit (scheduling)
Research Supported by NSF
True or False ?
 Sparse computations are “old school” numerical
computing
 Sparse computations have no role in our “big data”
world
 Sparse computations are inherently sequential
 Optimizations for any performance measure are
“one-off” --- are there any useful abstractions?
True or False ?
 Sparse computations are “old school” numerical
computing
Yes – try doing without them!
Sparse computations have no role in our “big
data” world
 No - Big data are sparse!
 Sparse computations are inherently sequential
 Looks are deceptive
 Optimizations for any performance measure are
“one-off” --- are there any useful abstractions?
 There is one ---- and it begins with the word “Multi”
Goal: Match algorithm to hardware features
Challenge: Depends on data known at run-time
Solutions?
CSR-K+ Scheduling: multi-level for multi-core
Compressed Sparse Row (CSR)
 Store and operate only on nonzeroes
 CSR format: store subscripts and nonzero values
 Example: sparse matrix vector multiplication
y(i) = 0
for each nonzero in row i
Load subscript k, A[i][k], x(k)
y(i) = y(i) + A[i][k] *x(k)
end for
CSR-k:
k is a level, e.g., 3…
NUMA Multicores
and
Multilevel Sparse Structures
6
Multicore NUMA
Memory
L3 Cache
L2
Cache
L3 Cache
L2
Cache
L2
Cache
L2
Cache
L1
L1
L1
L1
L1
L1
L1
L1
Core
0
Core
1
Core
2
Core
3
Core
0
Core
1
Core
2
Core
3
 Distance from processor to cache/varies
 Non-uniform memory access; latencies vary 10x-100x
 Increasing thread-level concurrency
Multilevel Substructuring of Data
Level 1
Level 3
Level 2
 Spatial
locality in
matrix or
graph
 Temporal
locality in
vector (reuse)
Parallel Sparse Matrix Vector Multiplication:
An Example
*
*
*
*
*
*
*
*
*
A
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
x
*
3
2
2
3
y
2
2
 Row-wise calculations have no dependencies
 Elements of A and y have no reuse
 Read re-use in X can be substantial
Read
re-use
in X
Parallel Sparse Triangular Solution:
Another Example
l11
l21
l22
l32
l33
l41
l44
l52
l54
l63
*
l55
l65
l66
l74
l77
l85
l87
l96
A
l88
l98
l99
x1
b1
x2
b2
x3
b3
x4
b4
x5
=
b5
x6
b6
x7
b7
x8
b8
x9
b9
x
b
 Lots of dependencies (purely sequential as shown)
 Elements of A and b have no reuse
 Read reuse in X can be substantial
Speeding-up Sparse Computations on Multicores:
What techniques are known to work?
 Ordering of matrix affects data locality & reuse in x
 Profile reducing orderings (e.g. RCM) are generally good
for matvec
 Level set and coloring generally good for tri solve
 Utilizing dense sub-blocks to reduce loading of nonzero
subscripts can help
 Tradeoffs between # of loads & # of operation
 Dense blocks can be artificially created by adding fill or
dense blocks that exist naturally can be exploited
 Loop unrolling, register blocking can further speed-up
calculations. Contributions by many Catalyurek,
Demel,…Gropp, …Schreiber, Ucar, ..Yelick
CSR-K: A multilevel form of CSR
Example: for K=3, symmetric A





Start with A1 =A and G1= graph of A1
Coarsen G1 to get G2 (with super-rows ); Order G2
Coarsen G2 to get G3 (with super-super-rows); Order G3
Expand G3 to G2; refine ordering in each super-super-row
Expand G2 to G1; refine ordering in each super-row
Motivation: To get uniform length tasks at
any desired granularity, with locality ,
parallelism enhancing orderings
CSR-K + Scheduling: Sparse Mat Vec &
Tri Solve
Mat-Vec
 Coarsen: heavy edge matching or consecutive rows of a
band ordering
 Ordering of G2, G3 : a weighted form of band ordering
Tri- Solve
 Coarsen : same as above
 Ordering of G2, G3: Coloring (serialization is removed)
Scheduling: To utilize temporal locality in
vector, load balance
CSR-K Computations

In parallel for each super-super-row l = 1….r

for each super-row j in super-row l
for each row k in super-row j
for each nonzero in row i
Load subscript k, A[i][k], x(k)
do blah blah blah
end for
…………………………..
CSR-3: Tri-Solve
CSR-2 : Mat-Vec
# of Tasks
Tunable Degree of Parallelism:
# Tasks vs Task Granularity
8000
7000
6000
5000
4000
3000
2000
1000
0
8
15
31
61
123
246
491
983
1,966
3,931
Task Size in Thousands of Operations
7,862
15,725
CSR-K:
 How does it perform?
 CSR-2 Sparse Mat Vec (to
appear, HiPC 2014)
 CSR-3 Tri-Solve
17
Sparse Matrix Suite for Tests
Matrix
G1: ldoor
D1: rgg_n_2_21_s0
S1: nlpkkt160
D2: delaunay_n23
D3: road_central
D4: hugetrace-20
D5: delaunay_n24
D6: hugebubbles-0
D7: hugebubbles-10
D8: hugebubbles-20
D9: road_usa
D10: europe_osm
# of Rows
952,203
2,097,152
8,345,600
8,388,608
14,081,816
16,002,413
16,777,216
18,318,143
19,458,087
21,198,119
23,947,347
50,912,018
# of Nonzeroes Row Density
42,493,817
44.63
31,073,142
14.82
225,422,112
27.01
58,720,176
7.00
47,948,642
3.41
64,000,039
4.00
117,440,418
7.00
73,258,305
4.00
77,817,615
4.00
84,778,477
4.00
81,655,971
3.41
159,021,338
3.12
CSR-2: NUMA-Awareness on Westmere
L1 access: 4 cycles (private)
L2 access:10 cycles (private)
L3 access: 38-170 cycles (shared)
Memory access: 175– 290 cycles ( shared)
Performance of MKL, pOSKI, and CSR-2 on
Test Matrices, 32 cores of Intel Westmere
pOSKI vs CSR-2
 Number of operations, q processors: F* = 2 x nonzeroes/q
 Number of loads on q processors: L* = 3 x nonzeroes/q + n/q
 Load factor: Measured loads of method/ L*
 Fill factor: Measured operations of method /F*
 pOSKI: Load factor < 1; Fill factor > 1 because of fill-in
 CSR-2: Load factor, Fill factor ~ 1
 Consider Load factor X Fill factor
Loads to Ops Tradeoff:
Values closer to 1 are preferred
 pOSKI succeeds in managing tradeoff for denser sparse
matrices
CSR-2 NUMA-Awareness: accesses to x
are mostly in cache
 Observed X accesses for D10: Euope_OSM
Task Size in Number of Rows
Tri-Solve : Performance of Level- Set and
CSR-3, 32 cores of Intel Westmere
Tri-Solve: Speed-up Relative to Level Set
Relative Speed-up
 CSR-3 is 20x faster than level-set
Serial Steps in Level Set Relative to CSR-3
Average Number of Parallel Tasks in CSR-3 Relative to Level Set
CSR-K:
 Scheduling models and open
problems
27
Model 1 of CSR-2 & List Scheduling with q cores
n uniform length tasks (n can be chosen within certain
constraints)
Edge between t_i and t_(i+1), i= 1, …, n-1
Indicates re-use in x (data affinity/sharing from temporal
locality); Unit edge weights
t_1





t_2 t_3
t_4 ………….
t_n
Assign available cores 1, … q to tasks t_1 … t_q
At some point tasks t_1…… t_k are completed
If core j becomes free, assign it to next task t_(k+1)
Ensures sharing of x in L3 between t_k and t_(k+1)
Explains performance results on Westmere
Summary and OpenProblems
 CSR-K allows efficient parallel implementations of
Sparse matrix vector multiplication and triangular
solution
 Extracts large levels of parallelism with multi-level
spatial and temporal locality of data accesses
 The triangular solution can be studied further through
bipartite graph model for NUMA-aware scheduling
 Allows multiple task graph models to determine lower
bounds on data movement and optimal schedules
Download