CSR-K + Locality-Aware Scheduling = Faster Sparse Computations on Multicores ? Padma Raghavan Computer Science and Engineering Institute for Cyberscience The Pennsylvania State University Collaborators: Humayun Kabir and Joshua Booth Yves Robert and Anne Benoit (scheduling) Research Supported by NSF True or False ? Sparse computations are “old school” numerical computing Sparse computations have no role in our “big data” world Sparse computations are inherently sequential Optimizations for any performance measure are “one-off” --- are there any useful abstractions? True or False ? Sparse computations are “old school” numerical computing Yes – try doing without them! Sparse computations have no role in our “big data” world No - Big data are sparse! Sparse computations are inherently sequential Looks are deceptive Optimizations for any performance measure are “one-off” --- are there any useful abstractions? There is one ---- and it begins with the word “Multi” Goal: Match algorithm to hardware features Challenge: Depends on data known at run-time Solutions? CSR-K+ Scheduling: multi-level for multi-core Compressed Sparse Row (CSR) Store and operate only on nonzeroes CSR format: store subscripts and nonzero values Example: sparse matrix vector multiplication y(i) = 0 for each nonzero in row i Load subscript k, A[i][k], x(k) y(i) = y(i) + A[i][k] *x(k) end for CSR-k: k is a level, e.g., 3… NUMA Multicores and Multilevel Sparse Structures 6 Multicore NUMA Memory L3 Cache L2 Cache L3 Cache L2 Cache L2 Cache L2 Cache L1 L1 L1 L1 L1 L1 L1 L1 Core 0 Core 1 Core 2 Core 3 Core 0 Core 1 Core 2 Core 3 Distance from processor to cache/varies Non-uniform memory access; latencies vary 10x-100x Increasing thread-level concurrency Multilevel Substructuring of Data Level 1 Level 3 Level 2 Spatial locality in matrix or graph Temporal locality in vector (reuse) Parallel Sparse Matrix Vector Multiplication: An Example * * * * * * * * * A * * * * * * * * * * * * * * * * x * 3 2 2 3 y 2 2 Row-wise calculations have no dependencies Elements of A and y have no reuse Read re-use in X can be substantial Read re-use in X Parallel Sparse Triangular Solution: Another Example l11 l21 l22 l32 l33 l41 l44 l52 l54 l63 * l55 l65 l66 l74 l77 l85 l87 l96 A l88 l98 l99 x1 b1 x2 b2 x3 b3 x4 b4 x5 = b5 x6 b6 x7 b7 x8 b8 x9 b9 x b Lots of dependencies (purely sequential as shown) Elements of A and b have no reuse Read reuse in X can be substantial Speeding-up Sparse Computations on Multicores: What techniques are known to work? Ordering of matrix affects data locality & reuse in x Profile reducing orderings (e.g. RCM) are generally good for matvec Level set and coloring generally good for tri solve Utilizing dense sub-blocks to reduce loading of nonzero subscripts can help Tradeoffs between # of loads & # of operation Dense blocks can be artificially created by adding fill or dense blocks that exist naturally can be exploited Loop unrolling, register blocking can further speed-up calculations. Contributions by many Catalyurek, Demel,…Gropp, …Schreiber, Ucar, ..Yelick CSR-K: A multilevel form of CSR Example: for K=3, symmetric A Start with A1 =A and G1= graph of A1 Coarsen G1 to get G2 (with super-rows ); Order G2 Coarsen G2 to get G3 (with super-super-rows); Order G3 Expand G3 to G2; refine ordering in each super-super-row Expand G2 to G1; refine ordering in each super-row Motivation: To get uniform length tasks at any desired granularity, with locality , parallelism enhancing orderings CSR-K + Scheduling: Sparse Mat Vec & Tri Solve Mat-Vec Coarsen: heavy edge matching or consecutive rows of a band ordering Ordering of G2, G3 : a weighted form of band ordering Tri- Solve Coarsen : same as above Ordering of G2, G3: Coloring (serialization is removed) Scheduling: To utilize temporal locality in vector, load balance CSR-K Computations In parallel for each super-super-row l = 1….r for each super-row j in super-row l for each row k in super-row j for each nonzero in row i Load subscript k, A[i][k], x(k) do blah blah blah end for ………………………….. CSR-3: Tri-Solve CSR-2 : Mat-Vec # of Tasks Tunable Degree of Parallelism: # Tasks vs Task Granularity 8000 7000 6000 5000 4000 3000 2000 1000 0 8 15 31 61 123 246 491 983 1,966 3,931 Task Size in Thousands of Operations 7,862 15,725 CSR-K: How does it perform? CSR-2 Sparse Mat Vec (to appear, HiPC 2014) CSR-3 Tri-Solve 17 Sparse Matrix Suite for Tests Matrix G1: ldoor D1: rgg_n_2_21_s0 S1: nlpkkt160 D2: delaunay_n23 D3: road_central D4: hugetrace-20 D5: delaunay_n24 D6: hugebubbles-0 D7: hugebubbles-10 D8: hugebubbles-20 D9: road_usa D10: europe_osm # of Rows 952,203 2,097,152 8,345,600 8,388,608 14,081,816 16,002,413 16,777,216 18,318,143 19,458,087 21,198,119 23,947,347 50,912,018 # of Nonzeroes Row Density 42,493,817 44.63 31,073,142 14.82 225,422,112 27.01 58,720,176 7.00 47,948,642 3.41 64,000,039 4.00 117,440,418 7.00 73,258,305 4.00 77,817,615 4.00 84,778,477 4.00 81,655,971 3.41 159,021,338 3.12 CSR-2: NUMA-Awareness on Westmere L1 access: 4 cycles (private) L2 access:10 cycles (private) L3 access: 38-170 cycles (shared) Memory access: 175– 290 cycles ( shared) Performance of MKL, pOSKI, and CSR-2 on Test Matrices, 32 cores of Intel Westmere pOSKI vs CSR-2 Number of operations, q processors: F* = 2 x nonzeroes/q Number of loads on q processors: L* = 3 x nonzeroes/q + n/q Load factor: Measured loads of method/ L* Fill factor: Measured operations of method /F* pOSKI: Load factor < 1; Fill factor > 1 because of fill-in CSR-2: Load factor, Fill factor ~ 1 Consider Load factor X Fill factor Loads to Ops Tradeoff: Values closer to 1 are preferred pOSKI succeeds in managing tradeoff for denser sparse matrices CSR-2 NUMA-Awareness: accesses to x are mostly in cache Observed X accesses for D10: Euope_OSM Task Size in Number of Rows Tri-Solve : Performance of Level- Set and CSR-3, 32 cores of Intel Westmere Tri-Solve: Speed-up Relative to Level Set Relative Speed-up CSR-3 is 20x faster than level-set Serial Steps in Level Set Relative to CSR-3 Average Number of Parallel Tasks in CSR-3 Relative to Level Set CSR-K: Scheduling models and open problems 27 Model 1 of CSR-2 & List Scheduling with q cores n uniform length tasks (n can be chosen within certain constraints) Edge between t_i and t_(i+1), i= 1, …, n-1 Indicates re-use in x (data affinity/sharing from temporal locality); Unit edge weights t_1 t_2 t_3 t_4 …………. t_n Assign available cores 1, … q to tasks t_1 … t_q At some point tasks t_1…… t_k are completed If core j becomes free, assign it to next task t_(k+1) Ensures sharing of x in L3 between t_k and t_(k+1) Explains performance results on Westmere Summary and OpenProblems CSR-K allows efficient parallel implementations of Sparse matrix vector multiplication and triangular solution Extracts large levels of parallelism with multi-level spatial and temporal locality of data accesses The triangular solution can be studied further through bipartite graph model for NUMA-aware scheduling Allows multiple task graph models to determine lower bounds on data movement and optimal schedules