Optimizing Sparse Matrix Vector Multiplication on Emerging Multicores Orhan Kislal, Wei Ding, Mahmut Kandemir The Pennsylvania State University University Park, Pennsylvania, USA omk103, wzd109, kandemir@cse.psu.edu 1 Saturday, September 7, 13 Ilteris Demirkiran Embry-Riddle Aeronautical University San Diego, California, USA demir4a4@erau.edu Introduction Importance of Sparse Matrix-Vector Multiplication (SpMV) Dominant component for solving eigenvalue problems and large-scale linear systems Difference from uniform/regular dense matrix computations Irregular data access patterns Compact data structure 2 Saturday, September 7, 13 Background SpMV is usually in the form of b=Ax+b, where A is a sparse matrix, and x and b are dense vectors x: source vector b: destination vector Only x and b can be reused. One of the most common data structures for A: Compressed Sparse Row (CSR) format 3 Saturday, September 7, 13 BackGround con’t CSR format // Basic SpMV implementation, // b = A*x + b, where A is in CSR and has m rows col ... val ... for (i = 0; i < m; ++i) { double b = b[i]; for (k = ptr[i]; k < ptr[i+1]; ++k) b+= val[k] * x[col[k]]; b[i] = b; ptr (a) } (b) Each row of A is packed one after the other in a dense array val integer array (col) stores the column indices of each stored element. ptr: keeps track of where each row starts in val and col. 4 Saturday, September 7, 13 Motivation Computation mapping and scheduling C0 C1 C2 C3 C4 C5 L1 L1 L1 L1 L1 L1 L2 L2 Mapping assigns the computation that involves one or more rows of A to a core (computation block) L3 (a) C0 C1 C2 C3 C4 C5 L1 L1 L1 L1 L1 L1 L2 L2 L3 (b) L2 Scheduling determines the execution order of those computations How to take the on-chip cache hierarchy into account to improve the data locality? 5 Saturday, September 7, 13 Motivation con’t If two computations share data -> better to map them to the cores that share a cache in some layer (more frequent sharing -> higher layer) Mapping! For these two computations, better to let the shared data accessed by two cores in close proximity in time. Scheduling! 6 Saturday, September 7, 13 Motivation con’t Data Reordering The source vector x is read-only Ideally, x can have a customized layout for each row computation rx, i.e., data elements in x that correspond to the nonzero elements in r are placed contiguously in memory (reduce cache footprint) However, can we have a smarter scheme? 7 Saturday, September 7, 13 Framework Original SpMV code Mapping Cache hierarchy description Scheduling Data Reordering Optimized SpMV code Mapping (cache hierarchy-aware) Scheduling (cache hierarchy-aware) Data Reordering (seek a way to determine the minimal number of layouts for x that keep cache footprint during computation as small as possible ) 8 Saturday, September 7, 13 Mapping Only consider the data sharing among the cores Basic idea: for two computation blocks, higher data sharing means mapping them to higher level of cache. We quantify the data sharing for two computation blocks as the sum of the number of nonzero elements at the same column (for those computation blocks). 9 Saturday, September 7, 13 Mapping con’t v3 1 v5 v11 1 1 v2 2 1 Constructing the reuse graph v4 v1 v9 1 1 v7 v6 Vertex: computation block Weight on an edge: the amount of data sharing 10 Saturday, September 7, 13 v8 1 1 v12 v10 Mapping-Algorithm SORT: Edges are sorted by their weights in a decreasing order PARTITION: Vertices are visited based on the order of edges. We then hierarchically partition the reuse graph. The number of partitions is equal to the number of cache levels. LOOP: Repeat Step 2 until the partition for the LLC is reached. The assignment of each partition to a set of cores is based on the cache hierarchy. 11 Saturday, September 7, 13 Mapping-example 10 0 1 2 3 L1 L1 L1 L1 2 5 10 1 L2 1 L2 10 12 L3 11 (a) 10 2 5 10 0 1 2 3 L1 L1 L1 L1 1 1 L2 10 L2 12 L3 11 (b) Saturday, September 7, 13 12 Scheduling Specify an order in which each row block is to be executed Goal: ensure the data sharing among the computation blocks can be caught in the expected cache level. 13 Saturday, September 7, 13 Scheduling con’t SORT (same as the mapping component) INITIAL: assign the logical time slot for the two nodes (vl and vr) that have the edge in between with the highest weight, and set up the offset o(v) for each vertex v. (o(vl) = +1, o(vr) = -1) Purpose of employing offset: ensure the nodes mapped to the same core with high data sharing are scheduled to be executed as closely as possible. 14 Saturday, September 7, 13 Scheduling CON’T SCHEDULE CASE 1: vx and vy are mapped to different cores. Then assign vx and vy to be executed at the same time slot or |T(vx) - T(vy)| is minimized CASE 2: vx and vy are mapped to the same core. If vx is already assigned, then vy will be assigned at T(vx) + o(vx) and o(vy) = o(vx). Otherwise, initialize vx and vy at Step 2 LOOP: repeat Step 3 until all vertices are scheduled. 15 Saturday, September 7, 13 Scheduling-example t0-2 t0-1 20 t0 t0+1 time slot v2 v1 19 v3 v3 v2 v1 v2 v1 (a) v3 (b) (a) is a portion of the reuse graph and (b) is the illustration of two schedules for v3. The first one places v3 next to v1 and the second one places v3 next to v2. Using the offset, our scheme successfully generates the first schedule instead of the second one. 16 Saturday, September 7, 13 dATA rEORDERING Original layout: x1 x2 x3 ... x10 x11 x12 memory blocks store x4-x9 Layout 1: x1 x2 x12 ... memory blocks store x3-x11 Layout 2: x3 x11 x12 ... memory blocks store x1, x2 and x4-x9 Find a customized data layout for x used in each set of rows or row blocks such that the cache footprint generated by the computation of these rows can be minimized. 17 Saturday, September 7, 13 data reordering con’t Layout 1: x1 x2 x3 ... Layout 2: x4 x5 x6 ... Combined layout: x1 x2 x3 x4 x5 x6 p%b ... (ni-p)%b b-(ni-p)%b-p%b ... ... ... b (b) (a) Case 1: r1 and r2 have no common nonzero elements, then x can have the same data layout for r1x and r2x (see (a)) Case 2: otherwise, assuming they have p common nonzero elements, the memory block size is b, and the number of nonzero elements in r1 and r2 are ni and nj, respectively. (see (b)) 18 Saturday, September 7, 13 end if end for k + +; end for [1], and represent a wide variety of a also exhibit diversity regarding typ matrix dimension, and the number o experiment Setup For each matrix in Table III, w with four different versions. All the I MPORTANT CHARACTERISTICS OF THE AMD O TABLE I. I MPORTANT CHARACTERISTICS OF THE TABLE I NTEL II. eordering known ARCHITECTURE data locality. optimizations D UNNINGTON ARCHITECTURE . AMD Opteron Intel Dunnington rix A, the source matrix ~x, and the last and iteration space tiling as well as S Number of Cores 48 cores (4 sockets) Number of Cores 12 cores (2 sockets) ze Clock b Frequency gies (whenever possible Clock Frequency 2.20GHzinnermost lo 2.40GHz fferent layout for ~x.32KB, 8-way, 64-byte line size, 3 cycle latency L1 rewritten 64KB, line extensio size L1 are to full, use64-byte SIMD L2 512KB, 4-way,are 64-byte line size (ma L2 3MB, 12-way, 64-byte line size, 12 cycle latency how loop iterations assigned L3 12MB, 16-way, 64-byte line size, 40 cycle latency L3 12MB, 16-way, 64-byte line size T are scheduled, or how da · ,~rOff-Chip about 85 ns TLBiterations Size 1024 4K pages n ) */ Latency Our versions arephysical, as follows: Address Sizes 40 bits physical, 48 bits virtual Address Sizes 48 bits 48 bits virtual of A do • Default: In this version, the = NULL then across cores insize a blocked fa uses a 64-bit instruction set. The TLB is 1024 4 newLayout(~ x); six 32 KB, 8 way privateTABLE E7450 employs L1 caches. There Benchmarks III. T ESTED M ATRICES . a set of support successive iterations SSE pairwise 4a is supported but SSSE3 is missing. k; three 3 MB, 12 way L2 caches that are shared assigned to a core are sched Structure Dimension Non-zeros ong cores and also a 12 MB, Name 16 way L3 cache to be shared me single socket of(lexicographic Opteron 6174, we have tw order). caidaRouterLevel 192244 1218132 ~rall A do cores. The 1066 MTS front sidesymmetric bus isOna aquad j ofsix sam KB private L1 caches and twelve 512 KB private L2 symmetric 88343 2441727 ~ri ,~r j ) bus is a function to test if ~rnet4-1 and ~GB/s r j have mped with a theoretical data transfer rate. i 8.5 can •327680 Mapping:a This version emp shallow water2 square 81920 In addition every socket employs single 12 MB L3 column.*/ enables source eon-zero MTS technology synchronous transfer of discussed inseries Section IV, eff but ohne2 square 181343 6869939 share among all cores. Opteron 6100 processor ~rress == data, NULLtherefore then the data i ,~r j )and lpl1 is transferred square four times 32460 328036 formed, the iterations assign nub HyperTransport technology links, providing system er elements and addresses arebetransferred two times faster. E7450 ta in ~x can placed in the way rmn10 unsymmetric 46835 2329092 inthe their default relative orde gra among CPUs to improve system balance and sc kim1 Multiple 38415 933195 o Figure supports9(a) an without architecture called Independent Bus conflicting with theunsymmetric Four x16 links inMapping+Scheduling: this system to provideThis 6.4 pin bcsstk17 symmetric 10974 are used htsone in Cprocessor •428650 k then on each bus. We use 16 GB of DDR2-1066 tsc opf 300 symmetric 9774 820783 link. The input/output busping frequency isdescribed 1.8 GHz inand aSl M. = k; out strategy ins2 symmetric 309412 it provides 2751484 is 102.4 GB/s. This mod I/O19bandwidth of nstraint to Ck the scheduling strategy disc with was an integrated DDR3 memory controller withinD 2)Saturday, AMD Opteron: The AMD Opteron 6100 series the September 7, 13 Experiment Setup CON’T Different versions in our experiments Default Mapping Mapping+Scheduling Mapping+Scheduling+Layout 20 Saturday, September 7, 13 Experimental Results Con’t '()*+),-./("0,1)+2(,(.3"456" Performance improvement on Dunnington &!" 7-118.9" 7-118.9:;/<(=>?8.9" 7-118.9:;/<(=>?8.9:@-A+>3" %#" %!" $#" $!" #" !" (,(.3"456" Mapping over Default: 8.1% ts with the Intel Dunnington system. Mapping+Scheduling over Mapping: 1.8% &#" 7-118.9" 7-118.9:;/<(=>?8.9" 7-118.9:;/<(=>?8.9:@-A+>3" Mapping+Scheduling+Layout over &!" Mapping+Scheduling: 1.7% 21 %#"7, 13 Saturday, September Experimental Results Con’t ovements with the Intel Dunnington system. Performance improvement on AMD '()*+),-./("0,1)+2(,(.3"456" &#" 7-118.9" 7-118.9:;/<(=>?8.9" 7-118.9:;/<(=>?8.9:@-A+>3" &!" %#" %!" $#" $!" #" !" )+2(,(.3"456" Mapping over Default: 9.1% ovements with the 48-core AMD based system. Mapping+Scheduling over Default: 11% %#" /-<=->+83()?(2(@" Mapping+Scheduling+Layout over .(3AB$"" %!" ;C-@@+DED-3()%"" 22 Default: 14% +C.(%" Saturday, September 7, 13 $#" Thank you! 23 Saturday, September 7, 13