BIPS BIPS C O M P U T A T I O N A L R E S E A R C H D I V I S I O N Tuning Sparse Matrix Vector Multiplication for multi-core SMPs Samuel Williams1,2, Richard Vuduc3, Leonid Oliker1,2, John Shalf2, Katherine Yelick1,2, James Demmel1,2 1University of California Berkeley 2Lawrence Berkeley National Laboratory 3Georgia Institute of Technology samw@cs.berkeley.edu BIPS BIPS Overview Multicore is the de facto performance solution for the next decade Examined Sparse Matrix Vector Multiplication (SpMV) kernel Important HPC kernel Memory intensive Challenging for multicore Present two autotuned threaded implementations: Pthread, cache-based implementation Cell local store-based implementation Benchmarked performance across 4 diverse multicore architectures Intel Xeon (Clovertown) AMD Opteron Sun Niagara2 IBM Cell Broadband Engine Compare with leading MPI implementation(PETSc) with an autotuned serial kernel (OSKI) BIPS BIPS Sparse Matrix Vector Multiplication Sparse Matrix Most entries are 0.0 Performance advantage in only storing/operating on the nonzeros Requires significant meta data A x Evaluate y=Ax A is a sparse matrix x & y are dense vectors Challenges Difficult to exploit ILP(bad for superscalar), Difficult to exploit DLP(bad for SIMD) Irregular memory access to source vector Difficult to load balance Very low computational intensity (often >6 bytes/flop) y BIPS BIPS C O M P U T A T I O N A L R E S E A R C H D Test Suite Dataset (Matrices) Multicore SMPs I V I S I O N Matrices Used BIPS BIPS 2K x 2K Dense matrix stored in sparse format Dense Well Structured (sorted by nonzeros/row) Protein FEM / Spheres FEM / Cantilever FEM / Accelerator Circuit webbase Wind Tunnel FEM / Harbor Poorly Structured hodgepodge Extreme Aspect Ratio (linear programming) LP Pruned original SPARSITY suite down to 14 none should fit in cache Subdivided them into 4 categories Rank ranges from 2K to 1M QCD FEM / Ship Economics Epidemiology Multicore SMP Systems BIPS BIPS Core2 Core2 Core2 Core2 Core2 Core2 Core2 Core2 4MB Shared L2 4MB Shared L2 4MB Shared L2 4MB Shared L2 FSB FSB 10.6GB/s 1MB victim 1MB victim Opteron Opteron Memory Controller / HT 10.6GB/s 4GB/s (each direction) 1MB victim 1MB victim Opteron Opteron Memory Controller / HT 10.6GB/s 10.6GB/s Chipset (4x64b controllers) 21.3 GB/s(read) DDR2 DRAM 10.6 GB/s(write) DDR2 DRAM Fully Buffered DRAM Intel Clovertown AMD Opteron FPU MT UltraSparc 8K D$ FPU MT UltraSparc 8K D$ FPU MT UltraSparc 8K D$ 4x128b FBDIMM memory controllers PPE SPE 256K MFC MFC 256K SPE SPE 256K MFC MFC 256K SPE SPE 256K MFC MFC 256K SPE SPE 256K MFC MFC 256K SPE SPE 256K MFC MFC 256K SPE SPE 256K MFC MFC 256K SPE SPE 256K MFC MFC 256K SPE SPE 256K MFC BIF 42.7GB/s (read), 21.3 GB/s (write) 25.6GB/s XDR DRAM Sun Niagara2 512K L2 MFC 256K SPE XDR Fully Buffered DRAM PPE <<20GB/s each direction BIF XDR 25.6GB/s XDR DRAM IBM Cell Blade EIB (Ring Network) FPU MT UltraSparc 8K D$ EIB (Ring Network) FPU MT UltraSparc 8K D$ Crossbar Switch FPU MT UltraSparc 8K D$ 90 GB/s (writethru) FPU MT UltraSparc 8K D$ 4MB Shared L2 (16 way) 512K L2 179 GB/s (fill) FPU MT UltraSparc 8K D$ Multicore SMP Systems BIPS BIPS (memory hierarchy) Core2 Core2 Core2 Core2 Core2 Core2 Core2 Core2 4MB Shared L2 4MB Shared L2 4MB Shared L2 4MB Shared L2 FSB FSB 10.6GB/s 1MB victim 1MB victim Opteron Opteron Memory Controller / HT 10.6GB/s 4GB/s (each direction) 1MB victim 1MB victim Opteron Opteron Memory Controller / HT 10.6GB/s 10.6GB/s Chipset (4x64b controllers) 21.3 GB/s(read) DDR2 DRAM 10.6 GB/s(write) DDR2 DRAM Fully Buffered DRAM Intel Clovertown AMD Opteron FPU MT UltraSparc 8K D$ FPU MT UltraSparc 8K D$ FPU MT UltraSparc 8K D$ 4x128b FBDIMM memory controllers PPE SPE 256K MFC MFC 256K SPE SPE 256K MFC MFC 256K SPE SPE 256K MFC MFC 256K SPE SPE 256K MFC MFC 256K SPE SPE 256K MFC MFC 256K SPE SPE 256K MFC MFC 256K SPE SPE 256K MFC MFC 256K SPE SPE 256K MFC BIF 42.7GB/s (read), 21.3 GB/s (write) 25.6GB/s XDR DRAM Sun Niagara2 512K L2 MFC 256K SPE XDR Fully Buffered DRAM PPE <<20GB/s each direction BIF XDR 25.6GB/s XDR DRAM IBM Cell Blade EIB (Ring Network) FPU MT UltraSparc 8K D$ EIB (Ring Network) FPU MT UltraSparc 8K D$ Crossbar Switch FPU MT UltraSparc 8K D$ 90 GB/s (writethru) FPU MT UltraSparc 8K D$ 4MB Shared L2 (16 way) 512K L2 179 GB/s (fill) FPU MT UltraSparc 8K D$ Multicore SMP Systems BIPS BIPS (cache) Core2 Core2 Core2 Core2 Core2 Core2 Core2 Core2 4MB Shared L2 4MB Shared L2 4MB Shared L2 4MB Shared L2 16MB FSB FSB 10.6GB/s 1MB victim 1MB victim Opteron Opteron 4MB Memory Controller / HT 10.6GB/s 4GB/s (each direction) 1MB victim 1MB victim Opteron Opteron Memory Controller / HT 10.6GB/s 10.6GB/s Chipset (4x64b controllers) (vectors fit) 21.3 GB/s(read) DDR2 DRAM 10.6 GB/s(write) DDR2 DRAM Fully Buffered DRAM Intel Clovertown AMD Opteron FPU MT UltraSparc 8K D$ FPU MT UltraSparc 8K D$ FPU MT UltraSparc 8K D$ 4x128b FBDIMM memory controllers PPE SPE 256K MFC MFC 256K SPE SPE 256K MFC MFC 256K SPE SPE 256K MFC 4MB (local store) MFC 256K SPE SPE 256K MFC MFC 256K SPE SPE 256K MFC MFC 256K SPE SPE 256K MFC MFC 256K SPE SPE 256K MFC MFC 256K SPE SPE 256K MFC BIF 42.7GB/s (read), 21.3 GB/s (write) 25.6GB/s XDR DRAM Sun Niagara2 512K L2 MFC 256K SPE XDR Fully Buffered DRAM PPE <<20GB/s each direction BIF XDR 25.6GB/s XDR DRAM IBM Cell Blade EIB (Ring Network) 4MB FPU MT UltraSparc 8K D$ EIB (Ring Network) FPU MT UltraSparc 8K D$ Crossbar Switch FPU MT UltraSparc 8K D$ 90 GB/s (writethru) FPU MT UltraSparc 8K D$ 4MB Shared L2 (16 way) 512K L2 179 GB/s (fill) FPU MT UltraSparc 8K D$ Multicore SMP Systems BIPS BIPS (peak flops) Core2 Core2 Core2 Core2 Core2 Core2 Core2 Core2 4MB Shared L2 4MB Shared L2 4MB Shared L2 4MB Shared L2 75 Gflop/s (w/SIMD) FSB FSB 10.6GB/s 1MB victim 1MB victim Opteron 1MB victim 17 Gflop/s Opteron Memory Controller / HT 10.6GB/s 4GB/s (each direction) Opteron 1MB victim Opteron Memory Controller / HT 10.6GB/s 10.6GB/s Chipset (4x64b controllers) 21.3 GB/s(read) DDR2 DRAM 10.6 GB/s(write) DDR2 DRAM Fully Buffered DRAM Intel Clovertown AMD Opteron FPU MT UltraSparc 8K D$ FPU MT UltraSparc 8K D$ 90 GB/s (writethru) FPU MT UltraSparc 8K D$ FPU MT UltraSparc 8K D$ 4x128b FBDIMM memory controllers PPE SPE 256K MFC MFC 256K SPE SPE 256K MFC MFC 256K SPE SPE 256K MFC 29 Gflop/s (w/SIMD) MFC 256K SPE SPE 256K MFC MFC 256K SPE SPE 256K MFC MFC 256K SPE SPE 256K MFC MFC 256K SPE SPE 256K MFC MFC 256K SPE SPE 256K MFC BIF 42.7GB/s (read), 21.3 GB/s (write) 25.6GB/s XDR DRAM Sun Niagara2 512K L2 MFC 256K SPE XDR Fully Buffered DRAM PPE <<20GB/s each direction BIF XDR 25.6GB/s XDR DRAM IBM Cell Blade EIB (Ring Network) 11 Gflop/s FPU MT UltraSparc 8K D$ EIB (Ring Network) FPU MT UltraSparc 8K D$ Crossbar Switch FPU MT UltraSparc 8K D$ 4MB Shared L2 (16 way) 512K L2 179 GB/s (fill) FPU MT UltraSparc 8K D$ Multicore SMP Systems BIPS BIPS (peak read bandwidth) Core2 Core2 Core2 Core2 Core2 Core2 Core2 Core2 4MB Shared L2 4MB Shared L2 4MB Shared L2 4MB Shared L2 FSB 21 GB/s 10.6GB/s FSB 1MB victim 1MB victim 1MB victim Opteron Opteron 21 GB/s Opteron 10.6GB/s 10.6GB/s Memory Controller / HT 10.6GB/s 1MB victim 4GB/s (each direction) Opteron Memory Controller / HT Chipset (4x64b controllers) 21.3 GB/s(read) DDR2 DRAM 10.6 GB/s(write) DDR2 DRAM Fully Buffered DRAM Intel Clovertown AMD Opteron FPU MT UltraSparc 8K D$ FPU MT UltraSparc 8K D$ 90 GB/s (writethru) FPU MT UltraSparc 8K D$ FPU MT UltraSparc 8K D$ 4x128b FBDIMM memory controllers PPE SPE 256K MFC MFC 256K SPE SPE 256K MFC MFC 256K SPE SPE 256K MFC 51 GB/s MFC 256K SPE SPE 256K MFC MFC 256K SPE SPE 256K MFC MFC 256K SPE SPE 256K MFC MFC 256K SPE SPE 256K MFC MFC 256K SPE SPE 256K MFC BIF 42.7GB/s (read), 21.3 GB/s (write) 25.6GB/s XDR DRAM Sun Niagara2 512K L2 MFC 256K SPE XDR Fully Buffered DRAM PPE <<20GB/s each direction BIF XDR 25.6GB/s XDR DRAM IBM Cell Blade EIB (Ring Network) 43 GB/s FPU MT UltraSparc 8K D$ EIB (Ring Network) FPU MT UltraSparc 8K D$ Crossbar Switch FPU MT UltraSparc 8K D$ 4MB Shared L2 (16 way) 512K L2 179 GB/s (fill) FPU MT UltraSparc 8K D$ Multicore SMP Systems BIPS BIPS (NUMA) Core2 Core2 Core2 Core2 Core2 Core2 Core2 Core2 4MB Shared L2 4MB Shared L2 4MB Shared L2 4MB Shared L2 FSB FSB 10.6GB/s 1MB victim 1MB victim Opteron Opteron Memory Controller / HT 10.6GB/s 4GB/s (each direction) 1MB victim 1MB victim Opteron Opteron Memory Controller / HT 10.6GB/s 10.6GB/s Chipset (4x64b controllers) 21.3 GB/s(read) DDR2 DRAM 10.6 GB/s(write) DDR2 DRAM Fully Buffered DRAM Intel Clovertown AMD Opteron FPU MT UltraSparc 8K D$ FPU MT UltraSparc 8K D$ FPU MT UltraSparc 8K D$ 4x128b FBDIMM memory controllers PPE SPE 256K MFC MFC 256K SPE SPE 256K MFC MFC 256K SPE SPE 256K MFC MFC 256K SPE SPE 256K MFC MFC 256K SPE SPE 256K MFC MFC 256K SPE SPE 256K MFC MFC 256K SPE SPE 256K MFC MFC 256K SPE SPE 256K MFC BIF 42.7GB/s (read), 21.3 GB/s (write) 25.6GB/s XDR DRAM Sun Niagara2 512K L2 MFC 256K SPE XDR Fully Buffered DRAM PPE <<20GB/s each direction BIF XDR 25.6GB/s XDR DRAM IBM Cell Blade EIB (Ring Network) FPU MT UltraSparc 8K D$ EIB (Ring Network) FPU MT UltraSparc 8K D$ Crossbar Switch FPU MT UltraSparc 8K D$ 90 GB/s (writethru) FPU MT UltraSparc 8K D$ 4MB Shared L2 (16 way) 512K L2 179 GB/s (fill) FPU MT UltraSparc 8K D$ BIPS BIPS C O M P U T A T I O N A L R E S E A R C H D I V I S I O Naïve Implementation For cache-based machines Included a median performance number N 0.0 Median LP Webbase Circuit FEM-Accel Epidem Econom FEM-Ship QCD 1.0 FEM-Har 0.3 0.3 0.2 0.2 0.1 0.1 0.0 0.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 Median LP Webbase Circuit FEM-Accel Epidem 0.4 Econom 0.5 FEM-Ship 0.6 QCD 0.7 0.6 FEM-Har 1.0 Tunnel 0.8 0.7 FEM-Cant 0.8 FEM-Sphr 0.9 Protein 0.9 GFlop/s Intel Clovertown Dense Median LP Webbase Circuit FEM-Accel Epidem Econom FEM-Ship QCD FEM-Har Tunnel FEM-Cant FEM-Sphr Protein Dense GFlop/s 1.0 Tunnel FEM-Cant FEM-Sphr Protein Dense GFlop/s BIPS BIPS vanilla C Performance AMD Opteron 0.5 0.4 Sun Niagara2 Vanilla C implementation Matrix stored in CSR (compressed sparse row) Explored compiler options - only the best is presented here BIPS BIPS C O M P U T A T I O N A L R E S E A R C H D I V I S I O N Pthread Implementation Optimized for multicore/threading Variety of shared memory programming models are acceptable(not just Pthreads) More colors = more optimizations = more work BIPS BIPS Parallelization Matrix partitioned by rows and balanced by the number of nonzeros SPMD like approach A barrier() is called before and after the SpMV kernel Each sub matrix stored separately in CSR Load balancing can be challenging x A y # of threads explored in powers of 2 (in paper) 0.0 Median LP Webbase Circuit FEM-Accel Epidem Econom FEM-Ship QCD 3.0 FEM-Har 0.5 0.5 0.0 0.0 2.5 2.0 1.5 1.0 0.5 Sun Niagara2 Naïve Pthreads Naïve Single Thread Median LP Webbase Circuit FEM-Accel Epidem Econom FEM-Ship 1.0 QCD 1.5 FEM-Har 3.0 Tunnel 2.0 FEM-Cant 2.0 FEM-Sphr 2.5 Protein 2.5 GFlop/s Intel Clovertown Dense Median LP Webbase Circuit FEM-Accel Epidem Econom FEM-Ship QCD FEM-Har Tunnel FEM-Cant FEM-Sphr Protein Dense GFlop/s 3.0 Tunnel FEM-Cant FEM-Sphr Protein Dense GFlop/s BIPS BIPS Naïve Parallel Performance AMD Opteron 1.5 1.0 0.0 Median LP Webbase Circuit FEM-Accel Epidem Econom FEM-Ship QCD 3.0 FEM-Har 0.5 0.0 0.0 64x threads = 41x performance 1.5 1.0 0.5 Sun Niagara2 Naïve Pthreads Naïve Single Thread Median LP Webbase Circuit FEM-Accel Epidem Econom FEM-Ship QCD 1.0 FEM-Har 3.0 Tunnel 1.5 FEM-Cant 2.5 FEM-Sphr 8x cores = 1.9x performance Protein 2.0 GFlop/s Intel Clovertown Dense Median LP Webbase Circuit FEM-Accel Epidem Econom FEM-Ship QCD FEM-Har Tunnel FEM-Cant FEM-Sphr Protein 3.0 Tunnel 2.0 FEM-Cant 2.5 FEM-Sphr 0.5 Dense GFlop/s 2.5 Protein Dense GFlop/s BIPS BIPS Naïve Parallel Performance AMD Opteron 4x cores = 1.5x performance 2.0 1.5 1.0 0.0 Median LP Webbase Circuit FEM-Accel Epidem Econom FEM-Ship QCD 2.0 FEM-Har 2.5 Tunnel 3.0 FEM-Cant 0.5 0.0 0.0 25% of peak flops 39% of bandwidth 1.5 1.0 0.5 Sun Niagara2 Naïve Pthreads Naïve Single Thread Median LP Webbase Circuit FEM-Accel Epidem Econom FEM-Ship QCD FEM-Har Tunnel 1.0 2.0 FEM-Cant 2.5 FEM-Sphr 1.4% of peak flops 29% of bandwidth 3.0 Protein 1.5 GFlop/s Intel Clovertown Dense Median LP Webbase Circuit FEM-Accel Epidem Econom FEM-Ship QCD FEM-Har Tunnel FEM-Cant FEM-Sphr 2.0 Protein 2.5 FEM-Sphr 0.5 Dense GFlop/s 3.0 Protein Dense GFlop/s BIPS BIPS Naïve Parallel Performance AMD Opteron 4% of peak flops 20% of bandwidth 1.5 1.0 Case for Autotuning BIPS BIPS How do we deliver good performance across all these architectures, across all matrices without exhaustively optimizing every combination Autotuning Write a Perl script that generates all possible optimizations Heuristically, or exhaustively search the optimizations Existing SpMV solution: OSKI (developed at UCB) This work: Optimizations geared for multi-core/-threading generates SSE/SIMD intrinsics, prefetching, loop transformations, alternate data structures, etc… “prototype for parallel OSKI” Exploiting NUMA, Affinity BIPS BIPS Bandwidth on the Opteron(and Cell) can vary substantially based on placement of data Opteron Opteron Opteron Opteron Opteron Opteron DDR2 DRAM DDR2 DRAM DDR2 DRAM DDR2 DRAM DDR2 DRAM DDR2 DRAM Single Thread Multiple Threads, One memory controller Multiple Threads, Both memory controllers Bind each sub matrix and the thread to process it together Explored libnuma, Linux, and Solaris routines Adjacent blocks bound to adjacent cores 0.0 Median LP Webbase Circuit FEM-Accel Epidem Econom FEM-Ship QCD 3.0 FEM-Har 0.5 0.5 0.0 0.0 2.5 1.5 1.0 0.5 Sun Niagara2 +NUMA/Affinity Naïve Pthreads 2.0 Naïve Single Thread Median LP Webbase Circuit FEM-Accel Epidem Econom FEM-Ship 1.0 QCD 1.5 FEM-Har 3.0 Tunnel 2.0 FEM-Cant 2.0 FEM-Sphr 2.5 Protein 2.5 GFlop/s Intel Clovertown Dense Median LP Webbase Circuit FEM-Accel Epidem Econom FEM-Ship QCD FEM-Har Tunnel FEM-Cant FEM-Sphr Protein Dense GFlop/s 3.0 Tunnel FEM-Cant FEM-Sphr Protein Dense GFlop/s BIPS BIPS Performance (+NUMA) AMD Opteron 1.5 1.0 0.0 Median LP Webbase Circuit FEM-Accel Epidem Econom FEM-Ship QCD 3.0 FEM-Har 0.5 0.5 0.0 0.0 2.5 1.0 0.5 Sun Niagara2 +Software Prefetching +NUMA/Affinity 2.0 1.5 Naïve Pthreads Naïve Single Thread Median LP Webbase Circuit FEM-Accel Epidem Econom FEM-Ship 1.0 QCD 1.5 FEM-Har 3.0 Tunnel 2.0 FEM-Cant 2.0 FEM-Sphr 2.5 Protein 2.5 GFlop/s Intel Clovertown Dense Median LP Webbase Circuit FEM-Accel Epidem Econom FEM-Ship QCD FEM-Har Tunnel FEM-Cant FEM-Sphr Protein Dense GFlop/s 3.0 Tunnel FEM-Cant FEM-Sphr Protein Dense GFlop/s BIPS BIPS Performance (+SW Prefetching) AMD Opteron 1.5 1.0 BIPS BIPS Matrix Compression For memory bound kernels, minimizing memory traffic should maximize performance Compress the meta data Exploit structure to eliminate meta data Heuristic: select the compression that minimizes the matrix size: power of 2 register blocking CSR/COO format 16b/32b indices etc… Side effect: matrix may be minimized to the point where it fits entirely in cache 0.0 Median LP Webbase Circuit FEM-Accel Epidem Econom FEM-Ship QCD 6.0 FEM-Har 1.0 1.0 0.0 0.0 5.0 2.0 1.0 Sun Niagara2 +Matrix Compression +Software Prefetching 4.0 3.0 +NUMA/Affinity Naïve Pthreads Naïve Single Thread Median LP Webbase Circuit FEM-Accel Epidem Econom FEM-Ship 2.0 QCD 3.0 FEM-Har 6.0 Tunnel 4.0 FEM-Cant 4.0 FEM-Sphr 5.0 Protein 5.0 GFlop/s Intel Clovertown Dense Median LP Webbase Circuit FEM-Accel Epidem Econom FEM-Ship QCD FEM-Har Tunnel FEM-Cant FEM-Sphr Protein Dense GFlop/s 6.0 Tunnel FEM-Cant FEM-Sphr Protein Dense GFlop/s BIPS BIPS Performance (+matrix compression) AMD Opteron 3.0 2.0 BIPS BIPS Cache and TLB Blocking Accesses to the matrix and destination vector are streaming But, access to the source vector can be random Reorganize matrix (and thus access pattern) to maximize reuse. Applies equally to TLB blocking (caching PTEs) Heuristic: block destination, then keep adding more columns as long as the number of source vector cache lines(or pages) touched is less than the cache(or TLB). Apply all previous optimizations individually to each cache block. Search: neither, cache, cache&TLB Better locality at the expense of confusing the hardware prefetchers. x A y 0.0 Median LP Webbase Circuit FEM-Accel Epidem Econom FEM-Ship QCD 6.0 FEM-Har 1.0 1.0 0.0 0.0 5.0 2.0 1.0 Sun Niagara2 +Cache/TLB Blocking +Matrix Compression 4.0 3.0 +Software Prefetching +NUMA/Affinity Naïve Pthreads Naïve Single Thread Median LP Webbase Circuit FEM-Accel Epidem Econom FEM-Ship 2.0 QCD 3.0 FEM-Har 6.0 Tunnel 4.0 FEM-Cant 4.0 FEM-Sphr 5.0 Protein 5.0 GFlop/s Intel Clovertown Dense Median LP Webbase Circuit FEM-Accel Epidem Econom FEM-Ship QCD FEM-Har Tunnel FEM-Cant FEM-Sphr Protein Dense GFlop/s 6.0 Tunnel FEM-Cant FEM-Sphr Protein Dense GFlop/s BIPS BIPS Performance (+cache blocking) AMD Opteron 3.0 2.0 BIPS BIPS Banks, Ranks, and DIMMs In this SPMD approach, as the number of threads increases, so to does the number of concurrent streams to memory. Most memory controllers have finite capability to reorder the requests. (DMA can avoid or minimize this) Addressing/Bank conflicts become increasingly likely Add more DIMMs, configuration of ranks can help Clovertown system was already fully populated 0.0 Median LP Webbase Circuit FEM-Accel Epidem Econom FEM-Ship QCD 6.0 FEM-Har 1.0 1.0 0.0 0.0 5.0 2.0 1.0 +Cache/TLB Blocking 4.0 3.0 +Matrix Compression +Software Prefetching +NUMA/Affinity Naïve Pthreads Naïve Single Thread Sun Niagara2 +More DIMMs, Rank configuration, etc… Median LP Webbase Circuit FEM-Accel Epidem Econom FEM-Ship 2.0 QCD 3.0 FEM-Har 6.0 Tunnel 4.0 FEM-Cant 4.0 FEM-Sphr 5.0 Protein 5.0 GFlop/s Intel Clovertown Dense Median LP Webbase Circuit FEM-Accel Epidem Econom FEM-Ship QCD FEM-Har Tunnel FEM-Cant FEM-Sphr Protein Dense GFlop/s 6.0 Tunnel FEM-Cant FEM-Sphr Protein Dense GFlop/s BIPS BIPS Performance (more DIMMs, …) AMD Opteron 3.0 2.0 Performance (more DIMMs, …) +More DIMMs, Rank configuration, etc… 52% of peak flops 54% of bandwidth 4.0 +Cache/TLB Blocking +Matrix Compression 3.0 +Software Prefetching 2.0 +NUMA/Affinity 1.0 Median LP Webbase Circuit FEM-Accel Epidem Econom FEM-Ship QCD FEM-Har Tunnel FEM-Cant FEM-Sphr Protein Dense Naïve Pthreads Naïve Single Thread Median LP Webbase Circuit FEM-Accel Epidem Sun Niagara2 5.0 0.0 Econom FEM-Ship QCD Median LP Webbase Circuit FEM-Accel Epidem Econom FEM-Ship QCD FEM-Har Tunnel FEM-Cant 0.0 FEM-Sphr 0.0 Protein 1.0 FEM-Har 2.0 1.0 6.0 GFlop/s 3.0 Tunnel 2.0 4.0 FEM-Cant GFlop/s 3.0 Dense GFlop/s 4.0 20% of peak flops 66% of bandwidth 5.0 FEM-Sphr 4% of peak flops 52% of bandwidth 5.0 AMD Opteron 6.0 Protein Intel Clovertown 6.0 Dense BIPS BIPS Performance (more DIMMs, …) 3.0 +More DIMMs, Rank configuration, etc… 2 essential optimizations +Cache/TLB Blocking 4.0 +Matrix Compression 3.0 +Software Prefetching 2.0 +NUMA/Affinity 1.0 Median LP Webbase Circuit FEM-Accel Epidem Econom FEM-Ship QCD FEM-Har Tunnel FEM-Cant FEM-Sphr Protein Dense Naïve Pthreads Naïve Single Thread Median LP Webbase Circuit FEM-Accel Epidem Sun Niagara2 5.0 0.0 Econom FEM-Ship QCD Median LP Webbase Circuit FEM-Accel Epidem Econom FEM-Ship QCD FEM-Har Tunnel 0.0 FEM-Cant 0.0 FEM-Sphr 1.0 Protein 1.0 FEM-Har 2.0 FEM-Cant 2.0 4.0 Protein GFlop/s 3.0 Dense GFlop/s 4.0 6.0 GFlop/s 4 essential optimizations 5.0 FEM-Sphr 3 essential optimizations 5.0 AMD Opteron 6.0 Tunnel Intel Clovertown 6.0 Dense BIPS BIPS BIPS BIPS C O M P U T A T I O N A L R E S E A R C H D I V I S I O Cell Implementation Comments Performance N BIPS BIPS Cell Implementation No vanilla C implementation (aside from the PPE) Even SIMDized double precision is extremely weak Scalar double precision is unbearable Minimum register blocking is 2x1 (SIMDizable) Can increase memory traffic by 66% Cache blocking optimization is transformed into local store blocking Spatial and temporal locality is captured by software when the matrix is optimized In essence, the high bits of column indices are grouped into DMA lists No branch prediction Replace branches with conditional operations In some cases, what were optional optimizations on cache based machines, are requirements for correctness on Cell Despite the performance, Cell is still handicapped by double precision Sun Niagara2 12.0 11.0 10.0 9.0 8.0 7.0 6.0 5.0 4.0 3.0 2.0 1.0 0.0 Median LP Webbase Circuit FEM-Accel Epidem Econom FEM-Ship QCD FEM-Har Tunnel FEM-Cant FEM-Sphr Protein Median LP Webbase Circuit FEM-Accel Epidem Econom FEM-Ship QCD FEM-Har Tunnel FEM-Cant FEM-Sphr Protein Dense 12.0 11.0 10.0 9.0 8.0 7.0 6.0 5.0 4.0 3.0 2.0 1.0 0.0 Dense GFlop/s Median LP Webbase Circuit FEM-Accel Epidem Econom FEM-Ship QCD FEM-Har Tunnel FEM-Cant FEM-Sphr Protein Intel Clovertown GFlop/s Median LP Webbase Circuit FEM-Accel Epidem Econom FEM-Ship QCD FEM-Har Tunnel FEM-Cant FEM-Sphr Protein Dense 12.0 11.0 10.0 9.0 8.0 7.0 6.0 5.0 4.0 3.0 2.0 1.0 0.0 Dense GFlop/s 12.0 11.0 10.0 9.0 8.0 7.0 6.0 5.0 4.0 3.0 2.0 1.0 0.0 GFlop/s BIPS BIPS Performance AMD Opteron IBM Cell Broadband Engine 16 SPEs 8 SPEs 1 SPE Sun Niagara2 12.0 11.0 10.0 9.0 8.0 7.0 6.0 5.0 4.0 3.0 2.0 1.0 0.0 Median 39% of peak flops 89% of bandwidth LP Webbase Circuit FEM-Accel Epidem Econom FEM-Ship QCD FEM-Har Tunnel FEM-Cant FEM-Sphr Protein Median LP Webbase Circuit FEM-Accel Epidem Econom FEM-Ship QCD FEM-Har Tunnel FEM-Cant FEM-Sphr Protein Dense 12.0 11.0 10.0 9.0 8.0 7.0 6.0 5.0 4.0 3.0 2.0 1.0 0.0 Dense GFlop/s Median LP Webbase Circuit FEM-Accel Epidem Econom FEM-Ship QCD FEM-Har Tunnel FEM-Cant FEM-Sphr Protein Intel Clovertown GFlop/s Median LP Webbase Circuit FEM-Accel Epidem Econom FEM-Ship QCD FEM-Har Tunnel FEM-Cant FEM-Sphr Protein Dense 12.0 11.0 10.0 9.0 8.0 7.0 6.0 5.0 4.0 3.0 2.0 1.0 0.0 Dense GFlop/s 12.0 11.0 10.0 9.0 8.0 7.0 6.0 5.0 4.0 3.0 2.0 1.0 0.0 GFlop/s BIPS BIPS Performance AMD Opteron IBM Cell Broadband Engine 16 SPEs 8 SPEs 1 SPE BIPS BIPS C O M P U T A T I O N A L R E S E A R C H D I V I S I O N Multicore MPI Implementation This is the default approach to programming multicore Multicore MPI Implementation BIPS BIPS Used PETSc with shared memory MPICH Used OSKI (developed @ UCB) to optimize each thread = Highly optimized MPI 3.5 3.0 3.0 2.5 2.5 Naïve Single Thread MPI(autotuned) Pthreads(autotuned) Median LP Webbase Circuit FEM-Accel Epidem Econom FEM-Ship QCD FEM-Har Median LP Webbase Circuit FEM-Accel Epidem Econom FEM-Ship QCD 0.0 FEM-Har 0.0 Tunnel 0.5 FEM-Cant 0.5 FEM-Sphr 1.0 Protein 1.0 FEM-Cant 1.5 FEM-Sphr 1.5 2.0 Protein 2.0 Dense GFlop/s 3.5 Dense GFlop/s AMD Opteron 4.0 Tunnel Intel Clovertown 4.0 BIPS BIPS C O M P U T A T I O N A L R E S E A R C H D Summary I V I S I O N BIPS BIPS Median Performance & Efficiency Used digital power meter to measure sustained system power FBDIMM drives up Clovertown and Niagara2 power Right: sustained MFlop/s / sustained Watts Default approach(MPI) achieves very low performance and efficiency Median Power Efficiency Median Performance 7.0 24.0 6.5 22.0 6.0 20.0 5.5 18.0 5.0 MFlop/s/Watt GFlop/s 4.5 4.0 3.5 3.0 2.5 2.0 12.0 10.0 8.0 4.0 1.0 2.0 0.5 Naïve 14.0 6.0 1.5 0.0 16.0 Intel Clovertown AMD Opteron Sun Niagara2 PETSc+OSKI(autotuned) IBM Cell Pthreads(autotuned) 0.0 Naïve Intel Clovertown AMD Opteron Sun Niagara2 PETSc+OSKI(autotuned) IBM Cell Pthreads(autotuned) Summary BIPS BIPS Paradoxically, the most complex/advanced architectures required the most tuning, and delivered the lowest performance. Most machines achieved less than 50-60% of DRAM bandwidth Niagara2 delivered both very good performance and productivity Cell delivered very good performance and efficiency 90% of memory bandwidth High power efficiency Easily understood performance Extra traffic = lower performance (future work can address this) multicore specific autotuned implementation significantly outperformed a state of the art MPI implementation Matrix compression geared towards multicore NUMA Prefetching BIPS BIPS Acknowledgments UC Berkeley RADLab Cluster (Opterons) PSI cluster(Clovertowns) Sun Microsystems Niagara2 Forschungszentrum Jülich Cell blade cluster BIPS BIPS C O M P U T A T I O N A L R E S E A R C H D I Questions? V I S I O N