BIPS BIPS C O M P U T A T I O N A L R E S E A R C H D I V I S I O N Lattice Boltzmann Simulation Optimization on Leading Multicore Platforms Samuel Williams1,2, Jonathan Carter2, Leonid Oliker1,2, John Shalf2, Katherine Yelick1,2 1University 2Lawrence of California, Berkeley Berkeley National Laboratory samw@eecs.berkeley.edu BIPS BIPS Motivation Multicore is the de facto solution for improving peak performance for the next decade How do we ensure this applies to sustained performance as well ? Processor architectures are extremely diverse and compilers can rarely fully exploit them Require a HW/SW solution that guarantees performance without completely sacrificing productivity Overview BIPS BIPS Examined the Lattice-Boltzmann Magneto-hydrodynamic (LBMHD) application Present and analyze two threaded & auto-tuned implementations Benchmarked performance across 5 diverse multicore microarchitectures Intel Xeon (Clovertown) AMD Opteron (rev.F) Sun Niagara2 (Huron) IBM QS20 Cell Blade (PPEs) IBM QS20 Cell Blade (SPEs) We show Auto-tuning can significantly improve application performance Cell consistently delivers good performance and efficiency Niagara2 delivers good performance and productivity BIPS BIPS C O M P U T A T I O N A L R E S E A R C H D I V I S I O Multicore SMPs used N Multicore SMP Systems Core2 Core2 Core2 Core2 Core2 Core2 4MB Shared L2 4MB Shared L2 4MB Shared L2 4MB Shared L2 Opteron Opteron 1MB victim 1MB victim Opteron Opteron 1MB victim HT Core2 Core2 AMD Opteron (rev.F) 4GB/s (each direction) Intel Xeon (Clovertown) HT BIPS BIPS SRI / crossbar FSB 1MB victim SRI / crossbar FSB 10.6 GB/s 10.6 GB/s 128b memory controller 128b memory controller Chipset (4x64b controllers) 21.3 GB/s(read) 10.66 GB/s 10.6 GB/s(write) 10.66 GB/s 667MHz DDR2 DIMMs 667MHz FBDIMMs Sun Niagara2 (Huron) MT MT MT MT MT MT MT MT Sparc Sparc Sparc Sparc Sparc Sparc Sparc Sparc 667MHz DDR2 DIMMs IBM QS20 Cell Blade SPE SPE SPE SPE PPE PPE 512KB L2 512KB L2 SPE MFC MFC SPE SPE 256K 256K 256K 256K 256K 256K 256K 256K 8K L1 8K L1 8K L1 8K L1 8K L1 8K L1 8K L1 8K L1 MFC SPE MFC MFC MFC MFC MFC Crossbar Switch EIB (Ring Network) 90 GB/s (writethru) 4MB Shared L2 (16 way) (address interleaving via 8x64B banks) MFC MFC MFC MFC 256K 256K 256K 256K XDR EIB (Ring Network) <20GB/s each direction 179 GB/s (fill) BIF MFC BIF XDR MFC MFC MFC 256K 256K 256K 256K 4x128b memory controllers (2 banks each) SPE 21.33 GB/s (write) SPE SPE SPE SPE SPE SPE 42.66 GB/s (read) 25.6GB/s 25.6GB/s 667MHz FBDIMMs 512MB XDR DRAM 512MB XDR DRAM SPE Multicore SMP Systems BIPS BIPS (memory hierarchy) Core2 Core2 Core2 Core2 4MB Shared L2 4MB Shared L2 4MB Shared L2 4MB Shared L2 Opteron Opteron 1MB victim 1MB victim Opteron Opteron 1MB victim HT Core2 Core2 HT Core2 Core2 AMD Opteron (rev.F) 4GB/s (each direction) Intel Xeon (Clovertown) SRI / crossbar FSB 1MB victim SRI / crossbar FSB 10.6 GB/s 10.6 GB/s 128b memory controller 128b memory controller Chipset (4x64b controllers) 21.3 GB/s(read) 10.66 GB/s 10.6 GB/s(write) 10.66 GB/s 667MHz DDR2 DIMMs 667MHz FBDIMMs Sun Niagara2 (Huron) MT MT MT MT MT MT MT MT Sparc Sparc Sparc Sparc Sparc Sparc Sparc Sparc 667MHz DDR2 DIMMs IBM QS20 Cell Blade SPE SPE SPE SPE PPE PPE 512KB L2 512KB L2 SPE MFC MFC SPE SPE 256K 256K 256K 256K 256K 256K 256K 256K 8K L1 8K L1 8K L1 8K L1 8K L1 8K L1 8K L1 8K L1 MFC SPE MFC MFC MFC MFC MFC Crossbar Switch EIB (Ring Network) 90 GB/s (writethru) 4MB Shared L2 (16 way) (address interleaving via 8x64B banks) MFC MFC MFC MFC 256K 256K 256K 256K XDR EIB (Ring Network) <20GB/s each direction 179 GB/s (fill) BIF MFC BIF XDR MFC MFC MFC 256K 256K 256K 256K 4x128b memory controllers (2 banks each) SPE 21.33 GB/s (write) SPE SPE SPE SPE SPE SPE 42.66 GB/s (read) 25.6GB/s 25.6GB/s 667MHz FBDIMMs 512MB XDR DRAM 512MB XDR DRAM SPE Multicore SMP Systems BIPS BIPS (memory hierarchy) Core2 Core2 Core2 Core2 4MB Shared L2 4MB Shared L2 4MB Shared L2 4MB Shared L2 Opteron Opteron 1MB victim 1MB victim Opteron Opteron 1MB victim HT Core2 Core2 HT Core2 Core2 AMD Opteron (rev.F) 4GB/s (each direction) Intel Xeon (Clovertown) SRI / crossbar FSB 1MB victim SRI / crossbar FSB 10.6 GB/s 10.6 GB/s 128b memory controller 128b memory controller Chipset (4x64b controllers) 21.3 GB/s(read) 10.66 GB/s 10.6 GB/s(write) 10.66 GB/s 667MHz DDR2 DIMMs 667MHz FBDIMMs Sun Niagara2 (Huron) MT MT MT MT MT MT MT MT Sparc Sparc Sparc Sparc Sparc Sparc Sparc Sparc 667MHz DDR2 DIMMs IBM QS20 Cell Blade SPE SPE SPE SPE PPE PPE 512KB L2 512KB L2 SPE MFC MFC SPE SPE 256K 256K 256K 256K 256K 256K 256K 256K 8K L1 8K L1 8K L1 8K L1 8K L1 8K L1 8K L1 8K L1 MFC SPE MFC MFC MFC MFC MFC Crossbar Switch EIB (Ring Network) 90 GB/s (writethru) 4MB Shared L2 (16 way) (address interleaving via 8x64B banks) MFC MFC MFC MFC 256K 256K 256K 256K XDR EIB (Ring Network) <20GB/s each direction 179 GB/s (fill) BIF MFC BIF XDR MFC MFC MFC 256K 256K 256K 256K 4x128b memory controllers (2 banks each) SPE 21.33 GB/s (write) SPE SPE SPE SPE SPE SPE 42.66 GB/s (read) 25.6GB/s 25.6GB/s 667MHz FBDIMMs 512MB XDR DRAM 512MB XDR DRAM SPE Multicore SMP Systems BIPS BIPS (memory hierarchy) Core2 Core2 Core2 Core2 4MB Shared L2 4MB Shared L2 4MB Shared L2 4MB Shared L2 Opteron Opteron 1MB victim 1MB victim Opteron Opteron 1MB victim HT Core2 Core2 HT Core2 Core2 AMD Opteron (rev.F) 4GB/s (each direction) Intel Xeon (Clovertown) SRI / crossbar FSB 1MB victim SRI / crossbar FSB 10.6 GB/s 10.6 GB/s 128b memory controller 128b memory controller Chipset (4x64b controllers) 21.3 GB/s(read) 10.66 GB/s 10.6 GB/s(write) 10.66 GB/s 667MHz DDR2 DIMMs 667MHz FBDIMMs Sun Niagara2 (Huron) MT MT MT MT MT MT MT MT Sparc Sparc Sparc Sparc Sparc Sparc Sparc Sparc 667MHz DDR2 DIMMs IBM QS20 Cell Blade SPE SPE SPE SPE PPE PPE 512KB L2 512KB L2 SPE MFC MFC SPE SPE 256K 256K 256K 256K 256K 256K 256K 256K 8K L1 8K L1 8K L1 8K L1 8K L1 8K L1 8K L1 8K L1 MFC SPE MFC MFC MFC MFC MFC Crossbar Switch EIB (Ring Network) 90 GB/s (writethru) 4MB Shared L2 (16 way) (address interleaving via 8x64B banks) MFC MFC MFC MFC 256K 256K 256K 256K XDR EIB (Ring Network) <20GB/s each direction 179 GB/s (fill) BIF MFC BIF XDR MFC MFC MFC 256K 256K 256K 256K 4x128b memory controllers (2 banks each) SPE 21.33 GB/s (write) SPE SPE SPE SPE SPE SPE 42.66 GB/s (read) 25.6GB/s 25.6GB/s 667MHz FBDIMMs 512MB XDR DRAM 512MB XDR DRAM SPE Multicore SMP Systems BIPS BIPS (peak flops) Core2 Core2 Core2 Core2 4MB Shared L2 4MB Shared L2 4MB Shared L2 4MB Shared L2 Opteron Opteron 1MB victim 1MB victim SRI / crossbar 17 Gflop/s FSB 10.6 GB/s 10.6 GB/s 1MB victim SRI / crossbar 75 Gflop/s FSB 1MB victim Opteron Opteron HT Core2 Core2 HT Core2 Core2 AMD Opteron (rev.F) 4GB/s (each direction) Intel Xeon (Clovertown) 128b memory controller 128b memory controller Chipset (4x64b controllers) 21.3 GB/s(read) 10.66 GB/s 10.6 GB/s(write) 10.66 GB/s 667MHz DDR2 DIMMs 667MHz FBDIMMs Sun Niagara2 (Huron) MT MT MT MT MT MT MT MT Sparc Sparc Sparc Sparc Sparc Sparc Sparc Sparc 667MHz DDR2 DIMMs IBM QS20 Cell Blade SPE SPE SPE SPE PPE PPE 512KB L2 512KB L2 SPE MFC Crossbar Switch 11 Gflop/s 90 GB/s (writethru) PPEs: 13 Gflop/s MFC MFC MFC EIB (Ring Network) MFC MFC MFC MFC 256K 256K 256K 256K 4x128b memory controllers (2 banks each) SPE 21.33 GB/s (write) 42.66 GB/s (read) XDR BIF MFC MFC MFC MFC BIF XDR SPE SPE 25.6GB/s MFC MFC SPE SPE SPE 25.6GB/s 667MHz FBDIMMs 512MB XDR DRAM MFC MFC 256K 256K 256K 256K SPEs: 29 Gflop/s SPE SPE EIB (Ring Network) <20GB/s each direction 179 GB/s (fill) 4MB Shared L2 (16 way) (address interleaving via 8x64B banks) SPE 256K 256K 256K 256K 256K 256K 256K 256K 8K L1 8K L1 8K L1 8K L1 8K L1 8K L1 8K L1 8K L1 SPE 512MB XDR DRAM SPE Multicore SMP Systems BIPS BIPS (peak DRAM bandwidth) Core2 Core2 Core2 Core2 4MB Shared L2 4MB Shared L2 4MB Shared L2 4MB Shared L2 Opteron Opteron 1MB victim 21 GB/s(read) 10 GB/s(write) FSB 1MB victim 1MB victim SRI / crossbar 1MB victim SRI / crossbar 21 GB/s FSB 10.6 GB/s 10.6 GB/s Opteron Opteron HT Core2 Core2 HT Core2 Core2 AMD Opteron (rev.F) 4GB/s (each direction) Intel Xeon (Clovertown) 128b memory controller 128b memory controller Chipset (4x64b controllers) 21.3 GB/s(read) 10.66 GB/s 10.6 GB/s(write) 10.66 GB/s 667MHz DDR2 DIMMs 667MHz FBDIMMs Sun Niagara2 (Huron) MT MT MT MT MT MT MT MT Sparc Sparc Sparc Sparc Sparc Sparc Sparc Sparc 667MHz DDR2 DIMMs IBM QS20 Cell Blade SPE SPE SPE SPE PPE PPE 512KB L2 512KB L2 SPE 42 GB/s(read) 21 GB/s(write) MFC MFC SPE SPE 256K 256K 256K 256K 256K 256K 256K 256K 8K L1 8K L1 8K L1 8K L1 8K L1 8K L1 8K L1 8K L1 MFC SPE MFC MFC MFC MFC MFC Crossbar Switch 90 GB/s (writethru) EIB (Ring Network) 179 GB/s (fill) 4MB Shared L2 (16 way) (address interleaving via 8x64B banks) MFC MFC MFC MFC 256K 256K 256K 256K <20GB/s each direction 51 GB/s XDR BIF BIF EIB (Ring Network) MFC XDR MFC MFC MFC 256K 256K 256K 256K 4x128b memory controllers (2 banks each) 21.33 GB/s (write) SPE SPE SPE SPE SPE SPE SPE 42.66 GB/s (read) 25.6GB/s 25.6GB/s 667MHz FBDIMMs 512MB XDR DRAM 512MB XDR DRAM SPE BIPS BIPS C O M P U T A T I O N A L R E S E A R C H D I Auto-tuning V I S I O N Auto-tuning BIPS BIPS Hand optimizing each architecture/dataset combination is not feasible Our auto-tuning approach finds a good performance solution by a combination of heuristics and exhaustive search Perl script generates many possible kernels (Generate SIMD optimized kernels) Auto-tuning benchmark examines kernels and reports back with the best one for the current architecture/dataset/compiler/… Performance depends on the optimizations generated Heuristics are often desirable when the search space isn’t tractable Proven value in Dense Linear Algebra(ATLAS), Spectral(FFTW,SPIRAL), and Sparse Methods(OSKI) BIPS BIPS C O M P U T A T I O N A L R E S E A R C H D I V I S I Introduction to LBMHD O N Introduction to Lattice Methods BIPS BIPS Structured grid code, with a series of time steps Popular in CFD (allows for complex boundary conditions) Overlay a higher dimensional phase space Simplified kinetic model that maintains the macroscopic quantities Distribution functions (e.g. 5-27 velocities per point in space) are used to reconstruct macroscopic quantities Significant Memory capacity requirements 12 0 4 15 25 6 10 +Z 23 18 26 20 14 8 2 21 9 13 22 1 11 +Y 17 5 24 7 16 3 19 +X LBMHD BIPS BIPS (general characteristics) Plasma turbulence simulation Couples CFD with Maxwell’s equations Two distributions: momentum distribution (27 scalar velocities) magnetic distribution (15 vector velocities) Three macroscopic quantities: Density Momentum (vector) Magnetic Field (vector) 12 12 0 4 15 +Z +Z25 6 15 +Z25 14 8 23 18 23 18 21 10 21 26 26 20 9 20 13 22 1 +Y +Y macroscopic variables 13 22 11 5 17 +Y 24 7 +X 14 2 17 24 16 3 19 momentum distribution +X 16 19 magnetic distribution +X LBMHD BIPS BIPS (flops and bytes) Must read 73 doubles, and update 79 doubles per point in space (minimum 1200 bytes) Requires about 1300 floating point operations per point in space Flop:Byte ratio 0.71 (write allocate architectures) 1.07 (ideal) Rule of thumb for LBMHD: Architectures with more flops than bandwidth are likely memory bound (e.g. Clovertown) LBMHD BIPS BIPS (implementation details) Data Structure choices: Array of Structures: no spatial locality, strided access Structure of Arrays: huge number of memory streams per thread, but guarantees spatial locality, unit-stride, and vectorizes well Parallelization Fortran version used MPI to communicate between tasks. = bad match for multicore The version in this work uses pthreads for multicore, and MPI for inter-node MPI is not used when auto-tuning Two problem sizes: 643 (~330MB) 1283 (~2.5GB) Stencil for Lattice Methods BIPS BIPS Very different the canonical heat equation stencil There are multiple read and write arrays There is no reuse read_lattice[ ][ ] write_lattice[ ][ ] BIPS BIPS Side Note on Performance Graphs Threads are mapped first to cores, then sockets. i.e. multithreading, then multicore, then multisocket Niagara2 always used 8 threads/core. Show two problem sizes We’ll step through performance as optimizations/features are enabled within the auto-tuner More colors implies more optimizations were necessary This allows us to compare architecture performance while keeping programmer effort(productivity) constant BIPS BIPS C O M P U T A T I O N A L R E S E A R C H D I V I S I O N Performance and Analysis of Pthreads Implementation Pthread Implementation BIPS BIPS Intel Xeon (Clovertown) 7.0 7.0 6.0 6.0 5.0 5.0 4.0 3.0 Not naïve fully unrolled loops NUMA-aware 1D parallelization Always used 8 threads per core on Niagara2 1P Niagara2 is faster than 2P x86 machines 4.0 2.0 1.0 1.0 0.0 1 2 4 8 1 64^3 8.0 AMD Opteron (rev.F) 3.0 2.0 2 4 8 0.0 1 2 128^3 4 1 64^3 Sun Niagara2 (Huron) 8.0 7.0 7.0 6.0 6.0 5.0 5.0 GFlop/s GFlop/s 8.0 GFlop/s GFlop/s 8.0 4.0 2 4 128^3 IBM Cell Blade (PPEs) 4.0 3.0 3.0 2.0 2.0 1.0 1.0 0.0 0.0 1 2 4 64^3 8 1 2 4 128^3 8 1 2 64^3 4 1 2 128^3 4 Pthread Implementation BIPS BIPS 8.0 Intel Xeon (Clovertown) 8.0 4.8% of peak flops 17% of bandwidth 5.0 GFlop/s Not naïve fully unrolled loops NUMA-aware 1D parallelization Always used 8 threads per core on Niagara2 1P Niagara2 is faster than 2P x86 machines 6.0 5.0 GFlop/s 6.0 4.0 3.0 4.0 14% of peak flops 17% of bandwidth 3.0 2.0 2.0 1.0 1.0 0.0 1 2 4 8 1 64^3 2 4 8 0.0 1 4 1 64^3 8.0 54% of peak flops 14% of bandwidth 6.0 2 128^3 Sun Niagara2 (Huron) 7.0 2 4 128^3 IBM Cell Blade (PPEs) 7.0 6.0 5.0 GFlop/s 5.0 GFlop/s 7.0 7.0 8.0 AMD Opteron (rev.F) 4.0 4.0 3.0 3.0 2.0 2.0 1% of peak flops 0.3% of bandwidth 1.0 1.0 0.0 0.0 1 2 4 64^3 8 1 2 4 128^3 8 1 2 64^3 4 1 2 128^3 4 Initial Pthread Implementation BIPS BIPS 8.0 Intel Xeon (Clovertown) 8.0 7.0 7.0 6.0 6.0 4.0 3.0 1.0 0.0 1 2 4 8 1 64^3 6.0 2 4 8 0.0 1 2 128^3 4 1 64^3 Sun Niagara2 (Huron) Performance degradation despite improved surface to volume ratio 8.0 2 4 128^3 IBM Cell Blade (PPEs) 7.0 6.0 5.0 GFlop/s 5.0 GFlop/s Always used 8 threads per core on Niagara2 1P Niagara2 is faster than 2P x86 machines 2.0 1.0 7.0 Not naïve fully unrolled loops NUMA-aware 1D parallelization 4.0 3.0 2.0 8.0 5.0 Performance degradation despite improved surface to volume ratio GFlop/s GFlop/s 5.0 AMD Opteron (rev.F) 4.0 4.0 3.0 3.0 2.0 2.0 1.0 1.0 0.0 0.0 1 2 4 64^3 8 1 2 4 128^3 8 1 2 64^3 4 1 2 128^3 4 Cache effects BIPS BIPS Want to maintain a working set of velocities in the L1 cache 150 arrays each trying to keep at least 1 cache line. Impossible with Niagara’s 1KB/thread L1 working set = capacity misses On other architectures, the combination of: Low associativity L1 caches (2 way on opteron) Large numbers or arrays Near power of 2 problem sizes can result in large numbers of conflict misses Solution: apply a lattice (offset) aware padding heuristic to the velocity arrays to avoid/minimize conflict misses Auto-tuned Performance BIPS BIPS 8.0 (+Stencil-aware Padding) Intel Xeon (Clovertown) 8.0 7.0 7.0 6.0 6.0 5.0 5.0 AMD Opteron (rev.F) GFlop/s GFlop/s 4.0 3.0 3.0 2.0 1.0 0.0 1 2 4 8 1 64^3 2 4 8 0.0 1 2 128^3 Sun Niagara2 (Huron) 8.0 7.0 7.0 6.0 6.0 5.0 5.0 4.0 . 3.0 4 1 64^3 GFlop/s GFlop/s 2.0 1.0 8.0 4.0 This lattice method is essentially 79 simultaneous 72-point stencils Can cause conflict misses even with highly associative L1 caches (not to mention opteron’s 2 way) 2 4 128^3 IBM Cell Blade (PPEs) Solution: pad each component so that when accessed with the corresponding stencil(spatial) offset, the components are uniformly distributed in the cache 4.0 3.0 2.0 2.0 1.0 1.0 0.0 0.0 1 2 4 64^3 8 1 2 4 128^3 8 +Padding Naïve+NUMA 1 2 64^3 4 1 2 128^3 4 Blocking for the TLB BIPS BIPS Touching 150 different arrays will thrash TLBs with less than 128 entries. Try to maximize TLB page locality Solution: borrow a technique from compilers for vector machines: Fuse spatial loops Strip mine into vectors of size vector length (VL) Interchange spatial and velocity loops Can be generalized by varying: The number of velocities simultaneously accessed The number of macroscopics / velocities simultaneously updated Has the side benefit expressing more ILP and DLP (SIMDization) and cleaner loop structure at the cost of increased L1 cache traffic Multicore SMP Systems BIPS BIPS (TLB organization) Core2 Core2 Core2 Core2 4MB Shared L2 4MB Shared L2 4MB Shared L2 4MB Shared L2 Opteron Opteron 1MB victim 16 entries FSB 10.6 GB/s 32 entries 4KB pages 128b memory controller 4KB pages 10.66 GB/s 10.6 GB/s(write) 10.66 GB/s 667MHz DDR2 DIMMs 667MHz FBDIMMs Sun Niagara2 (Huron) MT MT MT MT MT MT MT MT Sparc Sparc Sparc Sparc Sparc Sparc Sparc Sparc 667MHz DDR2 DIMMs IBM QS20 Cell Blade SPE SPE SPE SPE PPE PPE 512KB L2 512KB L2 SPE 128 entries 4MB pages Crossbar Switch 179 GB/s (fill) 4MB Shared L2 (16 way) (address interleaving via 8x64B banks) 4x128b memory controllers (2 banks each) 21.33 GB/s (write) SPE SPE PPEs: 1024 entries SPEs: 256 entries MFC MFC MFC MFC EIB (Ring Network) MFC MFC MFC SPE SPE SPE XDR SPE 42.66 GB/s (read) 25.6GB/s BIF MFC MFC MFC MFC BIF 4KB pages XDR MFC MFC MFC 256K 256K 256K 256K SPE SPE SPE 25.6GB/s 667MHz FBDIMMs 512MB XDR DRAM MFC EIB (Ring Network) <20GB/s each direction MFC 256K 256K 256K 256K SPE 256K 256K 256K 256K 256K 256K 256K 256K 8K L1 8K L1 8K L1 8K L1 8K L1 8K L1 8K L1 8K L1 90 GB/s (writethru) 1MB victim SRI / crossbar 128b memory controller Chipset (4x64b controllers) 21.3 GB/s(read) 1MB victim SRI / crossbar FSB 10.6 GB/s 1MB victim Opteron Opteron HT Core2 Core2 HT Core2 Core2 AMD Opteron (rev.F) 4GB/s (each direction) Intel Xeon (Clovertown) 512MB XDR DRAM SPE Cache / TLB Tug-of-War BIPS BIPS For cache locality we want small VL For TLB page locality we want large VL Each architecture has a different balance between these two forces TLB miss penalty L1 miss penalty L1 / 1200 VL Solution: auto-tune to find the optimal VL Page size / 8 Auto-tuned Performance BIPS BIPS Intel Xeon (Clovertown) 7.0 7.0 6.0 6.0 5.0 5.0 4.0 3.0 1.0 0.0 1 2 4 8 1 64^3 2 4 8 Solution: vectorization Fuse spatial loops, strip mine into vectors of size VL, and interchange with phase dimensional loops 0.0 1 2 128^3 4 1 64^3 Sun Niagara2 (Huron) 8.0 7.0 6.0 6.0 5.0 5.0 GFlop/s 7.0 4.0 Each update requires touching ~150 components, each likely to be on a different page TLB misses can significantly impact performance 4.0 2.0 1.0 8.0 AMD Opteron (rev.F) 3.0 2.0 GFlop/s 8.0 GFlop/s GFlop/s 8.0 (+Vectorization) 2 4 128^3 IBM Cell Blade (PPEs) Auto-tune: search for the optimal vector length Significant benefit on some architectures Becomes irrelevant when bandwidth dominates performance 4.0 3.0 3.0 2.0 2.0 1.0 1.0 0.0 0.0 +Vectorization 1 2 4 64^3 8 1 2 4 128^3 8 +Padding Naïve+NUMA 1 2 64^3 4 1 2 128^3 4 Auto-tuned Performance BIPS BIPS Intel Xeon (Clovertown) 7.0 7.0 6.0 6.0 5.0 5.0 4.0 3.0 4.0 2.0 1.0 1.0 0.0 1 2 4 8 1 64^3 8.0 AMD Opteron (rev.F) 3.0 2.0 2 4 8 0.0 1 2 128^3 4 1 64^3 Sun Niagara2 (Huron) 8.0 7.0 7.0 6.0 6.0 5.0 5.0 GFlop/s GFlop/s 8.0 GFlop/s GFlop/s 8.0 (+Explicit Unrolling/Reordering) 4.0 2 4 128^3 Give the compilers a helping hand for the complex loops Code Generator: Perl script to generate all power of 2 possibilities Auto-tune: search for the best unrolling and expression of data level parallelism Is essential when using SIMD intrinsics IBM Cell Blade (PPEs) 4.0 +Unrolling 3.0 3.0 2.0 2.0 1.0 1.0 0.0 0.0 +Vectorization 1 2 4 64^3 8 1 2 4 128^3 8 +Padding Naïve+NUMA 1 2 64^3 4 1 2 128^3 4 Auto-tuned Performance BIPS BIPS Intel Xeon (Clovertown) 7.0 7.0 6.0 6.0 5.0 5.0 4.0 3.0 2.0 1.0 1.0 0.0 1 2 4 8 1 64^3 8.0 4.0 3.0 2.0 2 4 8 0.0 AMD Opteron (rev.F) 8.0 7.0 7.0 6.0 6.0 5.0 5.0 4.0 Expanded the code generator to insert software prefetches in case the compiler doesn’t. Auto-tune: no prefetch prefetch 1 line ahead prefetch 1 vector ahead. Relatively little benefit for relatively little work IBM Cell Blade (PPEs) 2 4 1 64^3 Sun Niagara2 (Huron) 1 128^3 GFlop/s GFlop/s 8.0 GFlop/s GFlop/s 8.0 (+Software prefetching) 2 4 128^3 +SW Prefetching 4.0 +Unrolling 3.0 3.0 2.0 2.0 1.0 1.0 0.0 0.0 +Vectorization 1 2 4 64^3 8 1 2 4 128^3 8 +Padding Naïve+NUMA 1 2 64^3 4 1 2 128^3 4 Auto-tuned Performance BIPS BIPS 8.0 (+Software prefetching) Intel Xeon (Clovertown) 8.0 7.0 7.0 6% of peak flops 22% of bandwidth 6.0 4.0 3.0 Auto-tune: no prefetch prefetch 1 line ahead prefetch 1 vector ahead. Relatively little benefit for relatively little work 4.0 2.0 1.0 1.0 0.0 1 2 4 8 1 64^3 2 4 8 0.0 1 2 128^3 4 1 64^3 Sun Niagara2 (Huron) 8.0 2 4 128^3 IBM Cell Blade (PPEs) 7.0 7.0 59% of peak flops 15% of bandwidth 6.0 6.0 5.0 GFlop/s 5.0 GFlop/s Expanded the code generator to insert software prefetches in case the compiler doesn’t. 3.0 2.0 8.0 32% of peak flops 40% of bandwidth 5.0 GFlop/s GFlop/s 5.0 6.0 AMD Opteron (rev.F) 4.0 3.0 3.0 2.0 2.0 1.0 1.0 0.0 0.0 1 2 4 64^3 8 1 2 4 128^3 8 +SW Prefetching 10% of peak flops 3.7% of bandwidth 4.0 +Unrolling +Vectorization +Padding Naïve+NUMA 1 2 64^3 4 1 2 128^3 4 Auto-tuned Performance BIPS BIPS Intel Xeon (Clovertown) 7.0 7.0 6.0 6.0 5.0 5.0 4.0 3.0 4.0 2.0 1.0 1.0 0.0 1 2 4 8 1 64^3 8.0 AMD Opteron (rev.F) 3.0 2.0 2 4 8 0.0 1 2 128^3 4 1 64^3 Sun Niagara2 (Huron) 8.0 7.0 7.0 6.0 6.0 5.0 5.0 GFlop/s GFlop/s 8.0 GFlop/s GFlop/s 8.0 (+SIMDization, including non-temporal stores) 4.0 2 4 Compilers(gcc & icc) failed at exploiting SIMD. Expanded the code generator to use SIMD intrinsics. Explicit unrolling/reordering was extremely valuable here. Exploited movntpd to minimize memory traffic (only hope if memory bound) Significant benefit for significant work 128^3 IBM Cell Blade (PPEs) +SIMDization +SW Prefetching 4.0 +Unrolling 3.0 3.0 2.0 2.0 1.0 1.0 0.0 0.0 +Vectorization 1 2 4 64^3 8 1 2 4 128^3 8 +Padding Naïve+NUMA 1 2 64^3 4 1 2 128^3 4 Auto-tuned Performance BIPS BIPS 8.0 (+SIMDization, including non-temporal stores) Intel Xeon (Clovertown) 8.0 7.5% of peak flops 18% of bandwidth 7.0 6.0 7.0 6.0 4.0 3.0 4.0 2.0 1.0 1.0 0.0 1 2 4 8 1 64^3 2 4 8 Compilers(gcc & icc) failed at exploiting SIMD. Expanded the code generator to use SIMD intrinsics. Explicit unrolling/reordering was extremely valuable here. Exploited movntpd to minimize memory traffic (only hope if memory bound) Significant benefit for significant work 0.0 1 2 128^3 4 1 64^3 Sun Niagara2 (Huron) 8.0 2 4 128^3 IBM Cell Blade (PPEs) 7.0 7.0 59% of peak flops 15% of bandwidth 6.0 6.0 4.0 3.0 2.0 2.0 1.0 1.0 0.0 0.0 2 4 64^3 8 1 2 4 128^3 8 +SW Prefetching 10% of peak flops 3.7% of bandwidth 4.0 3.0 1 +SIMDization 5.0 GFlop/s 5.0 GFlop/s 3.0 2.0 8.0 42% of peak flops 35% of bandwidth 5.0 GFlop/s GFlop/s 5.0 AMD Opteron (rev.F) +Unrolling +Vectorization +Padding Naïve+NUMA 1 2 64^3 4 1 2 128^3 4 Auto-tuned Performance BIPS BIPS Intel Xeon (Clovertown) 7.0 7.0 6.0 6.0 5.0 5.0 1.6x 4.0 3.0 4.3x 4.0 2.0 1.0 1.0 0.0 1 2 4 8 1 64^3 8.0 AMD Opteron (rev.F) 3.0 2.0 2 4 8 1 4 1 64^3 8.0 7.0 7.0 6.0 6.0 5.0 5.0 1.5x 3.0 2 128^3 Sun Niagara2 (Huron) 4.0 Compilers(gcc & icc) failed at exploiting SIMD. Expanded the code generator to use SIMD intrinsics. Explicit unrolling/reordering was extremely valuable here. Exploited movntpd to minimize memory traffic (only hope if memory bound) Significant benefit for significant work 0.0 GFlop/s GFlop/s 8.0 GFlop/s GFlop/s 8.0 (+SIMDization, including non-temporal stores) 2 4 128^3 IBM Cell Blade (PPEs) +SIMDization 10x 4.0 3.0 +SW Prefetching +Unrolling +Vectorization 2.0 2.0 1.0 1.0 0.0 0.0 1 2 4 64^3 8 1 2 4 128^3 8 +Padding Naïve+NUMA 1 2 64^3 4 1 2 128^3 4 BIPS BIPS C O M P U T A T I O N A L R E S E A R C H D I V I S I O N Performance and Analysis of Cell Implementation BIPS BIPS Cell Implementation Double precision implementation DP will severely hamper performance Vectorized, double buffered, but not auto-tuned No NUMA optimizations No Unrolling VL is fixed Straight to SIMD intrinsics Prefetching replaced by DMA list commands Only collision() was implemented. Auto-tuned Performance BIPS BIPS Intel Xeon (Clovertown) 16.0 16.0 14.0 14.0 12.0 12.0 10.0 10.0 8.0 6.0 8.0 4.0 2.0 2.0 0.0 1 2 4 8 1 64^3 18.0 6.0 4.0 2 4 First attempt at cell implementation. VL, unrolling, reordering fixed No NUMA Exploits DMA and double buffering to load vectors Straight to SIMD intrinsics. Despite the relative performance, Cell’s DP implementation severely impairs performance 0.0 8 1 2 128^3 4 1 2 64^3 Sun Niagara2 (Huron) 18.0 16.0 16.0 14.0 14.0 12.0 12.0 10.0 10.0 GFlop/s GFlop/s AMD Opteron (rev.F) 18.0 GFlop/s GFlop/s 18.0 (Local Store Implementation) 4 128^3 IBM Cell Blade (SPEs)* +SIMDization +SW Prefetching 8.0 +Unrolling 6.0 6.0 +Vectorization 4.0 4.0 2.0 2.0 0.0 0.0 8.0 1 2 4 64^3 8 1 2 4 128^3 8 +Padding Naïve+NUMA 1 2 4 64^3 8 16 1 2 4 128^3 8 16 *collision() only Auto-tuned Performance BIPS BIPS Intel Xeon (Clovertown) 16.0 16.0 14.0 14.0 12.0 12.0 7.5% of peak flops 18% of bandwidth 10.0 8.0 6.0 10.0 8.0 1 2 4 8 1 64^3 2 4 1 4 1 2 64^3 18.0 16.0 16.0 14.0 14.0 4 128^3 IBM Cell Blade (SPEs)* 57% of peak flops 33% of bandwidth +SIMDization 12.0 GFlop/s 10.0 2 128^3 59% of peak flops 15% of bandwidth First attempt at cell implementation. VL, unrolling, reordering fixed No NUMA Exploits DMA and double buffering to load vectors Straight to SIMD intrinsics. Despite the relative performance, Cell’s DP implementation severely impairs performance 0.0 8 Sun Niagara2 (Huron) 12.0 2.0 0.0 GFlop/s 42% of peak flops 35% of bandwidth 4.0 2.0 6.0 4.0 18.0 AMD Opteron (rev.F) 18.0 GFlop/s GFlop/s 18.0 (Local Store Implementation) +SW Prefetching 10.0 8.0 +Unrolling 6.0 6.0 +Vectorization 4.0 4.0 2.0 2.0 0.0 0.0 8.0 1 2 4 64^3 8 1 2 4 128^3 8 +Padding Naïve+NUMA 1 2 4 64^3 8 16 1 2 4 128^3 8 16 *collision() only Speedup from Heterogeneity BIPS BIPS Intel Xeon (Clovertown) 16.0 16.0 14.0 14.0 12.0 12.0 1.6x 10.0 8.0 6.0 8.0 4.0 2.0 2.0 0.0 1 2 4 8 1 64^3 18.0 4.3x 10.0 6.0 4.0 2 4 1 18.0 16.0 14.0 14.0 12.0 12.0 1.5x 4.0 4.0 2.0 2.0 0.0 0.0 4 64^3 8 1 2 2 4 128^3 8 4 128^3 +SIMDization 13x 8.0 6.0 2 1 IBM Cell Blade (SPEs)* 10.0 6.0 1 4 64^3 16.0 8.0 2 128^3 Sun Niagara2 (Huron) 10.0 First attempt at cell implementation. VL, unrolling, reordering fixed Exploits DMA and double buffering to load vectors Straight to SIMD intrinsics. Despite the relative performance, Cell’s DP implementation severely impairs performance 0.0 8 GFlop/s GFlop/s AMD Opteron (rev.F) 18.0 GFlop/s GFlop/s 18.0 +SW Prefetching +Unrolling over auto-tuned PPE +Vectorization +Padding Naïve+NUMA 1 2 4 64^3 8 16 1 2 4 128^3 8 16 *collision() only Speedup over naive BIPS BIPS Intel Xeon (Clovertown) 16.0 16.0 14.0 14.0 12.0 12.0 1.6x 10.0 8.0 6.0 8.0 4.0 2.0 2.0 0.0 1 2 4 8 1 64^3 18.0 4.3x 10.0 6.0 4.0 2 4 1 18.0 16.0 14.0 14.0 12.0 12.0 1.5x 4.0 4.0 2.0 2.0 0.0 0.0 4 64^3 8 1 2 2 4 128^3 8 4 128^3 +SIMDization 130x 8.0 6.0 2 1 IBM Cell Blade (SPEs)* 10.0 6.0 1 4 64^3 16.0 8.0 2 128^3 Sun Niagara2 (Huron) 10.0 First attempt at cell implementation. VL, unrolling, reordering fixed Exploits DMA and double buffering to load vectors Straight to SIMD intrinsics. Despite the relative performance, Cell’s DP implementation severely impairs performance 0.0 8 GFlop/s GFlop/s AMD Opteron (rev.F) 18.0 GFlop/s GFlop/s 18.0 +SW Prefetching +Unrolling +Vectorization +Padding Naïve+NUMA 1 2 4 64^3 8 16 1 2 4 128^3 8 16 *collision() only BIPS BIPS C O M P U T A T I O N A L R E S E A R C H D Summary I V I S I O N Aggregate Performance BIPS BIPS (Fully optimized) Cell SPEs deliver the best full system performance Although, Niagara2 delivers near comparable per socket performance Dual core Opteron delivers far better performance (bandwidth) than Clovertown Clovertown has far too little effective FSB bandwidth LBMHD(64^3) 18.00 Opteron Clovertown Niagara2 (Huron) Cell Blade 16.00 14.00 GFlop/s 12.00 10.00 8.00 6.00 4.00 2.00 0.00 1 2 3 4 5 6 7 8 9 cores 10 11 12 13 14 15 16 Parallel Efficiency BIPS BIPS (average performance per thread, Fully optimized) Aggregate Mflop/s / #cores Niagara2 & Cell showed very good multicore scaling Clovertown showed very poor multicore scaling (FSB limited) LBMHD(64^3) 3.00 Opteron Clovertown Niagara2 (Huron) Cell Blade GFlop/s/core 2.50 2.00 1.50 1.00 0.50 0.00 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 cores Power Efficiency BIPS BIPS (Fully Optimized) Used a digital power meter to measure sustained power Calculate power efficiency as: sustained performance / sustained power All cache-based machines delivered similar power efficiency FBDIMMs (~12W each) sustained power 8 DIMMs on Clovertown (total of ~330W) 16 DIMMs on N2 machine (total of ~450W) LBMHD(64^3) 70 60 MFlop/s/watt 50 40 30 20 10 0 Clovertown Opteron Niagara2 (Huron) Cell Blade BIPS BIPS Productivity Niagara2 required significantly less work to deliver good performance (just vectorization for large problems). Clovertown, Opteron, and Cell all required SIMD (hampers productivity) for best performance. Cache based machines required search for some optimizations, while Cell relied solely on heuristics (less time to tune) BIPS BIPS Summary Niagara2 delivered both very good performance and productivity Cell delivered very good performance and efficiency (processor and power) On the memory bound Clovertown parallelism wins out over optimization and auto-tuning Our multicore auto-tuned LBMHD implementation significantly outperformed the already optimized serial implementation Sustainable memory bandwidth is essential even on kernels with moderate computational intensity (flop:byte ratio) Architectural transparency is invaluable in optimizing code BIPS BIPS C O M P U T A T I O N A L R E S E A R C H D I V I S I O Multi-core arms race N New Multicores BIPS BIPS 2.2GHz Opteron (rev.F) 20.0 18.0 16.0 GFlop/s 14.0 12.0 10.0 8.0 6.0 4.0 2.0 0.0 1 2 4 1 64^3 2 4 128^3 1.40GHz Niagara2 20.0 18.0 16.0 GFlop/s 14.0 12.0 10.0 8.0 6.0 4.0 2.0 0.0 1 2 4 64^3 8 1 2 4 128^3 8 New Multicores BIPS BIPS 2.2GHz Opteron (rev.F) 2.3GHz Opteron (Barcelona) 20.0 18.0 18.0 16.0 16.0 14.0 14.0 12.0 12.0 GFlop/s GFlop/s 20.0 10.0 8.0 core Opteron Victoria Falls is a dual socket (128 threads) Niagara2 Both have the same total bandwidth 10.0 8.0 6.0 6.0 4.0 4.0 2.0 2.0 0.0 0.0 1 2 4 1 64^3 2 1 4 2 4 8 1 64^3 128^3 1.40GHz Niagara2 20.0 2 4 8 128^3 1.16GHz Victoria Falls 20.0 18.0 18.0 16.0 16.0 14.0 14.0 12.0 12.0 GFlop/s GFlop/s Barcelona is a quad 10.0 +Smaller pages +SIMDization +SW Prefetching 10.0 8.0 +Unrolling 6.0 6.0 +Vectorization 4.0 4.0 2.0 2.0 0.0 0.0 8.0 1 2 4 64^3 8 1 2 4 128^3 8 +Padding Naïve+NUMA 1 2 4 64^3 8 16 1 2 4 128^3 8 16 Speedup from multicore/socket BIPS BIPS 2.2GHz Opteron (rev.F) 2.3GHz Opteron (Barcelona) 20.0 18.0 18.0 16.0 16.0 14.0 14.0 10.0 8.0 core Opteron Victoria Falls is a dual socket (128 threads) Niagara2 Both have the same total bandwidth 12.0 10.0 8.0 (1.8x frequency normalized) 6.0 6.0 4.0 4.0 2.0 2.0 0.0 0.0 1 2 4 1 64^3 2 1 4 2 4 8 1 64^3 128^3 1.40GHz Niagara2 20.0 2 4 8 128^3 1.16GHz Victoria Falls 20.0 18.0 18.0 16.0 16.0 14.0 14.0 +Smaller pages +SIMDization 1.6x 12.0 GFlop/s GFlop/s Barcelona is a quad 1.9x 12.0 GFlop/s GFlop/s 20.0 10.0 8.0 12.0 +SW Prefetching 10.0 +Unrolling 8.0 (1.9x frequency normalized) 6.0 6.0 4.0 4.0 2.0 2.0 0.0 0.0 1 2 4 64^3 8 1 2 4 128^3 8 +Vectorization +Padding Naïve+NUMA 1 2 4 64^3 8 16 1 2 4 128^3 8 16 Speedup from Auto-tuning BIPS BIPS 2.2GHz Opteron (rev.F) 2.3GHz Opteron (Barcelona) 20.0 18.0 18.0 16.0 16.0 14.0 14.0 12.0 12.0 GFlop/s GFlop/s 20.0 4.3x 10.0 8.0 8.0 6.0 4.0 4.0 2.0 2.0 core Opteron Victoria Falls is a dual socket (128 threads) Niagara2 Both have the same total bandwidth 3.9x 10.0 6.0 0.0 0.0 1 2 4 1 64^3 2 1 4 128^3 16.0 14.0 14.0 12.0 12.0 GFlop/s 16.0 1.5x 4.0 4.0 2.0 2.0 0.0 0.0 4 64^3 8 1 2 4 128^3 2 4 8 128^3 8 +SIMDization 16x 8.0 6.0 2 1 +Smaller pages 10.0 6.0 1 8 1.16GHz Victoria Falls 18.0 8.0 4 20.0 18.0 10.0 2 64^3 1.40GHz Niagara2 20.0 GFlop/s Barcelona is a quad +SW Prefetching +Unrolling +Vectorization +Padding Naïve+NUMA 1 2 4 64^3 8 16 1 2 4 128^3 8 16 BIPS BIPS C O M P U T A T I O N A L R E S E A R C H D I Questions? V I S I O N Acknowledgements BIPS BIPS UC Berkeley RADLab Cluster (Opterons) PSI cluster(Clovertowns) Sun Microsystems Niagara2 donations Forschungszentrum Jülich Cell blade cluster access George Vahala, et. al Original version of LBMHD ASCR Office in the DOE office of Science contract DE-AC02-05CH11231