Performance Op1miza1on and Autotuning in the SUPER Project Presenter: Shirley Moore University of Texas at El Paso October 7, 2014 1 SUPER SUPER Team Paul Hovland K. Narayanan Stefan Wild Lenny Oliker Sam Williams B. de Supinski Todd Gamblin Chun Leo Liao Kathryn Mohror Daniel Quinlan Kevin Huck Allen Malony Boyana Norris Sameer Shende Eduardo D’Azevedo Philip Roth Patrick Worley Laura Carrington Ananta Tiwari Rey Chen J. Hollingsworth Sukhyun Song Rob Fowler Anirban Mandal Allan Porterfield Raul Ruth Jacque Chame Pedro Diniz Bob Lucas (PI) Shirley Moore George Bosilca Anthony Danalis Jack Dongarra Heike McCraw G. Gopalakrishnan Mary Hall www.super-­‐scidac.org SUPER Research Themes Architectures and applications are evolving rapidly Exponential growth in cores, which are becoming heterogeneous Scientific questions spanning multiples scales and phenomena High performance is increasingly difficult to achieve and maintain Energy consumption has emerged as a major concern Cost of power and cooling is constraining science Resilience will be a challenge in the very near future Shrinking VLSI geometries and voltages will reduce device reliability These can be traded off against each other Maximize scientific throughput per dollar or Watt SUPER Presenta1on Focus • Tool Integration and Development • Optimization of MPAS-Ocean • Summary of other Measurement/Analysis/ Optimization Efforts 4 SUPER SUPER Tool Integra1on PBOUND performance modeling from code ANL ROSE compiler infrastructure LLNL TAU tuning and analysis system Oregon CHiLL transforma1on and code genera1on UTAH/USC PAPI performance API UTK Ac&ve Harmony online and offline autotuning UMD ORIO empirical tuning system ANL/Oregon Focus of first year on tool integra2on. Edges/arrows show SUPER tools that have been integrated. 5 SUPER SUPER Performance Focus • Tools must support architectural changes… – Heterogeneous processing units – Hierarchical parallelism, mixed threads and processes – Wide SIMD • … and application requirements – Scalability to many cores, processes – Dynamic parallelism – Batched linear algebra – Sparse matrices 6 SUPER Progress on Tool Development Wide SIMD, Threading & OpenACC code genera2on PBOUND performance modeling from code ANL ROSE compiler infrastructure LLNL Genera2ng performance models, Ini2al roofline support TAU tuning and analysis system Oregon Hybrid profiling, RAPL/CUPTI support CHiLL transforma1on and code genera1on UTAH/USC PAPI performance API UTK RAPL: Power/energy Ac&ve Harmony online and offline autotuning UMD Autotuning Communica2on, Non-­‐blocking FFT library ORIO empirical tuning system ANL/Oregon Op2mized Control flow, Sparse matrices, Threading (CUDA & OMP) Expanded Fortran support CUDA/OpenCL code genera2on, Sparse Matrices, PETSc Integra2on 7 SUPER Example: TAU + ROSE + CHiLL + Ac1ve Harmony Combine tools to create, store and query results of autotuning experiments in TAUdb 8 SUPER Example: TAU + ROSE + CHiLL + Ac1ve Harmony • Apply machine learning to data stored in TAUdb – Generate decision trees based upon code features • Consider matrix multiply (e.g., Nek5000) – Matrices of different sizes with different performance – Automatically generate function to select specialized variants PARAM_k =10 =2 PARAM_m =10 PARAM_n =10 MM_v1 ? MM_v0 MM_v7 PARAM_n ? MM_v0 ? MM_v6 PARAM_m =100 MM_v2 =10 =2 =2 MM_v3 ? MM_v0 PARAM_m =8 =10 MM_v4 MM_v5 ? MM_v0 =100 MM_v6 9 SUPER Applying Tools to SciDAC Codes • SUPER interaction with SciDAC application teams • Approach: – Collection of SUPER researchers examine application excerpt or full application to identify performance bottlenecks – Opportunities to improve code in different ways are explored. 10 SUPER MPAS-­‐Ocean • A new code è an opportunity to have early impact • Complex representation: variable resolution irregular mesh of hexagonal grid cells hYp://mpas-­‐dev.github.io 11 SUPER Issue 1: Limited Scalability • MPI-only scaling leads to additional halo computation, communication • OpenMP scaling promising, but reduction in FLOPs/core offset by increase in cache misses 12 SUPER Issue 2: Load Imbalance Due to Par11oning ADV correlated with Metadata • Add TAU_METADATA calls to capture sizes of structures • Explore different par11ons with METIS 1100" 1050" nCells nEdges 5000 totalLevelCells tauTotalLevelEdgeTop 500 14 24 34 • Conclusion: ADV TIME (seconds) – Par11on on nCellsSolve, but value unknown before par11oning 1150" MPAS%O'exec'+me'(seconds)' Shorter exec &me 1200" Count – Correlate computa1on balance with metadata fields nCellsSolve 50000 y"="3.4824x"+"664.93" 1000" Differs by 17% 950" 900" 850" 60" 70" 80" 90" 100" 110" 120" DHE:'Diff'btw'max'and'min'#'Halo'Edges'(edges)' BeLer load balance SUPER 13 Issue 3: Communica1on Costs Intra-­‐Node: Ordering of Cells Impacts Cache Behavior Inter-­‐Node: Communica1on Frequency Adds Overhead 1400 104 Random Original Hilbert Morton Peano 1200 Original Optimized 1000 Time [ms] Time [s] 103 800 600 102 400 Solu&on: Use space-­‐filling curves to explore different cell orderings # MPI Processes 4096 96 1024 48 # MPI Processes 256 24 64 12 16 101 0 4 200 Solu&on: Aggregate communica1on, represented by benchmark 14 SUPER Issue 4: Indirec1on Increases Instruc1on Count Performance Issue: Significant structure indirec1on inside loop nests block%mesh%cellsOnEdge%array(1,iEdge) block%mesh%cellsOnEdge%array(2,iEdge) Solu&on: Replace with pointer buffers that point directly to array 80" 70" 60" 50" 40" 30" 20" 10" 0" 5.32" 2.87" Speedup" 2.43" 2.12" TO re p" EC Using pointer buffers in all func1ons: overall 1.16x applica1on speedup RR CO H" SS Ba ro cli n rm al no se "p ST R" el o icV OR CT DI "P RE TY CI LO VE cit y" EP " "S T "S T OR CT RE OR EP " 1.33" "C TY CI LO VE Pointer buffer Auto-­‐generator InstrucIon"Count"and"Speedup"per"FuncIon" EP " %"reducIon"in"instrucIon"count""" compared"to"the"base"case" block => domain%blocklist do while (associated(block)) !... do iEdge=1,block%mesh%nEdges cell1 = block%mesh%cellsOnEdge%array(1,iEdge) cell2 = block%mesh%cellsOnEdge%array(2,iEdge) !... end do block => block % next end do ! block ! define Pointer buffers real(kind=RKIND),dimension(:,:),pointer :: cellsOnEdge block => domain%blocklist do while (associated(block)) ! associate pointer buffer to target variable cellsOnEdge => block%mesh%cellsOnEdge%array !... do iEdge=1,block%mesh%nEdges !Using pointer buffers to do calculation cell1 = cellsOnEdge(1,iEdge) cell2 = cellsOnEdge(2,iEdge) !... end do ! nullify the association when computation finished nullify(cellsOnEdge) ! deallocate buffers deallocate(cellsOnEdge) block => block % next end do ! block 15 SUPER Applying Tools to SciDAC Codes, cont. • Other code examples employing SUPER tools – XGC1 • Performance measurement using TAU/CUPTI (also, HPC Toolkit and PerfExpert) • Conclusion: low IPC and high data accesses in PUSHE • Currently working on semi-automated generation of OpenACC directives – USQCD • Sub-optimal performance for quantum linear algebra • Rewrote to higher level and used CHiLL to regenerate lowlevel complex matrix-vector multiplies • Need to generalize to higher-dimensional lattices 16 SUPER NWChem TCE module • Tensor Contraction Engine – implements approximations that converge toward exact solution of the Schrödinger equation • libtensor – Key Challenge: Master-worker task parallelism of small BLAS calls, agnostic of NUMA and cache locality – Implementing NUMA measurement capability in PAPI to enable optimization of task scheduling 17 SUPER “Analysis and Tuning of Libtensor Library” Khaled Ibrahim and Samuel Williams (Lawrence Berkeley National Lab) Objec7ves: Impact: § Mul1core architectures are imposing mul1ple challenges for scaling scien1fic applica1ons. § Tuning scien1fic codes leads to performance improvement and efficient execu1on. § This work aims at iden1fying performance boYlenecks in mul1ple quantum chemistry chemistry frameworks. § Make domain scien1sts focus efforts on the chemistry research rather than computa1onal aspects. § Expose challenges in quantum chemistry codes to computer scien1st. § Allow computer scien1sts to explore efficient techniques to increase the produc1vity of developing codes for quantum chemistry computa1ons. edison_p1_opt Performance improvement on 24 core Intel Ivy-bridge (Edison) Progress and Accomplishments: Tasking queue size (def:4) 2.0 1 0.3 0.93 1.53 1.66 1.68 1.76 1.81 2 0.28 0.87 1.45 1.54 1.57 1.55 OoM 4 0.26 0.74 1.21 1.31 1.28 1.33 OoM 8 0.23 0.6 0.97 1.03 0.99 1.05 OoM 2 3 4 5 6 7 8 1.5 § More than 1.8X improvement in performance on 24-core Ivy-bridge system. More than 18 2X improvement on AMD Magny-Cours. 1.0 § Speedup improvement from 9.5x to 15.3x on 24 core Intel Ivy-bridge. 0.5 § Results and techniques to be published. Tiling factor (def:4) SUPER Conclusion • SUPER creates the opportunity for state-­‐of-­‐the-­‐ art performance tuning to be applied to SciDAC applica1ons (e.g., MPAS-­‐Ocean). • SUPER fosters a community of performance tuning researchers, informed by the needs of applica1on programmers. • SUPER performance tools have been integrated, and are con1nuously extended to meet these needs. 19 SUPER