Performance Optimization and Autotuning in the Super Project

advertisement
Performance Op1miza1on and Autotuning in the SUPER Project Presenter: Shirley Moore
University of Texas at El Paso
October 7, 2014
1 SUPER SUPER Team
Paul Hovland
K. Narayanan
Stefan Wild
Lenny Oliker
Sam Williams
B. de Supinski
Todd Gamblin
Chun Leo Liao
Kathryn Mohror
Daniel Quinlan
Kevin Huck
Allen Malony
Boyana Norris
Sameer Shende
Eduardo D’Azevedo
Philip Roth
Patrick Worley
Laura Carrington
Ananta Tiwari
Rey Chen
J. Hollingsworth
Sukhyun Song
Rob Fowler
Anirban Mandal
Allan Porterfield
Raul Ruth
Jacque Chame
Pedro Diniz
Bob Lucas (PI)
Shirley Moore
George Bosilca
Anthony Danalis
Jack Dongarra
Heike McCraw
G. Gopalakrishnan
Mary Hall
www.super-­‐scidac.org SUPER Research Themes
Architectures and applications are evolving rapidly
Exponential growth in cores, which are becoming heterogeneous
Scientific questions spanning multiples scales and phenomena
High performance is increasingly difficult to achieve and maintain
Energy consumption has emerged as a major concern
Cost of power and cooling is constraining science
Resilience will be a challenge in the very near future
Shrinking VLSI geometries and voltages will reduce device reliability
These can be traded off against each other
Maximize scientific throughput per dollar or Watt
SUPER Presenta1on Focus •  Tool Integration and Development
•  Optimization of MPAS-Ocean
•  Summary of other Measurement/Analysis/
Optimization Efforts
4 SUPER SUPER Tool Integra1on PBOUND performance modeling from code ANL ROSE compiler infrastructure LLNL TAU tuning and analysis system Oregon CHiLL transforma1on and code genera1on UTAH/USC PAPI performance API UTK Ac&ve Harmony online and offline autotuning UMD ORIO empirical tuning system ANL/Oregon Focus of first year on tool integra2on. Edges/arrows show SUPER tools that have been integrated. 5 SUPER SUPER Performance Focus •  Tools must support architectural changes…
–  Heterogeneous processing units
–  Hierarchical parallelism, mixed threads and processes
–  Wide SIMD
•  … and application requirements
–  Scalability to many cores, processes
–  Dynamic parallelism
–  Batched linear algebra
–  Sparse matrices
6 SUPER Progress on Tool Development Wide SIMD, Threading & OpenACC code genera2on PBOUND performance modeling from code ANL ROSE compiler infrastructure LLNL Genera2ng performance models, Ini2al roofline support TAU tuning and analysis system Oregon Hybrid profiling, RAPL/CUPTI support CHiLL transforma1on and code genera1on UTAH/USC PAPI performance API UTK RAPL: Power/energy Ac&ve Harmony online and offline autotuning UMD Autotuning Communica2on, Non-­‐blocking FFT library ORIO empirical tuning system ANL/Oregon Op2mized Control flow, Sparse matrices, Threading (CUDA & OMP) Expanded Fortran support CUDA/OpenCL code genera2on, Sparse Matrices, PETSc Integra2on 7 SUPER Example: TAU + ROSE + CHiLL + Ac1ve Harmony Combine tools to create, store and query results of autotuning experiments in TAUdb 8 SUPER Example: TAU + ROSE + CHiLL + Ac1ve Harmony •  Apply machine learning to data stored in TAUdb
–  Generate decision trees based upon code features
•  Consider matrix multiply (e.g., Nek5000)
–  Matrices of different sizes
with different performance
–  Automatically generate function
to select specialized variants
PARAM_k
=10
=2
PARAM_m
=10
PARAM_n
=10
MM_v1
?
MM_v0
MM_v7
PARAM_n
?
MM_v0
?
MM_v6
PARAM_m
=100
MM_v2
=10
=2
=2
MM_v3
?
MM_v0
PARAM_m
=8
=10
MM_v4
MM_v5
?
MM_v0
=100
MM_v6
9 SUPER Applying Tools to SciDAC Codes •  SUPER interaction with SciDAC application
teams
•  Approach:
–  Collection of SUPER researchers examine
application excerpt or full application to identify
performance bottlenecks
–  Opportunities to improve code in different ways
are explored.
10 SUPER MPAS-­‐Ocean •  A new code è an
opportunity to have
early impact
•  Complex
representation:
variable resolution
irregular mesh of
hexagonal grid cells
hYp://mpas-­‐dev.github.io 11 SUPER Issue 1: Limited Scalability •  MPI-only scaling leads to
additional halo computation,
communication
•  OpenMP scaling promising, but
reduction in FLOPs/core offset
by increase in cache misses
12 SUPER Issue 2: Load Imbalance Due to Par11oning ADV correlated with Metadata •  Add TAU_METADATA calls to capture sizes of structures •  Explore different par11ons with METIS 1100"
1050"
nCells nEdges 5000 totalLevelCells tauTotalLevelEdgeTop 500 14 24 34 •  Conclusion: ADV TIME (seconds) –  Par11on on nCellsSolve, but value unknown before par11oning 1150"
MPAS%O'exec'+me'(seconds)'
Shorter exec &me 1200"
Count –  Correlate computa1on balance with metadata fields nCellsSolve 50000 y"="3.4824x"+"664.93"
1000"
Differs by 17% 950"
900"
850"
60"
70"
80"
90"
100"
110"
120"
DHE:'Diff'btw'max'and'min'#'Halo'Edges'(edges)'
BeLer load balance SUPER 13 Issue 3: Communica1on Costs Intra-­‐Node: Ordering of Cells Impacts Cache Behavior Inter-­‐Node: Communica1on Frequency Adds Overhead 1400
104
Random
Original
Hilbert
Morton
Peano
1200
Original
Optimized
1000
Time [ms]
Time [s]
103
800
600
102
400
Solu&on: Use space-­‐filling curves to explore different cell orderings # MPI Processes
4096
96
1024
48
# MPI Processes
256
24
64
12
16
101
0
4
200
Solu&on: Aggregate communica1on, represented by benchmark 14 SUPER Issue 4: Indirec1on Increases Instruc1on Count Performance Issue: Significant structure indirec1on inside loop nests block%mesh%cellsOnEdge%array(1,iEdge) block%mesh%cellsOnEdge%array(2,iEdge) Solu&on: Replace with pointer buffers that point directly to array 80"
70"
60"
50"
40"
30"
20"
10"
0"
5.32"
2.87"
Speedup"
2.43"
2.12"
TO
re
p"
EC
Using pointer buffers in all func1ons: overall 1.16x applica1on speedup RR
CO
H"
SS
Ba
ro
cli
n
rm
al
no
se
"p
ST
R"
el
o
icV
OR
CT
DI
"P
RE
TY
CI
LO
VE
cit
y"
EP
"
"S T
"S T
OR
CT
RE
OR
EP
"
1.33"
"C
TY
CI
LO
VE
Pointer buffer Auto-­‐generator InstrucIon"Count"and"Speedup"per"FuncIon"
EP
"
%"reducIon"in"instrucIon"count"""
compared"to"the"base"case"
block => domain%blocklist
do while (associated(block))
!...
do iEdge=1,block%mesh%nEdges
cell1 = block%mesh%cellsOnEdge%array(1,iEdge)
cell2 = block%mesh%cellsOnEdge%array(2,iEdge)
!...
end do
block => block % next
end do ! block
! define Pointer buffers
real(kind=RKIND),dimension(:,:),pointer :: cellsOnEdge
block => domain%blocklist
do while (associated(block))
! associate pointer buffer to target variable
cellsOnEdge => block%mesh%cellsOnEdge%array
!...
do iEdge=1,block%mesh%nEdges
!Using pointer buffers to do calculation
cell1 = cellsOnEdge(1,iEdge)
cell2 = cellsOnEdge(2,iEdge)
!...
end do
! nullify the association when computation finished
nullify(cellsOnEdge)
! deallocate buffers
deallocate(cellsOnEdge)
block => block % next
end do ! block
15 SUPER Applying Tools to SciDAC Codes, cont. •  Other code examples employing SUPER tools
–  XGC1
•  Performance measurement using TAU/CUPTI (also, HPC
Toolkit and PerfExpert)
•  Conclusion: low IPC and high data accesses in PUSHE
•  Currently working on semi-automated generation of
OpenACC directives
–  USQCD
•  Sub-optimal performance for quantum linear algebra
•  Rewrote to higher level and used CHiLL to regenerate lowlevel complex matrix-vector multiplies
•  Need to generalize to higher-dimensional lattices
16 SUPER NWChem TCE module •  Tensor Contraction Engine
–  implements approximations that converge toward
exact solution of the Schrödinger equation
•  libtensor
–  Key Challenge: Master-worker task parallelism of
small BLAS calls, agnostic of NUMA and cache
locality
–  Implementing NUMA measurement capability in
PAPI to enable optimization of task scheduling
17 SUPER “Analysis and Tuning of Libtensor Library”
Khaled Ibrahim and Samuel Williams (Lawrence Berkeley National Lab)
Objec7ves: Impact: §  Mul1core architectures are imposing mul1ple challenges for scaling scien1fic applica1ons.
§  Tuning scien1fic codes leads to performance improvement and efficient execu1on. §  This work aims at iden1fying performance boYlenecks in mul1ple quantum chemistry chemistry frameworks. §  Make domain scien1sts focus efforts on the chemistry research rather than computa1onal aspects. §  Expose challenges in quantum chemistry codes to computer scien1st. §  Allow computer scien1sts to explore efficient techniques to increase the produc1vity of developing codes for quantum chemistry computa1ons. edison_p1_opt
Performance improvement on 24 core Intel Ivy-bridge (Edison)
Progress and Accomplishments: Tasking queue size (def:4)
2.0
1
0.3
0.93
1.53
1.66
1.68
1.76
1.81
2
0.28
0.87
1.45
1.54
1.57
1.55
OoM
4
0.26
0.74
1.21
1.31
1.28
1.33
OoM
8
0.23
0.6
0.97
1.03
0.99
1.05
OoM
2
3
4
5
6
7
8
1.5
§  More than 1.8X improvement in
performance on 24-core Ivy-bridge
system. More than 18
2X improvement on
AMD Magny-Cours.
1.0
§  Speedup improvement from 9.5x to
15.3x on 24 core Intel Ivy-bridge.
0.5
§  Results and techniques to be
published.
Tiling factor (def:4)
SUPER Conclusion •  SUPER creates the opportunity for state-­‐of-­‐the-­‐
art performance tuning to be applied to SciDAC applica1ons (e.g., MPAS-­‐Ocean). •  SUPER fosters a community of performance tuning researchers, informed by the needs of applica1on programmers. •  SUPER performance tools have been integrated, and are con1nuously extended to meet these needs. 19 SUPER 
Download