COMMIT/ Henri Bal GPU Programming: eScience or Engineering?

advertisement
COMMIT/
GPU Programming:
eScience or Engineering?
Henri Bal
Vrije Universiteit
msterdam
Graphics Processing Units
●
●
●
GPUs and other accelerators take top-500 by
storm
Many application success stories
But GPUs are very difficult to program and
optimize
http://www.nvidia.com/object/
tesla-case-studies.html
Example: convolution
Fully
optimized
Naive
●
About half a Ph.D. thesis
Parallel Programming Lab course
●
Lab course for MSc students (next to lectures)
●
CUDA:
●
●
Simple image processing application on 1 node
MPI:
●
Parallel all pairs shortest path algorithms
●
CUDA: 11 out of 21 passed (52 %)
●
MPI: 17 out of 21 passed (80 %)
Questions
●
Why are accelerators so difficult to program?
●
What are the challenges for Computer Science?
●
What role do applications play?
Background
●
Netherlands eScience Center
●
Bridge between ICT and applications
●
●
COMMIT: (100 M€) public-private ICT program
●
●
Climate modeling, astronomy,
water management, digital forensics, …
http://www.commit-nl.nl/
Distributed ASCI Supercomputer (DAS)
●
Testbed for Computer Science (Euro-Par 2014
keynote)
COMMIT/
My background
•
Cluster computing
•
•
Wide-area computing
•
•
DAS-2 (2002), Manta, Satin
eScience & optical grids
•
•
DAS-1 (1997), Albatross
Grid computing
•
•
Zoo (1994), Orca
DAS-3 (2006), Ibis
Hybrid computing
•
DAS-4 (2010), Glasswing, MCL
Background (team)
Ph.D. students
●
●
●
●
Ben van Werkhoven
Alessio Sclocco
Ismail El Hewl
Pieter Hijma
Staff
●
●
Rob van Nieuwpoort
(NLeSC)
Ana Varbanescu (UvA)
Scientific programmers
●
Rutger Hofman
●
Ceriel Jacobs
Agenda
• Application case studies
• Multimedia kernel (convolution)
• Astronomy kernel (dedispersion)
• Climate modelling: optimizing multiple kernels
• Lessons learned: why is GPU programming
hard?
• Programming methodologies
• ‘’Stepwise refinement for performance’’ methodology
• Glasswing: MapReduce on accelerators
Application case study 1:
Convolution operations
Image I of size Iw by Ih
Filter F of size Fw by Fh
Thread block of size Bw by Bh
Naïve CUDA kernel:
1 thread per output pixel
Does 2 arithmetic operations
and 2 loads (8 bytes)
Arithmetic Intensity (AI) = 0.25
Hierarchy of concurrent threads
Grid
Thread Block 0, 0
Thread Block 0, 1
Thread Block 0, 2
0,0
0,1
0,2
0,3
0,0
0,1
0,2
0,3
0,0
0,1
0,2
0,3
1,0
1,1
1,2
2,3
1,0
1,1
1,2
2,3
1,0
1,1
1,2
2,3
2,0
2,1
2,2
2,3
2,0
2,1
2,2
2,3
2,0
2,1
2,2
2,3
Thread Block 1, 0
Thread Block 1, 1
Thread Block 1, 2
0,0
0,1
0,2
0,3
0,0
0,1
0,2
0,3
0,0
0,1
0,2
0,3
1,0
1,1
1,2
2,3
1,0
1,1
1,2
2,3
1,0
1,1
1,2
2,3
2,0
2,1
2,2
2,3
2,0
2,1
2,2
2,3
2,0
2,1
2,2
2,3
Memory optimizations for tiled
convolution
Grid
Block (0, 0)
Threads within a block
cooperatively load entire
area they need into a
small (e.g. 96KB) shared
memory
Shared Memory
Shared Memory
Registers
Registers
Registers
Registers
Thread (0, 0)
Thread (1, 0)
Thread (0, 0)
Thread (1, 0)
Global Memory
Filter (small) goes into
constant memory
Block (1, 0)
Constant Memory
Tiled convolution
16x16 thread block processing an 11x 7 filter
●
Arithmetic Intensity:
Analysis
●
If filter size increases:
●
Arithmetic Intensity increases:
●
●
Kernel shifts from memory-bandwidth bound to computebound
Amount of shared memory needed increases →
fewer thread blocks can run concurrently on each SM
Tiling
●
Each thread block computes 1xN tiles in
horizontal direction
+ Increases amount of work per thread
+ Saves loading overlapping borders
+ Saves redundant instructions
- More shared memory, fewer concurrent thread blocks
No shared memory
bank conflicts
Adaptive tiling
●
Tiling factor is selected at runtime depending on
the input data and the resource limitations of the
device
●
●
Highest possible tiling factor that fits within the
shared memory available (depending on filter size)
Plus loop unrolling, memory banks, search
optimal configuration
Ph.D. thesis Ben van Werkhoven,
27 Oct. 2014
+ FGCS journal, 2014
Lessons learned
●
Everything must be in balance to obtain high
performance
●
●
Subtle interactions between resource limits
Runtime decision system (adaptive tiling), in
combination with standard optimizations
●
Loop unrolling, memory bank conflicts
Application case study 2:
Auto-tuning Dedispersion
●
●
Used for searching pulsars in radio astronomy
data
Pulsar signals get dispersed: lower radio
frequencies arrive progressively later
●
Can be reversed by shifting in time the signal’s lower
frequencies (dedispersion)
Alessio Sclocco et al.: Auto-Tuning Dedispersion
for Many-Core Accelerators, IPDPS 2014
Auto-tuning
●
Using auto-tuning to find optimal configuration
for:
●
Different many-core platforms
●
●
Different observational scenarios
●
●
LOFAR, Apertif
Different number of Dispersion Measures (DMs)
●
●
●
NVIDIA & AMD GPUs, Intel Xeon Phi
Represents number of free electrons between source &
receiver
Measure of distance between emitting object & receiver
Parameters:
●
Number of threads per sample or DM, thread block
Auto-tuning: number of threads
per thread block
LOFAR
Apertif
Histogram of achieved GFLOP/s
●
396 configurations, the winner is an outlier
Lessons learned
●
●
●
Auto-tuning allows algorithms to adapt to
different platforms and scenarios
Auto-tuning has large impact on dedispersion
Guessing a good configuration without autotuning is difficult
Application case study 3:
Global Climate Modeling
●
Understand future local sea level changes
●
Needs high-resolution simulations
●
Combine two approaches:
●
Distributed computing (multiple resources)
●
GPUs
COMMIT/
Distributed Computing
●
Use Ibis to couple different simulation models
●
●
Land, ice, ocean, atmosphere
Wide-area optimizations similar to Albatross
project
(16 years ago), like hierarchical load balancing
Enlighten Your Research Global award
10G
#7
10G
STAMPEDE (USA)
EMERALD (UK)
CARTESIUS (NLD)
10G
KRAKEN (USA)
SUPERMUC (GER)
#10
GPU Computing
●
●
Offload expensive kernels for Parallel Ocean
Program (POP) from CPU to GPU
●
Many different kernels, fairly easy to port to GPUs
●
Execution time becomes virtually 0
New bottleneck: moving data between CPU &
GPU
CPU
host
memory
Host
PCI Express link
GPU
device
memory
Device
Different methods for CPU-GPU
communication
●
Memory copies (explicit)
●
●
Device-mapped host memory (implicit)
●
●
Allows fine-grained overlap between computation and
communication in either direction
CUDA Streams or OpenCL command-queues
●
●
No overlap with GPU computation
Allows overlap between computation and
communication in different streams
Any combination of the above
Problem
●
Problem:
●
●
Which method will be most efficient for a given GPU
kernel? Implementing all can be a large effort
Solution:
●
Create a performance model that identifies the best
implementation:
●
What implementation strategy for overlapping computation
and communication is best for my program?
Ben van Werkhoven, Jason Maassen, Frank Seinstra & Henri Bal: Performance
models for CPU-GPU data transfers, CCGrid2014
(nominated for best-paper-award)
MOVIE
Example result
Measured
Model
Different GPUs (state kernel)
Different GPUs (buoydiff)
Comes with spreadsheet
Lessons learned
●
●
●
PCIe transfers can have a large performance
impact for applications with many small kernels
Several methods for transferring data and
overlapping computation & communication exist
Performance modelling helps to select the best
mechanism
Why is GPU programming hard?
●
Mapping algorithm to architecture is difficult,
especially as the architecture is difficult:
●
Many levels of parallelism
●
Limited resources (registers, shared memory)
●
●
Less of everything than CPU (except parallelism), especially
per thread, makes problem-partitioning difficult
Everything must be in balance to obtain performance
Why is GPU programming hard?
●
Many crucial high-impact optimizations needed:
●
Data reuse
●
●
●
Shared memory bank conflicts, global memory coalescing
Instruction stream optimization
●
●
Limited by #registers per thread, shared memory per thread
block
Memory access patterns
●
●
Use shared memory efficiently
Control flow divergence, loop unrolling
Moving data to/from the GPU
●
PCIe transfers
Why is GPU programming hard?
●
Portability
●
●
Optimizations are architecture-dependent, and the
architectures change frequently
Optimizations are often input dependent
●
Finding the right parameters settings is difficult
●
Need better performance models
●
Like Roofline and our I/O model
Why is GPU programming hard?
●
●
●
Bottom line: tension between
●
control over hardware to achieve performance
●
higher abstraction level to ease programming
Programmers need understandable performance
Old problem in Computer Science,
but now in extreme form
(1989)
Agenda
• Application case studies
• Multimedia kernel (convolution)
• Astronomy kernel (dedispersion)
• Climate modelling: optimizing multiple kernels
• Lessons learned: why is GPU programming
hard?
• Programming methodologies
• ‘’Stepwise refinement for performance’’ methodology
• Glasswing: MapReduce on accelerators
Programming methodology: stepwise
refinement for performance
●
Methodology:
●
Programmers can work on multiple levels of
abstraction
●
●
Integrate hardware descriptions into programming
model
Performance feedback from compiler, based on
hardware description and kernel
●
Cooperation between compiler and programmer
P. Hijma et al., Stepwise-refinement for Performance: a
methodology for many-core programming,” Concurrency and
Computation: Practice and Experience (accepted)
MCL: Many-Core Levels
●
●
MCL program is an algorithm mapped to
hardware
Start at a suitable abstraction level
●
●
E.g. idealized accelerator, NVIDIA Kepler GPU, Xeon
Phi
MCL compiler guides programmer which
optimizations to apply on given abstraction level
or to move to deeper levels
MCL ecosystem
Convolution example
Compiler feedback
Performance
(GTX480, 9×9 filters)
380 GFLOPS
MCL:
302 GFLOPS
Compiler +
Performance evaluation
Compared to known, fully optimized versions
(* measured on a C2050, ** using a different input).
Current work on MCL:
Heterogeneous many-core clusters
●
New GPUs become available frequently, but
older-generation GPUs often still are fast enough
●
●
Clusters become heterogeneous and contain different
types of accelerators
VU DAS-4 cluster:
●
NVIDIA GTX480 GPUs (22)
●
NVIDIA K20 GPUs (8)
●
Intel Xeon Phi (2)
●
NVIDIA C2050 (2), Titan, GTX680 GPU
●
AMD HD7970 GPU
Cashmere
●
●
●
●
Integration MCL + Satin divide-and-conquer
system
Satin [ACM TOPLAS 2010] does:
●
Load-balancing (cluster-aware random work-stealing)
●
Latency hiding
MCL allows kernels to be written and optimized
for each type of hardware
Cashmere does integration, application logic,
mapping, and load balancing for multiple
GPUs/node
Cashmere skeleton
Kernel performance (GFLOP/s)
K-Means on a homogeneous
GTX480 cluster
scalability
absolute performance
Heterogeneous performance
Homogeneous:
efficiency on 16 GTX480
Heterogeneous:
efficiency over total
combined hardware
Lessons learned
●
MCL
●
●
●
Enables us to develop many optimized many-core
kernels
Key: stepwise refinement + multiple abstraction levels
Cashmere
●
●
High performance and automatic load balancing even
when the many-core devices differ widely
Efficiency >90% in 3 out of 4 applications in
heterogeneous executions
Agenda
• Application case studies
• Multimedia kernel (convolution)
• Astronomy kernel (dedispersion)
• Climate modelling: optimizing multiple kernels
• Lessons learned: why is GPU programming
hard?
• Programming methodologies
• ‘’Stepwise refinement for performance’’ methodology
• Glasswing: MapReduce on accelerators
Other approaches that deal with
performance vs abstraction
●
Domain specific languages
●
Patterns, skeletons, frameworks
●
Berkeley Dwarfs
Glasswing: Rethinking
MapReduce
●
Use accelerators (OpenCL) as mainstream feature
●
Massive out-of-core data sets
●
Scale vertically & horizontally
●
Maintain MapReduce abstraction
Ismail El Helw, Rutger Hofman,
Henri Bal [HPDC’2014, SC’2014]
Glasswing Pipeline
●
●
Overlaps computation, communication & disk
access
Supports multiple buffering levels
GPU optimizations
●
●
Glasswing framework does:
●
Memory management
●
Some shared memory optimizations
●
Data movement, data staging
Programmer:
●
●
Focusses on the map and reduce kernels (using
OpenCL)
Can do kernel optimizations if needed
●
Coalescing, memory banks, etc.
Glasswing vs. Hadoop
64-node CPU Infiniband cluster
Glasswing vs. Hadoop
16-Node GTX480 GPU Cluster
Performance K-Means
Hadoop
Glasswing
CPU
GPMR
compute
Glasswing
GPU
Compute Device Comparison
Lessons learned
●
●
●
●
Scalable MapReduce framework combining
coarse-grained and fine-grained parallelism
Handles out-of-core data, sticks with MapReduce
model
Overlaps kernel executions with memory
transfers, network communication and disk
access
Outperforms Hadoop by 1.2 – 4x on CPUs and
20 – 30x on GPUs
Discussion
●
●
eScience applications help us to
●
Understand the complexity of GPU programming
●
Validate our ideas and software
●
Give inspiration for new CS research
Applications do need performance of GPUs
●
●
Next in line: SKA, digital forensics, water management
…
GPU programming and optimization is too timeconsuming for real applications
Discussion
●
Dealing with performance
●
●
●
GPU programs need many complex optimizations to
obtain high performance
Auto-tuning, performance modelling, machine
learning, compiler-based reasoning
How to deal with the tension between
abstraction-level and control?
●
New programming methodologies that allow a choice
●
Frameworks that do separation of concerns
Questions?
Download