Slides - Forrest Iandola

advertisement
Communication-Minimizing 2D
Convolution in GPU Registers
forresti@eecs.berkeley.edu
Forrest N. Iandola
David Sheffield
Michael Anderson
P. Mangpo Phothilimthana
Kurt Keutzer
University of California, Berkeley
1
Overview
• Convolution is a recurring computational pattern in
a broad range of computer vision applications
• Memory communication is the bottleneck for
convolution on modern GPUs
• How to minimize memory communication overhead
in convolution
– Texture cache
– Loop blocking
• Up to 4.5x speedup over existing GPU
implementations from NVIDIA, OpenCV, and others
2
Forrest Iandola
forresti@eecs.berkeley.edu
Why focus on convolution?
• Berkeley ParLab project identified 15 recurring
computational patterns in computer vision
15 Computer Vision Patterns
• Small filters (2x2 – 7x7)
• Feature extraction
• Sliding-window object
detection
Forrest Iandola
• If we want fast
computer vision, we
need fast convolution
3
forresti@eecs.berkeley.edu
What limits the performance of convolution?
• Roofline model [1] divides a program’s execution
time into two parts:
– Computational cost (GFLOPS/s)
– Communication cost (GB/s) – memory traffic, I/O, etc.
• No program can outperform the hardware bound
on computation or communication
[1] S. Williams, A. Waterman, D. Patterson. Roofline: An Insightful Visual
Performance Model for Floating Point Programs and Multicore Architectures.
Communications of the ACM, 2009
Forrest Iandola
forresti@eecs.berkeley.edu
4
What limits the performance of convolution?
Roofline Model of computational performance
Fast
Computation Bounded
Slow
5
Forrest Iandola
forresti@eecs.berkeley.edu
What limits the performance of convolution?
• Convolution on NVIDIA GPUs:
– Communication between the GPU’s off-chip DRAM and
on-chip caches is the bottleneck
– This doesn’t include communication between the CPU
and GPU, though this can also be an issue
• If we want fast computer vision, we need fast
convolution.
• If we want fast convolution on GPUs, we need
to optimize memory communication.
6
Forrest Iandola
forresti@eecs.berkeley.edu
Exploiting the GPU Memory Architecture
Memory per GPU Multiprocessor
893 GB/s
Registers
L1 Cache /
Shared Memory
Threads
123 GB/s
L2 Cache
GPU Global
Memory (DRAM)
8 GB/s
CPU DRAM
129 Gtexels/s
Texture
Cache
NVIDIA GTX680
7
Data Reuse with Loop Blocking
1 output pixel
9 input pixels
Typical Implementation: no data
reuse at the register level
8
Forrest Iandola
forresti@eecs.berkeley.edu
Data Reuse with Loop Blocking
4 output pixels
1 output pixel
4 inputs
per output
9 input pixels
16 input pixels
Typical Implementation: no data
reuse at the register level
Forrest Iandola
forresti@eecs.berkeley.edu
Our approach: reuse
data by doing more
work per thread
9
Comparison with Related Work
Inverse roofline model
NVIDIA
GTX680
(Kepler)
11
Comparison with Related Work
With texture cache
and blocking (ours)
NVIDIA
GTX680
(Kepler)
12
Comparison with Related Work
NVIDIA
GTX680
(Kepler)
13
Comparison with Related Work
NVIDIA
GTX680
(Kepler)
14
Comparison with Related Work
NVIDIA
GTX680
(Kepler)
15
Comparison with Related Work
NVIDIA
GTX680
(Kepler)
16
Comparison with Related Work
4.5x
speedup
NVIDIA
GTX680
(Kepler)
17
Are we done?
• Are we done optimizing memory communication?
• I think so. We achieved the memory bandwidth
bound for small filters.
• Future work: optimize computation some more!
18
Forrest Iandola
forresti@eecs.berkeley.edu
Conclusions
• If we want fast computer vision, we need fast
convolution.
• If we want fast convolution on GPUs, we need to
optimize memory communication.
• Up to 4.5x faster than existing GPU languages and
libraries
• Download our code!
https://github.com/forresti/convolution
– Use/modify it for your
language/library/application
19
Forrest Iandola
forresti@eecs.berkeley.edu
Download