Harnessing the Power of GPUs for Non-Graphics Applications Michael Boyer

advertisement
Harnessing the Power of GPUs for
Non-Graphics Applications
Michael Boyer
Department of Computer Science
University of Virginia
Advisor: Kevin Skadron
© 2010 Michael Boyer
1
Outline
•
•
•
•
GPU architecture
Programming GPUs using CUDA
Case study: Leukocyte Tracking
Current work: CPU-GPU Task Sharing
© 2010 Michael Boyer
2
Graphics Processors
• Graphics Processing Units (GPUs) are designed
specifically for graphics rendering applications
Courtesy of GameSpot
© 2010 Michael Boyer
3
Graphics Applications
• Graphics applications involve applying the same
operation to many pieces of data
• Application characteristics:
– Massively parallel
– Only aggregate performance matters
© 2010 Michael Boyer
4
CPU vs. GPU: Architectural Difference 1
CPU
GPU
Fetch/
Decode
Branch
Predictor
Register File
OOO Logic
Execute
Memory
Pre-Fetcher
Data Cache
Avoid structures that only
improve single-thread performance
© 2010 Michael Boyer
5
CPU vs. GPU: Architectural Difference 2
CPU
GPU
Fetch/
Decode
Branch
Predictor
Register File
OOO Logic
Execute
Memory
Pre-Fetcher
Fetch/
Decode
Thread Group
RF
Register
RF
RF
File
RF
EXE
EXE
Execute
EXE
EXE
Data Cache
Amortize the overhead of control logic across multiple
execution units (SIMD processing)
© 2010 Michael Boyer
6
CPU vs. GPU: Architectural Difference 3
CPU
GPU
Fetch/
Decode
Branch
Predictor
Fetch/
Decode
Register File
OOO Logic
Thread
Thread
Group
Group
1
RF
RF
RF
RF
Execute
Memory
Pre-Fetcher
Thread Group 2
EXE
RF
EXE
RF
EXE
RF
EXE
RF
Thread Group 3
RF
RF
RF
RF
Thread Group 4
RF
RF
RF
RF
EXE
EXE
EXE
EXE
Data Cache
Use multiple groups of threads to keep
execution units busy and hide memory latency
© 2010 Michael Boyer
7
CPU vs. GPU: Architectural Difference 4
CPU
GPU
Fetch/
Decode
Branch
Predictor
Core 1
Register File
Core 2
OOO Logic
Core 1
RF
Core 6
Memory
ExecuteCPU Core
Pre-Fetcher
Core 3
Core 2
RF
Core 7
Core RF Core
11
12
Core 4
Core
RF
16
Data Cache
Fetch/
Decode
Core 3
Core 4
Core 8
Core 9
Core
RFCore RF
13
GPU
Core
Core
RFCore
17
18
Core
Core
21 RF 22
RF
Core
RF 23
Core
Core
Core
26 EXE 27 EXE28
14
Core
RF
19
Core
Core 5
RF
Core
10
Core
RF
15
Core
RF
20
Core
RF24
RF
25
Core
29
EXE
Core
30
EXE
Replicate cores to leverage more parallelism
© 2010 Michael Boyer
8
CPU vs. GPU: Architectural Differences
• Summary: take advantage of abundant parallelism
– Lots of threads, so focus on aggregate performance
– Parallelism in space:
• SIMD processing in each core
• Many independent SIMD cores across the chip
– Parallelism in time:
• Multiple SIMD groups in each core
© 2010 Michael Boyer
9
CPU vs. GPU: Peak Performance
Processor Type
Product
Throughput
(GFLOPs)
Memory Bandwidth
(GB/s)
Cost
CPU
GPU
Intel Xeon W5590
(Nehalem)
AMD Radeon HD
5870
107
2,720
32
154
$1,700
$450
• Note that these are peak numbers
• What we really care about is performance on real-world applications
© 2010 Michael Boyer
10
General-Purpose Computing on GPUs
• Lots of recent interest in using GPUs to run nongraphics applications (GPGPU)
• Why GPUs? Why now?
– Recent increases in performance via parallelism
– Recent increases in programmability
– Ubiquity in multiple market segments
• Old approach: graphics languages
• New approach: GPGPU languages
– OpenCL, CUDA
© 2010 Michael Boyer
11
CUDA
• Programming model for running general-purpose
applications on NVIDIA GPUs
• Extension to the C programming language
• GPU is a co-processor:
– Main program runs on the CPU
– Large computations (kernels) are offloaded to the GPU
– CPU and GPU have separate memory, so data must be
transferred back and forth
© 2010 Michael Boyer
12
CUDA: Typical Program Structure
void function(…) {
Allocate memory on the GPU
Transfer input data to the GPU
Launch kernel on the GPU
Transfer output data to CPU
}
CPU
CPU Memory
GPU Memory
__global__ void kernel(…) {
Code executed on
the GPU goes here…
}
GPU
© 2010 Michael Boyer
13
CUDA: Typical Program Transformation
for (i = 0; i < N; i++) {
Process array element i
}
Body of loop becomes body of kernel
__global__ void kernel(…) {
Determine this thread’s value of i
Process array element i
}
© 2010 Michael Boyer
14
CUDA Kernel
• Scalar program invoked across many threads
– Typically one thread per data element
• Overall computation decomposed into a grid of thread
blocks
– Thread blocks are independent and cannot communicate
(with some exceptions)
– Threads within the same block can communicate
Thread
Block 1
Thread
Block 2
Thread
Block 3
© 2010 Michael Boyer
Thread
Block 4
Thread
Block 5
15
Simple Example: Vector Addition
C=A+B
A
1
2
3
4
5
6
7
8
13
14
15
16
18
20
22
24
+
B
9
10
11
12
=
C
10
12
14
16
© 2010 Michael Boyer
16
C Code
float *CPU_add_vectors(float *A, float *B, int N) {
// Allocate memory for the result
float *C = (float *) malloc(N * sizeof(float));
// Compute the sum;
for (int i = 0; i < N; i++) {
C[i] = A[i] + B[i];
}
// Return the result
return C;
}
© 2010 Michael Boyer
17
CUDA Kernel
// GPU kernel that computes the vector sum C = A + B
// (each thread computes a single value of the result)
__global__ void kernel(float *A, float *B, float *C, int N) {
// Determine which element this thread is computing
int i = blockDim.x * blockIdx.x + threadIdx.x;
// Compute a single element of the result vector
if (i < N) {
C[i] = A[i] + B[i];
}
}
© 2010 Michael Boyer
18
CUDA Host Code
float *GPU_add_vectors(float *A_CPU, float *B_CPU, int N) {
// Allocate GPU memory for the inputs and the result
int vector_size = N * sizeof(float);
float *A_GPU, *B_GPU, *C_GPU;
cudaMalloc((void **) &A_GPU, vector_size);
cudaMalloc((void **) &B_GPU, vector_size);
cudaMalloc((void **) &C_GPU, vector_size);
// Transfer the input vectors to GPU memory
cudaMemcpy(A_GPU, A_CPU, vector_size, cudaMemcpyHostToDevice);
cudaMemcpy(B_GPU, B_CPU, vector_size, cudaMemcpyHostToDevice);
// Execute the kernel to compute the vector sum on the GPU
int num_blocks = ceil((double) N / (double) THREADS_PER_BLOCK);
kernel <<< num_blocks, THREADS_PER_BLOCK >>> (A_GPU, B_GPU, C_GPU, N);
// Transfer the result vector from the GPU to the CPU
float *C_CPU = (float *) malloc(vector_size);
cudaMemcpy(C_CPU, C_GPU, vector_size, cudaMemcpyDeviceToHost);
return C_CPU;
}
© 2010 Michael Boyer
19
Example Program Output
./vector_add 50,000,000
GPU:
Transfer to GPU:
Kernel execution:
Transfer from GPU:
Total:
0.236
0.005
0.152
0.404
sec
sec
sec
sec
CPU: 0.136 sec
Execution: GPU outperformed CPU by 27.2x
Overall: CPU outperformed GPU by 2.97x
Vector addition does not do enough work per memory
operation to justify offload!
© 2010 Michael Boyer
20
Case Study:
Leukocyte Tracking
© 2010 Michael Boyer
21
Leukocyte Tracking
• Important for evaluating inflammatory drugs
• Velocity measured by tracking leukocytes through
multiple frames
• Current approaches:
– Manual analysis: 1 minute video in tens of hours
– MATLAB: 1 minute video in 5 hours
© 2010 Michael Boyer
22
Goal: Leverage CUDA and a GPU to
accelerate leukocyte tracking
to near real-time speeds
© 2010 Michael Boyer
23
Acceleration
1. Translation: convert MATLAB code to C
2. Parallelization:
– OpenMP for multi-core CPU
– CUDA for GPU
• Experimental setup:
– CPU: 3.2 GHz quad-core Intel Core 2 Extreme X9770
– GPU: NVIDIA GeForce GTX 280 (PCIe 2.0)
© 2010 Michael Boyer
24
Tracking Algorithm
Inputs: Video frame
Location of cells in previous frame
Output: Location of cells in current frame
For each cell:
– Extract sub-image near cell’s old location
– Compute MGVF matrix over sub-image
– Evolve active contour using MGVF matrix
© 2010 Michael Boyer
→ 99.8%
25
Computing the MGVF Matrix
• Motion Gradient Vector Flow
• MGVF matrix is approximated via an iterative solution
procedure
Sub-image near cell
MGVF
© 2010 Michael Boyer
26
MGVF Pseudo-code
MGVF = normalized sub-image gradient
do {
Compute eight matrices based on current MGVF
Compute Heaviside function across each matrix
Update MGVF matrix
Compute convergence criterion
} while (! converged)
© 2010 Michael Boyer
27
Naïve CUDA Implementation
Speedup over MATLAB
250x
200x
150x
100x
50x
2.0x
7.7x
0.8x
C
C + OpenMP
Naïve CUDA
0x
CUDA
• Kernel is called ~50,000 times per frame
• Amount of work per call is small
• Runtime dominated by CUDA overheads:
– Memory allocation, memory copying, kernel call overhead
© 2010 Michael Boyer
28
Kernel Overhead
• Kernel calls are not cheap!
– Overhead of one kernel call: 9 microseconds
– Overhead of one CPU function: 3 nanoseconds
– Kernel call is 3,000 times more expensive
• Heaviside kernel:
– 27% of kernel runtime due to computation
– 73% of kernel runtime due to kernel overhead
© 2010 Michael Boyer
29
Lesson 1: Reduce Kernel Overhead
• Increase amount of work per kernel call
– Decrease total number of kernel calls
– Amortize overhead of each kernel call across more
computation
© 2010 Michael Boyer
30
Larger Kernel Implementation
MGVF = normalized sub-image gradient
do {
Compute eight matrices based on current MGVF
Compute Heaviside function across each matrix
Update MGVF matrix
Compute convergence criterion
} while (! converged)
© 2010 Michael Boyer
31
Larger Kernel Implementation
Speedup over MATLAB
250x
200x
150x
100x
50x
2.0x
7.7x
0.8x
6.3x
C
C + OpenMP
Naïve CUDA
Larger Kernel
0x
CUDA
Memory Allocation
71%
Memory Copying
15%
Kernel Execution
9%
0%
20%
40%
60%
80%
100%
Percentage of Runtime
© 2010 Michael Boyer
32
Memory Allocation Overhead
10000
Time Per Call (microseconds)
malloc (CPU memory)
cudaMalloc (GPU memory)
1000
100
10
1
0.1
0.01
1E-07 1E-06 1E-05 0.0001 0.001
0.01
0.1
1
10
100
1000
Megabytes Allocated Per Call
© 2010 Michael Boyer
33
Lesson 2: Reduce Memory Management Overhead
• Reduce the number of memory allocations
– Allocate memory once and reuse it throughout the
application
– If memory size is not known a priori, estimate and only reallocate if estimate is too small
© 2010 Michael Boyer
34
Reduced Allocation Implementation
Speedup over MATLAB
250x
200x
150x
100x
50x
25.4x
2.0x
7.7x
0.8x
6.3x
C
C + OpenMP
Naïve CUDA
Larger Kernel
0x
Reduced
Allocation
CUDA
Memory Allocation
3%
Memory Copying
56%
Kernel Execution
31%
0%
20%
40%
60%
80%
100%
Percentage of Runtime
© 2010 Michael Boyer
35
Memory Transfer Overhead
1000
CPU to GPU
GPU to CPU
Transfer Time (milliseconds)
100
10
1
0.1
0.01
0.001
1E-06
1E-05
0.0001
0.001
0.01
0.1
1
10
100
1000
Megabytes per Transfer
© 2010 Michael Boyer
36
Lesson 3: Reduce Memory Transfer Overhead
• If the CPU operates on values produced by the GPU:
– Move the operation to the GPU
– May improve performance even if the operation itself is
slower on the GPU
values
produced
by GPU
Memory
Transfer
Operation
(CPU)
Operation
(GPU)
Memory
Transfer
values
consumed
by GPU
Time
© 2010 Michael Boyer
37
GPU Reduction Implementation
MGVF = normalized sub-image gradient
do {
Compute eight matrices based on current MGVF
Compute Heaviside function across each matrix
Update MGVF matrix
Compute convergence criterion
} while (! converged)
© 2010 Michael Boyer
38
GPU Reduction Implementation
Speedup over MATLAB
250x
200x
150x
100x
60.7x
50x
25.4x
2.0x
7.7x
0.8x
6.3x
C
C + OpenMP
Naïve CUDA
Larger Kernel
0x
Reduced
Allocation
GPU
Reduction
CUDA
Memory Allocation
Memory Copying
7%
1%
Kernel Execution
80%
0%
20%
40%
60%
80%
100%
Percentage of Runtime
© 2010 Michael Boyer
39
Persistent Thread Block
MGVF = normalized sub-image gradient
do {
Compute eight matrices based on current MGVF
Compute Heaviside function across each matrix
Update MGVF matrix
Compute convergence criterion
} while (! converged)
© 2010 Michael Boyer
40
Persistent Thread Block
• Problem: need a global memory fence
– Multiple thread blocks compute the MGVF matrix
– Thread blocks cannot communicate with each other
– So each iteration requires a separate kernel call
• Solution: compute entire matrix in one thread block
– Arbitrary number of iterations can be computed in a single kernel call
© 2010 Michael Boyer
41
Persistent Thread Block: Example
MGVF Matrix
MGVF Matrix
1
2
3
1
1
1
4
5
6
1
1
1
7
8
9
1
1
1
Persistent Thread Block
Canonical CUDA Approach
(1-to-1 mapping between
threads and data elements)
© 2010 Michael Boyer
42
Persistent Thread Block: Example
GPU
GPU
Cell
SM
1
Cell
SM
1
Cell
SM
1
Cell
SM
1
Cell
SM
2
Cell
SM
3
Cell
SM
1
Cell
SM
1
Cell
SM
1
Cell
SM
4
Cell
SM
5
Cell
SM
6
Cell
SM
1
Cell
SM
1
Cell
SM
1
Cell
SM
7
Cell
SM
8
Cell
SM
9
Persistent Thread Block
Canonical CUDA Approach
(1-to-1 mapping between
threads and data elements)
SM = Streaming Multiprocessor (GPU core)
© 2010 Michael Boyer
43
Lesson 4: Avoid Global Memory Fences
• Confine dependent computations to a single thread
block
– Execute an iterative algorithm until convergence in a single
kernel call
– Only efficient if there are multiple independent
computations
© 2010 Michael Boyer
44
Persistent Thread Block Implementation
Speedup over MATLAB
250x
211.3x
200x
150x
27x
100x
60.7x
50x
25.4x
2.0x
7.7x
0.8x
6.3x
C
C + OpenMP
Naïve CUDA
Larger Kernel
0x
Reduced
Allocation
GPU
Reduction
Persistent
Thread Block
CUDA
© 2010 Michael Boyer
45
Absolute Performance
Frames per Second (FPS)
25
21.6
20
15
10
5
0.11
0.22
0.83
MATLAB
C
C + OpenMP
0
© 2010 Michael Boyer
CUDA
46
Video Example
© 2010 Michael Boyer
47
Conclusions
• CUDA overheads can be significant bottlenecks
• CUDA provides enormous performance improvements
for leukocyte tracking
– 200x over MATLAB
– 27x over OpenMP
• Processing time reduced from > 4.5 hours to
minutes
• Real-time analysis feasible in near future
< 1.5
M. Boyer, D. Tarjan, S. T. Acton, and K. Skadron. "Accelerating Leukocyte Tracking using CUDA: A Case
Study in Leveraging Manycore Coprocessors.“ In Proceedings of the International Parallel and Distributed
Processing Symposium (IPDPS), May 2009.
© 2010 Michael Boyer
48
Current work:
CPU-GPU Task Sharing
© 2010 Michael Boyer
49
CPU-GPU Task Sharing
• Offloading decision is generally considered to be
binary
GPU?
CPU?
© 2010 Michael Boyer
50
CPU-GPU Task Sharing
• Offload decision does not need to be
binary!
– Dividing a task between the CPU and
GPU can provide improved
performance over either device alone
GPU?
CPU?
GPU
CPU
© 2010 Michael Boyer
51
Theoretical Performance
Performance normalized
to best without sharing
2
GPU
CPU
CPU+GPU (equal sharing)
CPU+GPU (optimal sharing)
1.5
1
0.5
0
0.01
0.1
1
10
100
Ratio of GPU to CPU performance
© 2010 Michael Boyer
52
Research Goal
1. Given an input program written in CUDA or OpenCL,
automatically generate a program that can execute
on the CPU and GPU concurrently
2. Automatically determine best division of work:
– When beneficial, share work between CPU and GPU
– Otherwise, execute on CPU or GPU exclusively
– Optimal decision can change at runtime:
•
•
With different inputs
With contention
© 2010 Michael Boyer
53
Proposed System
OpenCL code
Transform all GPU memory
allocations, memory transfers, and
kernel launches into a form
supporting concurrent CPU-GPU
execution
Source-to-source
Translation
Framework
Modified OpenCL code
OpenCL Compiler
CPU/GPU binary
© 2010 Michael Boyer
54
Potential Problems
• One version of the kernel for multiple devices
– Optimizations for GPU may hurt performance on CPU and
vice versa
• Possible (but rare) for thread blocks to communicate
with each other
– Do we try to support this?
• Statically predicting data access patterns can be hard
(or even impossible for some applications)
© 2010 Michael Boyer
55
Data Sharing
• If we cannot predict data access patterns statically,
then the CPU and the GPU must have a consistent
view of memory
GPU
CPU
1) Computation
2) Data Transfer
© 2010 Michael Boyer
56
Data Sharing (2)
• If we can predict data access patterns statically, then
we can minimize the data transfer overhead
GPU
CPU
1) Computation
2) Data Transfer
© 2010 Michael Boyer
57
Preliminary Results (HotSpot)
12
Static Analysis
Dynamic Analysis
No Sharing
Execution Time (seconds)
10
8
6
4
2
0
0
20
40
60
80
100
Percent of Computation on GPU
© 2010 Michael Boyer
58
Conclusions
• GPUs are designed to provide good performance on
graphics workloads
– But they have evolved to support any workload with
abundant parallelism
• GPUs can provide large performance improvements
– But we need to take into account the overheads involved to
fully take advantage
• Allowing the CPU and GPU to work together can
provide an even larger performance improvement
© 2010 Michael Boyer
59
Acknowledgements
• Funding provided by:
– NSF grant IIS-0612049
– SRC grant 1607.001
– NVIDIA research grant
– GRC AMD/Mahboob Kahn Ph.D. fellowship
• Equipment donated by NVIDIA
© 2010 Michael Boyer
60
BACKUP
© 2010 Michael Boyer
61
3D Rendering APIs
• High-level abstractions for rendering geometry
Graphics Application
Vertex Program
Rasterization
Fragment Program
Display
Courtesy of D. Luebke, NVIDIA
© 2010 Michael Boyer
62
CUDA: Abstractions
1. Kernel function
– Mapped onto a grid of thread blocks
2. Scratchpad memory
– For sharing data within a thread block
3. Barrier synchronization
– For synchronizing within a thread block
© 2010 Michael Boyer
63
Kernel Function
__global__ void kernel(int *in, int *out) {
// Determine this thread’s index
int i = threadIdx.x;
// Add one to the input value
out[i] = in[i] + 1;
}
© 2010 Michael Boyer
64
Grid of Thread Blocks
Grid:
2-dimensional
≤ 4.3 billion blocks
Thread block:
3-dimensional
≤ 512 threads
© 2010 Michael Boyer
65
Launching a Kernel
int num_threads = ...;
int threads_per_block = 256;
// Determine how many thread blocks are needed
// (using either of the two methods shown below)
int num_blocks = ceil(num_threads / threads_per_block);
int num_blocks = (num_threads + threads_per_block – 1) /
threads_per_block;
// Make structures for grid and thread block dimensions
dim3 grid(num_blocks, 1);
dim3 thread_block(threads_per_block, 1, 1);
// Launch the kernel
kernel <<< grid, thread_block >>> (in, out);
© 2010 Michael Boyer
66
Scratchpad Memory
• Each multiprocessor has 16 KB of software-controlled
shared memory
• Variables declared “__shared__” get mapped into this
memory
• Values can only be shared among threads within the
same thread block
© 2010 Michael Boyer
67
Scratchpad Memory: Example
__global__ void kernel() {
int i = threadIdx.x;
// Compute some function
int v = foo(i);
// Write the value into shared memory
__shared__ int values[THREADS_PER_BLOCK];
values[i] = v;
// Use the shared values
...
}
© 2010 Michael Boyer
68
Barrier Synchronization
• __syncthreads() function
• Each thread waits for all other threads in the thread
block
• All values written by every thread are now visible to all
other threads
© 2010 Michael Boyer
69
Barrier Synchronization: Example
__global__ void kernel(float *out, int N) {
int i = threadIdx.x;
__shared__ int values[THREADS_PER_BLOCK];
values[i] = foo(i);
// Wait to ensure all values have been written
__syncthreads();
// Compute average of two values
out[i] = (values[i] + values[(i + 1) % N]);
}
© 2010 Michael Boyer
70
CUDA Overheads
• Driver initialization: 0.14 seconds
• Kernel launch: 13 μs
• GPU memory allocation and deallocation: orders of
magnitude slower than on CPU
• Memory transfer: 15 μs + 1 ms/MB
© 2010 Michael Boyer
71
Acceleration using CUDA
CPU
GPU
Program
Allocate GPU memory
Transfer input data
Launch kernel
Transfer results
CUDA kernel
Free GPU memory
Step 1: Determine which code to offload to the GPU as a CUDA kernel
Step 2: Write the CPU-side CUDA code
Step 3: Write and optimize the GPU kernel
© 2010 Michael Boyer
72
Performance Issues
• Branch divergence
• Memory coalescing
• Key concept: Warp
– Group of threads that execute concurrently
– In current hardware, warp size is 32 threads
© 2010 Michael Boyer
73
Branch Divergence
• Remember: hardware is SIMD
• What if threads in the same warp follow two different
paths?
• Solution: entire warp executes both paths
– Unneeded values are simply ignored
– Performance can suffer with many divergent branches
© 2010 Michael Boyer
74
Memory Coalescing
• Threads in the same half-warp access memory together
• If all threads access successive memory locations:
– All of the accesses are combined (coalesced)
– Result: significantly improved memory performance
• Otherwise:
– Each thread accesses memory separately
– Result: significantly reduced memory performance
© 2010 Michael Boyer
75
Memory Coalescing: Examples
Coalesced Accesses
Non-Coalesced
Access
© 2010 Michael Boyer
76
Parallelization Granularity
CPU Memory
GPU Memory
GPU
CPU
© 2010 Michael Boyer
77
Kernel Overhead Revisited
• Overhead depends on calling pattern:
– One at a time (synchronous): 9 microseconds
– Back-to-back (asynchronous): 3 microseconds
Implicit Synchronization
Synchronous:
Kernel
Call
Memory
Transfer
Kernel
Call
Memory
Transfer
Kernel
Call
Asynchronous:
Kernel
Call
Kernel
Call
Kernel
Call
Kernel
Call
Kernel
Call
© 2010 Michael Boyer
78
Lesson 1 Revisited: Reduce Kernel Overhead
• Increase amount of work per kernel call
– Decrease total number of kernel calls
– Amortize overhead of each kernel call across more
computation
• Launch kernels back-to-back
– Kernel calls are asynchronous: avoid explicit or implicit
synchronization between kernel calls
– Overlap kernel execution on the GPU with driver access on
the CPU
© 2010 Michael Boyer
79
Download