Presentation - IEEE High Performance Extreme Computing

General Purpose Computing on Graphics
Processing Units: Optimization Strategy
Henry Au
Space and Naval Warfare Center Pacific
henry.au@navy.mil
09/12/12
Distribution Statement
Outline
▼ Background
▼ NVIDIA’s CUDA
▼ Decomposition & Porting
▼ CUDA Optimizations
▼ GPU Results
▼ Conclusion
9/12/12
2
Background
▼ Parallel Programming on GPUs
 General-Purpose Computation on
Graphics Processing Units (GPGPU)
 Compute Unified Device Architecture
(CUDA)
 Open Computing Language
(OpenCLTM)
9/12/12
3
Background
▼ GPUs vs. CPUs
 GPU and CPU cores not the same
 CPU core is faster and more robust but, fewer cores
 GPU not as robust nor fast, but handles repetitive tasks quickly
▼ NVIDIA GeForce GTX 470
 448 cores
 Memory Bandwidth = 133.9 GB/sec
 544.32 GFLOPS DP
▼ Intel Core i7-965
 4 cores
 Memory Bandwidth = 25.6 GB/sec
 69.23 GFLOPS DP
9/12/12
4
CUDA by NVIDIA
▼ Compute Unified Device Architecture






9/12/12
Low and High Level API available
C for CUDA
High latency memory transfers
Limited Cache
Scalable programming model
Requires NVIDIA graphics cards
5
Decomposition and Porting
▼ Amdhal’s and Gustafson’s Law
▼ Estimate Speed Up
 P amount of parallel scaling achieved
 γ is the fraction of algorithm that is serial
9/12/12
6
Decomposition and Porting
▼ TAU Profile
 Determine call paths and consider subroutine calls
 Pay attention to large for loops or redundant computations
▼ Visual Studio 2008
 Initialize Profile: TAU_PROFILE(“StartFor”, “Main”, TAU_USER);
 Place Timers:
− TAU_START(“FunctionName”)
− TAU_STOP(“FunctionName”)
9/12/12
7
Decomposition and Porting
▼ CUDA Overhead
 High latency associated with memory transfers
 Can be hidden with large amounts of mathematical computations
 Reduce Device to Host memory transfers
− Many small transfers vs. fewer but larger transfers
− Perform serial tasks using parallel processors
9/12/12
8
CUDA Optimizations
▼ Thread and Block Occupancy
 Varies depending on graphics card
▼ Page Locked Memory
 cudaHostAlloc()
 Limited resource and should not be overused
▼ Streams
 A queue of GPU operations
 Such as GPU computation “kernels” and memory copies
▼ Asynchronous Memory Calls
 Ensure non-blocking calls
 cudaMemcpyAsync() or kernel call
9/12/12
9
Thread Occupancy
▼ Ensure enough threads are operating at the same time
 256 threads per block
 Max 1024 threads per block
 Monitor occupancy
ALF Threads Per Block vs Frames Per Second
90
85
ALF Frames Per Second Processed
64
144
196
256
324
80
400
36
75
1024
484
576
16
70
65
ALF FPS
60
55
50
4
45
40
0
200
400
600
800
1000
1200
Threads Per Block
9/12/12
10
CUDA Optimizations
▼ Page Locked Host Memory
 cudaHostAlloc() vs. malloc vs. new
Processing Time Vs. Mega Bytes Data Processed
14
12
Processing Time (ms)
10
8
New
Malloc
cudaHostAlloc
6
4
2
0
0
2
4
6
8
10
12
14
Data (MB)
9/12/12
11
CUDA Optimizations
▼ Stream Structure Non-Optimized
 Processing time: 49.5ms
cudaMemcpyAsync(dataA0, stream0, HostToDevice)
cudaMemcpyAsync(dataB0, stream0, HostToDevice)
kernel<<< blocks, threads, stream0>>>(result0, dataA0, dataB0)
cudaMemcpyAsync(result0, stream0, DeviceToHost)
cudaMemcpyAsync(dataA1, stream1, HostToDevice)
cudaMemcpyAsync(dataB1, stream1, HostToDevice)
kernel<<<blocks, threads, stream1>>>(result1, dataA1, dataB1)
cudaMemcpyAsync(result1, stream1, DeviceToHost)
9/12/12
12
CUDA Optimizations
▼ Stream Structure Optimized
 Processing time: 49.4ms
cudaMemcpyAsync(dataA0, stream0, HostToDevice)
cudaMemcpyAsync(dataA1, stream1, HostToDevice)
cudaMemcpyAsync(dataB0, stream0, HostToDevice)
cudaMemcpyAsync(dataB1, stream1, HostToDevice)
kernel<<< blocks, threads, stream0>>>(result0, dataA0, dataB0)
kernel<<<blocks, threads, stream1>>>(result1, dataA1, dataB1)
cudaMemcpyAsync(result0, stream0, DeviceToHost)
cudaMemcpyAsync(result1, stream1, DeviceToHost)
9/12/12
13
CUDA Optimizations
▼ Stream Structure Optimized & Modified
 Processing time: 41.1ms
cudaMemcpyAsync(dataA0, stream0, HostToDevice)
cudaMemcpyAsync(dataA1, stream1, HostToDevice)
cudaMemcpyAsync(dataB0, stream0, HostToDevice)
cudaMemcpyAsync(dataB1, stream1, HostToDevice)
kernel<<< blocks, threads, stream0>>>(result0, dataA0, dataB0)
cudaMemcpyAsync(result0, stream0, DeviceToHost)
kernel<<<blocks, threads, stream1>>>(result1, dataA1, dataB1)
cudaMemcpyAsync(result1, stream1, DeviceToHost)
9/12/12
14
CUDA Optimizations
▼ Stream Structure not always beneficial
 Overhead could result in performance reduction
 Profile to determine kernel execution vs. data transfer
− NVIDIA Visual Profiler
− cudaEventRecord()
9/12/12
15
GPU Results
▼ Optimization Stages





0: No Optimizations
1: Page Locking Memory
2: Asynchronous GPU calls
3: Non-optimized Streaming
4: Optimized Streaming
(65 FPS)
(67 FPS)
(81 FPS)
(82 FPS)
(85 FPS)
ALF Processing Speed (Frames Per Second) vs. Optimization Stage
90.00
4, 85.51
Processing Speed (FPS)
85.00
3, 82.88
2, 81.74
80.00
75.00
FPS
70.00
1, 67.05
65.00
0, 65.14
60.00
0
1
2
3
4
5
Optimization Stage
9/12/12
16
GPU Results
▼ ALF CPU vs. GPU Processing
Adaptive Linear Filter FPS Processing Vs. Image Height
100.00
624, 92.78
90.00
720, 85.51
80.00
624, 77.64
Frames Per Second
70.00
720, 67.78
60.00
CPU FPS
50.00
GPU GPS
40.00
1248, 31.95
30.00
1440, 26.13
1248, 20.05
20.00
1872, 15.23
2160, 12.91
1440, 17.07
10.00
1872, 8.92
0.00
500
700
900
1100
1300
1500
1700
1900
2160, 7.59
2100
2300
Image Height (4:3 Aspect)
9/12/12
17
Conclusion
▼ Test various thread per block allocations
▼ Use page locked memory for data transfers
 Asynchronous memory transfer and non-blocking calls
▼ Ensure proper coordination of streams
 Data Parallelism and Task Parallelism
9/12/12
18
QUESTIONS?
9/12/12
19
References
▼ Amdahl, G., "Validity of the single processor approach to
▼
▼
▼
▼
▼
9/12/12
achieving large scale computing capabilities." AFIPS Spring
Joint Computer Conference, 1976.
CUDA C Best Practices Guide Ver 4.0, 5/2011.
Gustafson, J., "Reevaluating Amdahl's Law." Communications
of the ACM, Vol. 31 Number 5, May 1988.
Jason Sanders, Edward Kandrot. CUDA By Example, An
Introduction to General-Purpose GPU Programming. AddisonWesley. Copyright NVIDIA Corporation 2011.
NVIDIA CUDA Programming Guide Ver 4.0, 5/6/2011.
Tau-User Guide. Department of Computer and Information
Science, University of Oregon Advanced Computing
Laboratory. 2011
20