GPU Programming - California State University, Los Angeles

advertisement
GPU PROGRAMMING
David Gilbert
California State University, Los Angeles
Outline
• CUDA
• CPU vs GPU Architecture
• Scalability
• Blocks
• Performance
• Speed Up
• Graphics Cards
• How It Works
• Program Flow
• When to Use the GPU
• Example: Matrix Row Sum
• References
CUDA
• Compute Unified Device Architecture (CUDA)
• High performance computing on your GPU
• CUDA is a proprietary architecture for GPU Computing,
there is also OpenCL which runs on AMD/ATI
CPU vs GPU Architecture
• ALU does the computations
Scalability
• Code automatically
scales upward
• GPUs with more
cores will execute
the same code in
less time
• Can add additional
graphics cards to
your computer and
gain exponential
performance
increases!
Blocks
• Essentially Groups
• Block Size and
ThreadsPerBlock are
defined before the memory
is copied to the graphics
card.
• To access a thread in a
block
i = blockIdx.x + threadIdx.x;
j = blockIdx.y + threadIdx.y;
Performance
• Super computer performance is measured in Floating
Point Operations Per Second (FLOPS)
• Megaflops = 10^6
• Gigaflops = 10^9
• Teraflops = 10^12
• Petaflops = 10^15
• Japan’s K Computer
• 10.51 Petaflops
• Nvidia GTX 480
• ~1300 gigaflops
• Core i7 920 @3.4Ghz
• 69 gigaflops
Graphics Cards
• Consumer
• AMD 6950, $250
• 2.25 TFLOPs Single Precision compute power
• 562.5 GFLOPs Double Precision compute power
• 1408 Stream Processors
• Nvidia GTX 470, $150
• 1.09 TFLOPs Single Precision compute power
• 544.32 GFLOPs Double Precision compute power
• 448 Cuda Cores
• About $1 per TFLOP
Speed Up?
How it works
• Computer dumps the load onto the GPU
• GPU does the computing
• GPU returns the results to System Memory
• This transfer is the biggest bottleneck in the system
Code
CPU
GPU
Results
Program Flow
1. Allocate System Memory
2. Allocate Device Memory
3. Copy Memory from System to Device
4. Execute the Code
5. Copy Results back to the System from the Device
6. Free Device Memory
7. Process Results
8. Free System Memory
• Lines 3 and 5 create the bottleneck
When to Use the GPU
• Let dT = transfer time between device and system
• Let st = serial execution time
• Let pt = parallel execution time
2(dT) + pt < st
Example: Matrix Row Sum
0.5
0.25
0.25
0
0.25
0.25
0.25
0.25
0
0.5
0.5
0
0
0
0.75
0.25
Block size, 4X1
0.5
0.25
0.25
0
0.25
0.25
0.25
0.25
0
0.5
0.5
0
0
0
0.75
0.25
Example: Matrix Row Sum
// Device code
__global__ void RowSum(float* B, float* Sum, int N,
int M)
{
int i = blockDim.x * blockIdx.x + threadIdx.x;
int j = blockDim.y * blockIdx.y + threadIdx.y;
if (i < N && j < M)
C[j] += B[i][j];
}
• B is the matrix being summed
• Sum is the array storing the row sum
• N is # of rows
• M is # of cols
Example: Matrix Row Sum
int main()
{
int M = 4, N = 4;
// Allocate System Memory
size_t size = N*M*sizeof(float);
float * h_B = (float *)malloc(size);
float * h_sum = (float *)malloc(size);
// Allocate Device Memory
float * d_B, * d_sum;
cudaMalloc(&d_B, size);
cudaMalloc(&d_sum, size);
// Copy System Memory to Device
cudaMemcpy(d_B, h_B, size, cudaMemcpyDeviceToHost);
// Execute the code
int threadsPerBlock = 4;
int blocksPerGrid = 4;
RowSum<<<blocksPerGrid, threadsPerBlock>>>(d_B, d_sum, N, M);
// Copy Results from Device Back to System Memory
cudaMemcpy(h_sum, d_sum, size, cudaMemcpyDeviceToHost);
// Free device Memory
cudaFree(d_B);
cudaFree(d_sum);
// Process Results
print results… // some method to display results
// Free System Memory
free(h_B);
free(h_sum);
return 0;
}
Example: Matrix Row Sum
• Now, imagine a matrix of 1000 x 1000
• I don’t guarantee that this code will run
References
• Newegg.com
• CUDA C Programming Guide
http://developer.download.nvidia.com/compute/DevZone/docs/html/C
/doc/CUDA_C_Programming_Guide.pdf
• AMD.com
http://www.amd.com/us/products/desktop/graphics/amd-radeon-hd6000/hd-6950/Pages/amd-radeon-hd-6950-overview.aspx
• PCGameshardware.com
http://www.pcgameshardware.com/aid,743498/Geforce-GTX-480and-GTX-470-reviewed-Fermi-performance-benchmarks/Reviews/
• Nvidia.com
http://www.nvidia.com/object/product_geforce_gtx_470_us.html
Download