SP - Hybrid computer cluster

advertisement
GPU Tutorial: How To Program for GPUs
Krešimir Ćosić1,
(1) University of Split, Croatia
High Performance Computing with GPUs: An Introduction
Krešimir Ćosić <cosic.kresimir@gmail.com>, Thursday, August 12th, 2010.
LSST All Hands Meeting 2010, Tucson, AZ
Overview
 CUDA
 Hardware architecture
 Programming model
 Convolution on GPU
High Performance Computing with GPUs: An Introduction
Krešimir Ćosić <cosic.kresimir@gmail.com>, Thursday, August 12th, 2010.
LSST All Hands Meeting 2010, Tucson, AZ
CUDA

‘Compute Unified Device Architecture’
–
•
Massively parallel architecture
–
•
over 8000 threads is common
C for CUDA (C++ for CUDA)
–
•
Hardware and software architecture for issuing
and managing computations on GPU
C/C++ language with some additions and
restrictions
Enables GPGPU – ‘General Purpose
Computing on GPUs’
High Performance Computing with GPUs: An Introduction
Krešimir Ćosić <cosic.kresimir@gmail.com>, Thursday, August 12th, 2010.
LSST All Hands Meeting 2010, Tucson, AZ
GPU: a multithreaded coprocessor
SM
SP: scalar processor
SP SP SP SP
‘CUDA core’
SP SP SP SP
Executes one thread
SP SP SP SP
SP SP SP SP
SM
streaming multiprocessor
SHARED
MEMORY
32xSP (or 16, 48 or more)
Fast local ‘shared memory’
(shared between SPs)
16 KiB (or 64 KiB)
High Performance Computing with GPUs: An Introduction
GLOBAL MEMORY
(ON DEVICE)
Krešimir Ćosić <cosic.kresimir@gmail.com>, Thursday, August 12th, 2010.
LSST All Hands Meeting 2010, Tucson, AZ
 GPU:
 SMs
o
30xSM on GT200,
o
14xSM on Fermi
 For

example, GTX 480:
14 SMs x 32 cores
= 448 cores on a GPU
SM
SP SP SP SP
SP SP SP SP
SP SP SP SP
SP SP SP SP
SHARED
MEMORY
GDDR memory
GLOBAL MEMORY
512 MiB - 6 GiB
(ON DEVICE)
High Performance Computing with GPUs: An Introduction
Krešimir Ćosić <cosic.kresimir@gmail.com>, Thursday, August 12th, 2010.
LSST All Hands Meeting 2010, Tucson, AZ
How To Program For GPUs

Parallelization


Decomposition to threads
Memory

shared memory, global memory
SM
SP SP SP SP
SP SP SP SP
SP SP SP SP
SP SP SP SP
SHARED
MEMORY
GLOBAL MEMORY
(ON DEVICE)
High Performance Computing with GPUs: An Introduction
Krešimir Ćosić <cosic.kresimir@gmail.com>, Thursday, August 12th, 2010.
LSST All Hands Meeting 2010, Tucson, AZ
Important Things To Keep In Mind

Avoid divergent branches



Threads of single SM must be
executing the same code
Code that branches heavily and
unpredictably will execute slowly
Threads shoud be independent as
much as possible

Synchronization and
communication can be done
efficiently only for threads of single
multiprocessor
High Performance Computing with GPUs: An Introduction
SM
SP SP SP SP
SP SP SP SP
SP SP SP SP
SP SP SP SP
SHARED
MEMORY
Krešimir Ćosić <cosic.kresimir@gmail.com>, Thursday, August 12th, 2010.
LSST All Hands Meeting 2010, Tucson, AZ
How To Program For GPUs

Parallelization


Memory


shared memory, global memory
Enormous processing power


Decomposition to threads
Avoid divergence
Thread communication

Synchronization, no
interdependencies
SM
SP SP SP SP
SP SP SP SP
SP SP SP SP
SP SP SP SP
SHARED
MEMORY
GLOBAL MEMORY
(ON DEVICE)
High Performance Computing with GPUs: An Introduction
Krešimir Ćosić <cosic.kresimir@gmail.com>, Thursday, August 12th, 2010.
LSST All Hands Meeting 2010, Tucson, AZ
Programming model
High Performance Computing with GPUs: An Introduction
Krešimir Ćosić <cosic.kresimir@gmail.com>, Thursday, August 12th, 2010.
LSST All Hands Meeting 2010, Tucson, AZ
Thread blocks

BLOCK 1
Threads grouped in
thread blocks
THREAD THREAD THREAD
(0,0)
(0,1)
(0,2)
128, 192 or 256
threads in a block
THREAD THREAD THREAD
(1,0)
(1,1)
(1,2)

• One thread block executes on one SM
– All threads sharing the ‘shared memory’
– 32 threads are executed simultaneously (‘warp’)
High Performance Computing with GPUs: An Introduction
Krešimir Ćosić <cosic.kresimir@gmail.com>, Thursday, August 12th, 2010.
LSST All Hands Meeting 2010, Tucson, AZ
Thread blocks
Blocks execute on SMs
 - execute in parallel
 - execute independently!
BLOCK 1

• Blocks form a GRID
• Thread ID
unique within block
• Block ID
unique within grid
High Performance Computing with GPUs: An Introduction
THREAD THREAD THREAD
(0,0)
(0,1)
(0,2)
THREAD THREAD THREAD
(1,0)
(1,1)
(1,2)
BLOCK 0
BLOCK 1
BLOCK 2
BLOCK 3
BLOCK 4
BLOCK 5
BLOCK 6
BLOCK 7
BLOCK 8
Grid
Krešimir Ćosić <cosic.kresimir@gmail.com>, Thursday, August 12th, 2010.
LSST All Hands Meeting 2010, Tucson, AZ
Code that executes on GPU: Kernels
Kernel
 - a simple C function
 - executes on GPU
 - Executes in parallel



as many times as there are threads
The keyword __global__ tells the
compiler to make a function a kernel (and
compile it for the GPU, instead of the CPU)
High Performance Computing with GPUs: An Introduction
Krešimir Ćosić <cosic.kresimir@gmail.com>, Thursday, August 12th, 2010.
LSST All Hands Meeting 2010, Tucson, AZ
Convolution

To get one pixel of
output image:
- multiply (pixelwise)
mask with image
at corresponding
position
- sum the products
High Performance Computing with GPUs: An Introduction
Krešimir Ćosić <cosic.kresimir@gmail.com>, Thursday, August 12th, 2010.
LSST All Hands Meeting 2010, Tucson, AZ
__global__
void Convolve( float* img, int imgW, int imgH,
float* filt, int filtW, int filtH,
float* out)
{
const int nThreads = blockDim.x * gridDim.x;
const int idx = blockIdx.x * blockDim.x + threadIdx.x;
const int outW = imgW – filtW + 1;
const int outH = imgH – filtH + 1;
const int nPixels = outW * outH;
for(int
< nPixels; curPixel += nThreads)
for
(int curPixel
y = 0; y =< idx;
outH;curPixel
y++)
{for (int x = 0; x < outW; x++)
int x = curPixel % outW;
{
int y = curPixel / outW;
float sum = 0;
for (int filtY = 0; filtY < filtH; filtY++)
for (int filtX = 0; filtX < filtW; filtX++)
{
int sx = x + filtX;
int sy = y + filtY;
sum+= img[sy*imgW + sx] * filt[filtY*filtW + filtX];
}
out[y * outW + x] = sum;
}
}
Setup and data transfer


cudaMemcpy
 transfer data to and from GPU
(global memory)
cudaMalloc
 Allocate memory on GPU
(global memory)

GPU is the ‘device’, CPU is the ‘host’

Kernel call syntax
High Performance Computing with GPUs: An Introduction
Krešimir Ćosić <cosic.kresimir@gmail.com>, Thursday, August 12th, 2010.
LSST All Hands Meeting 2010, Tucson, AZ
int main() {
...
float* img ...
int imgW, imgH ...
float* imgGPU;
cudaMalloc((void**)& imgGPU, imgW * imgH *
cudaMemcpy(
imgGPU,
//
img,
//
imgW * imgH * sizeof(float), //
cudaMemcpyHostToDevice
//
);
sizeof(float));
Destination
Source
Size in bytes
Direction
float* filter ...
int filterW, filterH ...
float* filterGPU;
cudaMalloc((void**)& filterGPU, filterW * filterH * sizeof(float));
cudaMemcpy(
filterGPU,
// Destination
filter,
// Source
filterW * filterH * sizeof(float), // Size in bytes
cudaMemcpyHostToDevice
// Direction
);
int resultW = imgW – filterW + 1;
int resultH = imgH – filterH + 1;
float* result = (float*) malloc(resultW * resultH * sizeof(float));
float* resultGPU;
cudaMalloc((void**) &resultGPU, resultW * resultH * sizeof(float));
/* Call the GPU kernel */
dim3 block(128);
dim3 grid(30);
Convolve<<<grid, block>>> (
imgGPU, imgW, imgH,
filterGPU, filterW, filterH,
resultGPU
);
cudaMemcpy( result,
resultGPU,
resultW * resultH * sizeof(float),
cudaMemcpyDeviceToHost
);
cudaThreadExit();
...
}
//
//
//
//
Desination
Source
Size in bytes
Direction
High Performance Computing with GPUs: An Introduction
Krešimir Ćosić <cosic.kresimir@gmail.com>, Thursday, August 12th, 2010.
LSST All Hands Meeting 2010, Tucson, AZ
Speedup
Linear combination of 3 filters sized 15x15
 Image size: 2k x 2k



CPU: Core 2 @ 2.0 GHz (1 core)
GPU: Tesla S1070 (GT200 )
 30xSM, 240 CUDA cores, 1.3 GHz
CPU: 6.58 s
 GPU: 0.21 s

0.89 Mpixels/s
27.99 Mpixels/s
31 times faster!
High Performance Computing with GPUs: An Introduction
Krešimir Ćosić <cosic.kresimir@gmail.com>, Thursday, August 12th, 2010.
LSST All Hands Meeting 2010, Tucson, AZ
High Performance Computing with GPUs: An Introduction
Krešimir Ćosić <cosic.kresimir@gmail.com>, Thursday, August 12th, 2010.
LSST All Hands Meeting 2010, Tucson, AZ
CUDA capabilities

1.0 GeForce 8800 Ultra/GTX/GTS

1.1 GeForce 9800 GT, GTX, GTS 250
+ atomic instructions …

1.2 GeForce GT 220

1.3 Tesla S1070, C1060, GeForce GTX 275,285
+ double precision (slow) …

2.0 Tesla C2050, GeForce GTX 480, 470
+ ECC, L1 and L2 cache, faster IMUL, faster atomics,
faster double precision on Tesla cards …
High Performance Computing with GPUs: An Introduction
Krešimir Ćosić <cosic.kresimir@gmail.com>, Thursday, August 12th, 2010.
LSST All Hands Meeting 2010, Tucson, AZ
CUDA essentials

developer.nvidia.com/object/cuda_3_1_downloads.html

Download
Driver
 Toolkit (compiler nvcc)
 SDK (examples) (recommended)


CUDA Programmers guide
High Performance Computing with GPUs: An Introduction
Krešimir Ćosić <cosic.kresimir@gmail.com>, Thursday, August 12th, 2010.
LSST All Hands Meeting 2010, Tucson, AZ
Other tools

‘Emulator’
Executes on CPU
 Slow

Simple profiler
 cuda-gdb (Linux)
 Paralel Nsight (Vista)

simple profiler
 on-device debugger

High Performance Computing with GPUs: An Introduction
Krešimir Ćosić <cosic.kresimir@gmail.com>, Thursday, August 12th, 2010.
LSST All Hands Meeting 2010, Tucson, AZ
...

...
High Performance Computing with GPUs: An Introduction
Krešimir Ćosić <cosic.kresimir@gmail.com>, Thursday, August 12th, 2010.
LSST All Hands Meeting 2010, Tucson, AZ
Logical thread hierarchy
Thread ID – unique within block
 Block ID – unique within grid


To get globally unique thread ID:


Combine block ID and thread ID
Threads can access both shared and global
memory
High Performance Computing with GPUs: An Introduction
Krešimir Ćosić <cosic.kresimir@gmail.com>, Thursday, August 12th, 2010.
LSST All Hands Meeting 2010, Tucson, AZ
Download