CUDA - TZI

advertisement
Algorithm Engineering
„GPGPU“
Stefan Edelkamp
Graphics Processing Units
 GPGPU = (GP)²U
General Purpose Programming on the GPU
 „Parallelism for the masses“
 Application: Fourier-Transformation, Model Checking,
Bio-Informatics, see CUDA-ZONE
Programming the
Graphics Processing Unit
with Cuda
Overview
 Cluster / Multicore / GPU comparison
 Computing on the GPU
 GPGPU languages
 CUDA
 Small Example
Overview
 Cluster / Multicore / GPU comparison
 Computing on the GPU
 GPGPU languages
 CUDA
 Small Example
Cluster / Multicore / GPU
 Cluster system
 many unique systems
 each one
 one (or more) processors
 internal memory
 often HDD
 communication over network
CPU
RAM
HDD
CPU
RAM
HDD
 slow compared to internal
 no shared memory
CPU
RAM
HDD
Switch
Cluster / Multicore / GPU
 Multicore systems




multiple CPUs
RAM
external memory on HDD
communication over RAM
CPU1 CPU2
CPU3 CPU4
RAM
HDD
Cluster / Multicore / GPU
 System with a Graphic Processing Unit
 Many (240) Parallel processing units
 Hierarchical memory structure
 RAM
 VideoRAM
 SharedRAM
 Communication
 PCI BUS
Graphics Card
CPU
GPU
SRAM
VRAM
RAM
Hard Disk Drive
Overview
 Cluster / Multicore / GPU comparison
 Computing on the GPU
 GPGPU languages
 CUDA
 Small Example
Computing on the GPU
 Hierarchical execution
 Groups
 executed sequentially
 Threads
 executed parallel
 lightweight (creation / switching nearly free)
 one Kernel function
 executed by each thread
•Group 0
Computing on the GPU
 Hierarchical memory
 Video RAM
 1 GB
 Comparable to RAM
 Shared RAM in the GPU
 16 KB
 Comparable to registers
 parallel access by threads
Graphic Card
GPU
SRAM
VideoRAM
Beispielarchitektur G200 z.B. in
280GTX
Beispielprobleme
Ranking und Unranking mit Parity
2-Bit BFS
1-Bit BFS
Schiebepuzzle
Some Results…
Weitere
Resultate …
Overview
 Cluster / Multicore / GPU comparison
 Computing on the GPU
 GPGPU languages
 CUDA
 Small Example
GPGPU Languages
 RapidMind
 Supports MultiCore, ATI, NVIDIA and Cell
 C++ analysed and compiled for target hardware
 Accelerator (Microsoft)
 Library for .NET language
 BrookGPU (Stanford University)
 Supports ATI, NVIDIA
 Own Language, variant of ANSI C
Overview
 Cluster / Multicore / GPU comparison
 Computing on the GPU
 Programming languages
 CUDA
 Small Example
CUDA
 Programming language
 Similar to C
 File suffix .cu
 Own compiler called nvcc
 Can be linked to C
CUDA
C++ code
CUDA Code
Compile with GCC
Compile with nvcc
Link with ld
Executable
CUDA
 Additional variable types
Dim3
Int3
Char3
CUDA
 Different types of functions
__global__ invoked from host
__device__ called from device
 Different types of variables
__device__ located in VRAM
__shared__ located in SRAM
CUDA
 Calling the kernel function
name<<<dim3 grid, dim3 block>>>(...)
Grid dimensions (groups)
Block dimensions (threads)
CUDA
 Memory handling
CudaMalloc(...) - allocating VRAM
CudaMemcpy(...) - copying Memory
CudaFree(...) - free VRAM
CUDA
 Distinguish threads
blockDim – Number of all groups
blockIdx – Id of Group (starting with 0)
threadIdx – Id of Thread (starting with
0)
Id = blockDim.x*blockIdx.x+threadIdx.x
Overview
 Cluster / Multicore / GPU comparison
 Computing on the GPU
 Programming languages
 CUDA
 Small Example
CUDA
void inc(int *a, int b, int N)
{
for (int i = 0; i<N; i++)
a[i] = a[i] + b;
}
void main()
{
...
inc(a,b,N);
}
__global__ void inc(int *a, int b, int N)
{
int id = blockDim.x*blockIdx.x+threadIdx.x;
if (id<N)
a[id] = a[id] + b;
}
void main()
{
...
int * a_d = CudaAlloc(N);
CudaMemCpy(a_d,a,N,HostToDevice);
dim3 dimBlock ( blocksize, 0, 0 );
dim3 dimGrid ( N / blocksize, 0, 0 );
inc<<<dimGrid,dimBlock>>>(a_d,b,N);
}
Realworld Example
 LTL Model checking
Traversing an implicit Graph G=(V,E)
Vertices called states
Edges represented by transitions
Duplicate removal needed
Realworld Example
 External Model checking
Generate Graph with external BFS
Each BFS layer needs to be sorted
GPU proven to be fast in sorting
Realworld Example
 Challenges
Millions of states in one layer
Huge state size
Fast access only in SRAM
Elements needs to be moved
Realworld Example
 Solutions:
Gpuqsort
Qsort optimized for GPUs
Intensive swapping in VRAM
Bitonic based sorting
Fast for subgroups
Concatenating Groups slow
Realworld Example
 Our solution
States S presorted by Hash H(S)
Bucket sorted in SRAM by a Group
•SRAM
•VRAM
Realworld Example
 Our solution
Order given by H(S),S
Realworld Example
 Results
Programming the GPU
Questions???
Download