CUDA - TZI

Algorithm Engineering „GPGPU“ Stefan Edelkamp Graphics Processing Units  GPGPU = (GP)²U General Purpose Programming on the GPU  „Parallelism for the masses“  Application: Fourier-Transformation, Model Checking, Bio-Informatics, see CUDA-ZONE Programming the Graphics Processing Unit with Cuda Overview  Cluster / Multicore / GPU comparison  Computing on the GPU  GPGPU languages  CUDA  Small Example Overview  Cluster / Multicore / GPU comparison  Computing on the GPU  GPGPU languages  CUDA  Small Example Cluster / Multicore / GPU  Cluster system  many unique systems  each one  one (or more) processors  internal memory  often HDD  communication over network CPU RAM HDD CPU RAM HDD  slow compared to internal  no shared memory CPU RAM HDD Switch Cluster / Multicore / GPU  Multicore systems     multiple CPUs RAM external memory on HDD communication over RAM CPU1 CPU2 CPU3 CPU4 RAM HDD Cluster / Multicore / GPU  System with a Graphic Processing Unit  Many (240) Parallel processing units  Hierarchical memory structure  RAM  VideoRAM  SharedRAM  Communication  PCI BUS Graphics Card CPU GPU SRAM VRAM RAM Hard Disk Drive Overview  Cluster / Multicore / GPU comparison  Computing on the GPU  GPGPU languages  CUDA  Small Example Computing on the GPU  Hierarchical execution  Groups  executed sequentially  Threads  executed parallel  lightweight (creation / switching nearly free)  one Kernel function  executed by each thread •Group 0 Computing on the GPU  Hierarchical memory  Video RAM  1 GB  Comparable to RAM  Shared RAM in the GPU  16 KB  Comparable to registers  parallel access by threads Graphic Card GPU SRAM VideoRAM Beispielarchitektur G200 z.B. in 280GTX Beispielprobleme Ranking und Unranking mit Parity 2-Bit BFS 1-Bit BFS Schiebepuzzle Some Results… Weitere Resultate … Overview  Cluster / Multicore / GPU comparison  Computing on the GPU  GPGPU languages  CUDA  Small Example GPGPU Languages  RapidMind  Supports MultiCore, ATI, NVIDIA and Cell  C++ analysed and compiled for target hardware  Accelerator (Microsoft)  Library for .NET language  BrookGPU (Stanford University)  Supports ATI, NVIDIA  Own Language, variant of ANSI C Overview  Cluster / Multicore / GPU comparison  Computing on the GPU  Programming languages  CUDA  Small Example CUDA  Programming language  Similar to C  File suffix .cu  Own compiler called nvcc  Can be linked to C CUDA C++ code CUDA Code Compile with GCC Compile with nvcc Link with ld Executable CUDA  Additional variable types Dim3 Int3 Char3 CUDA  Different types of functions __global__ invoked from host __device__ called from device  Different types of variables __device__ located in VRAM __shared__ located in SRAM CUDA  Calling the kernel function name<<<dim3 grid, dim3 block>>>(...) Grid dimensions (groups) Block dimensions (threads) CUDA  Memory handling CudaMalloc(...) - allocating VRAM CudaMemcpy(...) - copying Memory CudaFree(...) - free VRAM CUDA  Distinguish threads blockDim – Number of all groups blockIdx – Id of Group (starting with 0) threadIdx – Id of Thread (starting with 0) Id = blockDim.x*blockIdx.x+threadIdx.x Overview  Cluster / Multicore / GPU comparison  Computing on the GPU  Programming languages  CUDA  Small Example CUDA void inc(int *a, int b, int N) { for (int i = 0; i<N; i++) a[i] = a[i] + b; } void main() { ... inc(a,b,N); } __global__ void inc(int *a, int b, int N) { int id = blockDim.x*blockIdx.x+threadIdx.x; if (id<N) a[id] = a[id] + b; } void main() { ... int * a_d = CudaAlloc(N); CudaMemCpy(a_d,a,N,HostToDevice); dim3 dimBlock ( blocksize, 0, 0 ); dim3 dimGrid ( N / blocksize, 0, 0 ); inc<<<dimGrid,dimBlock>>>(a_d,b,N); } Realworld Example  LTL Model checking Traversing an implicit Graph G=(V,E) Vertices called states Edges represented by transitions Duplicate removal needed Realworld Example  External Model checking Generate Graph with external BFS Each BFS layer needs to be sorted GPU proven to be fast in sorting Realworld Example  Challenges Millions of states in one layer Huge state size Fast access only in SRAM Elements needs to be moved Realworld Example  Solutions: Gpuqsort Qsort optimized for GPUs Intensive swapping in VRAM Bitonic based sorting Fast for subgroups Concatenating Groups slow Realworld Example  Our solution States S presorted by Hash H(S) Bucket sorted in SRAM by a Group •SRAM •VRAM Realworld Example  Our solution Order given by H(S),S Realworld Example  Results Programming the GPU Questions???

CUDA - TZI

Related documents

Products

Support

CUDA - TZI

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib