Algorithm Engineering „GPGPU“ Stefan Edelkamp Graphics Processing Units GPGPU = (GP)²U General Purpose Programming on the GPU „Parallelism for the masses“ Application: Fourier-Transformation, Model Checking, Bio-Informatics, see CUDA-ZONE Programming the Graphics Processing Unit with Cuda Overview Cluster / Multicore / GPU comparison Computing on the GPU GPGPU languages CUDA Small Example Overview Cluster / Multicore / GPU comparison Computing on the GPU GPGPU languages CUDA Small Example Cluster / Multicore / GPU Cluster system many unique systems each one one (or more) processors internal memory often HDD communication over network CPU RAM HDD CPU RAM HDD slow compared to internal no shared memory CPU RAM HDD Switch Cluster / Multicore / GPU Multicore systems multiple CPUs RAM external memory on HDD communication over RAM CPU1 CPU2 CPU3 CPU4 RAM HDD Cluster / Multicore / GPU System with a Graphic Processing Unit Many (240) Parallel processing units Hierarchical memory structure RAM VideoRAM SharedRAM Communication PCI BUS Graphics Card CPU GPU SRAM VRAM RAM Hard Disk Drive Overview Cluster / Multicore / GPU comparison Computing on the GPU GPGPU languages CUDA Small Example Computing on the GPU Hierarchical execution Groups executed sequentially Threads executed parallel lightweight (creation / switching nearly free) one Kernel function executed by each thread •Group 0 Computing on the GPU Hierarchical memory Video RAM 1 GB Comparable to RAM Shared RAM in the GPU 16 KB Comparable to registers parallel access by threads Graphic Card GPU SRAM VideoRAM Beispielarchitektur G200 z.B. in 280GTX Beispielprobleme Ranking und Unranking mit Parity 2-Bit BFS 1-Bit BFS Schiebepuzzle Some Results… Weitere Resultate … Overview Cluster / Multicore / GPU comparison Computing on the GPU GPGPU languages CUDA Small Example GPGPU Languages RapidMind Supports MultiCore, ATI, NVIDIA and Cell C++ analysed and compiled for target hardware Accelerator (Microsoft) Library for .NET language BrookGPU (Stanford University) Supports ATI, NVIDIA Own Language, variant of ANSI C Overview Cluster / Multicore / GPU comparison Computing on the GPU Programming languages CUDA Small Example CUDA Programming language Similar to C File suffix .cu Own compiler called nvcc Can be linked to C CUDA C++ code CUDA Code Compile with GCC Compile with nvcc Link with ld Executable CUDA Additional variable types Dim3 Int3 Char3 CUDA Different types of functions __global__ invoked from host __device__ called from device Different types of variables __device__ located in VRAM __shared__ located in SRAM CUDA Calling the kernel function name<<<dim3 grid, dim3 block>>>(...) Grid dimensions (groups) Block dimensions (threads) CUDA Memory handling CudaMalloc(...) - allocating VRAM CudaMemcpy(...) - copying Memory CudaFree(...) - free VRAM CUDA Distinguish threads blockDim – Number of all groups blockIdx – Id of Group (starting with 0) threadIdx – Id of Thread (starting with 0) Id = blockDim.x*blockIdx.x+threadIdx.x Overview Cluster / Multicore / GPU comparison Computing on the GPU Programming languages CUDA Small Example CUDA void inc(int *a, int b, int N) { for (int i = 0; i<N; i++) a[i] = a[i] + b; } void main() { ... inc(a,b,N); } __global__ void inc(int *a, int b, int N) { int id = blockDim.x*blockIdx.x+threadIdx.x; if (id<N) a[id] = a[id] + b; } void main() { ... int * a_d = CudaAlloc(N); CudaMemCpy(a_d,a,N,HostToDevice); dim3 dimBlock ( blocksize, 0, 0 ); dim3 dimGrid ( N / blocksize, 0, 0 ); inc<<<dimGrid,dimBlock>>>(a_d,b,N); } Realworld Example LTL Model checking Traversing an implicit Graph G=(V,E) Vertices called states Edges represented by transitions Duplicate removal needed Realworld Example External Model checking Generate Graph with external BFS Each BFS layer needs to be sorted GPU proven to be fast in sorting Realworld Example Challenges Millions of states in one layer Huge state size Fast access only in SRAM Elements needs to be moved Realworld Example Solutions: Gpuqsort Qsort optimized for GPUs Intensive swapping in VRAM Bitonic based sorting Fast for subgroups Concatenating Groups slow Realworld Example Our solution States S presorted by Hash H(S) Bucket sorted in SRAM by a Group •SRAM •VRAM Realworld Example Our solution Order given by H(S),S Realworld Example Results Programming the GPU Questions???