CS179: GPU Programming Lecture 10: GPU-Accelerated Libraries Today Some useful libraries: cuRAND cuBLAS cuFFT cuRAND Oftentimes, we want random data Simulations often need entropy to behave realistically How to obtain on GPU? No rand(), or simple equivalent Could use pseudo-random function with inputs based on properties Ex.: int i = cos(999 * thread.Idx.x + 123 * threadIdx.y) Works okay, but not great cuRAND What could do with your current tools: Generate N random numbers on CPU Allocate space on GPU Memcpy to GPU Not bad -- if we want to do this only once Issues: Number generation is synchronous Memcpy can be slow Much more ideal if random data can live only on GPU cuRAND Solution: cuRAND CUDA random number library Works on both host and device Lots of different distributions Uniform, normal, log-normal, Poisson, etc. cuRAND Performance cuRAND Host API Using on the host: Call from host Allocates memory on GPU Generates random numbers on GPU Several pseudorandom generators available Several random distributions available cuRAND Host API Functions to know: curandCreateGenerator(&g, GEN_TYPE) GEN_TYPE = CURAND_RNG_PSEUDO_DEFAULT, CURAND_RNG_PSEUDO_XORWOW Doesn’t particularly matter, differences are small curandSetRandomGeneratorSeed(g, SEED) Again, SEED doesn’t matter too much, just pick one (ex.: time(NULL)) curandGenerate______(…) Depends on distribution Ex.: curandGenerate(g, src, n), curandGenerateNormal(g, src, n, mean, stddev) curandDestroyGenerator(g) cuRAND Host API curandGenerate() launches asynchronously Much faster than serial CPU generation However, we still need to copy data to GPU src in curandGenerate() is host pointer, not device pointer! Introduces some undesired overhead Might need more memory than we can pass in one go Solution: cuRAND device API cuRAND Device API Supports RNG on kernels Do not need to generate random data before kernel We don’t have to copy and store all data at once Stores RNG states completely on GPU Still need to allocate memory for it on host cuRAND Device API Example: curandState *devStates; cudaMalloc(&devStates, sizeof(curandState) * nThreads); kernel<<<gD, bD, sM>>>(devStates, …); cudaFree(devStates); don’t forget to free! cuRAND Device API Example continued: // On the device: __global__ kernel(curandState *states, …) { int id = … // calculate thread id curand_init(seed, id, 0, &states[id]); // generate random value in range [0, 1] v[id] = curand_uniform(&states[id]) // transform to rand [a, b] v[id] = v[id] * (b - a) + a } cuRAND Device API Note the difference between cuRAND states and the actual values States determine random seed of variables Numbers aren’t generated until curand_<DISTRIBUTION>(&state) is called cuRAND Overview Can generate numbers on either host or device Whether generating on host or device, host must allocate space for device Many different random seeds, distributions available Check out these for more details: http://docs.nvidia.com/cuda/curand/host-api-overview.html http://docs.nvidia.com/cuda/curand/device-api-overview.html cuBLAS Linear algebra is extremely important in many applications Physics, engineering, mathematics, computer graphics, networking, … Anything STEM, really Linear algebra systems are oftentimes HUGE Ex.: Invert a matrix of size 106x106 would take a while on a CPU… Linear algebra systems are oftentimes parallelizable Element a[0][0] doesn’t care about what a[1][0] will be, just what it was Linear algebra is a perfect candidate for GPU cuBLAS cuBLAS: CUDA’s linear algebra system Based on BLAS (basic linear algebra system) Supports all 152 standard BLAS routines Works pretty similarly to BLAS cuBLAS Performance cuBLAS Performance cuBLAS Performance cuBLAS Several levels of BLAS: BLAS1: Handles vector & vector-vector functions Sum, min, max, etc. Add, scale, dot, etc. BLAS2: Handles matrix-vector functions Multiplication, generally BLAS3: Handles matrix-matrix functions Multiplication, adding, etc. cuBLAS Using is fairly simple Call initialization before kernel cublasInit() Use whatever functions you need in kernel Call shutdown after you’re done with cuBLAS cublasShutdown Check out the following for more info: http://docs.nvidia.com/cuda/cublas/index.html cuBLAS Alternative: cuSPARSE Another CUDA LA library Generally works well when dealing with sparse matrices (most entries are 0) Works pretty well even with dense vectors cuFFT Another concept with lots of application, scalability, and parallelizability: Fourier Transformation Commonly used in physics, signal processing, etc. Oftentimes needs to be real-time Makes great use of GPU cuFFT Supports 1D, 2D, or 3D Fourier Transforms 1D transforms can have up to 128 million elements Based on Cooley-Tukey and Bluestein FFT algorithms Similar API to FFTW, if familiar Thread-safe, streamed, asynchronous execution Supports both in-place and out-of-place transforms Supports real, complex, float, double data cuFFT Performance cuFFT Performance cuFFT Usage is fairly simple Allocate space on the GPU Same old cudaMalloc() call Create a cuFFT plan Tells dimension, sizes, and data types cufftPlan3d(&plan, nx, ny, nz, TYPE) TYPE = C2C, C2R, R2C (complex to complex, complex to real, real to complex) cuFFT Execute the plan cufftExecC2C(plan, in_data, out_data, CUFFT_FORWARD) Replace C2C with your plan type Can replace CUFFT_FORWARD with CUFFT_INVERSE Destroy plan, clean up data cufftDestroy(plan) cudaFree(in_data), cudaFree(out_data) Check out more here: http://docs.nvidia.com/cuda/cufft/index.html GPU-Accelerated Libraries Many more available https://developer.nvidia.com/gpu-accelerated-libraries OpenCV: Computer vision library (has GPU acceleration libraries) NPP: Performance primitives library, helps with signal/image processing Check them out! Best practice for learning: Check out documentation Check out examples Modify example code Repeat above until familiar, then use in your own code!