PPT - Caltech

CS179: GPU Programming Lecture 10: GPU-Accelerated Libraries Today  Some useful libraries:  cuRAND  cuBLAS  cuFFT cuRAND  Oftentimes, we want random data  Simulations often need entropy to behave realistically  How to obtain on GPU?  No rand(), or simple equivalent  Could use pseudo-random function with inputs based on properties  Ex.: int i = cos(999 * thread.Idx.x + 123 * threadIdx.y)  Works okay, but not great cuRAND  What could do with your current tools:  Generate N random numbers on CPU  Allocate space on GPU  Memcpy to GPU  Not bad -- if we want to do this only once  Issues:  Number generation is synchronous  Memcpy can be slow  Much more ideal if random data can live only on GPU cuRAND  Solution: cuRAND  CUDA random number library  Works on both host and device  Lots of different distributions  Uniform, normal, log-normal, Poisson, etc. cuRAND  Performance cuRAND Host API  Using on the host:  Call from host  Allocates memory on GPU  Generates random numbers on GPU  Several pseudorandom generators available  Several random distributions available cuRAND Host API  Functions to know:  curandCreateGenerator(&g, GEN_TYPE)  GEN_TYPE = CURAND_RNG_PSEUDO_DEFAULT, CURAND_RNG_PSEUDO_XORWOW  Doesn’t particularly matter, differences are small  curandSetRandomGeneratorSeed(g, SEED)  Again, SEED doesn’t matter too much, just pick one (ex.: time(NULL))  curandGenerate______(…)  Depends on distribution  Ex.: curandGenerate(g, src, n), curandGenerateNormal(g, src, n, mean, stddev)  curandDestroyGenerator(g) cuRAND Host API  curandGenerate() launches asynchronously  Much faster than serial CPU generation  However, we still need to copy data to GPU  src in curandGenerate() is host pointer, not device pointer!  Introduces some undesired overhead  Might need more memory than we can pass in one go  Solution: cuRAND device API cuRAND Device API  Supports RNG on kernels  Do not need to generate random data before kernel  We don’t have to copy and store all data at once  Stores RNG states completely on GPU  Still need to allocate memory for it on host cuRAND Device API  Example: curandState *devStates; cudaMalloc(&devStates, sizeof(curandState) * nThreads); kernel<<<gD, bD, sM>>>(devStates, …); cudaFree(devStates); don’t forget to free! cuRAND Device API  Example continued: // On the device: __global__ kernel(curandState *states, …) { int id = … // calculate thread id curand_init(seed, id, 0, &states[id]); // generate random value in range [0, 1] v[id] = curand_uniform(&states[id]) // transform to rand [a, b] v[id] = v[id] * (b - a) + a } cuRAND Device API  Note the difference between cuRAND states and the actual values  States determine random seed of variables  Numbers aren’t generated until curand_<DISTRIBUTION>(&state) is called cuRAND Overview  Can generate numbers on either host or device  Whether generating on host or device, host must allocate space for device  Many different random seeds, distributions available  Check out these for more details:  http://docs.nvidia.com/cuda/curand/host-api-overview.html  http://docs.nvidia.com/cuda/curand/device-api-overview.html cuBLAS  Linear algebra is extremely important in many applications  Physics, engineering, mathematics, computer graphics, networking, …  Anything STEM, really  Linear algebra systems are oftentimes HUGE  Ex.: Invert a matrix of size 106x106 would take a while on a CPU…  Linear algebra systems are oftentimes parallelizable  Element a[0][0] doesn’t care about what a[1][0] will be, just what it was  Linear algebra is a perfect candidate for GPU cuBLAS  cuBLAS: CUDA’s linear algebra system  Based on BLAS (basic linear algebra system)  Supports all 152 standard BLAS routines  Works pretty similarly to BLAS cuBLAS  Performance cuBLAS  Performance cuBLAS  Performance cuBLAS  Several levels of BLAS:  BLAS1: Handles vector & vector-vector functions  Sum, min, max, etc.  Add, scale, dot, etc.  BLAS2: Handles matrix-vector functions  Multiplication, generally  BLAS3: Handles matrix-matrix functions  Multiplication, adding, etc. cuBLAS  Using is fairly simple  Call initialization before kernel  cublasInit()  Use whatever functions you need in kernel  Call shutdown after you’re done with cuBLAS  cublasShutdown  Check out the following for more info:  http://docs.nvidia.com/cuda/cublas/index.html cuBLAS  Alternative: cuSPARSE  Another CUDA LA library  Generally works well when dealing with sparse matrices (most entries are 0)  Works pretty well even with dense vectors cuFFT  Another concept with lots of application, scalability, and parallelizability: Fourier Transformation  Commonly used in physics, signal processing, etc.  Oftentimes needs to be real-time  Makes great use of GPU cuFFT  Supports 1D, 2D, or 3D Fourier Transforms  1D transforms can have up to 128 million elements  Based on Cooley-Tukey and Bluestein FFT algorithms  Similar API to FFTW, if familiar  Thread-safe, streamed, asynchronous execution  Supports both in-place and out-of-place transforms  Supports real, complex, float, double data cuFFT  Performance cuFFT  Performance cuFFT  Usage is fairly simple  Allocate space on the GPU  Same old cudaMalloc() call  Create a cuFFT plan  Tells dimension, sizes, and data types  cufftPlan3d(&plan, nx, ny, nz, TYPE)  TYPE = C2C, C2R, R2C (complex to complex, complex to real, real to complex) cuFFT  Execute the plan  cufftExecC2C(plan, in_data, out_data, CUFFT_FORWARD)  Replace C2C with your plan type  Can replace CUFFT_FORWARD with CUFFT_INVERSE  Destroy plan, clean up data  cufftDestroy(plan)  cudaFree(in_data), cudaFree(out_data)  Check out more here:  http://docs.nvidia.com/cuda/cufft/index.html GPU-Accelerated Libraries  Many more available     https://developer.nvidia.com/gpu-accelerated-libraries OpenCV: Computer vision library (has GPU acceleration libraries) NPP: Performance primitives library, helps with signal/image processing Check them out!  Best practice for learning:     Check out documentation Check out examples Modify example code Repeat above until familiar, then use in your own code!

PPT - Caltech

Related documents

Products

Support

PPT - Caltech

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib