CS179: GPU Programming Lecture 8: More CUDA Runtime Today CUDA arrays for textures CUDA runtime Helpful CUDA functions CUDA Arrays Recall texture memory Used to store large data Stored on GPU Accessible to all blocks, threads CUDA Arrays Used Texture memory for buffers (lab 3) Allows vertex data to remain on GPU How else can we access texture memory? CUDA arrays CUDA Arrays Why CUDA arrays over normal arrays? Better caching, 2D caching Spatial locality Supports wrapping/clamping Supports filtering CUDA Linear Textures “Textures” but in global memory Usage: Step 1: Create texture reference texture<TYPE> tex TYPE = float, float3, int, etc. Step 2: Bind memory to texture reference cudaBindTexture(offset, tex, devPtr, size); Step 3: Get data on device via tex1Dfetch tex1DFetch(tex, x); x is the byte where we want to read! Step 4: Clean up after finished cudaUnbindTexture(&tex) CUDA Linear Textures Texture reference properties: texRef<type, dim, mode> type = float, int, float3, etc. dim = # of dimensions (1, 2, or 3) mode = cudaReadModeElementType: standard read cudaReadModeNormalizedFloat: maps 0->0.0, 255->1.0 for ints->floats CUDA Linear Textures Important warning: Textures are in a global space of memory Threads can read and write to texture at same time This can cause synchronization problems! Do not rely on thread running order, ever CUDA Linear Textures Other limitations: Only 1D, can make indexing and caching a bit less convenient Pitch may be not ideal for 2D array Not read-write Solution: CUDA arrays CUDA Arrays Live in texture memory space Access via texture fetches CUDA Arrays Step 1: Create channel description Tells us texture attributes cudaCreateChannelDesc(int x, int y, int z, int w, enum mode) x, y, z, w are number of bytes per component mode is cudaChannelFormatKindFloat, etc. CUDA Arrays Step 2: Allocate memory Must be done dynamically Use cudaMallocArray(cudaArray **array, struct desc, int size) Most global memory functions work with CUDA arrays too cudaMemcpyToArray, etc. CUDA Arrays Step 3: Create texture reference texture<TYPE, dim, mode> texRef -- just as before Parameters must match channel description where applicable Step 4: Edit texture settings Settings are encoded as texRef struct members CUDA Arrays Step 5: Bind the texture reference to array cudaBindTextureToArray(texRef, array) Step 6: Access texture Similar to before, now we have more options: tex1DFetch(texRef, x) tex2DFetch(texRef, x, y) CUDA Arrays Final Notes: Coordinates can be normalized to [0, 1] if in float mode Filter modes: nearest point or linear Tells CUDA how to blend texture Wrap vs. clamp: Wrap: out of bounds accesses wrap around to other side Ex.: (1.5, 0.5) -> (0.5, 0.5) Clamp: out of bounds accesses set to border value Ex.: (1.5, 0.5) -> (1.0, 0.5) CUDA Arrays point sampling linear sampling CUDA Arrays wrap clamp CUDA Runtime Nothing new, every function cuda____ is part of the runtime Lots of other helpful functions Many runtime functions based on making your program robust Check properties of card, set up multiple GPUs, etc. Necessary for multi-platform development! CUDA Runtime Starting the runtime: Simply call a cuda_____ function! CUDA can waste a lot of resources Stop CUDA with cudaThreadExit() Called automatically on CPU exit, but you may want to call earlier CUDA Runtime Getting devices and properties: cudaGetDeviceCount(int * n); Returns # of CUDA-capable devices Can use to check if machine is CUDA-capable! cudaSetDevice(int n) Sets device n to the currently used device cudaGetDeviceProperties(struct *devProp prop, int n); Loads data from device n into prop Device Properties char name[256]: ASCII identifier of GPU size_t totalGlobalMem: Total global memory available size_t sharedMemPerBlock: Shared memory available per multiprocessor int regsPerBlock: How many registers we have per block int warpSize: size of our warps size_t memPitch: maximum pitch allowed for array allocation int maxThreadsPerBlock: maximum number of threads/block int maxThreadsDim[3]: maximum sizes of a block Device Properties int maxGridSize[3]: maximum grid sizes size_t totalConstantMemory: maximum available constant memory int major, int minor: major and minor versions of CUDA support int clockRate: clock rate of device in kHz size_t textureAlignment: memory alignment required for textures int deviceOverlap: Does this device allow for memory copying while kernel is running? (0 = no, 1 = yes) int multiprocessorCount: # of multiprocessors on device Device Properties Uses? Actually get values for memory, instead of guessing Program to be accessible for multiple systems Can get the best device Device Properties Getting the best device: Pick a metric (Ex.: most multiprocessors could be good) int num_devices, device; cudaGetDeviceCount(&num_devices); if (num_devices > 1) { int max_mp = 0, best_device = 0; for (device = 0; device < num_devices; device++) { cudaDeviceProp prop; cudaGetDeviceProperties(&prop, device); int mp_count = prop.multiProcessorCount; if (mp_count > max_mp) { max_mp = mp_count; best_device = device; } } cudaSetDevice(best_device); } Device Properties We can also use this to launch multiple GPUs Each GPU must have its own host thread Multithread on CPU, each thread calls different device Set device on thread using cudaSetDevice(n); CUDA Runtime Synchronization Note: Most calls to GPU/CUDA are asynchronous Some are synchonous (usually things dealing with memory) Can force synchronization: cudaThreadSynchronize() Blocks until all devices are done Good for error checking, timing, etc. CUDA Events Great for timing! Can place event markers in CUDA to measure time Example code: cudaEvent_t start, stop; cudaCreateEvent(&start); cudaCreateEvent(&stop); cudaEventRecord(start, 0); // DO SOME GPU CODE HERE cudaEventRecord(stop, 0); cudaEventSynchronize(stop); float elapsed_time; cudaEventElapsedTime(&elapsed_time, start, stop); CUDA Streams Streams manage concurrency and ordering Ex.: call malloc, then kernel 1, then kernel 2, etc. Calls in different streams are asynchronous! Don’t know when each stream is where in code Using Streams Create stream cudaStreamCreate(cudaStream_t *stream) Copy memory using async calls: cudaMemcpyAsync(…, cudaStream_t stream) Call in kernel as another parameter: kernel<<<gridDim, blockDim, sMem, stream>>> Query if stream is done: cudaStreamQuery(cudaStream_t stream) returns cudaSuccess if stream is done, cudaErrorNotReady otherwise Block process until a stream is done: cudaStreamSynchronize(cudaStream_t stream) Destroy stream & cleanup: cudaStreamDestroy(cudaStream_t stream) Using Streams Example: cudaStream_t stream[2]; for (int i = 0; i < 2; ++i) cudaStreamCreate(&stream[i]); for (int i = 0; i < 2; ++i) cudaMemcpyAsync(inputDevPtr + i * size, hostPtr + i * size, size, cudaMemcpyHostToDevice, stream[i]); for (int i = 0; i < 2; ++i) myKernel<<<100, 512, 0, stream[i]>>>(outputDevPtr + i * size, inputDevPtr + i * size, size); for (int i = 0; i < 2; ++i) cudaMemcpyAsync(hostPtr + i * size, outputDevPtr + i * size, size, cudaMemcpyDeviceToHost, stream[i]); cudaThreadSynchronize(); Next Time Lab 4 Recitation: 3D Textures Pixel Buffer Objects (PBOs) Fractals!