PPT

CS179: GPU Programming Lecture 5: Memory Today  GPU Memory Overview  CUDA Memory Syntax  Tips and tricks for memory handling Memory Overview • Very slow access: • Between host and device • Slow access: • Global Memory • Fast access: • Shared memory, constant memory, texture memory, local memory • Very fast access: • Register memory Global Memory  Read/write  Shared between blocks and grids  Same across multiple kernel executions  Very slow to access  No caching! Constant Memory  Read-only in device  Cached in multiprocessor  Fairly quick  Cache can broadcast to all active threads Texture Memory  Read-only in device  2D cached -- quick access  Filtering methods available Shared Memory  Read/write per block  Memory is shared within block  Generally quick  Has bad worst-cases Local Memory  Read/write per thread  Not too fast (stored independent of chip)  Each thread can only see its own local memory  Indexable (can do arrays) Register Memory  Read/write per thread function  Extremely fast  Each thread can only see its own register memory  Not indexable (can’t do arrays) Syntax: Register Memory  Default memory type  Declare as normal -- no special syntax  int var = 1;  Only accessible by current thread Syntax: Local Memory  “Global” variables for threads  Can modify across local functions for a thread  Declare with __device__ __local__ keyword  __device__ __local__ int var = 1;  Can also just use __local__ Syntax: Shared Memory  Shared across threads in block, not across blocks  Cannot use pointers, but can use array syntax for arrays  Declare with __device__ __shared__ keyword  __device__ __shared__ int var[];  Can also just use __shared__  Don’t need to declare size for arrays Syntax: Global Memory  Created with cudaMalloc  Can pass pointers between host and kernel  Transfer is slow!  Declare with __device__keyword  __device__ int var = 1; Syntax: Constant Memory  Declare with __device__ __constant__ keyword  __device__ __constant__ int var = 1;  Can also just use __constant__  Set using cudaMemcpyToSymbol (or cudaMemcpy)  cudaMemcpyToSymbol(var, src, count); Syntax: Texture Memory  To be discussed later… Memory Issues  Each multiprocessor has set amount of memory  Limits amount of blocks we can have  (# of blocks) x (memory used per block) <= total memory  Either get lots of blocks using little memory, or fewer blocks using lots of memory Memory Issues  Register memory is limited!  Similar to shared memory in blocks  Can have many threads using fewer registers, or few threads using many registers  Former is better, more parallelism Memory Issues  Global accesses: slow!  Can be sped up when memory is contiguous  Memory coalescing: making memory contiguous  Coalesced accesses are:  Contiguous accesses  In-order accesses  Aligned accesses Memory Coalescing: Aligned Accesses  Threads read 4, 8, or 16 bytes at a time from global memory  Accesses must be aligned in memory!  Good: 0x00 0x04  Bad: 0x00 0x14 0x07 0x14  Which is worse, reading 16 bytes from 0xABCD0 or 0xABCDE? Memory Coalescing Aligned Accesses Also bad: beginning unaligned Memory Coalescing: Aligned Accesses  Built-in types force alignment  float3 (12B) takes up the same space as float4 (16B)  float3 arrays are not aligned!  To align a struct, use __align__(x) // x = 4, 8, 16  cudaMalloc aligns the start of each block automatically  cudaMalloc2D aligns the start of each row for 2D arrays Memory Coalescing: Contiguous Accesses  Contiguous = memory is together  Example: non-contiguous memory  Thread 3 and 4 swapped accesses! Memory Coalescing: Contiguous Accesses  Which is better?  index = threadIdx.x + blockDim.x * (blockIdx.x + gridDim.x  index = threadIdx.x + blockDim.y * (blockIdx.y + gridDim.y * blockIdx.y); * blockIdx.x); Memory Coalescing: Contiguous Accesses  Case 1: Contiguous accesses Memory Coalescing: Contiguous Accesses  Case 1: Contiguous accesses Memory Coalescing: In-order Accesses  In-order accesses  Do not skip addresses  Access addresses in order in memory  Bad example:  Left: address 140 skipped  Right: lots of skipped addresses Memory Coalescing  Good example: Memory Coalescing  Not as much of an issue in new hardware  Many restrictions relaxed -- e.g., do not need to have sequential access  However, memory coalescing and alignment still good practice! Memory Issues  Shared memory:  Also can be limiting  Broken up into banks  Optimal when entire warp is reading shared memory together  Banks:  Each bank services only one thread at a time  Bank conflict: when two threads try to access same block  Causes slowdowns in program! Bank Conflicts  Bad:  Many threads trying to access the same bank Bank Conflicts  Good:  Few to no bank conflicts Bank Conflicts  Banks service 32-bit words at a time at addresses mod 64  Bank 0 services 0x00, 0x40, 0x80, etc., bank 1 services 0x04, 0x44, 0x84, etc.  Want to avoid multiple thread access to same bank  Keep data spread out  Split data that is larger than 4 bytes into multiple accesses  Be careful of data elements with even stride Broadcasting  Fast distribution of data to threads  Happens when entire warp tries to access same address  Memory will get broadcasted to all threads in one read Summary  Best memory management:  Balances memory optimization with parallelism  Break problem up into coalesced chunks  Process data in shared memory, then copy back to global  Remember to avoid bank conflicts! Next Time  Texture memory  CUDA Applications in graphics

PPT

Related documents

Products

Support

PPT

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib