Chapter 14: CUDA Memories - Southern Illinois University

advertisement
Programming Massively Parallel
Processors
Lecture Slides for Chapter 5:
CUDA Memories
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010
ECE408, University of Illinois, Urbana Champaign
1
Importance of Memory Access
Efficiency
•
•
•
So far, we learn writing kernel function that is executed by a massive number
of threads. The data is first transferred from the host memory to the device
global memory. The threads then access their portion of the data from the
global memory using their block IDs and thread IDs. However, it is likely to
achieve only a small fraction of the potential performance of the potential
speed of the underlying hardware. It is due to the fact that global memory
(implemented with DRAM) tends to have long access latencies (hundreds of
clock cycles) and finite access bandwidth.
Although many threads available for execution can theoretically tolerate long
memory access latencies, one can easily run into a situation where traffic
congestion inn the global memory access paths prevents all but very few
threads from making progress, thus rendering some of the SMs idle.
To circumvent such congestion, CUDA provides a number of additional
methods for accessing memory that can remove the majority of data
requests to the global memory.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010
ECE408, University of Illinois, Urbana Champaign
2
CGMA ratio from Example
• Inner product calculation for matrix multiplicaiton
• For (int k=0; k< Width; ++k)
pvalue += d_M[Row*Width+k] * d_N[k*Width +col];
•
•
In every iteration of this loop, two global memory accesses are performed for one
floating-point multiplication and one floating-point addition. One global memory access
fetches a d_M[] element and the other fetches a d_N[]. The ratio of floating-point
calculation to global memory access operations is: 1:1. We refer to this ratio as the
compute to global memory access (CGMA) ratio., defined as the number of floating
point calculations performed for each access to the global memory within a region of a
CUDA program.
In a high-end device today, the global memory bandwidth is around 200GB/s. With 4
bytes in each single-precision floating-point value, one can expect to load no more
than 50 (200/4) giga single-precision operands per second. This kernel will execute no
more than 50 giga floating-point operations per second(GFLOPS). It is only a fraction
of the peak single-precision performance of 1,500 GFLOPS or even higher for highend devices. A ratio of 30 is needed to achieve the peak performance.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010
ECE408, University of Illinois, Urbana Champaign
3
G80 Implementation of CUDA Memories
• Each thread can:
– Read/write per-thread
registers
– Read/write per-thread
local memory
– Read/write per-block
shared memory
– Read/write per-grid
global memory
– Read/only per-grid
constant memory (R)
Grid
Block (0, 0)
Block (1, 0)
Shared Memory
Registers
Registers
Thread (0, 0) Thread (1, 0)
Host
Shared Memory
Registers
Registers
Thread (0, 0) Thread (1, 0)
Global Memory
Constant Memory
By declaring CUDA variables in one of the CUDA memory type, a CUDA
programmer dictates the visibility and access speed of the variables.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010
ECE408, University of Illinois, Urbana Champaign
4
Register and Global Memory Types
•
•
•
•
The registers on the processor chip, which implies very short access latency
and higher access bandwidth. In a typical device, the aggregated access
bandwidth of the register is about two orders of magnitude of that of the
global memory. Accessing register variable no longer consume off-chip
global memory bandwidth. This will be reflected as an increase in the CGMA
ratio.
In addition, each access to register involves fewer instruction than global
memory. The energy consumed for accessing a value from the register is at
least an order of magnitude lower than for accessing a value from the global
memory.
However, the number of registers available for each thread is very limited.
Do not oversubscribe this resources.
Global memory has no efficient synchronization between threads from
different blocks or ensure data consistency across threads when accessing
global memory other than terminating the current kernel execution.
Therefore, global variables are often used to pass information from one
kernel to another kernel invocation.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010
ECE408, University of Illinois, Urbana Champaign
5
Shared Memory and Registers
• Although both are on-chip memories, they differ significantly in
functionality and cost of access. Shared memory is designed as part
of the memory space and resides on the processor chip. When the
processor accesses shared memory, it needs to perform a memory
load operation, just like accessing data in the global memory.
However, because shared memory resides on-chip, it can be
accessed with much lower latency and much higher bandwidth than
global memory.
• All block threads share the shared memory. It is designed to support
efficient, high-bandwidth sharing of data among threads in a block.
Threads in a block spread across SP in a SM. So these SP can
simultaneously access its contents to support data sharing among
threads in a block.
• Thus, Shared memory used to hold the portion of global memory that
are heavily used in an execution phase of a kernel.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010
ECE408, University of Illinois, Urbana Champaign
6
CUDA Variable Type Qualifiers
Variable declaration
Memory
Scope
Lifetime
__device__ __local__
int LocalVar;
register
thread
thread
__device__ __shared__
int SharedVar;
shared
block
kernel
__device__
int GlobalVar;
global
All
grids
application
constant
All
grids
application
__device__ __constant__ int ConstantVar;
•
•
•
__device__ is optional when used with __local__, __shared__,
or __constant__
Constant variables must be declared outside of the
function body. Usually Cached for fast, and size is
typical 65,536 bytes
Automatic variables without any qualifier reside in a register
–
Except arrays that reside in global memory and private to each thread.
However, rarely used.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010
ECE408, University of Illinois, Urbana Champaign
7
Where to Declare Variables?
Can host access it?
global
constant
yes
Outside of
any Function
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010
ECE408, University of Illinois, Urbana Champaign
no
register (automatic
Shared
In the kernel
8
Variable Type Restrictions
• Pointers can only point to memory allocated or
declared in global memory:
– Allocated in the host and passed to the kernel:
__global__ void KernelFunc(float* ptr)
– Obtained as the address of a global variable:
float* ptr = &GlobalVar;
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010
ECE408, University of Illinois, Urbana Champaign
9
A Common Programming Strategy
• Global memory resides in device memory (DRAM)
- much slower access than shared memory
• So, a profitable way of performing computation on
the device is to tile data / partition the data to take
advantage of fast shared memory:
– Partition data into subsets that fit into shared memory
– Handle each data subset with one thread block by:
• Loading the subset from global memory to shared memory,
using multiple threads to exploit memory-level parallelism
• Performing the computation on the subset from shared
memory; each thread can efficiently multi-pass over any data
element
• Copying results from shared memory to global memory
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010
ECE408, University of Illinois, Urbana Champaign
10
A Common Programming Strategy
(Cont.)
•
Constant memory also resides in device memory (DRAM) much slower access than shared memory
–
–
•
But… cached!
Highly efficient access for read-only data
Carefully divide data according to access patterns
–
–
–
–
R/Only  constant memory (very fast parallel access if in cache)
R/W shared within Block  shared memory (very fast)
R/W within each thread  registers (very fast)
R/W inputs/results  global memory (very slow)
For texture memory usage, see NVIDIA document.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010
ECE408, University of Illinois, Urbana Champaign
11
GPU Atomic Integer Operations
• Atomic operations on integers in global memory:
–
–
–
–
–
Associative operations on signed/unsigned ints
add, sub, min, max, ...
and, or, xor
Increment, decrement
Exchange, compare and swap
• Requires hardware with compute capability 1.1
and above.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010
ECE408, University of Illinois, Urbana Champaign
12
12
Matrix Multiplication using
Shared Memory
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010
ECE408, University of Illinois, Urbana Champaign
13
Review: Matrix Multiplication
Kernel using Multiple Blocks
__global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width)
{
// Calculate the row index of the Pd element and M
int Row = blockIdx.y*TILE_WIDTH + threadIdx.y;
// Calculate the column idenx of Pd and N
int Col = blockIdx.x*TILE_WIDTH + threadIdx.x;
float Pvalue = 0;
// each thread computes one element of the block sub-matrix
for (int k = 0; k < Width; ++k)
Pvalue += Md[Row*Width+k] * Nd[k*Width+Col];
Pd[Row*Width+Col] = Pvalue;
}
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010
ECE408, University of Illinois, Urbana Champaign
14
How about performance on G80?
• All threads access global memory
for their input matrix elements
–
–
–
–
Two memory accesses (8 bytes)
per floating point multiply-add
4B/s of memory
bandwidth/FLOPS
4*346.5 = 1386 GB/s required to
achieve peak FLOP rating
86.4 GB/s limits the code at
21.6 GFLOPS
• The actual code runs at about 15 Host
GFLOPS
• Need to drastically cut down
memory accesses to get closer to
the peak 346.5 GFLOPS
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010
ECE408, University of Illinois, Urbana Champaign
Grid
Block (0, 0)
Block (1, 0)
Shared Memory
Registers
Registers
Thread (0, 0) Thread (1, 0)
Shared Memory
Registers
Registers
Thread (0, 0) Thread (1, 0)
Global Memory
Constant Memory
15
Idea: Use Shared Memory to reuse
global memory data
• Each input element is read
by Width number of
threads.(e.g. same row)
• Load each element into
Shared Memory and have
several threads use the
local version to reduce the
M
memory bandwidth
WIDTH
N
P
– Tiled algorithms
WIDTH
ty
tx
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010
ECE408, University of Illinois, Urbana Champaign
WIDTH
WIDTH
16
bx
Tiled Multiply
Md
1
2
tx
Nd
WIDTH
TILE_WIDTH
0 1 2 TILE_WIDTH-1
TILE_WIDTH
• Break up the execution of the
kernel into phases so that the
data accesses in each phase is
focused on one subset (tile) of
Md and Nd
0
Pd
1
ty
Pdsub
WIDTH
by
0
1
2
TILE_WIDTHE
0
TILE_WIDTH-1
TILE_WIDTH TILE_WIDTH
2
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010
ECE408, University of Illinois, Urbana Champaign
WIDTH
TILE_WIDTH
WIDTH
17
A Small Example
Nd0,0 Nd1,0
Nd0,1 Nd1,1
Nd0,2 Nd1,2
Nd0,3 Nd1,3
Md0,0Md1,0Md2,0Md3,0
Pd0,0 Pd1,0 Pd2,0 Pd3,0
Md0,1Md1,1Md2,1Md3,1
Pd0,1 Pd1,1 Pd2,1 Pd3,1
Pd0,2 Pd1,2 Pd2,2 Pd3,2
Pd0,3 Pd1,3 Pd2,3 Pd3,3
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010
ECE408, University of Illinois, Urbana Champaign
18
Every Md and Nd Element is used
exactly twice in generating a 2X2 tile of P
P0,0
thread0,0
Access
order
P1,0
thread1,0
P0,1
thread0,1
P1,1
thread1,1
M0,0 * N0,0 M0,0 * N1,0
M0,1 * N0,0
M0,1 * N1,0
M1,0 * N0,1 M1,0 * N1,1
M1,1 * N0,1
M1,1 * N1,1
M2,0 * N0,2 M2,0 * N1,2
M2,1 * N0,2
M2,1 * N1,2
M3,0 * N0,3 M3,0 * N1,3
M3,1 * N0,3
M3,1 * N1,3
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010
ECE408, University of Illinois, Urbana Champaign
19
Breaking Md and Nd into Tiles
• Break up the inner
product loop of each
thread into phases
• At the beginning of each
phase, load the Md and
Nd elements that
everyone needs during
the phase into shared
Md0,0Md1,0Md2,0Md3,0
memory
Md0,1Md1,1Md2,1Md3,1
• Everyone access the Md
and Nd elements from the
shared memory during
the phase
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010
ECE408, University of Illinois, Urbana Champaign
Nd0,0 Nd1,0
Nd0,1 Nd1,1
Nd0,2 Nd1,2
Nd0,3 Nd1,3
Pd0,0 Pd1,0 Pd2,0 Pd3,0
Pd0,1 Pd1,1 Pd2,1 Pd3,1
Pd0,2 Pd1,2 Pd2,2 Pd3,2
Pd0,3 Pd1,3 Pd2,3 Pd3,3
20
Each phase of a Thread Block uses one
tile from Md and one from Nd
Step 5 2 Step 6
Step 4 Phase
Phase 1
T0,0
Md0,0
↓
Mds0,0
Nd0,0
↓
Nds0,0
PValue0,0 +=
Mds0,0*Nds0,0 +
Mds1,0*Nds0,1
Md2,0
↓
Mds0,0
Nd0,2
↓
Nds0,0
PValue0,0 +=
Mds0,0*Nds0,0 +
Mds1,0*Nds0,1
T1,0
Md1,0
↓
Mds1,0
Nd1,0
↓
Nds1,0
PValue1,0 +=
Mds0,0*Nds1,0 +
Mds1,0*Nds1,1
Md3,0
↓
Mds1,0
Nd1,2
↓
Nds1,0
PValue1,0 +=
Mds0,0*Nds1,0 +
Mds1,0*Nds1,1
T0,1
Md0,1
↓
Mds0,1
Nd0,1
↓
Nds0,1
PdValue0,1 +=
Mds0,1*Nds0,0 +
Mds1,1*Nds0,1
Md2,1
↓
Mds0,1
Nd0,3
↓
Nds0,1
PdValue0,1 +=
Mds0,1*Nds0,0 +
Mds1,1*Nds0,1
T1,1
Md1,1
↓
Mds1,1
Nd1,1
↓
Nds1,1
PdValue1,1 +=
Mds0,1*Nds1,0 +
Mds1,1*Nds1,1
Md3,1
↓
Mds1,1
Nd1,3
↓
Nds1,1
PdValue1,1 +=
Mds0,1*Nds1,0 +
Mds1,1*Nds1,1
time
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010
ECE408, University of Illinois, Urbana Champaign
( total N/TILE_WIDTH phases, data locality:
MD ND are reused to reduce the amount of shared
memory)
21
Tiled Matrix Multiplication Kernel
__global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width)
{
1.
2.
__shared __float Mds[TILE_WIDTH][TILE_WIDTH];
__shared __float Nds[TILE_WIDTH][TILE_WIDTH];
3.
4.
int bx = blockIdx.x; int by = blockIdx.y;
int tx = threadIdx.x; int ty = threadIdx.y;
// Identify the row and column of the Pd element to work on
5. int Row = by * TILE_WIDTH + ty;
6. int Col = bx * TILE_WIDTH + tx;
7. float Pvalue = 0;
// Loop over the Md and Nd tiles required to compute the Pd element
8. for (int m = 0; m < Width/TILE_WIDTH; ++m) {
// Collaborative loading of Md and Nd tiles into shared memory
9.
Mds[ty][tx] = Md[Row*Width + (m*TILE_WIDTH + tx)];
10.
Nds[ty][tx] = Nd[(m*TILE_WIDTH + ty)*Width + Col];
11.
__syncthreads();
12.
13.
14.
for (int k = 0; k < TILE_WIDTH; ++k)
Pvalue += Mds[ty][k] * Nds[k][tx];
__syncthreads();
}
15. Pd[Row*Width + Col] = Pvalue;
}
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010
ECE408, University of Illinois, Urbana Champaign
22
CUDA Code – Kernel Execution
Configuration
// Setup the execution configuration
dim3 dimBlock(TILE_WIDTH, TILE_WIDTH);
dim3 dimGrid(Width / TILE_WIDTH,
Width / TILE_WIDTH);
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010
ECE408, University of Illinois, Urbana Champaign
23
First-order Size Considerations in G80
• Each thread block should have many threads
– TILE_WIDTH of 16 gives 16*16 = 256 threads
• There should be many thread blocks
– A 1024*1024 Pd gives 64*64 = 4096 Thread Blocks
– TILE_WIDTH of 16 gives each SM 3 blocks, 768 threads (full
capacity)
• Each thread block perform 2*256 = 512 float loads from
global memory for 256 * (2*16) = 8,192 mul/add
operations.
– Memory bandwidth no longer a limiting factor
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010
ECE408, University of Illinois, Urbana Champaign
24
bx
Tiled Multiply
Md
m
bx
k
WIDTH
Nd
Pd
by
ty
Pdsub
k
Row
WIDTH
0
1
2
TILE_WIDTHE
1
TILE_WIDTH
0 1 2 TILE_WIDTH-1
m
0
by
2
Col
tx
TILE_WIDTH
• Each thread computes one
element of Pdsub
1
TILE_WIDTH
• Each block computes one
square sub-matrix Pdsub of size
0
TILE_WIDTH-1
TILE_WIDTH TILE_WIDTH
2
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010
ECE408, University of Illinois, Urbana Champaign
WIDTH
TILE_WIDTH
WIDTH
25
G80 Shared Memory and Threading
• Each SM in G80 has 16KB shared memory (dev_prop.sharedMemPerBlock:
registers number per SM)
– SM size is implementation dependent!
– For TILE_WIDTH = 16, each thread block uses 2*256*4B = 2KB of
shared memory.
– The 16K shared memory can potentially have up to 8 Thread Blocks
actively executing
• This allows up to 8*512 = 4,096 pending loads. (2 per thread, 256 threads per
block)
• The threading model limits the number of thread blocks to 3 so shared
memory is not the limiting factor here
– The next TILE_WIDTH 32 would lead to 2*32*32*4B= 8KB shared
memory usage per thread block, allowing only up to two thread blocks
active at the same time
• Using 16x16 tiling, we reduce the accesses to the global memory by
a factor of 16, CGMA is increased from 1 to 16.
– The 86.4B/s bandwidth can now support (86.4/4)*16 = 347.6 GFLOPS!
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010
ECE408, University of Illinois, Urbana Champaign
26
Tiled Matrix Multiplication Kernel
__global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width)
Or Use these to enable run-time size of
{
shared memory declaration:
1.
2.
__shared __float Mds[TILE_WIDTH][TILE_WIDTH];
__shared __float Nds[TILE_WIDTH][TILE_WIDTH];
extern __shared_Mds[];
extern __shared__Nds[];
3.
4.
int bx = blockIdx.x; int by = blockIdx.y;
int tx = threadIdx.x; int ty = threadIdx.y;
matrixMulKernel<<<dimGrid,
dimBlock,size>>> where size is a built-in type for
declaring a variable to hold the size information for
dynamically allocated data structure.
// Identify the row and column of the Pd element to work on
5. int Row = by * TILE_WIDTH + ty;
6. int Col = bx * TILE_WIDTH + tx;
7. float Pvalue = 0;
// Loop over the Md and Nd tiles required to compute the Pd element
8. for (int m = 0; m < Width/TILE_WIDTH; ++m) {
// Collaborative loading of Md and Nd tiles into shared memory
9.
Mds[ty][tx] = Md[Row*Width + (m*TILE_WIDTH + tx)];
10.
Nds[ty][tx] = Nd[(m*TILE_WIDTH + ty)*Width + Col];
11.
__syncthreads();
12.
13.
14.
for (int k = 0; k < TILE_WIDTH; ++k)
Pvalue += Mds[ty][k] * Nds[k][tx];
__syncthreads();
}
15. Pd[Row*Width + Col] = Pvalue;
}
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010
ECE408, University of Illinois, Urbana Champaign
27
Tiling Size Effects
100
90
80
GFLOPS
70
60
50
40
30
20
10
no t tile d
4 x4 tile s
8 x8 tile s
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010
ECE408, University of Illinois, Urbana Champaign
1 2 x1 2 tile s
tile d &
u n ro lle d
o n ly
tile d
u n ro lle d
tile d &
o n ly
tile d
u n ro lle d
tile d &
o n ly
tile d
u n ro lle d
tile d &
o n ly
tile d
0
1 6 x1 6 tile s
28
Summary- Typical Structure of a
CUDA Program
•
•
•
•
•
Global variables declaration
– __host__
– __device__... __global__, __constant__, __texture__
Function prototypes
– __global__ void kernelOne(…)
– float handyFunction(…)
Main ()
– allocate memory space on the device – cudaMalloc(&d_GlblVarPtr, bytes )
– transfer data from host to device – cudaMemCpy(d_GlblVarPtr, h_Gl…)
– execution configuration setup
– kernel call – kernelOne<<<execution configuration>>>( args… );
repeat
– transfer results from device to host – cudaMemCpy(h_GlblVarPtr,…)
as
– optional: compare against golden (host computed) solution
needed
Kernel – void kernelOne(type args,…)
– variables declaration - __local__, __shared__
• automatic variables transparently assigned to registers or local memory
– syncthreads()…
Other functions
– float handyFunction(int inVar…);
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010
ECE408, University of Illinois, Urbana Champaign
29
Download