matmult

advertisement
1 Content
1
2
3
3.1
3.2
3.3
3.4
3.5
4
4.1
4.2
5
5.1
5.1.1
5.1.2
5.2
5.2.1
5.2.2
5.2.3
5.3
6
6.1
6.2
6.3
7
8
9
Content
Summary
Introduction to CUDA
Scalability (1)
Kernels (1)
Thread-Hierarchy (1)
Memory-Hierarchy (1)
GPU as Coprocessor
Matrix Multiplication
Mathematics
Row-major order
Implementations
Global memory
Calculating memory address
Kernel function
Shared memory
Dynamic shared memory
Calculating memory address
Kernel function
Serial C++ code
Time measurements
High performance counter
Time for matrices up to 300x300 elements
Conclusions
Legal
Download
References
Created by Christian Arrer.
2 Summary
This document is a summary of the first 25 pages of the CUDA programmer’s guide written by NVIDIA
Corporation. It is more or less an approach of putting together the important stuff in a few lines, so one
can directly go on to an implementation for matrix multiplication and various algorithms for Fourier
transformation.
An adopted version of the shared memory matrix multiplication and the global memory matrix
multiplication are implemented, but not described in detail in this document. In a measurement section
the time in seconds between CUDA, MATLAB-built-in and a serial c code on the CPU are compared. The
results are only scientific and acquired with the best honesty and not intended for any competition
purpose.
NSIGHT is not used here.
These implementations of matrix multiplication do not force the matrices to be multiples of the CUDA
block dimensions. Arbitrary matrix sizes are allowed, which the implementation in the CUDA
programming guide prohibits. It is true that matrices can be padded with zeros to exactly fit the CUDA
block chess board. – In this case it should be checked if the extraction of the result matrix from zero
padded sources is more efficient, than out of bounds checking in the kernel function.
3 Introduction to CUDA
Here comes a short text dealing with the core important things in CUDA for the beginning.
3.1 Scalability (1)
CUDA provides a framework that allows the programmer to write code that dynamically adapts to the
amount of available execution cores. This is achieved by a local and a more global parallelism. On the
one hand the programmer can have a local parallelism with threads; on the other hand those threads
are organized in blocks, which can be launched over the physically available multiprocessor either in
serial or parallel. NVIDIA Hardware and Firmware handles the parallelism for the hardware’s
possibilities. An NVIDIA-Chip with 10 streaming multiprocessors will have the ability to launch 10 blocks
at once, while an NVIDIA-Chip with 4 streaming multiprocessors will only execute 4 blocks at the same
time.
3.2 Kernels (1)
Kernels are the basic elements which are executed in parallel. In C++ a CUDA-kernel is defined with the
___global___ keyword. Inside the Kernel the threadIdx structure is build-in accessible. Kernels are
launched in one ore multiple threads.
3.3 Thread-Hierarchy (1)
Figure 1: Hierarchy
Threads: Are the smallest unit of the hierarchy. Threads live inside blocks. The thread’s index is always
accessible via the threadIdx structure inside it. The maximum amount of threads per block is up to 1024
at the moment.
Blocks: Blocks are arranged inside a grid. The number of blocks can exceed the number of streaming
processors on the hardware. The block’s index is always accessible via the blockIdx structure. Grids are
only two dimensional!
A one, two or three dimensional structure is to use, weather what amount of dimensions the underlying
calculation matrices have.
Again it is important to mention, that blocks need to have the ability of arbitrary serial or parallel
computation, they are the ones enabling scalability for the currently installed hardware.
Threads need to use border synchronization to read with low latency from the current streaming
multiprocessor’s shared memory. ___syncthreads() ensures all threads to read within one cycle.
All built-in variables are compiled in Table 1.
Table 1: Important variables and language constructs
blockIdx.x
blockIdx.y
blockIdx.z
blockDim.x
blockDim.y
blockDim.z
threadIdx.x
threadIdx.y
threadIdx.z
<<<dim3,dim3,dynamicSharedSize>>>
3.4 Memory-Hierarchy (1)
X-block-index inside a grid.
Y-block-index inside a grid.
Z-block-index inside a grid.
X-thread-elements inside a block
Y-thread-elements inside a block
Z-thread-elements inside a block
Y-thread-index inside a block
Y-thread-index inside a block
Z-thread-index inside a block
Invocation of the threads.
There are five different types of memory, see Table 2:
Table 2: Memory location types
each thread’s private local memory
each block’s shared memory
global memory
constant memory
texture memory
Only the thread himself can access this space.
All threads in a block can read and write to this location.
Every thread as access to this storage.
This is read-only to all threads for special accessing.
This is read-only to all threads for special accessing.
3.5 GPU as Coprocessor
The GPU is more or less a auxiliary unit for the main program launched on the CPU. Data is transferred
from the host memory (the systems DRAM) to the device memory (the Nvidia-Accelerator’s DRAM) and
the other way round.
The major device number of the Graphics card is equivalent to the compute-architecture (Table 3).
Table 3: Compute architecture
Major-Device-Number
3
2
1
Architecture
Kepler
Fermi
Tesla
4 Matrix Multiplication
4.1 Mathematics
When two matrices are multiplied, the result will again be a matrix. If the matrix product is somewhat
like A * B = C, then the number of A’s columns needs to be equal to B’s number of rows. General
matrices A and B will then look like this: Equation 1.
Equation 1
A is a n * m matrix, that means A has n rows and m columns. B is a m * r matrix, so B has m rows and r
columns. m is the common thing for both matrices. The result matrix C will become a n * r matrix. Each
element of C is defined as the scalar product of a row vector of matrix A and a column vector of B
(Equation 2).
Equation 2
4.2 Row-major order
Inside MATLAB and most other computer programs two dimensional matrices are stored in the row
major order, which means the matrix column vectors are arranged in serial to a float or double array. In
row major order, the matrix A from Equation 1 becomes the array in Equation 3.
Equation 3
If the row of matrix A and the column of matrix B is available, then the linear index is easily calculated
with Equation 4:
Equation 4
5 Implementations
NVIDIA CUDA distinguishes five types of memory. For matrix multiplication two of them are used in two
different implementations. One time the GPU’s global memory will be utilized, the other time its shared
memory will be utilized.
5.1 Global memory
Basically the matrices A, B and C are loaded to the device’s global memory, using cudaMemcpy APIfunction. CUDA-API can be called from standard “.cpp”-files inside Visual C++, but CUDA kernels and
__device__ functions can only be called and defined in “.cu”-files, which are compiled by the nvidia c
compiler nvcc.
As mentioned in the introduction part threads are summarized to blocks and blocks are summarized to
a grid. The maximum number of threads in a block is limited by the hardware, if the matrix becomes
bigger than the maximum amount of threads inside a block it will be necessary to create more blocks
inside a two dimensional grid. Blocks can have three dimensions, but until now grids are two
dimensional only.
For this implementation a single for loop inside each thread is necessary to do the standard scalar
product’s mathematics. The main challenge inside a thread is to compute the liner address numbers for
each element. As mentioned before matrices are stored in row-major order.
5.1.1
Calculating memory address
Figure 2: Matrix sketch
Each thread has to be aware of its location in the result matrix. A thread can compute its row-index with
the following formula (Equation 5), as the data structures gridDim, blockDim, blockIdx and threadIdx are
available inside any kernel function. Nvidia and C++ start counting from 0, which is only consistent. It
means there is a thread and a block having the index (0, 0, 0) and respectively (0, 0). Compare Error!
Reference source not found..
Equation 5
To write in a row-major order we need to convert the row- and column-indexes into row major linear
numbering. The calculation in the next formular (Equation 6) is only valid if the thread is not out of C’s
boundaries, see Error! Reference source not found..
Equation 6
The linear index of the source matrix A can be computed as in Equation 7:
Equation 7
The variable “inner” in the above Equation 7 is the standard-scalar-product-for-loop-iteration-counter.
The linear index for the source matrix B is calculated as follows (Equation 8):
Equation 8
5.1.2 Kernel function
The Cuda kernel function is listed below in Code 1.
void __global__ mmKernel(caMat A, caMat B, caMat C) {
caMatIdx idxC;
idxC.row = blockIdx.x * blockDim.x + threadIdx.x;
idxC.col = blockIdx.y * blockDim.y + threadIdx.y;
if(idxC.col < C.numCols && idxC.row < C.numRows) { // check if we are not out
of bounds here
idxC.lin = idxC.col * C.numRows + idxC.row;
for(unsigned int i = 0; i < A.numCols; i++) { // loop innner matrix
dimensions
C.elemDevice[idxC.lin] = C.elemDevice[idxC.lin] +
(A.elemDevice[idxC.row + (i * A.numRows)] * B.elemDevice[(idxC.col * B.numRows) +
i]); // all matrices are row major
}
}
}
Code 1: global memory kernel
5.2 Shared memory
The matrix multiplication is divided into blocks to fit in the device’s resources. Shared memory is a fast
memory available to all threads inside one block. Loading blocks to shared memory before starting the
calculation increases performance due to saving of bandwidth. Therefore the threads of each block need
to be synchronized to wait until all data is loaded. All threads must synchronize for the result element
calculation before the next matrix block is loaded to shared memory.
With the shared memory implementation the data is loaded as usual to global memory and then
transferred from global to shared memory.
5.2.1 Dynamic shared memory
CUDA allows a single dynamic shared memory array for each thread. To have multiple shared memory
spaces, pointers to given offset of this single shared memory can be retrieved to divide it logically in
more variables. The third parameter of the kernel launch is the total size in bytes the dynamic shared
memory will use. The shared memory needs to be declared exactly as extern __shared__
<custom_name>[]. nvcc wants to see the brackets “[]” at the end, using the dereferencing operator “*”
will give errors. The size in bytes for a given data type need to be calculated with the
sizeof(<data_type>) function and the amount of elements stored in the dynamic shared memory. The
kernel launch looks like this (Code 2):
sharedSize = threads.x * threads.x;
sharedBytes = sharedSize * sizeof(float) * 2;
mmKernelShared<<<blocks, threads, sharedBytes>>>(A, B, C, sharedSize);
Code 2: kernel launch with dynamic shared memory
5.2.2
Calculating memory address
Figure 3: Shared memory matrix sketch
For computation of each block-result, blocks of A and B are loaded into shared memory. So there are
two nested for loops. The outer for loop loads the source-blocks for each result-block iteratively to the
shared memory. The inner for loop does the mathematics for the standard-scalar-product-part inside
each thread. Compare Error! Reference source not found..
Like in the previous example, some address calculation needs to be done. For the left and right source
matrices the linear indexes can be computed with their row- and column-index, the current block
iteration and the knowledge of the input’s size (Equation 9). It is exactly the other way round for A and
B. Triple checking equations is recommended, because things can easily get confusing here. Kernels can
only be debugged with NSIGHT-plug-in.
Equation 9
The calculation of the indexes for the result matrix C stays the same as with global memory
implementation.
The column and row index of matrix A and matrix B can be out of bounds. If this is the case the thread
does not need to load his shared memory element. A thread is only true idle, if it is out of bounds in
matrix C’s columns and rows. Otherwise it still can have to load a shared memory element. There for out
of bounds in A and B needs to be checked separately. If the thread is out of bounds in any of the source
matrices he loads a 0 to shared memory.
The amount of blocks in the outer loop is given by Equation 10.
Equation 10
For the use of shared memory, the blocks need to be squares.
After loading to the shared memory inside one outer for loop iteration, the threads need to be
synchronized. The inner for-loop calculates a part of the scalar product. If this loop is started all data
needs to be present inside the shared memory. The threads are synchronized to ensure every thread
loaded its data when the scalar product loop starts.
Inside the scalar-product-loop the kernel must not access the global memory; this would compromise
the bandwidth advantage of shared memory usage. To accumulate the partial value of the scalar
product a local kernel variable is used.
If the matrices are not multiples of the block size, in the right outer perimeter of matrix A or respectively
in the down perimeter of matrix B the partial-standard-scalar-product-loop can get out of bounds. To
prevent garbage memory access, the for loop needs an additional break condition (Equation 11).
Equation 11
Both statements in the above Equation 11 are equivalent, because multiplication requires the inner
dimensions to agree.
The linear index for shared memory are easily calculated (Equation 12). The shared memory elements
need to be multiplied and added to the local partial standard scalar product variable.
Equation 12
5.2.3 Kernel function
The CUDA kernel function is listed below in Code 3.
void __global__ mmKernelShared(caMat A, caMat B, caMat C,unsigned int shrSize) {
caMatIdx
idxA; // linear, row and colum indexes in global nvidia memory
caMatIdx
idxB;
caMatIdx
idxC;
extern __shared__ float
shared[]; // need for dynamically allocatade shared
memory
float*
shrA
= NULL;
float*
shrB
= NULL;
float
Cval;
unsigned int numBlk = 0; //the number of shared memory blocks
idxC.row = blockDim.x * blockIdx.x + threadIdx.x;
idxC.col = blockDim.y * blockIdx.y + threadIdx.y;
idxC.lin = idxC.col * C.numRows + idxC.row;
numBlk = A.numCols/(unsigned int)blockDim.x;
numBlk = numBlk + 1;
for(unsigned int curBlk = 0; curBlk < numBlk; curBlk++) { // loop shared memory
blocks
Cval
= 0;
shrA
= &shared[0];
shrB
= &shared[shrSize];
idxA.row = idxC.row;
idxA.col = curBlk * blockDim.y + threadIdx.y;
idxB.row = curBlk * blockDim.x + threadIdx.x;
idxB.col = idxC.col;
if(idxA.row < A.numRows && idxA.col < A.numCols) { // check out of
bounds, thread doesnt need to load anything
idxA.lin = idxA.col * A.numRows + idxA.row;
shrA[blockDim.x * threadIdx.y + threadIdx.x] =
A.elemDevice[idxA.lin]; // load to shared, read from global
}
else {
shrA[blockDim.x * threadIdx.y + threadIdx.x] = 0;
}
if(idxB.row < B.numRows && idxB.col < B.numCols) { // check out of
bounds, thread doesnt need to load anything
idxB.lin = idxB.col * B.numRows + idxB.row;
shrB[blockDim.x * threadIdx.y + threadIdx.x] =
B.elemDevice[idxB.lin]; // load to shared, read from global
}
else {
shrB[blockDim.x * threadIdx.y + threadIdx.x] = 0;
}
__syncthreads(); // the previous copy operation needs to finish before
we go on
if(idxC.col < C.numCols && idxC.row < C.numRows) {
for(unsigned int j = 0; j < blockDim.x && ((curBlk * blockDim.y +
j) < A.numCols); j++) { // do not loop on out of bounds, ((curBlk * blockDim.x + j) <
A.numRows) is the same
Cval = Cval + shrA[j * blockDim.x + threadIdx.x] *
shrB[threadIdx.y * blockDim.x + j];
}
}
__syncthreads(); // wait until the previous operation is finishend
if(idxC.col < C.numCols && idxC.row < C.numRows)
C.elemDevice[idxC.lin] = C.elemDevice[idxC.lin] + Cval; // write
to global, do not write global in loop
}
}
Code 3: shared memory kernel
5.3 Serial C++ code
The used serial C++ code is as follows (Code 4).
for(idxC.col = 0; idxC.col < this->C.numCols; idxC.col++) {
for(idxC.row = 0; idxC.row < this->C.numRows; idxC.row++) {
for(unsigned int inner = 0; inner < this->A.numCols; inner++) {
idxA.lin = inner * A.numRows + idxC.row;
idxB.lin = idxC.col * B.numRows + inner;
this->C.elemHost[idxC.lin] = this->C.elemHost[idxC.lin] +
this->A.elemHost[idxA.lin] * this->B.elemHost[idxB.lin];
}
idxC.lin++;
}
}
Code 4: serial c++ code
The serial code does not take advantage of threads on CPU, multi core or hyper threading. Three nested
loops are used. The code is highly inefficient.
6 Time measurements
6.1 High performance counter
The “High precision event timer” is available on most PC-machines. On multi-core processors the HPET is
somewhat difficult to use. Before taking the frequency of the counter the start counts and the stop
counts it is necessary to set the process affinity to a single CPU core, otherwise the results will not be
reliable.
6.2 Time for matrices up to 300x300 elements
Diagram 1: time measurement
The serial C++ code grows in an exponential shape and performs by far the worst. Both Cuda
implementations for shared memory and global memory are nearly not to distinguish. This diagram
(Error! Reference source not found.) is not really usable for analyzing CUDA behavior. The outliners
could be removed on taking the average of more series, instead of a single sequence.
6.3 Conclusions
Higher matrix sizes should be tested to compare the development of shared memory against global
memory with increasing matrix size. Serial C++ code should be greatly improved; the implementation is
maybe highly inefficient. Serial C++ on CPU should be able to beat MATLAB’s virtual machine code.
7 Legal
1. MATLAB is a label of MatWorks
2. NVIDIA, CUDA, GPGPU, NSIGHT and GPU are labels of NVIDIA
3. Visual C++ is a label of Microsoft
8 Download




http://ch.arrer.net/external/gpgpu/matmult
http://ch.arrer.net/external/gpgpu/matmult.zip
http://ch.arrer.net/external/gpgpu/matmult.docx
http://ch.arrer.net/external/gpgpu/matmult.pdf
9 References
1. developer.download.nvidia.com. [Online]
http://developer.download.nvidia.com/compute/DevZone/docs/html/C/doc/CUDA_C_Programming_G
uide.pdf.
Download