More CUDA Examples

advertisement
More CUDA Examples
Different Levels of parallelism
• Thread parallelism
– each thread is an independent thread of execution
• Data parallelism
– across threads in a block
– across blocks in a kernel
• Task parallelism
– different blocks are independent
– independent kernels
Amrita School of Biotechnology
Thread Ids
• Each thread that executes the kernel
is given a unique thread ID that is
accessible within the kernel through
the built-in threadIdx variable.
– threadIdx is a 3-component vector, so
that threads can be identified using a
one-dimensional, two-dimensional, or
three-dimensional thread index,
forming a one-dimensional, twodimensional, or three-dimensional
thread block.
– This provides a natural way to invoke
computation across the elements in a
domain such as a vector, matrix, or
volume.
Block ID: 1D or 2D
blockIdx.x {x,y}
Thread ID: 1D, 2D, or 3D
threadIdx.{x,y,z}
Amrita School of Biotechnology
• A general guidline is that a block should consist of at least 192
threads in order to hide memory access latency. Therefore,
256, and 512 threads are common and practical numbers.
• The following kernel used one block with N threads
// Kernel invocation with N threads
VecAdd<<<1, N>>>(A, B, C);
• Here, each of the N threads that execute VecAdd() performs
one pair-wise addition.
__global__ void VecAdd(float* A, float* B, float* C)
{
int i = threadIdx.x;
If(i < N)
C[i] = A[i] + B[i];
}
Simplest choice is to have each thread calculate one, and only one,
element in the final result array
Amrita School of Biotechnology
The number of threads per block and the number of blocks per grid
specified in the <<<…>>> syntax can be of type int or dim3.
The dimension of the thread block is accessible within the kernel
through the built-in blockDim variable.
Suppose we have 10000 elements and
No of threads per blocks : 256
Then
No of blocks required= 10000 / 256 = 40
An Array of 16 elements divided into 4 blocks
N=16, blockDim=4 -> 4 blocks
blockIdx.x=0
blockDim.x=4
threadIdx.x=0,1,2,3
idx=0,1,2,3
int idx = blockDim.x * blockIdx.x + threadIdx.x;
blockIdx.x=1
blockDim.x=4
threadIdx.x=0,1,2,3
idx=4,5,6,7
blockIdx.x=2
blockDim.x=4
threadIdx.x=0,1,2,3
idx=8,9,10,11
Amrita School of Biotechnology
blockIdx.x=3
blockDim.x=4
threadIdx.x=0,1,2,3
idx=12,13,14,15
2D Examples
• Add two matrices
• Case 1: matrix dimension and block dimension same
• Works for small matrices (dim. < 1024 * 1024)
• No of blocks needed: 1
– Dim3 threadsPerBlock(row,column)
– Dim3 blocksPerGrid(1)
– Kernel invocation
• AddMatrix<<<blocksPerGrid,threadsPerBlock>>>(a,b,c,cols)
Amrita School of Biotechnology
The following code adds two matrices A and B of size NxN and stores
the result into matrix C:
// Kernel definition
__global__ void MatAdd(float A[N][N], float B[N][N], float C[N][N])
{
int i = threadIdx.x;
int j = threadIdx.y;
C[i][j] = A[i][j] + B[i][j];
}
int main()
{
...
// Kernel invocation with one block of N * N * 1 threads
int numBlocks = 1;
dim3 threadsPerBlock(N, N);
MatAdd<<<numBlocks, threadsPerBlock>>>(A, B, C);
...
} // N should be less then 1024, the max threads per block
Amrita School of Biotechnology
Case 2: MatAdd() example to handle multiple blocks
// Kernel definition
__global__ void MatAdd(float A[N][N], float B[N][N],
float C[N][N])
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
int j = blockIdx.y * blockDim.y + threadIdx.y;
if (i < N && j < N)
C[i][j] = A[i][j] + B[i][j];
}
int main()
Total number of threads is equal to
{
the number of threads per block
times the number of blocks.
...
// Kernel invocation
dim3 threadsPerBlock(16, 16);
dim3 numBlocks(N / threadsPerBlock.x, N / threadsPerBlock.y);
MatAdd<<<numBlocks, threadsPerBlock>>>(A, B, C);
...
}
Amrita School of Biotechnology
Block
Dimensio
n
Thread
index
threadId
dataIndex
1D
(x)
(x)
i = blockIdx.x * blockDim.x + threadIdx.x
2D with
size (Dx,
Dy),
(x,y)
(x + y Dx)
i = blockIdx.x * blockDim.x + threadIdx.x
j = blockIdx.y * blockDim.y + threadIdx.y
There is a limit to the number
of threads per block, since all
threads of a block are expected
to reside on the same processor
core and must share the limited
memory resources of that core.
Amrita School of Biotechnology
• If
– Max Number of Threads per Block: 512
– Max Number of Blocks per Streaming
Multiprocessor: 8
– Number of Streaming Multiprocessors: 30
• Total Number of Threads Available =
– 30 x 8 x 512 = 122880
Amrita School of Biotechnology
Compute Capability
Technical Specifications
1.0
1.1
Maximum dimensionality of
a grid of thread blocks
Maximum x-, y-, or zdimension of a grid of thread
blocks
Maximum dimensionality of
a thread block
Maximum x- or y-dimension
of a block
Maximum z-dimension of a
block
Maximum number of threads
per block
Compute Capability 1.x
Thread dimension : 1D,2D or 3D
Thread Block dimension: 1D or 2D
Max Threads / block : 512
1.2
1.3
2.0
2
3
65535
3
512
1024
64
512
1024
Compute Capability 2.x
Thread dimension : 1D,2D or 3D
Thread Block dimension: 1D,2D or 2D
Max Threads / block : 1024
Amrita School of Biotechnology
Matrix Multiplication
Memory layout of a matrix
Matrices are stored in column major order in
CUDA
M0,0 M1,0 M2,0 M3,0
M0,1 M1,1 M2,1 M3,1
M
M0,2 M1,2 M2,2 M3,2
M0,3 M1,3 M2,3 M3,3
M0,0 M1,0 M2,0 M3,0 M0,1 M1,1 M2,1 M3,1 M0,2 M1,2 M2,2 M3,2 M0,3 M1,3 M2,3 M3,3
(M X P) * (P X N) => 12(M X N)
Matrix Multiplication
// Matrix multiplication on the (CPU) host
N
void MatrixMulOnHost(float* mat1, float* mat2, float* R, int M,int P,int N)
{
mat2
for (int i = 0; i < M; ++i)
k
for (int j = 0; j < N; ++j) {
(M X P) * (P X N) => (M X N)
double sum = 0;
j
for (int k = 0; k < P; ++k) {
double a = mat1[i * P + k];
double b = mat2[k * N+ j];
sum += a * b;
P
mat1
}
P
R[i * N + j] = sum;
R
}
i
}
M
N
k
13
P
M
Matrix multiplication on GPU
Each thread calculates one value in the resulting matrix
__global__ void MatrixMulOnDevice(float* mat1, float* mat2, float* R, int
M,int P,int N)
{
int sum = 0;
int row = threadIdx.y;
int col = threadIdx.x;
for (int k = 0; k < P; ++k)
{
int a = mat1[row * P + k];
int b = mat2[k * N+ col];
sum += a * b;
}
R[row * N + col] = sum;
}
}
MatrixMulOnDevice<<<threadsPerBlock,blocksPerGrid>>>(A,B,C,m,p,n);
Amrita School of Biotechnology
• Limitation:
– Size of a matrix is limited by the number of threads
allowed in a thread block
• Solution: Use multiple thread blocks
– Kernel invocation
•
•
•
•
Int threads = 64;
Dim3 threadsPerBlock(threads,threads);
dim2 blocksPerGrid(m/threads,n/threads);
Multiply<<<threadsPerBlock,blocksPerGrid>>>(A,B,C,m,p,n);
– threadIds
• int row = blockIdx.y * blockDim.y + threadIdx.y;
• int col = blockIdx.x * blockDim.x + threadIdx.x;
Amrita School of Biotechnology
• Another solution:
– Give each thread more work
– Instead of doing one operation, each thread is
assigned more jobs
• A tile of WIDTH * WIDTH
entries
Amrita School of Biotechnology
Question??
• Write a program to implement the kernel function
– Increment(a[],b)
• The function is to increment each elements of the array a by b
units.
– The array size need to be dynamically allocated
– No of threads per block: 256
– No of blocks need to be dynamically allocated depending on the
size of the array
– Each thread should perform one increment operation in one
array element
• Do the same in a two dimensional array
– With one block
– With a no. of blocks
Amrita School of Biotechnology
Download