Block

advertisement
Introduction to CUDA(1)
BIN ZHOU
USTC
Fall 2012
Clarification for last week’s class
Warp0
LOAD
R1,[R2]
Warp0
Add
R3,R4,R5
Warp0
Add
R1,R2,R7
Key point: If the following arithmetic instructions don’t depend on
memory access instruction, it will not stall the whole pipeline.
Only stall when there’s some dependent instruction.
Announcements
•Hw2 released today.:Due Time Attention!
•Hw1 finished fine,Good students:
•张振国,刘源,秦子龙,李鑫,陈俊仕,张海博,
•周学进,李丰,张爱民,程亦超,张然,王锋,陈宇超
•Some students didn’t attend the labwork, nor
homework. Please pay attention!
•We’ll start to use the network resource system:
• szkc.jingpinke.com
Advertisement
•NVIDIA Corp. is holding the campus recruitment.
•Tomorrow: Sunday: 10/14: 18:00pm; 西活学术报告
厅
Acknowledgements
• Many slides are from David Kirk and
Wen-mei Hwu’s UIUC course:
• Most slides are from Patrick Cozzi
University of Pennsylvania CIS 565
GPU Architecture Review
• GPUs are specialized for
– Compute-intensive, highly parallel computation
– Graphics!
• Transistors are devoted to:
– Processing
– Not:
• Data caching
• Flow control
GPU Architecture Review
Transistor Usage
Image from: http://developer.download.nvidia.com/compute/cuda/3_2_prod/toolkit/docs/CUDA_C_Programming_Guide.pdf
Let’s program
this thing!
GPU Computing History
• 2001/2002 – researchers see GPU as dataparallel coprocessor
– The GPGPU field is born
• 2007 – NVIDIA releases CUDA
– CUDA – Compute Uniform Device Architecture
– GPGPU shifts to GPU Computing
• 2008 – Khronos releases OpenCL specification
CUDA Abstractions
• A hierarchy of thread groups
• Shared memories
• Barrier synchronization
High Level View
Global Memory
PCIe
CPU
Chipset
Fermi: 14 Streaming Multiprocessors (SMs),
448 CUDA cores
Fermi Multiprocessor
2 Warp Scheduler
In-order issue
Up to 1536 concurrent threads
32 CUDA Cores
Full IEEE 754-2008 FP32 and FP64
32 FP32 ops/clock
16 FP64 ops/clock
Up to 48 KB shared memory
Up to 48 KB L1 cache
Not coherent across multiprocessors
4 SFUs
32K 32-bit registers
Up to 63 registers per thread
Instruction Cache
Scheduler
Scheduler
Dispatch
Dispatch
Register File
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Load/Store Units x 16
Special Func Units x 4
Interconnect Network
64K Configurable
Cache / Shared Mem
Uniform Cache
CUDA Terminology
• Host – typically the CPU
– Code written in ANSI C
• Device – typically the GPU (data-parallel)
– Code written in extended ANSI C
• Host and device have separate memories
• CUDA Program
– Contains both host and device code
CUDA Terminology
• Kernel – data-parallel function
– Invoking a kernel creates lightweight
threads on the device
• Threads are generated and scheduled with
hardware
CUDA Kernels
• Executed N times in parallel by N
different CUDA threads
Thread ID
Declaration
Specifier
Execution
Configuration
CUDA Code Example
CUDA Program Execution
Image from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
Thread Hierarchies
• Grid – one or more thread blocks
– 1D or 2D
• Block – array of threads
– 1D, 2D, or 3D
– Each block in a grid has the same number of
threads
– Each thread in a block can
• Synchronize
• Access shared memory
Thread Hierarchies
Image from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
Thread Hierarchies
• Block – 1D, 2D, or 3D
– Example: Index into vector, matrix,
volume
Thread Hierarchies
• Thread ID: Scalar thread identifier
• Thread Index: threadIdx
• 1D: Thread ID == Thread Index
• 2D with size (Dx, Dy)
– Thread ID of index (x, y) == x + y Dy
• 3D with size (Dx, Dy, Dz)
– Thread ID of index (x, y, z) == x + y Dy + z Dx Dy
Thread Hierarchies
2D Index
1 Thread Block
2D Block
Thread Hierarchies
• Thread Block
– Group of threads
• G80 and GT200: Up to 512 threads
• Fermi: Up to 1024 threads
• Kepler: Up to 1024 threads
– Reside on same SM/SMX
– Share memory of that SM/SMX
Thread Hierarchies
• Thread Block
– Group of threads
• G80 and GT200: Up to 512 threads
• Fermi: Up to 1024 threads
– Reside on same processor core
– Share memory of that core
Image from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
Thread Hierarchies
• Block Index: blockIdx
• Dimension: blockDim
– 1D or 2D
Thread Hierarchies
16x16
Threads per block
2D Thread Block
Thread Hierarchies
• Example: N = 32
– 16x16 threads per block (independent of N)
• threadIdx ([0, 15], [0, 15])
– 2x2 thread blocks in grid
• blockIdx ([0, 1], [0, 1])
• blockDim = 16

i = [0, 1] * 16 + [0, 15]
Thread Hierarchies
• Thread blocks execute independently
– In any order: parallel or series
– Scheduled in any order by any number of
cores
• Allows code to scale with SM count
Thread Hierarchies
Image from: http://developer.download.nvidia.com/compute/cuda/3_2_prod/toolkit/docs/CUDA_C_Programming_Guide.pdf
Thread Hierarchies
• Threads in a block
– Share (limited) low-latency memory
– Synchronize execution
• To coordinate memory accesses
• __syncThreads()
– Barrier – threads in block wait until all threads
reach this
– Lightweight
Image from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
Memory Spaces
CPU and GPU have separate memory spaces
Data is moved across PCIe bus
Use functions to allocate/set/copy memory on GPU
Very similar to corresponding C functions
Pointers are just addresses
Can’t tell from the pointer value whether the address is on
CPU or GPU
Must exercise care when dereferencing:
Dereferencing CPU pointer on GPU will likely crash
Same for vice versa
GPU Memory Allocation / Release
Host (CPU) manages device (GPU) memory:
cudaMalloc (void ** pointer, size_t nbytes)
cudaMemset (void * pointer, int value, size_t count)
cudaFree (void* pointer)
int n = 1024;
int nbytes = 1024*sizeof(int);
int * d_a = 0;
cudaMalloc( (void**)&d_a, nbytes );
cudaMemset( d_a, 0, nbytes);
cudaFree(d_a);
CUDA Memory Transfers
Image from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
CUDA Memory Transfers
• Host can transfer to/from device
– Global memory
– Constant memory
Image from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
CUDA Memory Transfers
• cudaMalloc()
– Allocate global memory on device
• cudaFree()
– Frees memory
Image from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
CUDA Memory Transfers
Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
CUDA Memory Transfers
Pointer to device memory
Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
CUDA Memory Transfers
Size in bytes
Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
CUDA Memory Transfers
• cudaMemcpy()
– Memory transfer
•
•
•
•
Host to host
Host to device
Device to host
Device to device
Host
Device
Global Memory
CUDA Memory Transfers
• cudaMemcpy()
– Memory transfer
•
•
•
•
Host to host
Host to device
Device to host
Device to device
Host
Device
Global Memory
CUDA Memory Transfers
• cudaMemcpy()
– Memory transfer
•
•
•
•
Host to host
Host to device
Device to host
Device to device
Host
Device
Global Memory
CUDA Memory Transfers
• cudaMemcpy()
– Memory transfer
•
•
•
•
Host to host
Host to device
Device to host
Device to device
Host
Device
Global Memory
CUDA Memory Transfers
• cudaMemcpy()
– Memory transfer
•
•
•
•
Host to host
Host to device
Device to host
Device to device
Host
Device
Global Memory
CUDA Memory Transfers
Host to device
Host
Device
Global Memory
Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
CUDA Memory Transfers
Destination (device)
Source (host)
Host
Device
Global Memory
Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
CUDA Memory Transfers
Host
Device
Global Memory
Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
Data Copies
cudaMemcpy( void *dst, void *src, size_t nbytes,
enum cudaMemcpyKind direction);
returns after the copy is complete
blocks CPU thread until all bytes have been copied
doesn’t start copying until previous CUDA calls complete
enum cudaMemcpyKind
cudaMemcpyHostToDevice
cudaMemcpyDeviceToHost
cudaMemcpyDeviceToDevice
Non-blocking memcopies are provided
Code Walkthrough 1
Allocate CPU memory for n integers
Allocate GPU memory for n integers
Initialize GPU memory to 0s
Copy from GPU to CPU
Print the values
Code Walkthrough 1
#include <stdio.h>
int main()
{
int dimx = 16;
int num_bytes = dimx*sizeof(int);
int *d_a=0, *h_a=0; // device and host pointers
h_a = (int*)malloc(num_bytes);
cudaMalloc( (void**)&d_a, num_bytes );
if( 0==h_a || 0==d_a )
{
printf("couldn't allocate memory\n");
return 1;
}
Code Walkthrough 1
cudaMemset( d_a, 0, num_bytes );
cudaMemcpy( h_a, d_a, num_bytes,
cudaMemcpyDeviceToHost );
for(int i=0; i<dimx; i++)
printf("%d ", h_a[i] );
printf("\n");
free( h_a );
cudaFree( d_a );
return 0;
}
#include <stdio.h>
int main()
{
int dimx = 16;
int num_bytes = dimx*sizeof(int);
int *d_a=0, *h_a=0; // device and host
pointers
h_a = (int*)malloc(num_bytes);
cudaMalloc( (void**)&d_a, num_bytes
);
cudaMemset( d_a, 0, num_bytes
);
cudaMemcpy( h_a, d_a,
num_bytes,
cudaMemcpyDeviceToHost );
for(int i=0; i<dimx; i++)
printf("%d ", h_a[i] );
printf("\n");
free( h_a );
cudaFree( d_a );
if( 0==h_a || 0==d_a )
{
return 0;
printf("couldn't allocate memory\n"); }
return 1;
}
Basic Kernels and Execution on GPU
CUDA Programming Model
C Program Sequential
Execution
Serial code
Host
Parallel code (kernel) is
launched and executed on
a GPU by many threads
Parallel code is written
for a thread
Each thread is free to
execute a unique code path
Built-in thread and
block ID variables
Device
Grid 0
Parallel kernel
Kernel0<<<>>>()
Block (0,
0)
Block (1,
0)
Block (2,
0)
Block (0,
1)
Block (1,
1)
Block (2,
1)
Host
Serial code
Device
Grid 1
Parallel kernel
Kernel1<<<>>>()
Block (0, 0)
Block (1, 0)
Block (0, 1)
Block (1, 1)
Block (0, 2)
Block (1, 2)
Thread Hierarchy
Threads launched for a parallel section are
partitioned into a Grid of Thread Blocks
Grid = all blocks for a given launch
A thread block is a group of threads that can:
Synchronize their execution
Communicate via shared memory
Size of grid and blocks are specified during kernel
launch
Grid 0
dim3 grid(3,2), block(12);
kernel<<<grid, block>>>(…);
Block (0,0)
Block (1, 0)
Block (2, 0)
Block (0, 1)
Block (1, 1)
Block (2, 1)
IDs and Dimensions
Threads:
3D IDs, unique within a block
Device
Grid 1
Blocks:
2D IDs, unique within a grid
Dimensions set at launch time
Block
(0, 0)
Block
(1, 0)
Block
(2, 0)
Block
(0, 1)
Block
(1, 1)
Block
(2, 1)
Can be unique for each grid
Built-in variables:
threadIdx
blockIdx
blockDim
gridDim
Block (1, 1)
Thread Thread Thread Thread Thread
(0, 0)
(1, 0)
(2, 0)
(3, 0)
(4, 0)
Thread Thread Thread Thread Thread
(0, 1)
(1, 1)
(2, 1)
(3, 1)
(4, 1)
Thread Thread Thread Thread Thread
(0, 2)
(1, 2)
(2, 2)
(3, 2)
(4, 2)
CUDA Code Example
GPU and Programming Model
Software
Thread
GPU
Scalar
Processor
Threads are executed by scalar processors
Thread blocks are executed on
multiprocessors
Thread blocks do not migrate
Thread
Block
Multiprocessor
A kernel is launched as a grid of thread
blocks
...
Grid
Several concurrent thread blocks can reside
on one multiprocessor - limited by
multiprocessor resources
Device
Only one kernel can execute on a device at
one time
Code executed on GPU
Kernel: C function with some restrictions
Can only access GPU memory
No variable number of arguments
No static variables
No recursion
No dynamic memory allocation
Must be declared with a qualifier:
__global__ : launched by CPU,
cannot be called from GPU must return void
__device__ : called from other GPU functions,
cannot be launched by the CPU
__host__ : can be executed by CPU
__host__ and __device__ qualifiers can be combined
sample use: overloading operators
Code Walkthrough 2
Build on Walkthrough 1
Write a kernel to initialize integers
Copy the result back to CPU
Print the values
Kernel Code (executed on GPU)
__global__ void kernel( int *a )
{
int idx = blockIdx.x*blockDim.x + threadIdx.x;
a[idx] = 7;
}
dim3 grid, block;
block.x = 4;
grid.x = dimx / block.x;
kernel<<<grid, block>>>( d_a );
#include <stdio.h>
__global__ void kernel( int *a )
{
int
idx=blockIdx.x*blockDim.x+threadIdx.x;
a[idx] = 7;
}
int main()
{
int dimx = 16;
int num_bytes = dimx*sizeof(int);
int *d_a=0, *h_a=0
h_a = (int*)malloc(num_bytes);
cudaMalloc( (void**)&d_a, num_bytes );
if( 0==h_a || 0==d_a ){
printf("couldn't allocate memory\n");
return 1;
}
cudaMemset( d_a, 0, num_bytes );
dim3 grid, block;
block.x = 4;
grid.x = dimx / block.x;
kernel<<<grid, block>>>( d_a );
cudaMemcpy( h_a, d_a,
num_bytes,
cudaMemcpyDeviceToHost );
for(int i=0; i<dimx; i++)
printf("%d ", h_a[i] );
printf("\n");
free( h_a );
cudaFree( d_a );
return 0;
}
Launching kernels on GPU
Execution Configuration :
<<<Grid, Block, Smem, Stream>>>
grid dimensions (up to 2D), dim3 type
thread-block dimensions (up to 3D), dim3 type
shared memory: number of bytes per block
for extern smem variables declared without size
Optional, 0 by default
stream ID
Optional, 0 by default
dim3 grid(16, 16);
dim3 block(16,16);
kernel<<<grid, block, 0, 0>>>(...);
kernel<<<32, 512>>>(...);
Kernel Variations and Output
__global__ void kernel( int *a )
{
int idx = blockIdx.x*blockDim.x + threadIdx.x;
a[idx] = 7;
}
__global__ void kernel( int *a )
{
int idx = blockIdx.x*blockDim.x + threadIdx.x;
a[idx] = blockIdx.x;
}
__global__ void kernel( int *a )
{
int idx = blockIdx.x*blockDim.x + threadIdx.x;
a[idx] = threadIdx.x;
}
Output: 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7
Output: 0 0 0 0 1 1 1 1 2 2 2 2 3 3 3 3
Output: 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3
Blocks must be independent
Thread blocks can run in any order
Concurrently or sequentially
Facilitates scaling of the same code across many devices
Scalability
CUDA Memory System
Memory Model Review
Device 0
memory
Host memory
cudaMemcpy()
Device 1
memory
CPU and GPU have separate memory spaces
Data is moved across PCIe bus
GPU Memory Model Review
Thread
Per-thread
Local
Memory
Block
Sequential
Kernels
Per-block
Shared
Memory
Kernel 0
..
.
Kernel 1
...
Per-device
Global
Memory
Global Memory
Kernel 0
Sequential
Kernels
...
Per-device
Global
Memory
Kernel 1
...
Accessible by all threads as well as host (CPU)
Data lifetime = from allocation to deallocation
GPU Memory Allocation / Release
Host (CPU) manages device (GPU) memory:
cudaMalloc (void ** pointer, size_t nbytes)
cudaMemset (void * pointer, int value, size_t count)
cudaFree (void* pointer)
int n = 1024;
int nbytes = 1024*sizeof(int);
int * d_a = 0;
cudaMalloc( (void**)&d_a, nbytes );
cudaMemset( d_a, 0, nbytes);
cudaFree(d_a);
Shared Memory
Block
Per-block
Shared
Memory
__shared__ int a[SIZE];
Allocated per threadblock
Data lifetime = block lifetime
Accessible by any thread in the threadblock
Not accessible to other threadblocks
Several uses:
Sharing data among threads in a threadblock
User-managed cache (reducing gmem accesses)
Registers and Local Memory
Thread
Per-thread
Local Storage
Automatic variables (scalar/array) inside kernels
Spills to local memory
Data lifetime = thread lifetime
Shared Memory
Shared Memory
On-chip memory
2 orders of magnitude lower latency than global memory
Order of magnitude higher bandwidth than global memory
16 KB or 48 KB per multiprocessor for Fermi architecture
(up to 15 multiprocessors)
Allocated per thread block
Accessible to any thread in the thread block
Not accessible to other thread blocks
Several uses:
Sharing data among threads in a thread block
User-managed cache (reducing global memory accesses)
Example of Using Shared Memory
Applying a 1D stencil to a 1D array of elements:
Each output element is the sum of all elements within a
radius
For example, for radius = 3, each output element is
the sum of 7 input elements:
radius
radius
Implementation with Shared Memory
Each block outputs one element per thread, so a total
of BLOCK_SIZE output elements:
BLOCK_SIZE = number of threads per block
Read (BLOCK_SIZE + 2 * RADIUS) elements from global
memory to shared memory
Compute BLOCK_SIZE output elements in shared memory
Write BLOCK_SIZE output elements to global memory
“halo”
= RADIUS
elements on the
left
The BLOCK_SIZE input elements
corresponding to the output
elements
“halo”
= RADIUS
elements on the
right
Kernel Code
RADIUS = 3
BLOCK_SIZE = 16
__global__ void stencil(int* in, int* out) {
__shared__ int shared[BLOCK_SIZE + 2 * RADIUS];
int globIdx = blockIdx.x * blockDim.x +
threadIdx.x;
int locIdx = threadIdx.x + RADIUS;
shared[locIdx] = in[globIdx];
if (threadIdx.x < RADIUS) {
shared[locIdx – RADIUS] = in[globIdx –
RADIUS];
shared[locIdx + BLOCK_DIMX] = in[globIdx +
BLOCK_SIZE];
}
__syncthreads();
int value = 0;
for (offset = - RADIUS; offset <= RADIUS;
offset++)
value += shared[locIdx + offset];
out[globIdx] = value;
}
Thread Synchronization Function
void __syncthreads();
Synchronizes all threads in a thread block
Since threads are scheduled at run-time
Once all threads have reached this point, execution
resumes normally
Used to avoid RAW / WAR / WAW hazards when accessing
shared memory
Should be used in conditional code only if the
conditional is uniform across the entire thread
block
Coordinating CPU and GPU
Execution
Synchronizing GPU and CPU
All kernel launches are asynchronous
Control returns to CPU immediately
Kernel starts executing after all preceding CUDA calls
complete
cudaMemcpy() is synchronous
Control returns to CPU once the copy is complete
Copy starts once all previous CUDA calls have completed
cudaMemcpyAsync() is asynchronous
cudaThreadSynchronize()
Blocks until all previous CUDA calls complete
Asynchronous CUDA calls provide ability to:
Overlap memory copies and kernel execution
Concurrently execute several kernels
CUDA Error Reporting to CPU
All CUDA calls return error code
except kernel launches
cudaError_t type
cudaError_t cudaGetLastError(void)
returns the code for the last error (“no error” has a code)
char* cudaGetErrorString(cudaError_t
code)
returns a null-terminated character string describing the
error
printf(“%s\n”,cudaGetErrorString(cudaGetLast
Error()));
CUDA Event API
Events are inserted (recorded) into CUDA call streams
Usage scenarios:
Measure elapsed time for CUDA calls
Query the status of an asynchronous CUDA call
Block CPU until CUDA calls prior to the event are completed
cudaEvent_t start, stop;
cudaEventCreate(&start),
cudaEventCreate(&stop);
cudaEventRecord(start, 0);
kernel<<<grid, block>>>(...);
cudaEventRecord(stop, 0);
cudaEventSynchronize(stop);
float time; cudaEventElapsedTime(&time,
start, stop);
cudaEventDestroy(start);
Device Management
CPU can query and select GPU devices
cudaGetDeviceCount(int* count)
cudaSetDevice(int device)
cudaGetDevice(int* current_device)
cudaGetDeviceProperties(cudaDeviceProp* prop,
int device)
cudaChooseDevice(int *device, cudaDeviceProp*
prop)
Multi-GPU setup:
Device 0 is used by default
One CPU thread can control one GPU
Multiple CPU threads can control the same GPU
Calls are serialized by the driver
CUDA Development Resources
CUDA Programming Resources
CUDA toolkit
Compiler, libraries, and documentation
free download for Windows, Linux, and MacOS
CUDA SDK
code samples
whitepapers
Instructional materials on CUDA Zone
slides and audio
parallel programming course at University of Illinois UC
tutorials
forums
GPU Tools
Profiler
Available now for all supported OSs
Command-line or GUI
Sampling signals on GPU for:
Memory access parameters
Execution (serialization, divergence)
Debugger
Currently linux only (gdb)
Runs on the GPU
Emulation mode
Matrix Multiply Reminder
•
•
•
•
Vectors
Dot products
Row major or column major?
Dot product per output element
Matrix Multiply


P=M*N
Assume M and N are
square for simplicity
Image from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
Matrix Multiply

1,000 x 1,000 matrix

1,000,000 dot products
 Each 1,000 multiples and 1,000 adds
Matrix Multiply: CPU Implementation
void MatrixMulOnHost(float* M, float* N, float* P, int width)‫‏‬
{
for (int i = 0; i < width; ++i)‫‏‬
for (int j = 0; j < width; ++j)
{
float sum = 0;
for (int k = 0; k < width; ++k)
{
float a = M[i * width + k];
float b = N[k * width + j];
sum += a * b;
}
P[i * width + j] = sum;
}
}
Code from: http://courses.engr.illinois.edu/ece498/al/lectures/lecture3%20cuda%20threads%20spring%202010.ppt
Matrix Multiply: CUDA Skeleton
Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
Matrix Multiply: CUDA Skeleton
Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
Matrix Multiply: CUDA Skeleton
Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
Matrix Multiply
• Step 1
– Add CUDA memory transfers to the skeleton
Matrix Multiply: Data Transfer
Allocate input
Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
Matrix Multiply: Data Transfer
Allocate output
Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
Matrix Multiply: Data Transfer
Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
Matrix Multiply: Data Transfer
Read back
from device
Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
Matrix Multiply: Data Transfer
Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
Matrix Multiply
• Step 2
– Implement the kernel in CUDA C
Matrix Multiply: CUDA Kernel
Accessing a matrix, so using a 2D block
Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
Matrix Multiply: CUDA Kernel
Each kernel computes one output
Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
Matrix Multiply: CUDA Kernel
Where did the two outer for loops
in the CPU implementation go?
Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
Matrix Multiply: CUDA Kernel
No locks or synchronization, why?
Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
Matrix Multiply
• Step 3
– Invoke the kernel in CUDA C
Matrix Multiply: Invoke Kernel
One block with width
by width threads
Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
Matrix Multiply
Nd
Grid 1



Block 1
One Block of threads compute matrix
Pd
2
4
Each thread computes one element of
Pd
2
Thread
(2, 2)‫‏‬
Each thread
6
Loads a row of matrix Md
 Loads a column of matrix Nd
 Perform one multiply and addition for
each pair of Md and Nd elements
 Compute to off-chip memory access
ratio close to 1:1 (not very high)


3
2
5
4
48
Size of matrix limited by the number of
threads allowed in a thread block
WIDTH
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009
ECE 498AL Spring 2010, University of Illinois, Urbana-Champaign
Md
Pd
106
Slide from: http://courses.engr.illinois.edu/ece498/al/lectures/lecture2%20cuda%20spring%2009.ppt
Matrix Multiply
• What is the major performance problem with
our implementation?
• What is the major limitation?
Download