Introduction to CUDA(1) BIN ZHOU USTC Fall 2012 Clarification for last week’s class Warp0 LOAD R1,[R2] Warp0 Add R3,R4,R5 Warp0 Add R1,R2,R7 Key point: If the following arithmetic instructions don’t depend on memory access instruction, it will not stall the whole pipeline. Only stall when there’s some dependent instruction. Announcements •Hw2 released today.:Due Time Attention! •Hw1 finished fine,Good students: •张振国,刘源,秦子龙,李鑫,陈俊仕,张海博, •周学进,李丰,张爱民,程亦超,张然,王锋,陈宇超 •Some students didn’t attend the labwork, nor homework. Please pay attention! •We’ll start to use the network resource system: • szkc.jingpinke.com Advertisement •NVIDIA Corp. is holding the campus recruitment. •Tomorrow: Sunday: 10/14: 18:00pm; 西活学术报告 厅 Acknowledgements • Many slides are from David Kirk and Wen-mei Hwu’s UIUC course: • Most slides are from Patrick Cozzi University of Pennsylvania CIS 565 GPU Architecture Review • GPUs are specialized for – Compute-intensive, highly parallel computation – Graphics! • Transistors are devoted to: – Processing – Not: • Data caching • Flow control GPU Architecture Review Transistor Usage Image from: http://developer.download.nvidia.com/compute/cuda/3_2_prod/toolkit/docs/CUDA_C_Programming_Guide.pdf Let’s program this thing! GPU Computing History • 2001/2002 – researchers see GPU as dataparallel coprocessor – The GPGPU field is born • 2007 – NVIDIA releases CUDA – CUDA – Compute Uniform Device Architecture – GPGPU shifts to GPU Computing • 2008 – Khronos releases OpenCL specification CUDA Abstractions • A hierarchy of thread groups • Shared memories • Barrier synchronization High Level View Global Memory PCIe CPU Chipset Fermi: 14 Streaming Multiprocessors (SMs), 448 CUDA cores Fermi Multiprocessor 2 Warp Scheduler In-order issue Up to 1536 concurrent threads 32 CUDA Cores Full IEEE 754-2008 FP32 and FP64 32 FP32 ops/clock 16 FP64 ops/clock Up to 48 KB shared memory Up to 48 KB L1 cache Not coherent across multiprocessors 4 SFUs 32K 32-bit registers Up to 63 registers per thread Instruction Cache Scheduler Scheduler Dispatch Dispatch Register File Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Load/Store Units x 16 Special Func Units x 4 Interconnect Network 64K Configurable Cache / Shared Mem Uniform Cache CUDA Terminology • Host – typically the CPU – Code written in ANSI C • Device – typically the GPU (data-parallel) – Code written in extended ANSI C • Host and device have separate memories • CUDA Program – Contains both host and device code CUDA Terminology • Kernel – data-parallel function – Invoking a kernel creates lightweight threads on the device • Threads are generated and scheduled with hardware CUDA Kernels • Executed N times in parallel by N different CUDA threads Thread ID Declaration Specifier Execution Configuration CUDA Code Example CUDA Program Execution Image from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf Thread Hierarchies • Grid – one or more thread blocks – 1D or 2D • Block – array of threads – 1D, 2D, or 3D – Each block in a grid has the same number of threads – Each thread in a block can • Synchronize • Access shared memory Thread Hierarchies Image from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf Thread Hierarchies • Block – 1D, 2D, or 3D – Example: Index into vector, matrix, volume Thread Hierarchies • Thread ID: Scalar thread identifier • Thread Index: threadIdx • 1D: Thread ID == Thread Index • 2D with size (Dx, Dy) – Thread ID of index (x, y) == x + y Dy • 3D with size (Dx, Dy, Dz) – Thread ID of index (x, y, z) == x + y Dy + z Dx Dy Thread Hierarchies 2D Index 1 Thread Block 2D Block Thread Hierarchies • Thread Block – Group of threads • G80 and GT200: Up to 512 threads • Fermi: Up to 1024 threads • Kepler: Up to 1024 threads – Reside on same SM/SMX – Share memory of that SM/SMX Thread Hierarchies • Thread Block – Group of threads • G80 and GT200: Up to 512 threads • Fermi: Up to 1024 threads – Reside on same processor core – Share memory of that core Image from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf Thread Hierarchies • Block Index: blockIdx • Dimension: blockDim – 1D or 2D Thread Hierarchies 16x16 Threads per block 2D Thread Block Thread Hierarchies • Example: N = 32 – 16x16 threads per block (independent of N) • threadIdx ([0, 15], [0, 15]) – 2x2 thread blocks in grid • blockIdx ([0, 1], [0, 1]) • blockDim = 16 i = [0, 1] * 16 + [0, 15] Thread Hierarchies • Thread blocks execute independently – In any order: parallel or series – Scheduled in any order by any number of cores • Allows code to scale with SM count Thread Hierarchies Image from: http://developer.download.nvidia.com/compute/cuda/3_2_prod/toolkit/docs/CUDA_C_Programming_Guide.pdf Thread Hierarchies • Threads in a block – Share (limited) low-latency memory – Synchronize execution • To coordinate memory accesses • __syncThreads() – Barrier – threads in block wait until all threads reach this – Lightweight Image from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf Memory Spaces CPU and GPU have separate memory spaces Data is moved across PCIe bus Use functions to allocate/set/copy memory on GPU Very similar to corresponding C functions Pointers are just addresses Can’t tell from the pointer value whether the address is on CPU or GPU Must exercise care when dereferencing: Dereferencing CPU pointer on GPU will likely crash Same for vice versa GPU Memory Allocation / Release Host (CPU) manages device (GPU) memory: cudaMalloc (void ** pointer, size_t nbytes) cudaMemset (void * pointer, int value, size_t count) cudaFree (void* pointer) int n = 1024; int nbytes = 1024*sizeof(int); int * d_a = 0; cudaMalloc( (void**)&d_a, nbytes ); cudaMemset( d_a, 0, nbytes); cudaFree(d_a); CUDA Memory Transfers Image from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf CUDA Memory Transfers • Host can transfer to/from device – Global memory – Constant memory Image from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf CUDA Memory Transfers • cudaMalloc() – Allocate global memory on device • cudaFree() – Frees memory Image from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf CUDA Memory Transfers Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf CUDA Memory Transfers Pointer to device memory Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf CUDA Memory Transfers Size in bytes Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf CUDA Memory Transfers • cudaMemcpy() – Memory transfer • • • • Host to host Host to device Device to host Device to device Host Device Global Memory CUDA Memory Transfers • cudaMemcpy() – Memory transfer • • • • Host to host Host to device Device to host Device to device Host Device Global Memory CUDA Memory Transfers • cudaMemcpy() – Memory transfer • • • • Host to host Host to device Device to host Device to device Host Device Global Memory CUDA Memory Transfers • cudaMemcpy() – Memory transfer • • • • Host to host Host to device Device to host Device to device Host Device Global Memory CUDA Memory Transfers • cudaMemcpy() – Memory transfer • • • • Host to host Host to device Device to host Device to device Host Device Global Memory CUDA Memory Transfers Host to device Host Device Global Memory Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf CUDA Memory Transfers Destination (device) Source (host) Host Device Global Memory Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf CUDA Memory Transfers Host Device Global Memory Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf Data Copies cudaMemcpy( void *dst, void *src, size_t nbytes, enum cudaMemcpyKind direction); returns after the copy is complete blocks CPU thread until all bytes have been copied doesn’t start copying until previous CUDA calls complete enum cudaMemcpyKind cudaMemcpyHostToDevice cudaMemcpyDeviceToHost cudaMemcpyDeviceToDevice Non-blocking memcopies are provided Code Walkthrough 1 Allocate CPU memory for n integers Allocate GPU memory for n integers Initialize GPU memory to 0s Copy from GPU to CPU Print the values Code Walkthrough 1 #include <stdio.h> int main() { int dimx = 16; int num_bytes = dimx*sizeof(int); int *d_a=0, *h_a=0; // device and host pointers h_a = (int*)malloc(num_bytes); cudaMalloc( (void**)&d_a, num_bytes ); if( 0==h_a || 0==d_a ) { printf("couldn't allocate memory\n"); return 1; } Code Walkthrough 1 cudaMemset( d_a, 0, num_bytes ); cudaMemcpy( h_a, d_a, num_bytes, cudaMemcpyDeviceToHost ); for(int i=0; i<dimx; i++) printf("%d ", h_a[i] ); printf("\n"); free( h_a ); cudaFree( d_a ); return 0; } #include <stdio.h> int main() { int dimx = 16; int num_bytes = dimx*sizeof(int); int *d_a=0, *h_a=0; // device and host pointers h_a = (int*)malloc(num_bytes); cudaMalloc( (void**)&d_a, num_bytes ); cudaMemset( d_a, 0, num_bytes ); cudaMemcpy( h_a, d_a, num_bytes, cudaMemcpyDeviceToHost ); for(int i=0; i<dimx; i++) printf("%d ", h_a[i] ); printf("\n"); free( h_a ); cudaFree( d_a ); if( 0==h_a || 0==d_a ) { return 0; printf("couldn't allocate memory\n"); } return 1; } Basic Kernels and Execution on GPU CUDA Programming Model C Program Sequential Execution Serial code Host Parallel code (kernel) is launched and executed on a GPU by many threads Parallel code is written for a thread Each thread is free to execute a unique code path Built-in thread and block ID variables Device Grid 0 Parallel kernel Kernel0<<<>>>() Block (0, 0) Block (1, 0) Block (2, 0) Block (0, 1) Block (1, 1) Block (2, 1) Host Serial code Device Grid 1 Parallel kernel Kernel1<<<>>>() Block (0, 0) Block (1, 0) Block (0, 1) Block (1, 1) Block (0, 2) Block (1, 2) Thread Hierarchy Threads launched for a parallel section are partitioned into a Grid of Thread Blocks Grid = all blocks for a given launch A thread block is a group of threads that can: Synchronize their execution Communicate via shared memory Size of grid and blocks are specified during kernel launch Grid 0 dim3 grid(3,2), block(12); kernel<<<grid, block>>>(…); Block (0,0) Block (1, 0) Block (2, 0) Block (0, 1) Block (1, 1) Block (2, 1) IDs and Dimensions Threads: 3D IDs, unique within a block Device Grid 1 Blocks: 2D IDs, unique within a grid Dimensions set at launch time Block (0, 0) Block (1, 0) Block (2, 0) Block (0, 1) Block (1, 1) Block (2, 1) Can be unique for each grid Built-in variables: threadIdx blockIdx blockDim gridDim Block (1, 1) Thread Thread Thread Thread Thread (0, 0) (1, 0) (2, 0) (3, 0) (4, 0) Thread Thread Thread Thread Thread (0, 1) (1, 1) (2, 1) (3, 1) (4, 1) Thread Thread Thread Thread Thread (0, 2) (1, 2) (2, 2) (3, 2) (4, 2) CUDA Code Example GPU and Programming Model Software Thread GPU Scalar Processor Threads are executed by scalar processors Thread blocks are executed on multiprocessors Thread blocks do not migrate Thread Block Multiprocessor A kernel is launched as a grid of thread blocks ... Grid Several concurrent thread blocks can reside on one multiprocessor - limited by multiprocessor resources Device Only one kernel can execute on a device at one time Code executed on GPU Kernel: C function with some restrictions Can only access GPU memory No variable number of arguments No static variables No recursion No dynamic memory allocation Must be declared with a qualifier: __global__ : launched by CPU, cannot be called from GPU must return void __device__ : called from other GPU functions, cannot be launched by the CPU __host__ : can be executed by CPU __host__ and __device__ qualifiers can be combined sample use: overloading operators Code Walkthrough 2 Build on Walkthrough 1 Write a kernel to initialize integers Copy the result back to CPU Print the values Kernel Code (executed on GPU) __global__ void kernel( int *a ) { int idx = blockIdx.x*blockDim.x + threadIdx.x; a[idx] = 7; } dim3 grid, block; block.x = 4; grid.x = dimx / block.x; kernel<<<grid, block>>>( d_a ); #include <stdio.h> __global__ void kernel( int *a ) { int idx=blockIdx.x*blockDim.x+threadIdx.x; a[idx] = 7; } int main() { int dimx = 16; int num_bytes = dimx*sizeof(int); int *d_a=0, *h_a=0 h_a = (int*)malloc(num_bytes); cudaMalloc( (void**)&d_a, num_bytes ); if( 0==h_a || 0==d_a ){ printf("couldn't allocate memory\n"); return 1; } cudaMemset( d_a, 0, num_bytes ); dim3 grid, block; block.x = 4; grid.x = dimx / block.x; kernel<<<grid, block>>>( d_a ); cudaMemcpy( h_a, d_a, num_bytes, cudaMemcpyDeviceToHost ); for(int i=0; i<dimx; i++) printf("%d ", h_a[i] ); printf("\n"); free( h_a ); cudaFree( d_a ); return 0; } Launching kernels on GPU Execution Configuration : <<<Grid, Block, Smem, Stream>>> grid dimensions (up to 2D), dim3 type thread-block dimensions (up to 3D), dim3 type shared memory: number of bytes per block for extern smem variables declared without size Optional, 0 by default stream ID Optional, 0 by default dim3 grid(16, 16); dim3 block(16,16); kernel<<<grid, block, 0, 0>>>(...); kernel<<<32, 512>>>(...); Kernel Variations and Output __global__ void kernel( int *a ) { int idx = blockIdx.x*blockDim.x + threadIdx.x; a[idx] = 7; } __global__ void kernel( int *a ) { int idx = blockIdx.x*blockDim.x + threadIdx.x; a[idx] = blockIdx.x; } __global__ void kernel( int *a ) { int idx = blockIdx.x*blockDim.x + threadIdx.x; a[idx] = threadIdx.x; } Output: 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 Output: 0 0 0 0 1 1 1 1 2 2 2 2 3 3 3 3 Output: 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 Blocks must be independent Thread blocks can run in any order Concurrently or sequentially Facilitates scaling of the same code across many devices Scalability CUDA Memory System Memory Model Review Device 0 memory Host memory cudaMemcpy() Device 1 memory CPU and GPU have separate memory spaces Data is moved across PCIe bus GPU Memory Model Review Thread Per-thread Local Memory Block Sequential Kernels Per-block Shared Memory Kernel 0 .. . Kernel 1 ... Per-device Global Memory Global Memory Kernel 0 Sequential Kernels ... Per-device Global Memory Kernel 1 ... Accessible by all threads as well as host (CPU) Data lifetime = from allocation to deallocation GPU Memory Allocation / Release Host (CPU) manages device (GPU) memory: cudaMalloc (void ** pointer, size_t nbytes) cudaMemset (void * pointer, int value, size_t count) cudaFree (void* pointer) int n = 1024; int nbytes = 1024*sizeof(int); int * d_a = 0; cudaMalloc( (void**)&d_a, nbytes ); cudaMemset( d_a, 0, nbytes); cudaFree(d_a); Shared Memory Block Per-block Shared Memory __shared__ int a[SIZE]; Allocated per threadblock Data lifetime = block lifetime Accessible by any thread in the threadblock Not accessible to other threadblocks Several uses: Sharing data among threads in a threadblock User-managed cache (reducing gmem accesses) Registers and Local Memory Thread Per-thread Local Storage Automatic variables (scalar/array) inside kernels Spills to local memory Data lifetime = thread lifetime Shared Memory Shared Memory On-chip memory 2 orders of magnitude lower latency than global memory Order of magnitude higher bandwidth than global memory 16 KB or 48 KB per multiprocessor for Fermi architecture (up to 15 multiprocessors) Allocated per thread block Accessible to any thread in the thread block Not accessible to other thread blocks Several uses: Sharing data among threads in a thread block User-managed cache (reducing global memory accesses) Example of Using Shared Memory Applying a 1D stencil to a 1D array of elements: Each output element is the sum of all elements within a radius For example, for radius = 3, each output element is the sum of 7 input elements: radius radius Implementation with Shared Memory Each block outputs one element per thread, so a total of BLOCK_SIZE output elements: BLOCK_SIZE = number of threads per block Read (BLOCK_SIZE + 2 * RADIUS) elements from global memory to shared memory Compute BLOCK_SIZE output elements in shared memory Write BLOCK_SIZE output elements to global memory “halo” = RADIUS elements on the left The BLOCK_SIZE input elements corresponding to the output elements “halo” = RADIUS elements on the right Kernel Code RADIUS = 3 BLOCK_SIZE = 16 __global__ void stencil(int* in, int* out) { __shared__ int shared[BLOCK_SIZE + 2 * RADIUS]; int globIdx = blockIdx.x * blockDim.x + threadIdx.x; int locIdx = threadIdx.x + RADIUS; shared[locIdx] = in[globIdx]; if (threadIdx.x < RADIUS) { shared[locIdx – RADIUS] = in[globIdx – RADIUS]; shared[locIdx + BLOCK_DIMX] = in[globIdx + BLOCK_SIZE]; } __syncthreads(); int value = 0; for (offset = - RADIUS; offset <= RADIUS; offset++) value += shared[locIdx + offset]; out[globIdx] = value; } Thread Synchronization Function void __syncthreads(); Synchronizes all threads in a thread block Since threads are scheduled at run-time Once all threads have reached this point, execution resumes normally Used to avoid RAW / WAR / WAW hazards when accessing shared memory Should be used in conditional code only if the conditional is uniform across the entire thread block Coordinating CPU and GPU Execution Synchronizing GPU and CPU All kernel launches are asynchronous Control returns to CPU immediately Kernel starts executing after all preceding CUDA calls complete cudaMemcpy() is synchronous Control returns to CPU once the copy is complete Copy starts once all previous CUDA calls have completed cudaMemcpyAsync() is asynchronous cudaThreadSynchronize() Blocks until all previous CUDA calls complete Asynchronous CUDA calls provide ability to: Overlap memory copies and kernel execution Concurrently execute several kernels CUDA Error Reporting to CPU All CUDA calls return error code except kernel launches cudaError_t type cudaError_t cudaGetLastError(void) returns the code for the last error (“no error” has a code) char* cudaGetErrorString(cudaError_t code) returns a null-terminated character string describing the error printf(“%s\n”,cudaGetErrorString(cudaGetLast Error())); CUDA Event API Events are inserted (recorded) into CUDA call streams Usage scenarios: Measure elapsed time for CUDA calls Query the status of an asynchronous CUDA call Block CPU until CUDA calls prior to the event are completed cudaEvent_t start, stop; cudaEventCreate(&start), cudaEventCreate(&stop); cudaEventRecord(start, 0); kernel<<<grid, block>>>(...); cudaEventRecord(stop, 0); cudaEventSynchronize(stop); float time; cudaEventElapsedTime(&time, start, stop); cudaEventDestroy(start); Device Management CPU can query and select GPU devices cudaGetDeviceCount(int* count) cudaSetDevice(int device) cudaGetDevice(int* current_device) cudaGetDeviceProperties(cudaDeviceProp* prop, int device) cudaChooseDevice(int *device, cudaDeviceProp* prop) Multi-GPU setup: Device 0 is used by default One CPU thread can control one GPU Multiple CPU threads can control the same GPU Calls are serialized by the driver CUDA Development Resources CUDA Programming Resources CUDA toolkit Compiler, libraries, and documentation free download for Windows, Linux, and MacOS CUDA SDK code samples whitepapers Instructional materials on CUDA Zone slides and audio parallel programming course at University of Illinois UC tutorials forums GPU Tools Profiler Available now for all supported OSs Command-line or GUI Sampling signals on GPU for: Memory access parameters Execution (serialization, divergence) Debugger Currently linux only (gdb) Runs on the GPU Emulation mode Matrix Multiply Reminder • • • • Vectors Dot products Row major or column major? Dot product per output element Matrix Multiply P=M*N Assume M and N are square for simplicity Image from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf Matrix Multiply 1,000 x 1,000 matrix 1,000,000 dot products Each 1,000 multiples and 1,000 adds Matrix Multiply: CPU Implementation void MatrixMulOnHost(float* M, float* N, float* P, int width) { for (int i = 0; i < width; ++i) for (int j = 0; j < width; ++j) { float sum = 0; for (int k = 0; k < width; ++k) { float a = M[i * width + k]; float b = N[k * width + j]; sum += a * b; } P[i * width + j] = sum; } } Code from: http://courses.engr.illinois.edu/ece498/al/lectures/lecture3%20cuda%20threads%20spring%202010.ppt Matrix Multiply: CUDA Skeleton Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf Matrix Multiply: CUDA Skeleton Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf Matrix Multiply: CUDA Skeleton Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf Matrix Multiply • Step 1 – Add CUDA memory transfers to the skeleton Matrix Multiply: Data Transfer Allocate input Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf Matrix Multiply: Data Transfer Allocate output Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf Matrix Multiply: Data Transfer Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf Matrix Multiply: Data Transfer Read back from device Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf Matrix Multiply: Data Transfer Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf Matrix Multiply • Step 2 – Implement the kernel in CUDA C Matrix Multiply: CUDA Kernel Accessing a matrix, so using a 2D block Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf Matrix Multiply: CUDA Kernel Each kernel computes one output Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf Matrix Multiply: CUDA Kernel Where did the two outer for loops in the CPU implementation go? Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf Matrix Multiply: CUDA Kernel No locks or synchronization, why? Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf Matrix Multiply • Step 3 – Invoke the kernel in CUDA C Matrix Multiply: Invoke Kernel One block with width by width threads Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf Matrix Multiply Nd Grid 1 Block 1 One Block of threads compute matrix Pd 2 4 Each thread computes one element of Pd 2 Thread (2, 2) Each thread 6 Loads a row of matrix Md Loads a column of matrix Nd Perform one multiply and addition for each pair of Md and Nd elements Compute to off-chip memory access ratio close to 1:1 (not very high) 3 2 5 4 48 Size of matrix limited by the number of threads allowed in a thread block WIDTH © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL Spring 2010, University of Illinois, Urbana-Champaign Md Pd 106 Slide from: http://courses.engr.illinois.edu/ece498/al/lectures/lecture2%20cuda%20spring%2009.ppt Matrix Multiply • What is the major performance problem with our implementation? • What is the major limitation?