CUDA Performance Considerations Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011 Agenda Data Prefetching Loop Unrolling Thread Granularity Data Prefetching Independent instructions between a global memory read and its use can hide memory latency float m = Md[i]; float f = a * b + c * d; float f2 = m * f; Data Prefetching Independent instructions between a global memory read and its use can hide memory latency float m = Md[i]; Read global memory float f = a * b + c * d; float f2 = m * f; Data Prefetching Independent instructions between a global memory read and its use can hide memory latency float m = Md[i]; float f = a * b + c * d; float f2 = m * f; Execute instructions that are not dependent on memory read Data Prefetching Independent instructions between a global memory read and its use can hide memory latency float m = Md[i]; float f = a * b + c * d; global memory after float f2 = m * f; Use the above line executes in enough warps hide the memory latency Data Prefetching Prefetching data from global memory can effectively increase the number of independent instructions between global memory read and use Data Prefetching Recall tiled matrix multiply: for (/* ... */) { // Load current tile into shared memory __syncthreads(); // Accumulate dot product __syncthreads(); } Data Prefetching Tiled matrix multiply with prefetch: // Load first tile into registers for (/* ... */) { // Deposit registers into shared memory __syncthreads(); // Load next tile into registers // Accumulate dot product __syncthreads(); } Data Prefetching Tiled matrix multiply with prefetch: // Load first tile into registers for (/* ... */) { // Deposit registers into shared memory __syncthreads(); // Load next tile into registers // Accumulate dot product __syncthreads(); } Data Prefetching Tiled matrix multiply with prefetch: // Load first tile into registers for (/* ... */) { // Deposit registers into shared memory __syncthreads(); // Load next tile into registers Prefetch for next iteration of the loop // Accumulate dot product __syncthreads(); } Data Prefetching Tiled matrix multiply with prefetch: // Load first tile into registers for (/* ... */) { // Deposit registers into shared memory __syncthreads(); // Load next tile into registers These instructions executed by enough // Accumulate dot product warps will hide the __syncthreads(); memory latency of the } prefetch Data Prefetching Cost Added complexity More registers – what does this imply? Loop Unrolling for (int k = 0; k < BLOCK_SIZE; ++k) { Pvalue += Ms[ty][k] * Ns[k][tx]; } Instructions per iteration One floating-point multiple One floating-point add What else? Loop Unrolling for (int k = 0; k < BLOCK_SIZE; ++k) { Pvalue += Ms[ty][k] * Ns[k][tx]; } Other instructions per iteration Update loop counter Loop Unrolling for (int k = 0; k < BLOCK_SIZE; ++k) { Pvalue += Ms[ty][k] * Ns[k][tx]; } Other instructions per iteration Update Branch loop counter Loop Unrolling for (int k = 0; k < BLOCK_SIZE; ++k) { Pvalue += Ms[ty][k] * Ns[k][tx]; } Other instructions per iteration Update loop counter Branch Address arithmetic Loop Unrolling for (int k = 0; k < BLOCK_SIZE; ++k) { Pvalue += Ms[ty][k] * Ns[k][tx]; } Instruction Mix 2 floating-point arithmetic instructions 1 loop branch instruction 2 address arithmetic instructions 1 loop counter increment instruction Loop Unrolling Only 1/3 are floating-point calculations But I want my full theoretical 346.5 GFLOPs (G80) Consider loop unrolling Loop Unrolling Pvalue += Ms[ty][0] * Ns[0][tx] + Ms[ty][1] * Ns[1][tx] + ... Ms[ty][15] * Ns[15][tx]; // BLOCK_SIZE = 16 No more loop No loop count update No branch Constant indices – no address arithmetic instructions Thread Granularity How much work should one thread do? Parallel Reduce two elements? Matrix Reduction multiply Compute one element of Pd? Thread Granularity Matrix Multiple Image from http://courses.engr.illinois.edu/ece498/al/textbook/Chapter5-CudaPerformance.pdf Thread Granularity Matrix Multiple elements of Pd require the same row of Md Both Image from http://courses.engr.illinois.edu/ece498/al/textbook/Chapter5-CudaPerformance.pdf Thread Granularity Matrix Multiple Compute both Pd elements in the same thread Reduces global memory access by ¼ Increases number of independent instructions What is the benefit? New kernel uses more registers and shared memory What does that imply? Matrix Multiply What improves performance? Prefetching? Loop unrolling? Thread granularity? For what inputs? Matrix Multiply Image from http://courses.engr.illinois.edu/ece498/al/textbook/Chapter5-CudaPerformance.pdf Matrix Multiply 8x8 Tiles • Coarser thread granularity helps • Prefetching doesn’t • Loop unrolling doesn’t Image from http://courses.engr.illinois.edu/ece498/al/textbook/Chapter5-CudaPerformance.pdf Matrix Multiply 16x16 Tiles • Coarser thread granularity helps Image from http://courses.engr.illinois.edu/ece498/al/textbook/Chapter5-CudaPerformance.pdf Matrix Multiply 16x16 Tiles • Full loop unrolling can help Image from http://courses.engr.illinois.edu/ece498/al/textbook/Chapter5-CudaPerformance.pdf Matrix Multiply 16x16 Tiles • Prefetch helps for 1x1 tiling Image from http://courses.engr.illinois.edu/ece498/al/textbook/Chapter5-CudaPerformance.pdf Floating-Point Considerations © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 University of Illinois, Urbana-Champaign 31 What is IEEE floating-point format? A floating point binary number consists of three parts: For each bit pattern, its IEEE floating-point value is derived as: sign (S), exponent (E), and mantissa (M). Each (S, E, M) pattern uniquely identifies a floating point number. value = (-1)S * M * {2E}, where 1.0 ≤ M < 10.0B The interpretation of S is simple: S=0 results in a positive number and S=1 a negative number. © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 University of Illinois, Urbana-Champaign 32 IEEE 754 Format Single Precision 1 bit sign, 8 bit exponent (bias-127), 23 bit fraction Double Precision 1 bit sign, 11 bit exponent (1023-bias), 52 bit fraction http://kipirvine.com/asm/workbook/floating_tut.htm Mantissa -3.154 x 105 as an example, the sign is negative, the mantissa is 3.154, and the exponent is 5. The fractional portion of the mantissa is the sum of each digit multiplied by a power of 10: .154 = 1/10 + 5/100 + 4/1000 A binary floating-point number is similar. For example, in the number +11.1011 x 23, the sign is positive, the mantissa is 11.1011, and the exponent is 3. The fractional portion of the mantissa is the sum of successive powers of 2. In our example, it is expressed as: .1011 = 1/2 + 0/4 + 1/8 + 1/16 =0.6875D Combined with the left-hand side of 11.1011, the decimal value of the number is 3.6875. http://kipirvine.com/asm/workbook/floating_tut.htm Normalizing the Mantissa Before a floating-point binary number can be stored correctly, its mantissa must be normalized. The process is basically the same as when normalizing a floating-point decimal number. For example, decimal 1234.567 is normalized as 1.234567 x 103 by moving the decimal point so that only one digit appears before the decimal. http://kipirvine.com/asm/workbook/floating_tut.htm The Exponent 8-bit unsigned integers with a bias of 127. An example: 1.101 x 25 . The exponent (5) is added to 127(2n-1-1) and the sum (132) is binary 10100010. http://kipirvine.com/asm/workbook/floating_tut.htm Creating the IEEE Bit Representation 1.101 x 20 is stored as sign = 0 (positive), mantissa = 101, and exponent = 01111111 (the exponent value is added to 127). The "1" to the left of the decimal point is dropped from the mantissa. Here are more examples: http://kipirvine.com/asm/workbook/floating_tut.htm Arithmetic Instruction Throughput int and float add, shift, min, max and float mul, mad: 4 cycles per warp int multiply (*) is by default 32-bit requires multiple cycles / warp Use __mul24() / __umul24() intrinsics for 4-cycle 24-bit int multiply Integer divide and modulo are expensive Compiler will convert literal power-of-2 divides to shifts Be explicit in cases where compiler can’t tell that divisor is a power of 2! Useful trick: foo % n == foo & (n-1) if n is a power of 2 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 University of Illinois, Urbana-Champaign 38 Arithmetic Instruction Throughput Reciprocal, reciprocal square root, sin/cos, log, exp: 16 cycles per warp These are the versions prefixed with “__” Examples:__rcp(), __sin(), __exp() Other functions are combinations of the above y / x == rcp(x) * y == 20 cycles per warp sqrt(x) == rcp(rsqrt(x)) == 32 cycles per warp © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 University of Illinois, Urbana-Champaign 39 Runtime Math Library There are two types of runtime math operations __func(): direct mapping to hardware ISA func() : compile to multiple instructions Fast but low accuracy (see prog. guide for details) Examples: __sin(x), __exp(x), __pow(x,y) Slower but higher accuracy (5 ulp, units in the least place, or less) Examples: sin(x), exp(x), pow(x,y) The -use_fast_math compiler option forces every func() to compile to __func() © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 University of Illinois, Urbana-Champaign 40 Make your program float-safe! Future hardware will have double precision support G80 is single-precision only Double precision will have additional performance cost Careless use of double or undeclared types may run more slowly on G80+ Important to be float-safe (be explicit whenever you want single precision) to avoid using double precision where it is not needed Add ‘f’ specifier on float literals: foo = bar * 0.123; foo = bar * 0.123f; // double assumed // float explicit Use float version of standard library functions foo = sin(bar); foo = sinf(bar); // double assumed // single precision explicit © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 University of Illinois, Urbana-Champaign 41 Deviations from IEEE-754 Addition and Multiplication are IEEE 754 compliant However, often combined into multiply-add (FMAD) Maximum 0.5 ulp (units in the least place) error Intermediate result is truncated Division is non-compliant (2 ulp) Not all rounding modes are supported Denormalized numbers are not supported No mechanism to detect floating-point exceptions © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 University of Illinois, Urbana-Champaign 42 GPU Floating Point Features G80 SSE IBM Altivec Cell SPE Precision IEEE 754 IEEE 754 IEEE 754 IEEE 754 Rounding modes for FADD and FMUL Round to nearest and round to zero All 4 IEEE, round to nearest, zero, inf, -inf Round to nearest only Round to zero/truncate only Denormal handling Flush to zero Supported, 1000’s of cycles Supported, 1000’s of cycles Flush to zero NaN support Yes Yes Yes No Overflow and Infinity support Yes, only clamps to max norm Yes Yes No, infinity Flags No Yes Yes Some Square root Software only Hardware Software only Software only Division Software only Hardware Software only Software only Reciprocal estimate accuracy 24 bit 12 bit 12 bit 12 bit Reciprocal sqrt estimate accuracy 23 bit 12 bit 12 bit 12 bit log2(x) and 2^x estimates accuracy 23 bit No 12 bit No © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 University of Illinois, Urbana-Champaign 43