loop unrolling

advertisement
CUDA Performance
Considerations
Patrick Cozzi
University of Pennsylvania
CIS 565 - Spring 2011
Agenda
Data Prefetching
 Loop Unrolling
 Thread Granularity

Data Prefetching

Independent instructions between a global
memory read and its use can hide memory
latency
float m = Md[i];
float f = a * b + c * d;
float f2 = m * f;
Data Prefetching

Independent instructions between a global
memory read and its use can hide memory
latency
float m = Md[i]; Read global memory
float f = a * b + c * d;
float f2 = m * f;
Data Prefetching

Independent instructions between a global
memory read and its use can hide memory
latency
float m = Md[i];
float f = a * b + c * d;
float f2 = m * f;
Execute instructions
that are not dependent
on memory read
Data Prefetching

Independent instructions between a global
memory read and its use can hide memory
latency
float m = Md[i];
float f = a * b + c * d;
global memory after
float f2 = m * f; Use
the above line executes
in enough warps hide the
memory latency
Data Prefetching

Prefetching data from global memory can
effectively increase the number of
independent instructions between global
memory read and use
Data Prefetching

Recall tiled matrix multiply:
for (/* ... */)
{
// Load current tile into shared memory
__syncthreads();
// Accumulate dot product
__syncthreads();
}
Data Prefetching

Tiled matrix multiply with prefetch:
// Load first tile into registers
for (/* ... */)
{
// Deposit registers into shared memory
__syncthreads();
// Load next tile into registers
// Accumulate dot product
__syncthreads();
}
Data Prefetching

Tiled matrix multiply with prefetch:
// Load first tile into registers
for (/* ... */)
{
// Deposit registers into shared memory
__syncthreads();
// Load next tile into registers
// Accumulate dot product
__syncthreads();
}
Data Prefetching

Tiled matrix multiply with prefetch:
// Load first tile into registers
for (/* ... */)
{
// Deposit registers into shared memory
__syncthreads();
// Load next tile into registers
Prefetch for next
iteration of the loop
// Accumulate dot product
__syncthreads();
}
Data Prefetching

Tiled matrix multiply with prefetch:
// Load first tile into registers
for (/* ... */)
{
// Deposit registers into shared memory
__syncthreads();
// Load next tile into registers
These instructions
executed by enough
// Accumulate dot product
warps will hide the
__syncthreads();
memory latency of the
}
prefetch
Data Prefetching

Cost
 Added
complexity
 More registers – what does this imply?
Loop Unrolling
for (int k = 0; k < BLOCK_SIZE; ++k)
{
Pvalue += Ms[ty][k] * Ns[k][tx];
}

Instructions per iteration
 One
floating-point multiple
 One floating-point add
 What else?
Loop Unrolling
for (int k = 0; k < BLOCK_SIZE; ++k)
{
Pvalue += Ms[ty][k] * Ns[k][tx];
}

Other instructions per iteration
 Update
loop counter
Loop Unrolling
for (int k = 0; k < BLOCK_SIZE; ++k)
{
Pvalue += Ms[ty][k] * Ns[k][tx];
}

Other instructions per iteration
 Update
 Branch
loop counter
Loop Unrolling
for (int k = 0; k < BLOCK_SIZE; ++k)
{
Pvalue += Ms[ty][k] * Ns[k][tx];
}

Other instructions per iteration
 Update
loop counter
 Branch
 Address
arithmetic
Loop Unrolling
for (int k = 0; k < BLOCK_SIZE; ++k)
{
Pvalue += Ms[ty][k] * Ns[k][tx];
}

Instruction Mix
2
floating-point arithmetic instructions
 1 loop branch instruction
 2 address arithmetic instructions
 1 loop counter increment instruction
Loop Unrolling

Only 1/3 are floating-point calculations


But I want my full theoretical 346.5 GFLOPs
(G80)
Consider loop unrolling
Loop Unrolling
Pvalue +=
Ms[ty][0] * Ns[0][tx] +
Ms[ty][1] * Ns[1][tx] +
...
Ms[ty][15] * Ns[15][tx]; // BLOCK_SIZE = 16

No more loop
 No
loop count update
 No branch
 Constant indices – no address arithmetic
instructions
Thread Granularity

How much work should one thread do?
 Parallel

Reduce two elements?
 Matrix

Reduction
multiply
Compute one element of Pd?
Thread Granularity

Matrix Multiple
Image from http://courses.engr.illinois.edu/ece498/al/textbook/Chapter5-CudaPerformance.pdf
Thread Granularity

Matrix Multiple
elements of Pd
require the same
row of Md
 Both
Image from http://courses.engr.illinois.edu/ece498/al/textbook/Chapter5-CudaPerformance.pdf
Thread Granularity

Matrix Multiple
 Compute
both Pd elements in the same thread
Reduces global memory access by ¼
 Increases number of independent instructions



What is the benefit?
New kernel uses more registers and shared memory

What does that imply?
Matrix Multiply

What improves performance?
 Prefetching?
 Loop
unrolling?
 Thread granularity?

For what inputs?
Matrix Multiply
Image from http://courses.engr.illinois.edu/ece498/al/textbook/Chapter5-CudaPerformance.pdf
Matrix Multiply
8x8 Tiles
• Coarser thread granularity helps
• Prefetching doesn’t
• Loop unrolling doesn’t
Image from http://courses.engr.illinois.edu/ece498/al/textbook/Chapter5-CudaPerformance.pdf
Matrix Multiply
16x16 Tiles
• Coarser thread granularity helps
Image from http://courses.engr.illinois.edu/ece498/al/textbook/Chapter5-CudaPerformance.pdf
Matrix Multiply
16x16 Tiles
• Full loop unrolling can help
Image from http://courses.engr.illinois.edu/ece498/al/textbook/Chapter5-CudaPerformance.pdf
Matrix Multiply
16x16 Tiles
• Prefetch helps for 1x1 tiling
Image from http://courses.engr.illinois.edu/ece498/al/textbook/Chapter5-CudaPerformance.pdf
Floating-Point Considerations
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009
University of Illinois, Urbana-Champaign
31
What is IEEE floating-point
format?

A floating point binary number consists of three parts:



For each bit pattern, its IEEE floating-point value is
derived as:


sign (S), exponent (E), and mantissa (M).
Each (S, E, M) pattern uniquely identifies a floating point
number.
value = (-1)S * M * {2E}, where 1.0 ≤ M < 10.0B
The interpretation of S is simple: S=0 results in a positive
number and S=1 a negative number.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009
University of Illinois, Urbana-Champaign
32
IEEE 754 Format
Single Precision
1 bit sign, 8 bit exponent (bias-127), 23 bit fraction
Double Precision
1 bit sign, 11 bit exponent (1023-bias), 52 bit fraction
http://kipirvine.com/asm/workbook/floating_tut.htm
Mantissa
 -3.154 x 105 as an example, the sign is negative, the mantissa is

3.154, and the exponent is 5.
The fractional portion of the mantissa is the sum of each digit
multiplied by a power of 10:
.154 = 1/10 + 5/100 + 4/1000

A binary floating-point number is similar. For example, in the number
+11.1011 x 23,

the sign is positive, the mantissa is 11.1011, and the exponent is 3.
 The fractional portion of the mantissa is the sum of successive powers of 2. In
our example, it is expressed as:
.1011 = 1/2 + 0/4 + 1/8 + 1/16 =0.6875D
Combined with the left-hand side of 11.1011, the decimal value of the number is 3.6875.
http://kipirvine.com/asm/workbook/floating_tut.htm
Normalizing the Mantissa
Before a floating-point binary number can be stored correctly, its mantissa
must be normalized. The process is basically the same as when normalizing
a floating-point decimal number.
For example, decimal 1234.567 is normalized as 1.234567 x 103 by
moving the decimal point so that only one digit appears before the
decimal.
http://kipirvine.com/asm/workbook/floating_tut.htm
The Exponent

8-bit unsigned integers with a bias of 127.

An example: 1.101 x 25 . The exponent (5) is
added to 127(2n-1-1) and the sum (132) is
binary 10100010.
http://kipirvine.com/asm/workbook/floating_tut.htm
Creating the IEEE Bit
Representation

1.101 x 20 is stored as sign = 0 (positive), mantissa = 101, and exponent = 01111111
(the exponent value is added to 127). The "1" to the left of the decimal point is
dropped from the mantissa. Here are more examples:
http://kipirvine.com/asm/workbook/floating_tut.htm
Arithmetic Instruction
Throughput

int and float add, shift, min, max and float mul, mad: 4
cycles per warp

int multiply (*) is by default 32-bit



requires multiple cycles / warp
Use __mul24() / __umul24() intrinsics for 4-cycle 24-bit int
multiply
Integer divide and modulo are expensive



Compiler will convert literal power-of-2 divides to shifts
Be explicit in cases where compiler can’t tell that divisor is a
power of 2!
Useful trick: foo % n == foo & (n-1) if n is a power of 2
© David Kirk/NVIDIA and Wen-mei W.
Hwu, 2007-2009
University of Illinois, Urbana-Champaign
38
Arithmetic Instruction
Throughput

Reciprocal, reciprocal square root, sin/cos, log,
exp: 16 cycles per warp
These are the versions prefixed with “__”
 Examples:__rcp(), __sin(), __exp()


Other functions are combinations of the above


y / x == rcp(x) * y == 20 cycles per warp
sqrt(x) == rcp(rsqrt(x)) == 32 cycles per warp
© David Kirk/NVIDIA and Wen-mei W.
Hwu, 2007-2009
University of Illinois, Urbana-Champaign
39
Runtime Math Library

There are two types of runtime math operations

__func(): direct mapping to hardware ISA



func() : compile to multiple instructions



Fast but low accuracy (see prog. guide for details)
Examples: __sin(x), __exp(x), __pow(x,y)
Slower but higher accuracy (5 ulp, units in the least place,
or less)
Examples: sin(x), exp(x), pow(x,y)
The -use_fast_math compiler option forces
every func() to compile to __func()
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009
University of Illinois, Urbana-Champaign
40
Make your program float-safe!

Future hardware will have double precision support




G80 is single-precision only
Double precision will have additional performance cost
Careless use of double or undeclared types may run more
slowly on G80+
Important to be float-safe (be explicit whenever you want
single precision) to avoid using double precision where it
is not needed

Add ‘f’ specifier on float literals:



foo = bar * 0.123;
foo = bar * 0.123f;
// double assumed
// float explicit
Use float version of standard library functions


foo = sin(bar);
foo = sinf(bar);
// double assumed
// single precision explicit
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009
University of Illinois, Urbana-Champaign
41
Deviations from IEEE-754

Addition and Multiplication are IEEE 754 compliant


However, often combined into multiply-add (FMAD)





Maximum 0.5 ulp (units in the least place) error
Intermediate result is truncated
Division is non-compliant (2 ulp)
Not all rounding modes are supported
Denormalized numbers are not supported
No mechanism to detect floating-point exceptions
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009
University of Illinois, Urbana-Champaign
42
GPU Floating Point Features
G80
SSE
IBM Altivec
Cell SPE
Precision
IEEE 754
IEEE 754
IEEE 754
IEEE 754
Rounding modes for
FADD and FMUL
Round to nearest and
round to zero
All 4 IEEE, round to
nearest, zero, inf, -inf
Round to nearest only
Round to
zero/truncate only
Denormal handling
Flush to zero
Supported,
1000’s of cycles
Supported,
1000’s of cycles
Flush to zero
NaN support
Yes
Yes
Yes
No
Overflow and Infinity
support
Yes, only clamps to
max norm
Yes
Yes
No, infinity
Flags
No
Yes
Yes
Some
Square root
Software only
Hardware
Software only
Software only
Division
Software only
Hardware
Software only
Software only
Reciprocal estimate
accuracy
24 bit
12 bit
12 bit
12 bit
Reciprocal sqrt
estimate accuracy
23 bit
12 bit
12 bit
12 bit
log2(x) and 2^x
estimates accuracy
23 bit
No
12 bit
No
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009
University of Illinois, Urbana-Champaign
43
Download