Uploaded by Lin Wang

MMM optimizations

advertisement
MMM Optimizations
●
●
●
MMM – matrix-matrix multiplication
Reference: How to Write Fast Numerical
Code: A Small Introduction
Srinivas Chellappa, Franz Franchetti, and
Markus Püschel – Carnegie Mellon
Univeristy
Parallelsim
●
●
Task of achieving highest performance for
an implementation – lies to great extent
with programmer
Platform dependent
●
Needs to be repeated for every new hardware
release
Automatic Performance
Tuning
●
Active research area
●
Adaptive libraries
●
●
Strategy – determined at runtime – from
platform’s memory hierarchy
Source code generators
●
●
●
Produce algorithm generations from scratch
Generate many different variants of
algorithm & select fasted one
Example: ATLAS – Automatically Tuned Linear
Algebra Software
Reuse
●
●
Measures how often a given input value is used
in a computation during the algorithm
High degree of reuse – algorithm may perform
better with memory hierarchy
●
●
●
Number of computations dominates data transfers
from memory to CPU
Said to be CPU bound
Low degree of reuse
●
●
Number of data transfers from memory to CPU is
high compared to the number of computations
Said to be memory bound
Matrix-Matrix Mulitiplication
●
Generally consider C = C + AB
●
If A & B are nxn matrices
●
n3 multiplications
●
n3 additions
●
O(n3) floating point operations
●
Reuse O(n3/n2) = O(n)
●
Better implementations exist
Memory Hierarchy
Locality
●
Temporal memory locality
●
●
Spatial memory locality
●
●
Memory location that is referenced will likely
be referenced again in the future
Likelihood of referencing a memory location
is higher is a nearby local was recented
referenced
High performance software needs to take
advantage of locality
Registers
●
●
Register spill – register contents written
to lower levels of memory – expensive
Compiler optimizations
Cache Memory
●
Cache miss
●
●
●
Data not in cache – fetched from memory
Need to minimize number of cache
misses
Divided into cache lines (blocks)
●
●
Data moved in & out of cache in chunks of
the cache line size
Take advantage of spatial locality
Design Principles
●
●
Once data is brought into cache – should
be reused as much as possible before it is
evicted
Programs need to be designed to perform
computations on neighboring data before
cache line in evicted
Cache Analysis
●
Consider simple direct mapped 16 byte data
cache with 2 cache lines – each of size 8 bytes
●
●
●
2 floats per line
Assume data is cache-aligned (data is question
starts at the beginning of the cache line
Consider the following code fragment
float X[8];
for(int j=0; j<2; j++)
for(int i=0; i<8; i++)
access(X[i]);
●
8 cache hits & 8 cache misses
Cache Analysis
●
Stride of 2
float X[8];
for(int j=0; j<2; j++)
{ for(int i=0; i<7; i+=2)
access(X[i]);
for(int i=1; i<8; i+=2)
access(X[i]);
}
●
0 hits & 16 misses
Cache Analysis
●
Different code fragment – same logical
result
float X[8];
for(i=0; i<2; i++)
for(k=0; k<2; k++)
for(j=0; j<4; j++)
access(X[j+(i*4]);
●
12 hits & 4 misses
CPU Features
●
Most processors contain pipelined
superscalar out-of-order cores with
multiple execution units
●
●
●
Pipelining – different parts of a processor
work simultaneously on different components
of different instructions
Superscalar cores – can retire more than 1
instruction per processsor clock cycle
Out-of-order – reschedule instruction
sequence around dependencies
Peak Performance
●
Theoretical Peal Performance – theoretical
rate at which a processor can perform
floating point operations
●
Measure in FLOPS
Using Compilers
●
Use appropriate compiler flags, language
extensions, monitoring & analyzing of
compiler’s output
Variable Declarations
●
C assigns variables to different storage
classes by default
●
Overridden by storage class specifier
●
extern – shared by different sources files
●
static – exist as long as program executes
●
auto – allocated on stack
●
register – compiler allocates space
directly in registers
Variable Declarations
●
Qualifiers – specify variable attributes
●
const – constant
●
●
volatile – values may be influenced by
sources external to compiler’s knowledge
restrict – tells compiler that memory address
will be restricted to access by a specific
pointer
●
●
No aliasing
Memory alignment – request that variables
to aligned to cache line boundaries or virtual
memory pages
Inline Assemblies
●
●
Can include assembly code
Language can provide intrinsics – ability
to access special machine instructions
Compiler Flags
●
C standards – version of C
●
Architecture specifications
●
Optimization levels
●
Specialized compiler options
Compiler Output
●
Optimization reports – inform what
optimizations are possible or why certain
attempted optimizations did not work
Performance Optimization
●
Finding hotspots (most frequently
executed code regions)
●
Timing hotspots
●
Analyzing measured runtimes
Finding the Hotspots
●
Profiling tool
●
Gnu gprof
●
Intel vtune
●
valgrind
Gprof Example
#include <stdio.h>
float function1()
{ int i; float retval=0;
for(i=1; i<1000000; i++)
retval += (1/i);
return(retval);
}
float function2()
{ int i; float retval=0;
for(i=1; i<10000000; i++)
retval += (1/(i+1));
return(retval);
}
void function3() { return; }
int main()
{ int i;
printf("Result: %.2f\n", function1());
printf("Result: %.2f\n", function2());
if (1==2) function3();
return(0);
}
Gprof Example
●
Compile & link:
●
●
Execute
●
●
gcc -O0 -lm -g -pg -o ourProgram
ourProgram.c
./ourProgram
Run gprof
●
gprof ourProgram gmon.out > profile.txt
Timing a Hotspot
●
Read current time
●
Execute hotspot – number of times
●
Read current time
●
Calculate time per iteration
Known Problems
●
●
●
Too few iterations of the function to be
timed are executed between the two time
stamp readings, and the resulting timing
is inaccurate due to poor timer resolution.
Too many iterations are executed
between the two time stamp readings,
and the resulting timing is affected by
system events.
The machine is under load and the load
has side effects on the measured
program.
Known Problems
●
●
●
●
Multiple timing jobs are executed
concurrently, and they interfere with one
another.
Data alignment of input and output
triggers cache problems.
Virtual-to-physical memory translation
makes timing irreproducible.
The time stamp counter overflows and
either triggers an interrupt or produces a
meaningless value.
Known Problems
●
●
●
Reading the timestamp counters requires
hundred(s) of cycles, which itself affects
the timing.
The linking order of object files changes
locality of static constants and this
produces cache interference.
The machine was not rebooted in a long
time and the operating system state
causes problems.
Known Problems
●
●
The control flow in the numerical kernel
being timed is data-dependent and the
test data is not representative.
The kernel is in-place (e.g., the input is a
vector x and the output is written back to
x), and the norm of the output is larger
than the norm of the input. Repetitive
application of the kernel leads to an
exponential growth of the norm and
finally triggers floating-point exceptions
which interfere with the timing.
Known Problems
●
The transform is timed with a zero vector,
and the operating system is “smart,” and
responds to a request for a large zerovector dynamic memory allocation by
returning a special zero-valued copy-onwrite virtual memory page. Read
accesses to this “page” would be much
faster than accesses to a page that is
actually allocated, since this page is a
special one maintained by the operating
system for efficiency.
Analyzing Measured
Runtime
●
2 basic questions:
●
●
●
How efficient is the implementation with
respect to the limiting resource?
Normalization
●
●
What is the limiting resource?
Asymptotic or exact operations count
Relative performance
●
Comparing measured performance to
theoretical peak performance
Optimization for Memory
Hierarchy
●
Performance conscious programming
●
Optimizations for cache
●
Optimizations for registers & CPU
●
Parameter-based performance tuning
Performance-Conscious
Programming
●
Object oriented programming – avoided
for performance critical parts of program
●
●
●
Late binding
Languages not compiled to native
machine code – avoided
Use 1 dimensional arrays whenever
possible
●
Linearize higher dimension arrays
Performance-Conscious
Programming
●
Avoid complicated struct & union data types
●
Multiple arrays over array of structs
●
●
Dynamically generated data structures must be
avoided if algorithm can be implemented using
arrays
While loops & loops with complicated
termination conditions must be avoided
●
●
●
Use for loops with loop counters & loop bounds
known at compile time
Selection structures need to be avoided in hot
spots & inner loops
Macros instead of small functions
Cache Optimization
●
●
Cache & TLB – goal is to reuse data as
much as possible
Optimization methods
●
●
●
Blocking – working on chunks of data that fit
into cache
Loop merging – merging consecutive loops
that traverse data
Buffering – copying into contiguous
temporary buffers
Blocking
●
Perform computation in blocks that
operate on a subset of input data –
memory locality
●
Tiling split loop into smaller loops
●
Sometimes recursion
Loop Merging
●
For sequencial loops – if all operations of
first loop do not need to executed before
second loop – possible to merge into 1
loop
Buffering
●
Logically close data elements could be
stored apart from each other
●
●
Elements in a column of a matrix that is
stored in row-matrix form
Copy logically close data elements into a
contiguous buffer
CPU & Register Optimization
●
Optimization goals for a modern CPU
●
●
●
●
●
Have inner loops with adequately large loop
bodies
Have many independent operations inside an
inner loop body
Use automatic variables whenever possible
Reuse loaded data elements to the extent
possible
Avoid math library function calls inside an
inner loop
Blocking
●
Partitions data into chunk on which
computation can be performed within
register set
for(i=0; i<8; i++)
{ y[2*i] = x[2*i] + x[2*i+1];
y[2*i+1] = x[2*i] - x[2*i+1];
}
●
Blocked into
for(i1=0; i1<4; i1++)
for(i2=0; i2<2; i2++)
{ y[4*i1+2*i2] = x[4*i1+2*i2] + x[4*i1+2*i2+1];
y[4*i1+2*i2+1] = x[4*i1+2*i2] - x[4*i1+2*i2+1];
}
Unrolling & Scheduling
●
Unrolling produces larger basic blocks
●
Decreases number of conditional branches
●
Increases number of operations in basic block
●
Easier to determine data dependencies
●
Easier to do rescheduling
for(i1=0; i1<4; i1++)
{ y[4*i1] = x[4*i1] + x[4*i1+1];
y[4*i1+1] = x[4*i1] - x[4*i1+1];
y[4*i1+2] = x[4*i1+2] + x[4*i1+3];
y[4*i1+3] = x[4*i1+2] - x[4*i1+3];
}
Scalar Replacement
●
●
Pointer analysis – complicated
Replace arrays that are fully inside scope
of innermost loop by 1 automatic scalar
variable per array element
double t[2];
for(i=0; i<8; i++)
{ t[0] = x[2*i] + x[2*i+1];
t[1] = x[2*i] - x[2*i+1];
y[2*i] = t[0] * D[2*i];
y[2*i+1] = t[0] * D[2*i];
}
●
double t0, t1;
for(i=0; i<8; i++)
{ t0 = x[2*i] + x[2*i+1];
t1 = x[2*i] - x[2*i+1];
y[2*i] = t0 * D[2*i];
y[2*i+1] = t1 * D[2*i];
}
Automatic variables – can be stored in
registers – arrays in memory
Precomputation of
Constants
●
All constants know ahead of time – should
be precomputed at compile time or
initialization time & stored in a data array
for(i=0; i<8; i++)
y[i] = x[i] * sin(M_PI * i / 8);
static double D[8];
void init()
{ for(int i=0; i<8; i++)
D[i] = sin(M_PI * i / 8);
}
...
// in the kernel
for(i=0; i<8; i++)
y[i] = x[i] * D[i];
Download