Compiler Techniques for Single Processor Tuning An Introduction TM

advertisement
TM
Compiler Techniques for
Single Processor Tuning
An Introduction
TM
The Compiler
Compiler manages processor resources:
• registers
• integer/floating-point execution units
• load/store/prefetch for data flow in/out of processor
• the implementation details of processor and system
architecture are built into the compiler
User Program (C/C++/Fortran, etc.)
– high level representation
Compilation
process
– low level representation
Machine instructions
Solving:
–data dependencies
–control flow dependencies
–parallelization
–compactification of code
–optimal scheduling of the code
TM
MIPSpro Compiler Components
source
Executable object
InterProcedural
Analyzer
Linker
F77/f90
cc/CC
driver
Macro
preprocessor
Front-end
(source to
WHIRL
format)
Global
optimizer
InterProcedural
Analyzer
Loop
nest
optimizer
Code
generator
Parallel
optimizer
• There are no source-to-source optimizers or parallelizers
• Source code is translated to WHIRL (Winning Hierarchical Intermediate
Representation Language);
– same IR for different levels of representation
– whirl2f and whirl2c translates back into Fortran or C from IRs
• Inter-Procedural analyzer requires final translation at link time
TM
Compiler Optimizations
• Global Optimizer:
– dead code elimination
– copy propagation
– loop normalization
• stride one loops
• single induction variable
– memory alias analysis
– strength reduction
• Inter-Procedural Analyzer:
–
–
–
–
–
cross-file function inlining
dead function elimination
dead variable elimination
padding of variables in common blocks
inter-procedural constant propagation
• Automatic Parallelizer
– loop level work distribution
• Loop Nest Optimizer:
–
–
–
–
–
–
loop unrolling (outer)
loop interchange
loop fusion/fission
loop blocking
memory prefetch
padding local variables
• Code Generator:
–
–
–
–
–
–
software pipelining
inner loop unrolling
if-conversion
read/write optimization
recurrence breaking
instruction scheduling
inside basic blocks
TM
SGI Architecture, ABI, Languages
• Instruction Set Architecture (ISA):
– -mips4 (R1x000, R8000, R5000 processors)
– -mips3 (R4400)
– -mips[1|2] (R3000, R4000 processors, invokes old ucode compiler)
• ABI (Application Binary Interface):
– -n32 (32 bit pointers, 4 byte integers, 4 byte real)
– -64 (64 bit pointers, 4 byte integers, 4 byte real)
Variable
• Languages:
–
–
–
–
C
C++
Fortran 77
Fortran 90
C size[bit]
-n32 -64
char/character 8
8
short
16
16
int/integer
32
32
long
32
64
long long
64
64
logical
float/real
32
32
double
64
64
pointer
32
64
F size[bit]
-n32 -64
8
8
32
32
32
32
64
32
32
64
TM
Options: ABI & ISA
Option
-n32
-64
-o32/-32
-mips[1234]
Functionality
invoke the MIPSpro Compiler, use 32 bit addressing
invoke the MIPSpro Compiler, use 64 bit addressing
invoke the old ucode compiler, 32 bit addressing
ISA; -mips[12] implies ucode compiler
There are two more ways to define the ABI and ISA:
• environment variable “SGI_ABI” can be set to -n32 or -64
• the ABI/ISA/Processor/optimization can be set in a file
~/compiler.defaults or /etc/compiler.defaults.
In addition, the location of the file can be defined by
“COMPILER_DEFAULTS_PATH” environment variable.
The file should contain a line like:
DEFAULT:abi=n32:isa=mips4:proc=r10000:arith=3:opt=O3
There is a way to find which compiler flags were used:
dwarfdump -i file.o | grep DW_AT_producer
TM
Optimization Levels
Compilation speed degrades with higher optimization
• -O0
turn off all optimizations
• -O1
only local optimizations
• -O2 or -O extensive but conservative optimizations
• -O3
aggressive optimizations, LNO, software pipelining
• -ipa
inter-procedural analysis (only at -O2 and -O3)
• -apo
automatic parallelization option (same as -pfa)
• -g[0|3]
debugging switch:
-g0 forces -O0
-g3 to debug with -O3
TM
Options: Performance
Option
-r10000
-r8000
Functionality
Generate optimal instruction schedule for the R10000 proc
Generate optimal instruction schedule for the R8000 proc
-O[0|1|2|3]
Set optimization Level to 0, 1, 2, 3
-Ofast=[ipXX]
Select best optimization for the given architecture
XX
machine (output of the hinv -c processor command)
27
Origin2000 (all cpu frequencies and cache sizes)
35
Origin3000 (all cpu frequencies and cache sizes)
optimizations may differ on the version of the compiler. Currently:
-O3 -IPA -TARG:platform=ip27 -n32
-OPT:Olimit=0:roundoff=3:div_split=ON:alias=typed
(thus -Ofast switch invokes the Interprocedural Analyzer)
-mp
-mpio
-apo
Enable multi-processing directives
Support I/O from a parallel region
Invoke automatic parallelization option
TM
Options: Porting
Option
-d8/d16
-r8
-i8
-static
Functionality
Double precision variables as 8 or 16 bytes
Convert REAL to REAL*8 and COMPLEX to COMPLEX*16
(1)
Convert INTEGER to INTEGER*8 and LOGICAL to 8 byte sizes (1)
Local variables will be initialized in fixed locations on the heap
(-static_threadprivate makes static variables private to each thread)
-col[72|120]
Source line is 72 or 120 columns
-Dname
-Idir
-alignN
-G0
-xgot
-multigot
Define name for the pre-processor
Define include directory dir
Assume alignment on the N=8,16,32,64,128 bit boundary
Put all static data into indirect address area
make big tables for static data and program addresses
-version
-show
Show compiler version
Put the compiler in verbose mode: all switches are displayed
Automatic choice of table sizes for static variables and addresses
(1) Note: explicit sizes are preserved, i.e. REAL*4 remains 32 bit
TM
Options: Debugging
Option
-g
-DEBUG:
Functionality
Disable optimization and keep all symbol tables
the DEBUG group option (man DEBUG_GROUP):
• check_div=n n=1 (default) check integer divide by zero
n=2 check integer overflow
n=3 check integer divide by zero and overflow
• subscript_check
(default ON) to check for subscripts out of range
C/C++: produces trap #8
f77:
aborts run and dumps core
f90:
aborts run if setenv F90_BOUNDS_CHECK_ABORT
• verbose_runtime
(default OFF) to give source line number of failures
• trap_uninitialized
(default OFF) initialize all variables to 0xFFFA5A5
when used as pointer - access violation
when used as fp values - NaN causes fp trap
Example:
f77 -n32 -mips4 -g file.f
\
-DEBUG:subscript_check:verbose_runtime=ON \
-DEBUG:check_div=3 -DEBUG:trap_uninitialized=ON
TM
Compilation Examples
1. Produce executable a.out with default compilation options:
f77 source.f
cc source.f
be aware of the defaults setting (e.g. /etc/compiler.defaults )
same flags for Fortran and C
2. Options for debugging:
f77/cc -o prog -n32 -g -static source.f
2. Explicit setting of ABI/ISA/Processor, highest opt:
f77/cc -o prog -n32 -mips4 -r10000 -O3 source.f
3. Detailed control of the optimization process with the
group options :
f77/cc -o prog -64 -mips4 -O3 -Ofast=ip27
-OPT:round=3:IEEE_arith=3 -IPA:dfe=on ...
TM
Fine Tuning Compiler Actions
Compiler performs many sophisticated optimizations on the source
code under certain assumption about the program. Typically:
•
•
•
•
program data is large (does not fit into the cache)
program does not violate language standard
program is insensitive to roundoff errors
all data in the program is alias-ed, unless it can be
proved otherwise
if one or more of these assumptions does not hold, compiler should
be tuned to the program with the compiler options. Most important:
• OPT for general optimizations assumptions
• LNO for the Loop Nest optimizer options
• IPA for the Inter-Procedural Analyzer options
Additional options that help to tune the compiler properly:
• TENV, TARG for the target machine and environment description
-TENV:align_aggregates=x (bytes)
• LIST, DEBUG for the listing and debugging options
TM
Group Options
Compiler options can be set with the key=value expressions on the command line. These
options are combined in logical groups. Multiple key=val expressions are colon separated; same
group headings can be specified several times, the effects are cumulative:
E.g.: -OPT:roundoff=2:alias=restrict -OPT:IEEE_arithmetic=3 etc.
Group Heading
Reference page
Usage comments
-OPT:key=val
-TENV:key=val
-TARG:key=val
-FLIST/CLIST
-LIST:key=val
-DEBUG:key=val
-IPA:key=val
-INLINE:key=val
-LNO:key=val
-MP:key=val
-LANG:
-CG:
-WOPT:
cc(1) f77(1) opt(5)
cc(1) f77(1)
cc(1) f77(1)
cc(1) f77(1)
cc(1) f77(1)
debug_group(5)
ipa(5)
ipa(5)
lno(5)
cc(1) f77(1)
cc(1) f77(1)
cc(1) f77(1)
cc(1) f77(1)
Optimizations
Control target environment
Control target architecture
Listing control
Options to control listing
Debugging options
Inter-Procedural Analyzer control
Procedure inliner control
Loop Nest Optimizer control
Parallelization control
language compatibility features
code generation
global optimizer
TM
Compiler man Pages
• Primary man pages:
man f77(1) f90(1) cc(1) CC(1) ld(1)
• some of the compiler option groups are rather large and deserve their
own man pages
man opt(5)
man lno(5)
man ipa(5)
man DEBUG_GROUP(5)
man mp(3F)
man pe_environ(5)
man sigfpe(3C)
TM
The Run-Time Library Structure
*.a, *.so
ucode-compiler
lib/
o32
nonshared/*.a
*.a, *.so
Cmplrs/mongoose-compiler
nonshared/*.a
/usr
lib32/
mips3
mips4
*.a, *.so
nonshared/*.a
*.a, *.so
nonshared/*.a
n32
*.a, *.so
lib64/
Cmplrs/mongoose-compiler
nonshared/*.a
Environment variables:
LD_LIBRARY_PATH
LD_LIBRARY32_PATH
LD_LIBRARY64_PATH
mips3
mips4
*.a, *.so
nonshared/*.a
*.a, *.so
nonshared/*.a
n64
TM
The Scientific Libraries
Standard scientific libraries containing:
• Basic Linear Algebra operations and algorithms:
– BLAS1, BLAS2, BLAS3 (see man intro_blas1,_blas2,_blas3)
– LAPACK (see man intro_lapack)
• Fast Fourier Transformations (FFT):
– 1D, 2D, 3D, multiple 1D transformations (see man intro_fft)
• Convolutions (Signal Processing, e.g. man SIIR2D)
• Sparse Solvers (see man solvers; man PSLDLT)
To use:
– -lscs serial versions ( -lscs_i8, -lscs_i8_mp for long integers)
– -lscs_mp -mp for parallel versions
– man intro_scsl for detailed description
– -lcomplib.sgimath or -lcomplib.sgimath_mp for older versions
– man complib.sgimath for detailed description
TM
Computational Domain
Range of numbers (from /usr/include/limits.h):
FLT_DIG
FLT_MAX
FLT_MIN
6 /* decimal digits of precision of a float */
3.40282347E+38F
1.17549435E-38F
DBL_DIG
DBL_MAX
DBL_MIN
15 /* decimal digits of precision of a double */
1.7976931348623157E+308
2.2250738585072014E-308
LONGLONG_MIN
LONGLONG_MAX
ULONGLONG_MAX
-9223372036854775807LL-1LL
9223372036854775807LL
18446744073709551615LLU
The extended precision (REAL*16) is available and supported by the
compiler. But this mode of calculation is slow (by factor ~40)
TM
Underflow and Denormal Numbers
When de-normalized numbers emerge in a computation
(i.e. numbers x<DBL_MIN) they are flushed to zero by default:
Program denorm
real*8 a,b
a = 2.2250738585072014D-308
b = a/10.0D0
write(6,10) b
end
#include <sys/fpu.h>
void no_flush_()
{
union fpc_csr f;
f.fc_word = get_fpc_csr();
f.fc_struct.flush = 0;
set_fpc_csr(f.fc_word);
}
will print zero. To force IEEE-754 gradual underflow it is necessary to
manipulate status register on the R1x000 cpu.
Calling no_flush at the beginning of the program will print
0.22250738585072014D-308
Flush-to-zero property can lead to x-y=0, while xy .
Keeping de-normalized numbers in computations will avoid that condition, but will cause
fp exception, that must be processed in software.
It is a performance issue - not to manipulate the de-normalized
numbers in calculations.
TM
Overflow Example
Program example that generates overflows and underflows:
Parameter (N=20)
INCLUDE “/usr/include/limits.h”
Real*8 A(N),B(N)
Compile with: f77 -n32 -mips4 -O3
complex*16 C(N)
Note: Compilation with -r8 avoids the error.
do I=1,N
A(I) = (FLT_MAX/10)*I ! single precision range
B(I) = (FLT_MIN*10)/I ! will fit into double
enddo
C = CMPLX(A,B)
! Standard requires passing from base precision: real*4 !
write (0,’(I3,2(2G22.15/))’) (I,A(I),B(I),C(I),I=1,N)
Output with all exceptions ignored by default:
10
0.340282347000000E+39
0.340282346638529E+39
11
0.374310581700000E+39
Overflow!
Infinity
12
etc…
0.117549435000000E-37
0.117549435082229E-37
0.106863122727273E-37
0.000000000000000
A,B
Cr,Ci
Flush to zero!
setenv TRAP_FPE “UNDERFL=TRACE; OVERFL=TRACE“
will trap at Overflow and Underflow and produce traceback (Link -lfpe).
TM
Floating Point Exceptions
A fp status register flag is set when fpu is has an illegal condition:
•
•
•
•
•
division by zero
overflow
underflow
invalid
inexact
By default, all exceptions are ignored!
(e.g. for 1/0 NaN value is set and execution continues)
The status register can be programmed to raise a Floating Point Exception.
If an FPE occurs, the system can take a specified action:
• abort
• ignore the exception
• repair the illegal condition
You can manipulate the status register to select action:
• with calls to the FPE library, link with -lfpe
• with environment variable TRAP_FPE
see man handle_sigfpes
TM
Compiler-Generated Exceptions
•Compiler can do more optimizations if it is allowed to generate
code that could cause exceptions (-TENV:X=0..4)
X
X
X
X
X
=
=
=
=
=
0
1
2
3
4
no speculative code motion
IEEE-754 underflow and inexact FPE disabled (default -O0 and -O2)
all IEEE-754 exceptions disabled except 1/0 (default -O3)
all IEEE-754 exceptions are disabled
memory access exceptions are disabled
IF-conversion with conditional moves for Software Pipelining (with -O3):
Do i=1,N
if(a(i) .lt. eps) then
x = x + 1/eps
else
x = x + 1/a(i)
endif
enddo
#put eps in $f1 and (1/eps) in $f0
ldc1
c.lt.d
recip.d
movt.d
add.d
$f5,-8($2)
$fcc0,$f5,$f1
$f2,$f5
$f2,$f0,$fcc0
$f1,$f1,$f2
load a(i)
if(a(i) < eps)
1/a(i)
y = 1/a or 1/eps
x = x+ y
Removing IF(…) will cause divide by zero!
In this case this exception must be ignored
Note: transf applied already at -O3 & X=1!
TM
IEEE_754 Compliance
The MIPS4 instruction set contains IEEE-754 non-compliant instructions:
• recip.s/d
(reciprocal 1/x) instruction is accurate to 1 ulp
• rsqrt.s/d
(reciprocal-sqrt: 1/sqrt(x)) instruction to 2 ulp
-OPT:IEEE_arithmetic=X specify degree of non-compliance
and what to do with inf and NaN operands
X = 1 strict IEEE-754 compliance; does not use the recip and rsqrt instructions (-O1,2)
X = 2 optimize 0*x=0 and x/x=1 while x can be NaN (default at -O3)
X = 3 any mathematically valid transformation is allowed, including recip & rsqrt instr.
y_tmp = 1/y
Do i=1,n
-O3
do i=1,n
x = x + a(i)/y
-OPT:IEEE_arithmetic=3
x = x + a(I)*y_tmp
enddo
enddo
21 cycles/iteration; 4% peak
1 cycles/iteration; 100% peak
Note: X=3
is required!
TM
Rounding Accuracy
Rounding mode can be specified with -OPT:roundoff=X switch:
X = 0 no optimizations that affect fp behaviour (default at -O1 -O2)
X = 1 allows simple transformations with limited round-off and overflow differences
X = 2 allows reordering of reduction loops (default at -O3)
X = 3 any mathematically valid transformation is allowed
do i=1,n
x = x + a(i)
enddo
With -O3 -OPT:roundoff=1
2 cycles/iter; 25% peak
-O3 -OPT:roundoff=2
(default at -O3)
do i=1,n,8
x0 = x0 + a(i)
x1 = x1 + a(i+1)
…
enddo
x = x0 + x1 + …
1 cycles/iter; 50% peak
Recommendation: Your program should work correctly when
compiled with -O3 -OPT:IEEE_arithmetic=3:roundoff=3
TM
Summary
Compiler is the primary tool of program optimization
• Compilation is the process of lowering the code representation
from high level to low, I.e. processor level
• The MipsPro compiler targets the MIPS R1x000 processor and has
built in the features of the processor and Origin architecture
• A large number of options exist to steer the compilation process
– ABI, ISA and optimization options selections
– setting of assumptions about the program behaviour
• There are optimized and parallelized libraries of subroutines for
scientific computation
• When programming for a digital computer, it is important to
remember the limitations due to limited validity range of the
floating point calculations
Download