TM Compiler Techniques for Single Processor Tuning An Introduction TM The Compiler Compiler manages processor resources: • registers • integer/floating-point execution units • load/store/prefetch for data flow in/out of processor • the implementation details of processor and system architecture are built into the compiler User Program (C/C++/Fortran, etc.) – high level representation Compilation process – low level representation Machine instructions Solving: –data dependencies –control flow dependencies –parallelization –compactification of code –optimal scheduling of the code TM MIPSpro Compiler Components source Executable object InterProcedural Analyzer Linker F77/f90 cc/CC driver Macro preprocessor Front-end (source to WHIRL format) Global optimizer InterProcedural Analyzer Loop nest optimizer Code generator Parallel optimizer • There are no source-to-source optimizers or parallelizers • Source code is translated to WHIRL (Winning Hierarchical Intermediate Representation Language); – same IR for different levels of representation – whirl2f and whirl2c translates back into Fortran or C from IRs • Inter-Procedural analyzer requires final translation at link time TM Compiler Optimizations • Global Optimizer: – dead code elimination – copy propagation – loop normalization • stride one loops • single induction variable – memory alias analysis – strength reduction • Inter-Procedural Analyzer: – – – – – cross-file function inlining dead function elimination dead variable elimination padding of variables in common blocks inter-procedural constant propagation • Automatic Parallelizer – loop level work distribution • Loop Nest Optimizer: – – – – – – loop unrolling (outer) loop interchange loop fusion/fission loop blocking memory prefetch padding local variables • Code Generator: – – – – – – software pipelining inner loop unrolling if-conversion read/write optimization recurrence breaking instruction scheduling inside basic blocks TM SGI Architecture, ABI, Languages • Instruction Set Architecture (ISA): – -mips4 (R1x000, R8000, R5000 processors) – -mips3 (R4400) – -mips[1|2] (R3000, R4000 processors, invokes old ucode compiler) • ABI (Application Binary Interface): – -n32 (32 bit pointers, 4 byte integers, 4 byte real) – -64 (64 bit pointers, 4 byte integers, 4 byte real) Variable • Languages: – – – – C C++ Fortran 77 Fortran 90 C size[bit] -n32 -64 char/character 8 8 short 16 16 int/integer 32 32 long 32 64 long long 64 64 logical float/real 32 32 double 64 64 pointer 32 64 F size[bit] -n32 -64 8 8 32 32 32 32 64 32 32 64 TM Options: ABI & ISA Option -n32 -64 -o32/-32 -mips[1234] Functionality invoke the MIPSpro Compiler, use 32 bit addressing invoke the MIPSpro Compiler, use 64 bit addressing invoke the old ucode compiler, 32 bit addressing ISA; -mips[12] implies ucode compiler There are two more ways to define the ABI and ISA: • environment variable “SGI_ABI” can be set to -n32 or -64 • the ABI/ISA/Processor/optimization can be set in a file ~/compiler.defaults or /etc/compiler.defaults. In addition, the location of the file can be defined by “COMPILER_DEFAULTS_PATH” environment variable. The file should contain a line like: DEFAULT:abi=n32:isa=mips4:proc=r10000:arith=3:opt=O3 There is a way to find which compiler flags were used: dwarfdump -i file.o | grep DW_AT_producer TM Optimization Levels Compilation speed degrades with higher optimization • -O0 turn off all optimizations • -O1 only local optimizations • -O2 or -O extensive but conservative optimizations • -O3 aggressive optimizations, LNO, software pipelining • -ipa inter-procedural analysis (only at -O2 and -O3) • -apo automatic parallelization option (same as -pfa) • -g[0|3] debugging switch: -g0 forces -O0 -g3 to debug with -O3 TM Options: Performance Option -r10000 -r8000 Functionality Generate optimal instruction schedule for the R10000 proc Generate optimal instruction schedule for the R8000 proc -O[0|1|2|3] Set optimization Level to 0, 1, 2, 3 -Ofast=[ipXX] Select best optimization for the given architecture XX machine (output of the hinv -c processor command) 27 Origin2000 (all cpu frequencies and cache sizes) 35 Origin3000 (all cpu frequencies and cache sizes) optimizations may differ on the version of the compiler. Currently: -O3 -IPA -TARG:platform=ip27 -n32 -OPT:Olimit=0:roundoff=3:div_split=ON:alias=typed (thus -Ofast switch invokes the Interprocedural Analyzer) -mp -mpio -apo Enable multi-processing directives Support I/O from a parallel region Invoke automatic parallelization option TM Options: Porting Option -d8/d16 -r8 -i8 -static Functionality Double precision variables as 8 or 16 bytes Convert REAL to REAL*8 and COMPLEX to COMPLEX*16 (1) Convert INTEGER to INTEGER*8 and LOGICAL to 8 byte sizes (1) Local variables will be initialized in fixed locations on the heap (-static_threadprivate makes static variables private to each thread) -col[72|120] Source line is 72 or 120 columns -Dname -Idir -alignN -G0 -xgot -multigot Define name for the pre-processor Define include directory dir Assume alignment on the N=8,16,32,64,128 bit boundary Put all static data into indirect address area make big tables for static data and program addresses -version -show Show compiler version Put the compiler in verbose mode: all switches are displayed Automatic choice of table sizes for static variables and addresses (1) Note: explicit sizes are preserved, i.e. REAL*4 remains 32 bit TM Options: Debugging Option -g -DEBUG: Functionality Disable optimization and keep all symbol tables the DEBUG group option (man DEBUG_GROUP): • check_div=n n=1 (default) check integer divide by zero n=2 check integer overflow n=3 check integer divide by zero and overflow • subscript_check (default ON) to check for subscripts out of range C/C++: produces trap #8 f77: aborts run and dumps core f90: aborts run if setenv F90_BOUNDS_CHECK_ABORT • verbose_runtime (default OFF) to give source line number of failures • trap_uninitialized (default OFF) initialize all variables to 0xFFFA5A5 when used as pointer - access violation when used as fp values - NaN causes fp trap Example: f77 -n32 -mips4 -g file.f \ -DEBUG:subscript_check:verbose_runtime=ON \ -DEBUG:check_div=3 -DEBUG:trap_uninitialized=ON TM Compilation Examples 1. Produce executable a.out with default compilation options: f77 source.f cc source.f be aware of the defaults setting (e.g. /etc/compiler.defaults ) same flags for Fortran and C 2. Options for debugging: f77/cc -o prog -n32 -g -static source.f 2. Explicit setting of ABI/ISA/Processor, highest opt: f77/cc -o prog -n32 -mips4 -r10000 -O3 source.f 3. Detailed control of the optimization process with the group options : f77/cc -o prog -64 -mips4 -O3 -Ofast=ip27 -OPT:round=3:IEEE_arith=3 -IPA:dfe=on ... TM Fine Tuning Compiler Actions Compiler performs many sophisticated optimizations on the source code under certain assumption about the program. Typically: • • • • program data is large (does not fit into the cache) program does not violate language standard program is insensitive to roundoff errors all data in the program is alias-ed, unless it can be proved otherwise if one or more of these assumptions does not hold, compiler should be tuned to the program with the compiler options. Most important: • OPT for general optimizations assumptions • LNO for the Loop Nest optimizer options • IPA for the Inter-Procedural Analyzer options Additional options that help to tune the compiler properly: • TENV, TARG for the target machine and environment description -TENV:align_aggregates=x (bytes) • LIST, DEBUG for the listing and debugging options TM Group Options Compiler options can be set with the key=value expressions on the command line. These options are combined in logical groups. Multiple key=val expressions are colon separated; same group headings can be specified several times, the effects are cumulative: E.g.: -OPT:roundoff=2:alias=restrict -OPT:IEEE_arithmetic=3 etc. Group Heading Reference page Usage comments -OPT:key=val -TENV:key=val -TARG:key=val -FLIST/CLIST -LIST:key=val -DEBUG:key=val -IPA:key=val -INLINE:key=val -LNO:key=val -MP:key=val -LANG: -CG: -WOPT: cc(1) f77(1) opt(5) cc(1) f77(1) cc(1) f77(1) cc(1) f77(1) cc(1) f77(1) debug_group(5) ipa(5) ipa(5) lno(5) cc(1) f77(1) cc(1) f77(1) cc(1) f77(1) cc(1) f77(1) Optimizations Control target environment Control target architecture Listing control Options to control listing Debugging options Inter-Procedural Analyzer control Procedure inliner control Loop Nest Optimizer control Parallelization control language compatibility features code generation global optimizer TM Compiler man Pages • Primary man pages: man f77(1) f90(1) cc(1) CC(1) ld(1) • some of the compiler option groups are rather large and deserve their own man pages man opt(5) man lno(5) man ipa(5) man DEBUG_GROUP(5) man mp(3F) man pe_environ(5) man sigfpe(3C) TM The Run-Time Library Structure *.a, *.so ucode-compiler lib/ o32 nonshared/*.a *.a, *.so Cmplrs/mongoose-compiler nonshared/*.a /usr lib32/ mips3 mips4 *.a, *.so nonshared/*.a *.a, *.so nonshared/*.a n32 *.a, *.so lib64/ Cmplrs/mongoose-compiler nonshared/*.a Environment variables: LD_LIBRARY_PATH LD_LIBRARY32_PATH LD_LIBRARY64_PATH mips3 mips4 *.a, *.so nonshared/*.a *.a, *.so nonshared/*.a n64 TM The Scientific Libraries Standard scientific libraries containing: • Basic Linear Algebra operations and algorithms: – BLAS1, BLAS2, BLAS3 (see man intro_blas1,_blas2,_blas3) – LAPACK (see man intro_lapack) • Fast Fourier Transformations (FFT): – 1D, 2D, 3D, multiple 1D transformations (see man intro_fft) • Convolutions (Signal Processing, e.g. man SIIR2D) • Sparse Solvers (see man solvers; man PSLDLT) To use: – -lscs serial versions ( -lscs_i8, -lscs_i8_mp for long integers) – -lscs_mp -mp for parallel versions – man intro_scsl for detailed description – -lcomplib.sgimath or -lcomplib.sgimath_mp for older versions – man complib.sgimath for detailed description TM Computational Domain Range of numbers (from /usr/include/limits.h): FLT_DIG FLT_MAX FLT_MIN 6 /* decimal digits of precision of a float */ 3.40282347E+38F 1.17549435E-38F DBL_DIG DBL_MAX DBL_MIN 15 /* decimal digits of precision of a double */ 1.7976931348623157E+308 2.2250738585072014E-308 LONGLONG_MIN LONGLONG_MAX ULONGLONG_MAX -9223372036854775807LL-1LL 9223372036854775807LL 18446744073709551615LLU The extended precision (REAL*16) is available and supported by the compiler. But this mode of calculation is slow (by factor ~40) TM Underflow and Denormal Numbers When de-normalized numbers emerge in a computation (i.e. numbers x<DBL_MIN) they are flushed to zero by default: Program denorm real*8 a,b a = 2.2250738585072014D-308 b = a/10.0D0 write(6,10) b end #include <sys/fpu.h> void no_flush_() { union fpc_csr f; f.fc_word = get_fpc_csr(); f.fc_struct.flush = 0; set_fpc_csr(f.fc_word); } will print zero. To force IEEE-754 gradual underflow it is necessary to manipulate status register on the R1x000 cpu. Calling no_flush at the beginning of the program will print 0.22250738585072014D-308 Flush-to-zero property can lead to x-y=0, while xy . Keeping de-normalized numbers in computations will avoid that condition, but will cause fp exception, that must be processed in software. It is a performance issue - not to manipulate the de-normalized numbers in calculations. TM Overflow Example Program example that generates overflows and underflows: Parameter (N=20) INCLUDE “/usr/include/limits.h” Real*8 A(N),B(N) Compile with: f77 -n32 -mips4 -O3 complex*16 C(N) Note: Compilation with -r8 avoids the error. do I=1,N A(I) = (FLT_MAX/10)*I ! single precision range B(I) = (FLT_MIN*10)/I ! will fit into double enddo C = CMPLX(A,B) ! Standard requires passing from base precision: real*4 ! write (0,’(I3,2(2G22.15/))’) (I,A(I),B(I),C(I),I=1,N) Output with all exceptions ignored by default: 10 0.340282347000000E+39 0.340282346638529E+39 11 0.374310581700000E+39 Overflow! Infinity 12 etc… 0.117549435000000E-37 0.117549435082229E-37 0.106863122727273E-37 0.000000000000000 A,B Cr,Ci Flush to zero! setenv TRAP_FPE “UNDERFL=TRACE; OVERFL=TRACE“ will trap at Overflow and Underflow and produce traceback (Link -lfpe). TM Floating Point Exceptions A fp status register flag is set when fpu is has an illegal condition: • • • • • division by zero overflow underflow invalid inexact By default, all exceptions are ignored! (e.g. for 1/0 NaN value is set and execution continues) The status register can be programmed to raise a Floating Point Exception. If an FPE occurs, the system can take a specified action: • abort • ignore the exception • repair the illegal condition You can manipulate the status register to select action: • with calls to the FPE library, link with -lfpe • with environment variable TRAP_FPE see man handle_sigfpes TM Compiler-Generated Exceptions •Compiler can do more optimizations if it is allowed to generate code that could cause exceptions (-TENV:X=0..4) X X X X X = = = = = 0 1 2 3 4 no speculative code motion IEEE-754 underflow and inexact FPE disabled (default -O0 and -O2) all IEEE-754 exceptions disabled except 1/0 (default -O3) all IEEE-754 exceptions are disabled memory access exceptions are disabled IF-conversion with conditional moves for Software Pipelining (with -O3): Do i=1,N if(a(i) .lt. eps) then x = x + 1/eps else x = x + 1/a(i) endif enddo #put eps in $f1 and (1/eps) in $f0 ldc1 c.lt.d recip.d movt.d add.d $f5,-8($2) $fcc0,$f5,$f1 $f2,$f5 $f2,$f0,$fcc0 $f1,$f1,$f2 load a(i) if(a(i) < eps) 1/a(i) y = 1/a or 1/eps x = x+ y Removing IF(…) will cause divide by zero! In this case this exception must be ignored Note: transf applied already at -O3 & X=1! TM IEEE_754 Compliance The MIPS4 instruction set contains IEEE-754 non-compliant instructions: • recip.s/d (reciprocal 1/x) instruction is accurate to 1 ulp • rsqrt.s/d (reciprocal-sqrt: 1/sqrt(x)) instruction to 2 ulp -OPT:IEEE_arithmetic=X specify degree of non-compliance and what to do with inf and NaN operands X = 1 strict IEEE-754 compliance; does not use the recip and rsqrt instructions (-O1,2) X = 2 optimize 0*x=0 and x/x=1 while x can be NaN (default at -O3) X = 3 any mathematically valid transformation is allowed, including recip & rsqrt instr. y_tmp = 1/y Do i=1,n -O3 do i=1,n x = x + a(i)/y -OPT:IEEE_arithmetic=3 x = x + a(I)*y_tmp enddo enddo 21 cycles/iteration; 4% peak 1 cycles/iteration; 100% peak Note: X=3 is required! TM Rounding Accuracy Rounding mode can be specified with -OPT:roundoff=X switch: X = 0 no optimizations that affect fp behaviour (default at -O1 -O2) X = 1 allows simple transformations with limited round-off and overflow differences X = 2 allows reordering of reduction loops (default at -O3) X = 3 any mathematically valid transformation is allowed do i=1,n x = x + a(i) enddo With -O3 -OPT:roundoff=1 2 cycles/iter; 25% peak -O3 -OPT:roundoff=2 (default at -O3) do i=1,n,8 x0 = x0 + a(i) x1 = x1 + a(i+1) … enddo x = x0 + x1 + … 1 cycles/iter; 50% peak Recommendation: Your program should work correctly when compiled with -O3 -OPT:IEEE_arithmetic=3:roundoff=3 TM Summary Compiler is the primary tool of program optimization • Compilation is the process of lowering the code representation from high level to low, I.e. processor level • The MipsPro compiler targets the MIPS R1x000 processor and has built in the features of the processor and Origin architecture • A large number of options exist to steer the compilation process – ABI, ISA and optimization options selections – setting of assumptions about the program behaviour • There are optimized and parallelized libraries of subroutines for scientific computation • When programming for a digital computer, it is important to remember the limitations due to limited validity range of the floating point calculations