Compiler Techniques for Single Processor Tuning An Introduction TM

TM Compiler Techniques for Single Processor Tuning An Introduction TM The Compiler Compiler manages processor resources: • registers • integer/floating-point execution units • load/store/prefetch for data flow in/out of processor • the implementation details of processor and system architecture are built into the compiler User Program (C/C++/Fortran, etc.) – high level representation Compilation process – low level representation Machine instructions Solving: –data dependencies –control flow dependencies –parallelization –compactification of code –optimal scheduling of the code TM MIPSpro Compiler Components source Executable object InterProcedural Analyzer Linker F77/f90 cc/CC driver Macro preprocessor Front-end (source to WHIRL format) Global optimizer InterProcedural Analyzer Loop nest optimizer Code generator Parallel optimizer • There are no source-to-source optimizers or parallelizers • Source code is translated to WHIRL (Winning Hierarchical Intermediate Representation Language); – same IR for different levels of representation – whirl2f and whirl2c translates back into Fortran or C from IRs • Inter-Procedural analyzer requires final translation at link time TM Compiler Optimizations • Global Optimizer: – dead code elimination – copy propagation – loop normalization • stride one loops • single induction variable – memory alias analysis – strength reduction • Inter-Procedural Analyzer: – – – – – cross-file function inlining dead function elimination dead variable elimination padding of variables in common blocks inter-procedural constant propagation • Automatic Parallelizer – loop level work distribution • Loop Nest Optimizer: – – – – – – loop unrolling (outer) loop interchange loop fusion/fission loop blocking memory prefetch padding local variables • Code Generator: – – – – – – software pipelining inner loop unrolling if-conversion read/write optimization recurrence breaking instruction scheduling inside basic blocks TM SGI Architecture, ABI, Languages • Instruction Set Architecture (ISA): – -mips4 (R1x000, R8000, R5000 processors) – -mips3 (R4400) – -mips[1|2] (R3000, R4000 processors, invokes old ucode compiler) • ABI (Application Binary Interface): – -n32 (32 bit pointers, 4 byte integers, 4 byte real) – -64 (64 bit pointers, 4 byte integers, 4 byte real) Variable • Languages: – – – – C C++ Fortran 77 Fortran 90 C size[bit] -n32 -64 char/character 8 8 short 16 16 int/integer 32 32 long 32 64 long long 64 64 logical float/real 32 32 double 64 64 pointer 32 64 F size[bit] -n32 -64 8 8 32 32 32 32 64 32 32 64 TM Options: ABI & ISA Option -n32 -64 -o32/-32 -mips[1234] Functionality invoke the MIPSpro Compiler, use 32 bit addressing invoke the MIPSpro Compiler, use 64 bit addressing invoke the old ucode compiler, 32 bit addressing ISA; -mips[12] implies ucode compiler There are two more ways to define the ABI and ISA: • environment variable “SGI_ABI” can be set to -n32 or -64 • the ABI/ISA/Processor/optimization can be set in a file ~/compiler.defaults or /etc/compiler.defaults. In addition, the location of the file can be defined by “COMPILER_DEFAULTS_PATH” environment variable. The file should contain a line like: DEFAULT:abi=n32:isa=mips4:proc=r10000:arith=3:opt=O3 There is a way to find which compiler flags were used: dwarfdump -i file.o | grep DW_AT_producer TM Optimization Levels Compilation speed degrades with higher optimization • -O0 turn off all optimizations • -O1 only local optimizations • -O2 or -O extensive but conservative optimizations • -O3 aggressive optimizations, LNO, software pipelining • -ipa inter-procedural analysis (only at -O2 and -O3) • -apo automatic parallelization option (same as -pfa) • -g[0|3] debugging switch: -g0 forces -O0 -g3 to debug with -O3 TM Options: Performance Option -r10000 -r8000 Functionality Generate optimal instruction schedule for the R10000 proc Generate optimal instruction schedule for the R8000 proc -O[0|1|2|3] Set optimization Level to 0, 1, 2, 3 -Ofast=[ipXX] Select best optimization for the given architecture XX machine (output of the hinv -c processor command) 27 Origin2000 (all cpu frequencies and cache sizes) 35 Origin3000 (all cpu frequencies and cache sizes) optimizations may differ on the version of the compiler. Currently: -O3 -IPA -TARG:platform=ip27 -n32 -OPT:Olimit=0:roundoff=3:div_split=ON:alias=typed (thus -Ofast switch invokes the Interprocedural Analyzer) -mp -mpio -apo Enable multi-processing directives Support I/O from a parallel region Invoke automatic parallelization option TM Options: Porting Option -d8/d16 -r8 -i8 -static Functionality Double precision variables as 8 or 16 bytes Convert REAL to REAL*8 and COMPLEX to COMPLEX*16 (1) Convert INTEGER to INTEGER*8 and LOGICAL to 8 byte sizes (1) Local variables will be initialized in fixed locations on the heap (-static_threadprivate makes static variables private to each thread) -col[72|120] Source line is 72 or 120 columns -Dname -Idir -alignN -G0 -xgot -multigot Define name for the pre-processor Define include directory dir Assume alignment on the N=8,16,32,64,128 bit boundary Put all static data into indirect address area make big tables for static data and program addresses -version -show Show compiler version Put the compiler in verbose mode: all switches are displayed Automatic choice of table sizes for static variables and addresses (1) Note: explicit sizes are preserved, i.e. REAL*4 remains 32 bit TM Options: Debugging Option -g -DEBUG: Functionality Disable optimization and keep all symbol tables the DEBUG group option (man DEBUG_GROUP): • check_div=n n=1 (default) check integer divide by zero n=2 check integer overflow n=3 check integer divide by zero and overflow • subscript_check (default ON) to check for subscripts out of range C/C++: produces trap #8 f77: aborts run and dumps core f90: aborts run if setenv F90_BOUNDS_CHECK_ABORT • verbose_runtime (default OFF) to give source line number of failures • trap_uninitialized (default OFF) initialize all variables to 0xFFFA5A5 when used as pointer - access violation when used as fp values - NaN causes fp trap Example: f77 -n32 -mips4 -g file.f \ -DEBUG:subscript_check:verbose_runtime=ON \ -DEBUG:check_div=3 -DEBUG:trap_uninitialized=ON TM Compilation Examples 1. Produce executable a.out with default compilation options: f77 source.f cc source.f be aware of the defaults setting (e.g. /etc/compiler.defaults ) same flags for Fortran and C 2. Options for debugging: f77/cc -o prog -n32 -g -static source.f 2. Explicit setting of ABI/ISA/Processor, highest opt: f77/cc -o prog -n32 -mips4 -r10000 -O3 source.f 3. Detailed control of the optimization process with the group options : f77/cc -o prog -64 -mips4 -O3 -Ofast=ip27 -OPT:round=3:IEEE_arith=3 -IPA:dfe=on ... TM Fine Tuning Compiler Actions Compiler performs many sophisticated optimizations on the source code under certain assumption about the program. Typically: • • • • program data is large (does not fit into the cache) program does not violate language standard program is insensitive to roundoff errors all data in the program is alias-ed, unless it can be proved otherwise if one or more of these assumptions does not hold, compiler should be tuned to the program with the compiler options. Most important: • OPT for general optimizations assumptions • LNO for the Loop Nest optimizer options • IPA for the Inter-Procedural Analyzer options Additional options that help to tune the compiler properly: • TENV, TARG for the target machine and environment description -TENV:align_aggregates=x (bytes) • LIST, DEBUG for the listing and debugging options TM Group Options Compiler options can be set with the key=value expressions on the command line. These options are combined in logical groups. Multiple key=val expressions are colon separated; same group headings can be specified several times, the effects are cumulative: E.g.: -OPT:roundoff=2:alias=restrict -OPT:IEEE_arithmetic=3 etc. Group Heading Reference page Usage comments -OPT:key=val -TENV:key=val -TARG:key=val -FLIST/CLIST -LIST:key=val -DEBUG:key=val -IPA:key=val -INLINE:key=val -LNO:key=val -MP:key=val -LANG: -CG: -WOPT: cc(1) f77(1) opt(5) cc(1) f77(1) cc(1) f77(1) cc(1) f77(1) cc(1) f77(1) debug_group(5) ipa(5) ipa(5) lno(5) cc(1) f77(1) cc(1) f77(1) cc(1) f77(1) cc(1) f77(1) Optimizations Control target environment Control target architecture Listing control Options to control listing Debugging options Inter-Procedural Analyzer control Procedure inliner control Loop Nest Optimizer control Parallelization control language compatibility features code generation global optimizer TM Compiler man Pages • Primary man pages: man f77(1) f90(1) cc(1) CC(1) ld(1) • some of the compiler option groups are rather large and deserve their own man pages man opt(5) man lno(5) man ipa(5) man DEBUG_GROUP(5) man mp(3F) man pe_environ(5) man sigfpe(3C) TM The Run-Time Library Structure *.a, *.so ucode-compiler lib/ o32 nonshared/*.a *.a, *.so Cmplrs/mongoose-compiler nonshared/*.a /usr lib32/ mips3 mips4 *.a, *.so nonshared/*.a *.a, *.so nonshared/*.a n32 *.a, *.so lib64/ Cmplrs/mongoose-compiler nonshared/*.a Environment variables: LD_LIBRARY_PATH LD_LIBRARY32_PATH LD_LIBRARY64_PATH mips3 mips4 *.a, *.so nonshared/*.a *.a, *.so nonshared/*.a n64 TM The Scientific Libraries Standard scientific libraries containing: • Basic Linear Algebra operations and algorithms: – BLAS1, BLAS2, BLAS3 (see man intro_blas1,_blas2,_blas3) – LAPACK (see man intro_lapack) • Fast Fourier Transformations (FFT): – 1D, 2D, 3D, multiple 1D transformations (see man intro_fft) • Convolutions (Signal Processing, e.g. man SIIR2D) • Sparse Solvers (see man solvers; man PSLDLT) To use: – -lscs serial versions ( -lscs_i8, -lscs_i8_mp for long integers) – -lscs_mp -mp for parallel versions – man intro_scsl for detailed description – -lcomplib.sgimath or -lcomplib.sgimath_mp for older versions – man complib.sgimath for detailed description TM Computational Domain Range of numbers (from /usr/include/limits.h): FLT_DIG FLT_MAX FLT_MIN 6 /* decimal digits of precision of a float */ 3.40282347E+38F 1.17549435E-38F DBL_DIG DBL_MAX DBL_MIN 15 /* decimal digits of precision of a double */ 1.7976931348623157E+308 2.2250738585072014E-308 LONGLONG_MIN LONGLONG_MAX ULONGLONG_MAX -9223372036854775807LL-1LL 9223372036854775807LL 18446744073709551615LLU The extended precision (REAL*16) is available and supported by the compiler. But this mode of calculation is slow (by factor ~40) TM Underflow and Denormal Numbers When de-normalized numbers emerge in a computation (i.e. numbers x<DBL_MIN) they are flushed to zero by default: Program denorm real*8 a,b a = 2.2250738585072014D-308 b = a/10.0D0 write(6,10) b end #include <sys/fpu.h> void no_flush_() { union fpc_csr f; f.fc_word = get_fpc_csr(); f.fc_struct.flush = 0; set_fpc_csr(f.fc_word); } will print zero. To force IEEE-754 gradual underflow it is necessary to manipulate status register on the R1x000 cpu. Calling no_flush at the beginning of the program will print 0.22250738585072014D-308 Flush-to-zero property can lead to x-y=0, while xy . Keeping de-normalized numbers in computations will avoid that condition, but will cause fp exception, that must be processed in software. It is a performance issue - not to manipulate the de-normalized numbers in calculations. TM Overflow Example Program example that generates overflows and underflows: Parameter (N=20) INCLUDE “/usr/include/limits.h” Real*8 A(N),B(N) Compile with: f77 -n32 -mips4 -O3 complex*16 C(N) Note: Compilation with -r8 avoids the error. do I=1,N A(I) = (FLT_MAX/10)*I ! single precision range B(I) = (FLT_MIN*10)/I ! will fit into double enddo C = CMPLX(A,B) ! Standard requires passing from base precision: real*4 ! write (0,’(I3,2(2G22.15/))’) (I,A(I),B(I),C(I),I=1,N) Output with all exceptions ignored by default: 10 0.340282347000000E+39 0.340282346638529E+39 11 0.374310581700000E+39 Overflow! Infinity 12 etc… 0.117549435000000E-37 0.117549435082229E-37 0.106863122727273E-37 0.000000000000000 A,B Cr,Ci Flush to zero! setenv TRAP_FPE “UNDERFL=TRACE; OVERFL=TRACE“ will trap at Overflow and Underflow and produce traceback (Link -lfpe). TM Floating Point Exceptions A fp status register flag is set when fpu is has an illegal condition: • • • • • division by zero overflow underflow invalid inexact By default, all exceptions are ignored! (e.g. for 1/0 NaN value is set and execution continues) The status register can be programmed to raise a Floating Point Exception. If an FPE occurs, the system can take a specified action: • abort • ignore the exception • repair the illegal condition You can manipulate the status register to select action: • with calls to the FPE library, link with -lfpe • with environment variable TRAP_FPE see man handle_sigfpes TM Compiler-Generated Exceptions •Compiler can do more optimizations if it is allowed to generate code that could cause exceptions (-TENV:X=0..4) X X X X X = = = = = 0 1 2 3 4 no speculative code motion IEEE-754 underflow and inexact FPE disabled (default -O0 and -O2) all IEEE-754 exceptions disabled except 1/0 (default -O3) all IEEE-754 exceptions are disabled memory access exceptions are disabled IF-conversion with conditional moves for Software Pipelining (with -O3): Do i=1,N if(a(i) .lt. eps) then x = x + 1/eps else x = x + 1/a(i) endif enddo #put eps in $f1 and (1/eps) in $f0 ldc1 c.lt.d recip.d movt.d add.d $f5,-8($2) $fcc0,$f5,$f1 $f2,$f5 $f2,$f0,$fcc0 $f1,$f1,$f2 load a(i) if(a(i) < eps) 1/a(i) y = 1/a or 1/eps x = x+ y Removing IF(…) will cause divide by zero! In this case this exception must be ignored Note: transf applied already at -O3 & X=1! TM IEEE_754 Compliance The MIPS4 instruction set contains IEEE-754 non-compliant instructions: • recip.s/d (reciprocal 1/x) instruction is accurate to 1 ulp • rsqrt.s/d (reciprocal-sqrt: 1/sqrt(x)) instruction to 2 ulp -OPT:IEEE_arithmetic=X specify degree of non-compliance and what to do with inf and NaN operands X = 1 strict IEEE-754 compliance; does not use the recip and rsqrt instructions (-O1,2) X = 2 optimize 0*x=0 and x/x=1 while x can be NaN (default at -O3) X = 3 any mathematically valid transformation is allowed, including recip & rsqrt instr. y_tmp = 1/y Do i=1,n -O3 do i=1,n x = x + a(i)/y -OPT:IEEE_arithmetic=3 x = x + a(I)*y_tmp enddo enddo 21 cycles/iteration; 4% peak 1 cycles/iteration; 100% peak Note: X=3 is required! TM Rounding Accuracy Rounding mode can be specified with -OPT:roundoff=X switch: X = 0 no optimizations that affect fp behaviour (default at -O1 -O2) X = 1 allows simple transformations with limited round-off and overflow differences X = 2 allows reordering of reduction loops (default at -O3) X = 3 any mathematically valid transformation is allowed do i=1,n x = x + a(i) enddo With -O3 -OPT:roundoff=1 2 cycles/iter; 25% peak -O3 -OPT:roundoff=2 (default at -O3) do i=1,n,8 x0 = x0 + a(i) x1 = x1 + a(i+1) … enddo x = x0 + x1 + … 1 cycles/iter; 50% peak Recommendation: Your program should work correctly when compiled with -O3 -OPT:IEEE_arithmetic=3:roundoff=3 TM Summary Compiler is the primary tool of program optimization • Compilation is the process of lowering the code representation from high level to low, I.e. processor level • The MipsPro compiler targets the MIPS R1x000 processor and has built in the features of the processor and Origin architecture • A large number of options exist to steer the compilation process – ABI, ISA and optimization options selections – setting of assumptions about the program behaviour • There are optimized and parallelized libraries of subroutines for scientific computation • When programming for a digital computer, it is important to remember the limitations due to limited validity range of the floating point calculations

Compiler Techniques for Single Processor Tuning An Introduction TM

Related documents

Products

Support

Compiler Techniques for Single Processor Tuning An Introduction TM

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib