Parallel Computing Explained Porting Issues Slides Prepared from the CI-Tutor Courses at NCSA http://ci-tutor.ncsa.uiuc.edu/ By S. Masoud Sadjadi School of Computing and Information Sciences Florida International University March 2009 Agenda 1 Parallel Computing Overview 2 How to Parallelize a Code 3 Porting Issues 3.1 Recompile 3.2 Word Length 3.3 Compiler Options for Debugging 3.4 Standards Violations 3.5 IEEE Arithmetic Differences 3.6 Math Library Differences 3.7 Compute Order Related Differences 3.8 Optimization Level Too High 3.9 Diagnostic Listings 3.10 Further Information Porting Issues In order to run a computer program that presently runs on a workstation, a mainframe, a vector computer, or another parallel computer, on a new parallel computer you must first "port" the code. After porting the code, it is important to have some benchmark results you can use for comparison. To do this, run the original program on a well-defined dataset, and save the results from the old or “baseline” computer. Then run the ported code on the new computer and compare the results. If the results are different, don't automatically assume that the new results are wrong – they may actually be better. There are several reasons why this might be true, including: Precision Differences - the new results may actually be more accurate than the baseline results. Code Flaws - porting your code to a new computer may have uncovered a hidden flaw in the code that was already there. Detection methods for finding code flaws, solutions, and workarounds are provided in this lecture. Recompile Some codes just need to be recompiled to get accurate results. The compilers available on the NCSA computer platforms are shown in the following table: Language SGI Origin2000 MIPSpro Portland Group IA-32 Linux Intel GNU Portland Group Intel GNU g77 pgf77 ifort g77 pgf90 ifort Fortran 77 f77 ifort Fortran 90 f90 ifort Fortran 90 f95 ifort High Performance Fortran C C++ IA-64 Linux ifort pghpf cc CC pghpf icc icpc gcc g++ pgcc pgCC icc icpc gcc g++ Word Length Code flaws can occur when you are porting your code to a different word length computer. For C, the size of an integer variable differs depending on the machine and how the variable is generated. On the IA32 and IA64 Linux clusters, the size of an integer variable is 4 and 8 bytes, respectively. On the SGI Origin2000, the corresponding value is 4 bytes if the code is compiled with the –n32 flag, and 8 bytes if compiled without any flags or explicitly with the –64 flag. For Fortran, the SGI MIPSpro and Intel compilers contain the following flags to set default variable size. -in where n is a number: set the default INTEGER to INTEGER*n. The value of n can be 4 or 8 on SGI, and 2, 4, or 8 on the Linux clusters. -rn where n is a number: set the default REAL to REAL*n. The value of n can be 4 or 8 on SGI, and 4, 8, or 16 on the Linux clusters. Compiler Options for Debugging On the SGI Origin2000, the MIPSpro compilers include debugging options via the –DEBUG:group. The syntax is as follows: -DEBUG:option1[=value1]:option2[=value2]... Two examples are: Array-bound checking: check for subscripts out of range at runtime. -DEBUG:subscript_check=ON Force all un-initialized stack, automatic and dynamically allocated variables to be initialized. -DEBUG:trap_uninitialized=ON Compiler Options for Debugging On the IA32 Linux cluster, the Fortran compiler is equipped with the following –C flags for runtime diagnostics: -CA: pointers and allocatable references -CB: array and subscript bounds -CS: consistent shape of intrinsic procedure -CU: use of uninitialized variables -CV: correspondence between dummy and actual arguments Standards Violations Code flaws can occur when the program has non-ANSI standard Fortran coding. ANSI standard Fortran is a set of rules for compiler writers that specify, for example, the value of the do loop index upon exit from the do loop. Standards Violations Detection To detect standards violations on the SGI Origin2000 computer use the -ansi flag. This option generates a listing of warning messages for the use of non-ANSI standard coding. On the Linux clusters, the -ansi[-] flag enables/disables assumption of ANSI conformance. IEEE Arithmetic Differences Code flaws occur when the baseline computer conforms to the IEEE arithmetic standard and the new computer does not. The IEEE Arithmetic Standard is a set of rules governing arithmetic roundoff and overflow behavior. For example, it prohibits the compiler writer from replacing x/y with x *recip (y) since the two results may differ slightly for some operands.You can make your program strictly conform to the IEEE standard. To make your program conform to the IEEE Arithmetic Standards on the SGI Origin2000 computer use: f90 -OPT:IEEEarithmetic=n ... prog.f where n is 1, 2, or 3. This option specifies the level of conformance to the IEEE standard where 1 is the most stringent and 3 is the most liberal. On the Linux clusters, the Intel compilers can achieve conformance to IEEE standard at a stringent level with the –mp flag, or a slightly relaxed level with the –mp1 flag. Math Library Differences Most high-performance parallel computers are equipped with vendor-supplied math libraries. On the SGI Origin2000 platform, there are SGI/Cray Scientific Library (SCSL) and Complib.sgimath. SCSL contains Level 1, 2, and 3 Basic Linear Algebra Subprograms (BLAS), LAPACK and Fast Fourier Transform (FFT) routines. SCSL can be linked with –lscs for the serial version, or –mp – lscs_mp for the parallel version. The complib library can be linked with –lcomplib.sgimath for the serial version, or –mp –lcomplib.sgimath_mp for the parallel version. The Intel Math Kernel Library (MKL) contains the complete set of functions from BLAS, the extended BLAS (sparse), the complete set of LAPACK routines, and Fast Fourier Transform (FFT) routines. Math Library Differences On the IA32 Linux cluster, the libraries to link to are: For BLAS: -L/usr/local/intel/mkl/lib/32 -lmkl -lguide – lpthread For LAPACK: -L/usr/local/intel/mkl/lib/32 –lmkl_lapack - lmkl -lguide –lpthread When calling MKL routines from C/C++ programs, you also need to link with –lF90. On the IA64 Linux cluster, the corresponding libraries are: For BLAS: -L/usr/local/intel/mkl/lib/64 –lmkl_itp – lpthread For LAPACK: -L/usr/local/intel/mkl/lib/64 –lmkl_lapack – lmkl_itp –lpthread When calling MKL routines from C/C++ programs, you also need to link with -lPEPCF90 –lCEPCF90 –lF90 -lintrins Compute Order Related Differences Code flaws can occur because of the non-deterministic computation of data elements on a parallel computer. The compute order in which the threads will run cannot be guaranteed. For example, in a data parallel program, the 50th index of a do loop may be computed before the 10th index of the loop. Furthermore, the threads may run in one order on the first run, and in another order on the next run of the program. Note: : If your algorithm depends on data being compared in a specific order, your code is inappropriate for a parallel computer. Use the following method to detect compute order related differences: If your loop looks like change it to 1, -1 The results should not change if the iterations are DO I = 1, N DO I = N, independent Optimization Level Too High Code flaws can occur when the optimization level has been set too high thus trading speed for accuracy. The compiler reorders and optimizes your code based on assumptions it makes about your program. This can sometimes cause answers to change at higher optimization level. Setting the Optimization Level Both SGI Origin2000 computer and IBM Linux clusters provide Level 0 (no optimization) to Level 3 (most aggressive) optimization, using the –O{0,1,2, or 3} flag. One should bear in mind that Level 3 optimization may carry out loop transformations that affect the correctness of calculations. Checking correctness and precision of calculation is highly recommended when –O3 is used. For example on the Origin 2000 f90 -O0 … prog.f turns off all optimizations. Optimization Level Too High Isolating Optimization Level Problems You can sometimes isolate optimization level problems using the method of binary chop. To do this, divide your program prog.f into halves. Name them prog1.f and prog2.f. Compile the first half with -O0 and the second half with -O3 f90 -c -O0 prog1.f f90 -c -O3 prog2.f f90 prog1.o prog2.o a.out > results If the results are correct, the optimization problem lies in prog1.f Next divide prog1.f into halves. Name them prog1a.f and prog1b.f Compile prog1a.f with -O0 and prog1b.f with -O3 f90 -c -O0 prog1a.f f90 -c -O3 prog1b.f f90 prog1a.o prog1b.o prog2.o a.out > results Continue in this manner until you have isolated the section of code that is producing incorrect results. Diagnostic Listings The SGI Origin 2000 compiler will generate all kinds of diagnostic warnings and messages, but not always by default. Some useful listing options are: f90 f90 f90 f90 f90 -listing ... -fullwarn ... -showdefaults ... -version ... -help ... Further Information SGI man f77/f90/cc man debug_group man math man complib.sgimath MIPSpro 64-Bit Porting and Transition Guide Online Manuals Linux clusters pages (IA32, IA64, Intel64) Intel Fortran Compiler for Linux Intel C/C++ Compiler for Linux ifort/icc/icpc –help Agenda 1 Parallel Computing Overview 2 How to Parallelize a Code 3 Porting Issues 4 Scalar Tuning 4.1 Aggressive Compiler Options 4.2 Compiler Optimizations 4.3 Vendor Tuned Code 4.4 Further Information Scalar Tuning If you are not satisfied with the performance of your program on the new computer, you can tune the scalar code to decrease its runtime. This chapter describes many of these techniques: The use of the most aggressive compiler options The improvement of loop unrolling The use of subroutine inlining The use of vendor supplied tuned code The detection of cache problems, and their solution are presented in the Cache Tuning chapter. Aggressive Compiler Options For the SGI Origin2000 Linux clusters the main optimization switch is -On where n ranges from 0 to 3. -O0 turns off all optimizations. -O1 and -O2 do beneficial optimizations that will not effect the accuracy of results. -O3 specifies the most aggressive optimizations. It takes the most compile time, may produce changes in accuracy, and turns on software pipelining. Aggressive Compiler Options It should be noted that –O3 might carry out loop transformations that produce incorrect results in some codes. It is recommended that one compare the answer obtained from Level 3 optimization with one obtained from a lower-level optimization. On the SGI Origin2000 and the Linux clusters, –O3 can be used together with –OPT:IEEE_arithmetic=n (n=1,2, or 3) and –mp (or –mp1), respectively, to enforce operation conformance to IEEE standard at different levels. On the SGI Origin2000, the option -Ofast = ip27 is also available. This option specifies the most aggressive optimizations that are specifically tuned for the Origin2000 computer.