3 Porting Issues - School of Computing and Information Sciences

advertisement
Parallel Computing Explained
Porting Issues
Slides Prepared from the CI-Tutor Courses at NCSA
http://ci-tutor.ncsa.uiuc.edu/
By
S. Masoud Sadjadi
School of Computing and Information Sciences
Florida International University
March 2009
Agenda
1 Parallel Computing Overview
2 How to Parallelize a Code
3 Porting Issues
3.1 Recompile
3.2 Word Length
3.3 Compiler Options for Debugging
3.4 Standards Violations
3.5 IEEE Arithmetic Differences
3.6 Math Library Differences
3.7 Compute Order Related Differences
3.8 Optimization Level Too High
3.9 Diagnostic Listings
3.10 Further Information
Porting Issues
 In order to run a computer program that presently runs on a
workstation, a mainframe, a vector computer, or another parallel
computer, on a new parallel computer you must first "port" the code.
 After porting the code, it is important to have some benchmark results
you can use for comparison.
 To do this, run the original program on a well-defined dataset, and save the
results from the old or “baseline” computer.
 Then run the ported code on the new computer and compare the results.
 If the results are different, don't automatically assume that the new results
are wrong – they may actually be better. There are several reasons why this
might be true, including:
 Precision Differences - the new results may actually be more accurate than the baseline
results.
 Code Flaws - porting your code to a new computer may have uncovered a hidden flaw in
the code that was already there.
 Detection methods for finding code flaws, solutions, and workarounds
are provided in this lecture.
Recompile
 Some codes just need to be recompiled to get accurate results.
 The compilers available on the NCSA computer platforms are
shown in the following table:
Language
SGI Origin2000
MIPSpro
Portland
Group
IA-32 Linux
Intel
GNU
Portland
Group
Intel
GNU
g77
pgf77
ifort
g77
pgf90
ifort
Fortran 77
f77
ifort
Fortran 90
f90
ifort
Fortran 90
f95
ifort
High
Performance
Fortran
C
C++
IA-64 Linux
ifort
pghpf
cc
CC
pghpf
icc
icpc
gcc
g++
pgcc
pgCC
icc
icpc
gcc
g++
Word Length
 Code flaws can occur when you are porting your code to a
different word length computer.
 For C, the size of an integer variable differs depending on the
machine and how the variable is generated. On the IA32 and IA64
Linux clusters, the size of an integer variable is 4 and 8 bytes,
respectively. On the SGI Origin2000, the corresponding value is 4
bytes if the code is compiled with the –n32 flag, and 8 bytes if
compiled without any flags or explicitly with the –64 flag.
 For Fortran, the SGI MIPSpro and Intel compilers contain the
following flags to set default variable size.
 -in where n is a number: set the default INTEGER to INTEGER*n.
The value of n can be 4 or 8 on SGI, and 2, 4, or 8 on the Linux
clusters.
 -rn where n is a number: set the default REAL to REAL*n. The value
of n can be 4 or 8 on SGI, and 4, 8, or 16 on the Linux clusters.
Compiler Options for Debugging
 On the SGI Origin2000, the MIPSpro compilers include
debugging options via the –DEBUG:group. The syntax is as
follows:
-DEBUG:option1[=value1]:option2[=value2]...
 Two examples are:
 Array-bound checking: check for subscripts out of range at
runtime.
-DEBUG:subscript_check=ON
 Force all un-initialized stack, automatic and dynamically
allocated variables to be initialized.
-DEBUG:trap_uninitialized=ON
Compiler Options for Debugging
 On the IA32 Linux cluster, the Fortran compiler is
equipped with the following –C flags for runtime
diagnostics:
 -CA: pointers and allocatable references
 -CB: array and subscript bounds
 -CS: consistent shape of intrinsic procedure
 -CU: use of uninitialized variables
 -CV: correspondence between dummy and actual
arguments
Standards Violations
 Code flaws can occur when the program has non-ANSI
standard Fortran coding.
 ANSI standard Fortran is a set of rules for compiler writers that
specify, for example, the value of the do loop index upon exit
from the do loop.
 Standards Violations Detection
 To detect standards violations on the SGI Origin2000 computer
use the -ansi flag.
 This option generates a listing of warning messages for the use
of non-ANSI standard coding.
 On the Linux clusters, the -ansi[-] flag enables/disables
assumption of ANSI conformance.
IEEE Arithmetic Differences
 Code flaws occur when the baseline computer conforms to the
IEEE arithmetic standard and the new computer does not.
 The IEEE Arithmetic Standard is a set of rules governing arithmetic
roundoff and overflow behavior.
 For example, it prohibits the compiler writer from replacing x/y
with x *recip (y) since the two results may differ slightly for some
operands.You can make your program strictly conform to the IEEE
standard.
 To make your program conform to the IEEE Arithmetic Standards
on the SGI Origin2000 computer use:
f90 -OPT:IEEEarithmetic=n ... prog.f where n is 1, 2, or 3.
 This option specifies the level of conformance to the IEEE
standard where 1 is the most stringent and 3 is the most liberal.
 On the Linux clusters, the Intel compilers can achieve
conformance to IEEE standard at a stringent level with the –mp
flag, or a slightly relaxed level with the –mp1 flag.
Math Library Differences
 Most high-performance parallel computers are equipped with
vendor-supplied math libraries.
 On the SGI Origin2000 platform, there are SGI/Cray Scientific
Library (SCSL) and Complib.sgimath.
 SCSL contains Level 1, 2, and 3 Basic Linear Algebra Subprograms
(BLAS), LAPACK and Fast Fourier Transform (FFT) routines.
 SCSL can be linked with –lscs for the serial version, or –mp –
lscs_mp for the parallel version.
 The complib library can be linked with –lcomplib.sgimath for the
serial version, or –mp –lcomplib.sgimath_mp for the parallel
version.
 The Intel Math Kernel Library (MKL) contains the complete set of
functions from BLAS, the extended BLAS (sparse), the complete
set of LAPACK routines, and Fast Fourier Transform (FFT)
routines.
Math Library Differences
 On the IA32 Linux cluster, the libraries to link to are:
 For BLAS: -L/usr/local/intel/mkl/lib/32
-lmkl -lguide –
lpthread
 For LAPACK: -L/usr/local/intel/mkl/lib/32
–lmkl_lapack -
lmkl -lguide –lpthread
 When calling MKL routines from C/C++ programs, you also
need to link with –lF90.
 On the IA64 Linux cluster, the corresponding libraries are:
 For BLAS: -L/usr/local/intel/mkl/lib/64 –lmkl_itp –
lpthread
 For LAPACK: -L/usr/local/intel/mkl/lib/64
–lmkl_lapack –
lmkl_itp –lpthread
 When calling MKL routines from C/C++ programs, you also
need to link with -lPEPCF90
–lCEPCF90 –lF90 -lintrins
Compute Order Related Differences
 Code flaws can occur because of the non-deterministic computation of
data elements on a parallel computer. The compute order in which the
threads will run cannot be guaranteed.
 For example, in a data parallel program, the 50th index of a do loop may be
computed before the 10th index of the loop. Furthermore, the threads may
run in one order on the first run, and in another order on the next run of the
program.
 Note: : If your algorithm depends on data being compared in a specific order,
your code is inappropriate for a parallel computer.
 Use the following method to detect compute order related differences:
 If your loop looks like
change it to
1, -1 The results should not change if the iterations are
 DO I = 1, N
 DO I = N,
independent
Optimization Level Too High
 Code flaws can occur when the optimization level has been set too
high thus trading speed for accuracy.
 The compiler reorders and optimizes your code based on
assumptions it makes about your program. This can sometimes cause
answers to change at higher optimization level.
 Setting the Optimization Level
 Both SGI Origin2000 computer and IBM Linux clusters provide
Level 0 (no optimization) to Level 3 (most aggressive) optimization,
using the –O{0,1,2, or 3} flag. One should bear in mind that Level 3
optimization may carry out loop transformations that affect the
correctness of calculations. Checking correctness and precision of
calculation is highly recommended when –O3 is used.
 For example on the Origin 2000
 f90 -O0 … prog.f
turns off all optimizations.
Optimization Level Too High
 Isolating Optimization Level Problems
 You can sometimes isolate optimization level problems using the
method of binary chop.
 To do this, divide your program prog.f into halves. Name them prog1.f and
prog2.f.
 Compile the first half with -O0 and the second half with -O3
f90 -c -O0 prog1.f
f90 -c -O3 prog2.f
f90 prog1.o prog2.o
a.out > results
 If the results are correct, the optimization problem lies in prog1.f
 Next divide prog1.f into halves. Name them prog1a.f and prog1b.f
 Compile prog1a.f with -O0 and prog1b.f with -O3
f90 -c -O0 prog1a.f
f90 -c -O3 prog1b.f
f90 prog1a.o prog1b.o prog2.o
a.out > results
 Continue in this manner until you have isolated the section of code that is
producing incorrect results.
Diagnostic Listings
 The SGI Origin 2000 compiler will generate all
kinds of diagnostic warnings and messages, but
not always by default. Some useful listing options
are:
f90
f90
f90
f90
f90
-listing ...
-fullwarn ...
-showdefaults ...
-version ...
-help ...
Further Information
 SGI
 man f77/f90/cc
 man debug_group
 man math
 man complib.sgimath
 MIPSpro 64-Bit Porting and Transition Guide
 Online Manuals
 Linux clusters pages
(IA32, IA64, Intel64)
 Intel Fortran Compiler for Linux
 Intel C/C++ Compiler for Linux
 ifort/icc/icpc –help
Agenda
 1 Parallel Computing Overview
 2 How to Parallelize a Code
 3 Porting Issues
 4 Scalar Tuning
 4.1 Aggressive Compiler Options
 4.2 Compiler Optimizations
 4.3 Vendor Tuned Code
 4.4 Further Information
Scalar Tuning
 If you are not satisfied with the performance of your
program on the new computer, you can tune the scalar code
to decrease its runtime.
 This chapter describes many of these techniques:
 The use of the most aggressive compiler options
 The improvement of loop unrolling
 The use of subroutine inlining
 The use of vendor supplied tuned code
 The detection of cache problems, and their solution are
presented in the Cache Tuning chapter.
Aggressive Compiler Options
 For the SGI Origin2000 Linux clusters the main
optimization switch is
-On where n ranges from 0 to 3.
-O0 turns off all optimizations.
-O1 and -O2 do beneficial optimizations that will not
effect the accuracy of results.
-O3 specifies the most aggressive optimizations. It takes
the most compile time, may produce changes in
accuracy, and turns on software pipelining.
Aggressive Compiler Options
 It should be noted that –O3 might carry out loop
transformations that produce incorrect results in some codes.
 It is recommended that one compare the answer obtained from
Level 3 optimization with one obtained from a lower-level
optimization.
 On the SGI Origin2000 and the Linux clusters, –O3 can be
used together with –OPT:IEEE_arithmetic=n (n=1,2, or
3) and –mp (or –mp1), respectively, to enforce operation
conformance to IEEE standard at different levels.
 On the SGI Origin2000, the option
-Ofast = ip27
is also available. This option specifies the most aggressive
optimizations that are specifically tuned for the Origin2000
computer.
Download