OptimizationSlides

advertisement
April 8, 2015
Optimizing Applications on Blue Waters
Blue Waters Institute June 5, 2014
Victor Anisimov
NCSA Science and Engineering Applications Support
Overview
•
•
•
•
•
•
•
•
Hardware
Topology Aware Scheduling
Balanced Injection: Sharing the Network
Compilers
Libraries
Profiling: Finding Hot Spots in the Code
File System and I/O Performance Optimization
Training materials:
cp -pr /u/staff/anisimov/training_bw .
Performance Optimization on Blue Waters
2
Hardware Overview
Performance Optimization on Blue Waters
3
Blue Waters – Configuration
• GPU: NVIDIA K20X (Kepler GK110)
• CPU: AMD 6276 Interlagos (2.3 – 2.6 GHz)
• 3D Torus Network Topology
• 22,640 XE6 nodes (Dual CPU), 64 GB RAM each
• 4,224 XK7 nodes (CPU+GPU), 32 GB RAM each
• File system 26.4 PB, Aggregate I/O bandwidth 1 TB/s
Performance Optimization on Blue Waters
4
Cray XE6 Blade has 2 Compute nodes
Node Characteristics
Number of Cores
32 cores
(2 AMD 6276 Interlagos )
Peak Performance
313 Gflops/sec
Memory Size
64 GB per node
Memory
Bandwidth(Peak)
Z
Y
102.4 GB/sec
X
Performance Optimization on Blue Waters
5
AMD 6276 Interlagos Processor
• Compute node contains
• XE6: 2 Processors
Numa node
NB/HT Links
Memory
Controller
Shared L3 Cache
Memory
Controller
Shared L3 Cache
• Processor has 2 numa nodes
• Numa node has 4 Bulldozer
modules
• Each Buldozer module has
single FP-unit and 2 integer
cores
Processor socket
Performance Optimization on Blue Waters
NB/HT Links
6
Job Submitting Performance Options: XE6
BW node: 64 GB RAM, 16 Bulldozer cores, 32 cores, 4 NUMA domains
•aprun enumerates cores from 0 to 31; each pair represents a BD core
•
•
•
•
BD core consists of 2 compute cores sharing single FP unit
Default task placement – fill up from 0 to 31
-N16 will use cores 0,1,…,15 utilizing only 8 FP units
-N16 -d2 will utilize cores 0,2,4,6,8,10,12,14,…,24,26,28,30 and 16 FP units
•aprun options to specify particular number of tasks per XE6 node
•
•
•
•
•
•
-N1
-N2 -cc 0,16
-N4 -d8
-N8 -d4
-N16 -d2
-N32
1 process
2 processes
4 processes, one per NUMA node(cache optimal)
8 processes, two per NUMA node
16 processes, interleaved, 16 BD cores
32 processes (may improve performance over -N16)
Performance Optimization on Blue Waters
7
Test 01: Optimal MPI task placement
• Question:
• When 16 cores do better job than 32 cores?
• How to optimally use fewer than 32 cores per XE6 node?
• Running the test:
• Show task placement: export MPICH_CPUMASK_DISPLAY=1
• cc app.c; qsub run
• aprun -n32
./a.out
• aprun -n16
./a.out
• aprun -n16 -d2 ./a.out
(Time = 7.613)
(Time = 5.995)
(Time = 5.640)
• Take home lesson: Contention will degrade performance.
Test different task placements and choose the optimal one
before starting the production computations.
Performance Optimization on Blue Waters
8
3D Torus in XY plane
Z
Y
X
Performance Optimization on Blue Waters
9
Cray 3D Torus Topology
VMD image of 3D torus
•Torus is a periodic box in 3D
•XK nodes – red
•Service nodes – blue
•Compute nodes – gray
•Slower – Y, X, Z – Faster
•Routing X, then Y, then Z
•Routing path depends on application
placement on 3D torus
Reference: Bob Fiedler, Cray Inc.:
https://bluewaters.ncsa.illinois.edu/documents/10157/12008/AdvancedFeatures_PRA
C_WS_2013-02-27.pdf
Performance Optimization on Blue Waters
10
Why My Job Run Slow - Performance Variation
Dedicated-machine performance
•
•
Performance variation due to job-job
interaction
Yellow – defragmented 1000-node job,
red – XK nodes, blue –service nodes,
other jobs not shown
Example of defragmented node allocation
Performance Optimization on Blue Waters
11
Moab: Nodesets
Available Shapes and Sizes (number of
nodes in them):
Sheets: 1100, 2200, 6700, 8200, 12300
Bars: 6700
Cubes: 3300
Yellow: Job placement in a cube nodeset
Specifying a nodeset in PBS script:
#PBS -l nodes=3360:ppn=32
#PBS -l nodeset=ONEOF:FEATURE:c1_3300n:c2_3300n:c3_3300n:
c5_3300n:c6_3300n:c7_3300n
Performance Optimization on Blue Waters
12
Moab: Hostlist
Yellow – Job placement in 3D torus
#PBS -l nodes=1000:ppn=32
#PBS -l hostlist=1152+1153+1154+1155+1156+…+8207^
Performance Optimization on Blue Waters
13
Test 02: Optimal job placement on 3D torus
• Challenge: Chose a pair of adjacent nodes on 3D torus. Find
the application performance on 0-1, Y-Y, and Z-Z links.
• Runnig the test:
• Show node ids and X,Y,Z coordinates: getnodexyz.sh
• cc app.c; qsub run
node 1
• Example of 0-1, Z-Z, Y-Y links
node 0
•
•
•
•
•
aprun -n64 -L 24124,24125 ./a.out
aprun -n64 -L 24125,24126 ./a.out
aprun -n64 -L 24108,24178 ./a.out
(non-local) aprun -n64 -L 23074,24102 ./a.out
Try your own node ids to test your understanding
Z
Y
X
• Take home lesson: Different links have different network
bandwidth: 0-1 > Z-Z = X-X > Y-Y
Performance Optimization on Blue Waters
14
Congestion Protection
• Network congestion is a condition that
occurs when the volume of traffic on the
high-speed network (HSN) exceeds the
capacity to handle it.
• To "protect" the network from data loss,
congestion protection (CP) globally
“throttles” injection bandwidth per-node.
• If CP happens often, application
performance degrades.
http://lh5.google.ca/abramsv/R9WYOKtLe1I/AAAAAAAALO4/FLefbnOq5rQ/s1600-h/495711679_52f8d76d11_o.jpg
• At job completion you might see the following message reported to stdout:
Application 61435 network throttled: 4459 nodes throttled, 25:31:21 node-seconds
Application 61435 balanced injection 100, after throttle 63
• Throttling disrupts the work on the entire machine.
Performance Optimization on Blue Waters
15
Types of congestion events
• There are two main forms of congestion: many-to-one and long-path.
The former is easy to detect and correct. The latter is harder to detect
and may not be correctable.
• Many-to-one congestion occurs in some algorithms and can be
corrected. See “Modifying Your Application to Avoid Gemini Network
Congestion Errors” on balanced injection section on the portal.
• Long-path congestion is typically due to a combination of
communication pattern and node allocation. It can also be due to a
combination of jobs running on the system.
• We monitor for cases of congestion protection and try to determine
the most likely cause.
Performance Optimization on Blue Waters
16
Congestion on a Shared, Torus Network
• HSN uses dimension ordered
routing: X-then-Y-then-Z
between two locations on the
torus. Note that AB≠BA.
A
B
• Shortest route can sometimes
cause traffic to pass through
geminis used by other jobs.
• Non-compact node allocations can have traffic that passes through
geminis used by other jobs.
• I/O traffic can lead to network hot-spots.
• We are working with Adaptive and Cray on eliminating some of the
above causes of congestion with better node allocation: shape,
location, etc.
Performance Optimization on Blue Waters
17
Balanced Injection
• Balanced Injection (BI) is a mechanism that attempts to reduce
compute node injection bandwidth in order to prevent throttling and
which may have the effect of improving application performance for
certain communication patterns.
• BI can be applied “per-job” using an environment variable or with user
accessible API.
• export APRUN_BALANCED_INJECTION=63
• Can be set from 1-100 (100 = no BI).
• There isn’t a linear relation of BI to application performance.
• MPI-based applications have “balanced injection” enabled in
collective MPI calls that locally “throttle” injection bandwidth.
Performance Optimization on Blue Waters
18
Compilers
Performance Optimization on Blue Waters
19
Available Compilers
• Cray Compilers - Cray Compiling Environment (CCE)
• Fortran 2003, Co-arrays, UPC, PGAS, OpenACC
• GNU Compiler Collection (GCC)
• Portland Group Inc (PGI) Compilers
• OpenACC
• Intel Compilers (to be available soon)
• All compilers provide Fortran, C, C++, OpenMP support
• Use cc, CC, ftn wrappers for C, C++, and FORTRAN
• So which compiler do I choose?
• Experiment with various compilers
• Work with BW staff
• Mixing libraries created by different compilers may cause issues
Performance Optimization on Blue Waters
20
Compiler Choices – Relative Strength
• CCE – outstanding Fortran, very good C, and okay C++
• Very good vectorization
• Very good Fortran language support; only real choice for
coarrays
• C support is very good, with UPC support
• Very good scalar optimization and automatic parallelization
• Clean implementation of OpenMP 3.0 with tasks
• Cleanest integration with other Cray tools (Performance tools,
debuggers, upcoming productivity tools)
• No inline assembly support
• Excellent support from Cray (bugs, issues, performance etc)
Performance Optimization on Blue Waters
21
Compiler Choices – Relative Strength
• PGI – very good Fortran, okay C and C++
• Good vectorization
• Good functional correctness with optimization
enabled
• Good automatic prefetch capabilities
• Company (NVIDIA) focused on HPC market
• Excellent working relationship with Cray
• Slow bug fixing
Performance Optimization on Blue Waters
22
Compiler Choices – Relative Strength
• GNU – good Fortran, outstanding C and C++
• Obviously, the best gcc compatibility
• Scalable optimizer was recently rewritten and is very
good
• Vectiorization capabilities focus mostly on inline
assembly
• Few releases have been incompatible with each
other and require recompilation of modules (4.3, 4.4,
4.5)
• General purpose application, not necessarily HPC
Performance Optimization on Blue Waters
23
Recommended CCE Compilation Options
• Use default optimization levels
• It’s the equivalent of most other compilers -O3 or -fast
• Use -O3, fp3 (or -O3 -hfp3 or some variation)
• -O3 gives slightly more than -O2
• -hfp3 gives a lot more floating point optimizations, esp 32 bit
• If an application is intolerant of floating point reassociation, try lower
hfp number, try hfp1 first, only hfp0 if absolutely necessary
• Might be needed for tests that require strict IEEE conformance
• Or applications that have validated results from diffferent compiler
• Do not suggest using -Oipa5, -Oaggress and so on; higher
numbers are not always correlated with better performance
• Compiler feedback : -rm (fortran), -hlist=m ( C )
• If don’t want OpenMP : -xomp or -Othread0 or -hnoomp
• Manpages : crayftn, craycc, crayCC
Performance Optimization on Blue Waters
24
Starting Point for PGI Compilers
• Suggested Option : -fast
• Interprocedural analysis allows the compiler to perform
whole program optimizations : -Mipa=fast(,safe)
• If you can be flexible with precision, also try -Mfprelaxed
• Option -Msmartalloc, calls the subroutine mallopt in the
main routine, can have a dramatic impact on the
performance of program that uses dynamic allocation of
memory
• Compiler feedback : -Minfo=all, -Mneginfo
• Manpages : pgf90, pgcc, pgCC
Performance Optimization on Blue Waters
25
Additional PGI Compiler Options
• -default64 : Fortran driver option for -i8 and -r8
• -i8, -r8 : Treats INTEGER and REAL variables in
Fortran as 8 bytes (use ftn -default64 option to
link the right libraries)
• -byteswapio : Reads big endian files in fortran
• -Mnomain : Uses ftn driver to link programs with
the main program (written in C or C++) and one
or more subroutines (written in fortran)
Performance Optimization on Blue Waters
26
Starting Point for GNU Compilers
• -O3 –ffast-math –funroll-loops
• Compiler feedback : -ftree-vectorizer-verbose=2
• Manpages : gfortran, gcc, g++
Performance Optimization on Blue Waters
27
Test 03: Manual code optimization
• Challenge: Find and fix two programming blunders in the
test code. Modify app2.c and app3.c so their runtime
T(app1) > T(app2) > T(app3).
• Runnig the test:
•
•
•
•
cc -hlist=m -o app1.x app1.c
cc -hlist=m -o app2.x app2.c
cc -hlist=m -o app3.x app3.c
qsub run
sum =
x
åy
y=1
1000000 1000
å
x=1
• Target: T(app1)=5.665s, T(app2)=0.242s, T(app3)=0.031s
Performance Optimization on Blue Waters
28
Test 03: Solution
sum =
x
åy
y=1
1000000 1000
å
x=1
app1.c
app2.c
app3.c
Presentation Title
29
Libraries
Performance Optimization on Blue Waters
30
Libraries: Where to Start
• Libraries motto: Reuse rather than reinvent
• Libraries are tailored to Programming Environment
•
•
•
•
•
Choose programming environment PrgEnv-[cray,gnu,pgi]
Load library module: module load <libname>
See actual path: module show <libname>
Location path: ls –l $CRAY_LIBSCI_PREFIX_DIR/lib
Use compiler wrappers: ftn, cc, CC
• For most applications, using default settings work very well
• OpenMP threaded BLAS/LAPACK libraries are available
• The serial version is used if “OMP_NUM_THREADS” is not
set or set to 1
Performance Optimization on Blue Waters
31
PETSc (Argonne National Laboratory)
• Programmable, Extensible Toolkit for Scientific
Computing
• Widely-used collection of many different types of linear
and non-linear solvers
• Actively under development; very responsive team
• Can also interface with numerous optional external
packages (e.g., SLEPC, HYPRE, ParMETIS, …)
• Optimized version installed by Cray, along with many
external packages
• Use “module load petsc[/version]”
Performance Optimization on Blue Waters
32
Other Numerical Libraries
• ACML (AMD Core Math Library)
• BLAS, LAPACK, FFT, Random Number Generators
• Trilinos (from Sandia National Laboratories)
• Somewhat similar to PETSc, interfaces to a large
collection of preconditioners, solvers, and other
computational tools
• GSL (GNU Scientific Library)
• Collection of numerous computational solvers and tools
for C and C++ programs
• See all available modules “module avail”
Performance Optimization on Blue Waters
33
Cray Scientific Library (libsci)
• Contains optimized versions of several popular
scientific software routines
• Available by default; see available versions with
“module avail cray-libsci” and load particular version
“module load cray-libsci[/version]”
• BLAS, BLACS
• LAPACK, ScaLAPACK
• FFT, FFTW
• Unique to Cray (affects portability)
• CRAFFT, CASE, IRT
Performance Optimization on Blue Waters
34
Cray Accelerated Libraries
Cray LibSci accelerated BLAS, LAPACK, and ScaLAPACK libraries
PrgEnv-cray or PrgEnv-gnu Programming Environment
module add craype-accel-nvidia35
call libsci_acc_init()
… (your code) …
call libsci_acc_finalize()
The Library interface automatically initiates appropriate execution mode (CPU, GPU, Hybrid).
When BLAS or LAPACK routines are called from applications built with the Cray or GNU
compilers, Cray LibSci automatically loads and links libsci_acc libraries upon execution if it
determines performance will be enhanced.
Execution control from source code:
routine_name
invokes automated method
routine_name_cpu
executed on host CPU only
routine_name_acc
executed on GPU only
See “man intro_libsci_acc” for more information.
Performance Optimization on Blue Waters
35
Cray Accelerated Libraries, continued
Libsci_acc is not thread safe. It will fail when called concurrently from OpenMP threads.
ENVIRONMENT VARIABLES
CRAY_LIBSCI_ACC_MODE
Specifies execution mode for libsci_acc routine:
0
Use automated mode. Adds slight overhead. This is the default.
1
Forces all supported auto-tuned routines to execute on the accelerator.
2
Forces all supported routines to execute on the CPU if the data is located
within host processor address space.
LIBSCI_ACC_BYPASS_FUNCTION
Specifies the execution mode for FUNCTION.
0
Use automated mode. Adds slight overhead. This is the default.
1
Call the version that handles data addresses resident on the GPU.
2
Call the version that handles data addresses resident on the CPU.
3
For SGEMM, DGEMM, CGEMM, ZGEMM, this will call accelerated version.
Warning: Passing a wrong CPU / GPU address will cause the program to crash.
Performance Optimization on Blue Waters
36
Test 04: Speed up matrix multiplication by using
OpenMP-threaded AMD Core Math Library
Challenge: Speed up application by using threaded library
Running the test
module swap PrgEnv-cray PrgEnv-pgi
module add acml
cc -lacml -Wl,-ydgemm_ -o app1.x app.c
/opt/acml/5.3.1/pgi64_fma4/lib/libacml.a(dgemm.o): definition of dgemm_
cc -lacml -Wl,-ydgemm_ -mp=nonuma -o app2.x app.c
/opt/acml/5.3.1/pgi64_fma4/lib/libacml.a(dgemm.o): definition of dgemm_
export OMP_NUM_THREADS=32
aprun -n1 -cc none -d32 ./app1.x
aprun -n1 -cc none -d32 ./app2.x
aprun -n1
-d32 ./app2.x
Output: Time = 2.938, Time = 0.213, Time = 0.302
Take home lesson: Reuse instead of reinvent. “-Wl,-ydgemm_” tells where
dgemm() was resolved from. “-cc none” allows OpenMP threads to migrate.
Performance Optimization on Blue Waters
37
Profiling
Performance Optimization on Blue Waters
38
The Cray Performance Analysis Tools
Supports traditional post-mortem performance analysis
• Automatic identification of performance problems
o Indication of causes of problems
o Suggestions of modifications for performance improvement
• pat_build: provides automatic instrumentation
• CrayPAT run-time library collects measurements (transparent to the user)
• pat_report performs analysis and generates text reports
• pat_help: online help utility
To start working with CrayPAT:
o module load perftools
o http://docs.cray.com/books/S-2376-612/S-2376-612.pdf
Performance Optimization on Blue Waters
39
Application Instrumentation with pat_build
• Supports two categories of experiments
− asynchronous experiments (sampling) capture values
from the call stack or the program counter at specified
intervals or when a specified counter overflows
− Event-based experiments (tracing) count some events
such as the number of times a specific system call is
executed
• While tracing provides most detailed information, it can be
very heavy if the application runs on a large number of cores
for a long period of time
• Sampling (-S, default: -Oapa = sampling +HWPC + MPI
tracing) can be useful as a starting point, to provide a first
overview of the work distribution
Performance Optimization on Blue Waters
40
Example Runtime Environment Variables
• Optional timeline view of program available
• export PAT_RT_SUMMARY=0 (minimize volume of tracing data)
• Request hardware performance counter information:
• export PAT_RT_PERFCTR=1 (FLOP count)
Performance Optimization on Blue Waters
41
Predefined Trace Wrappers (-g tracegroup)
•
•
•
•
•
•
•
•
•
•
•
•
•
•
blas
caf
hdf5
heap
io
lapack
math
mpi
omp
pthreads
shmem
sysio
system
upc
Basic Linear Algebra subprograms
Co-Array Fortran (Cray CCE compiler only)
manages extremely large data collection
dynamic heap
includes stdio and sysio groups
Linear Algebra Package
ANSI math
MPI
OpenMP API
POSIX threads
SHMEM
I/O system calls
system calls
Unified Parallel C (Cray CCE compiler only)
For a full list, please see pat_build(1) man page
Performance Optimization on Blue Waters
42
Example Experiments
• > pat_build –O apa
• Gets top time consuming routines
• Least overhead
• > pat_build –u –g mpi ./my_program
• Collects information about user functions and MPI
• > pat_build –w ./my_program
• Collections information for MAIN
• Lightest-weight tracing
• > pat_build –g netcdf,mpi ./my_program
• Collects information about netcdf routines and MPI
Performance Optimization on Blue Waters
43
Steps to Collecting Performance Data
•
Access performance tools software
% module load perftools
•
Build application keeping .o files (CCE: -h keepfiles)
% make clean ; make
•
Instrument application for automatic profiling analysis
o You should get an instrumented program a.out+pat
% pat_build a.out
•
Run application to get top time consuming routines
% aprun … a.out+pat (or qsub <pat script>)
Performance Optimization on Blue Waters
44
Steps to Collecting Performance Data (2)
• You should get a performance file (“<sdatafile>.xf”) or
multiple files in a directory <sdatadir>
• Generate report
% pat_report <sdatafile>.xf > sampling_report
Performance Optimization on Blue Waters
45
Example: HW counter data
PAPI_TLB_DM Data translation lookaside buffer misses
PAPI_L1_DCA Level 1 data cache accesses
PAPI_FP_OPS Average Rate of Floating point operations per single MPI task
MFLOPS (aggregate) Total FLOP rate of the entire job
FLOP rate shows the efficiency of hardware utilization
========================================================================
USER
-----------------------------------------------------------------------Time%
98.3%
Time
4.434402 secs
Imb.Time
-- secs
Imb.Time%
-Calls
0.001M/sec
4500.0 calls
PAPI_L1_DCM
14.820M/sec
65712197 misses
PAPI_TLB_DM
0.902M/sec
3998928 misses
PAPI_L1_DCA
333.331M/sec
1477996162 refs
PAPI_FP_OPS
445.571M/sec
1975672594 ops
User time (approx)
4.434 secs 11971868993 cycles 100.0%Time
Average Time per Call
0.000985 sec
CrayPat Overhead : Time
0.1%
HW FP Ops / User time
445.571M/sec
1975672594 ops
4.1%peak(DP)
HW FP Ops / WCT
445.533M/sec
Computational intensity
0.17 ops/cycle
1.34 ops/ref
MFLOPS (aggregate)
1782.28M/sec
TLB utilization
369.60 refs/miss
0.722 avg uses
D1 cache hit,miss ratios
95.6% hits
4.4% misses
D1 cache utilization (misses) 22.49 refs/miss
2.811 avg hits
========================================================================
Performance Optimization on Blue Waters
46
Example: Time spent in User and MPI functions
Table 1: Profile by Function
Samp % | Samp | Imb. |
Imb. |Group
|
| Samp | Samp % | Function
|
|
|
| PE='HIDE’
100.0% | 775 |
-- |
-- |Total
|------------------------------------------| 94.2% | 730 |
-- |
-- |USER
Computation intensity
||-----------------------------------------|| 43.4% | 336 | 8.75 |
2.6% |mlwxyz_
|| 16.1% | 125 | 6.28 |
4.9% |half_
||
8.0% |
62 | 6.25 |
9.5% |full_
||
6.8% |
53 | 1.88 |
3.5% |artv_
||
4.9% |
38 | 1.34 |
3.6% |bnd_
||
3.6% |
28 | 2.00 |
6.9% |currenf_
||
2.2%
|
17
|
1.50
|
8.6% |bndsf_
vs.
||
1.7% |
13 | 1.97 | 13.5% |model_
||
1.4% |
11 | 1.53 | 12.2% |cfl_
||
1.3% |
10 | 0.75 |
7.0% |currenh_
||
1.0% |
8 | 5.28 | 41.9% |bndbo_
||
1.0% |
8 | 8.28 | 53.4% |bndto_
||==========================================
5.4% |
42 |
-- |
-- |MPI
Communication intensity |
||-----------------------------------------||
1.9% |
15 | 4.62 | 23.9% |mpi_sendrecv_
||
1.8% |
14 | 16.53 | 55.0% |mpi_bcast_
||
1.7% |
13 | 5.66 | 30.7% |mpi_barrier_
|===========================================
Performance Optimization on Blue Waters
47
Test 05: Obtain FLOP count and User/MPI time
• Challenge: What is application FLOP count? Use CrayPat
to perform profiling experiment and compare the result with
1000000 1000
manual count.
x
sum =
• Running the test:
•
•
•
•
•
•
•
module add perftools
cc -c app.c
// compile the application
cc -o app.x app.o
// link the application
pat_build app.x
// instrument the application
qsub run
// submit the job
aprun -n16 -d2 ./app.x+pat
pat_report app.x+pat*.xf > report.out
å åy
x=1
y=1
• Results:
• MPI: 88.9%; USER: 11.1%; communication intensive job
• FLOP count = 2 GF (hint – convert FLOP rate to FLOP count)
Performance Optimization on Blue Waters
48
I/O optimization
HDF5
I/O Library
PnetCDF
Adios
IOBUF
MPI-IO
Scientist…
Application…
I/O
Middleware
Damaris
Utilities
Parallel File
System
Darshan
Blue Waters
Lustre
Performance Optimization on Blue Waters
49
Lustre File System: Striping
Physical
Logical
• File striping: single files are distributed across a
series of OSTs
• File size can grow to the aggregate size of available
OSTs (rather than a single disk)
• Accessing multiple OSTs concurrently increases I/O
bandwidth
Performance Optimization on Blue Waters
50
Performance Impact: Configuring File Striping
•
lfs is the Lustre utility for viewing/setting file striping info
•
•
•
•
•
Stripe count – the number of OSTs across which the file can be striped
Stripe size – the size of the blocks that a file will be broken into
Stripe offset – the ID of an OST for Lustre to start with, when deciding which
OSTs a file will be striped across (leave at default value)
Configurations should focus on stripe count/size
Blue Waters defaults:
$> touch test
$> lfs getstripe test
test
lmm_stripe_count:
1
lmm_stripe_size:
1048576
lmm_stripe_offset: 708
obdidx
objid
708
2161316
objid
0x20faa4
group
0
Performance Optimization on Blue Waters
51
Setting Striping Patterns
$> lfs setstripe -c 5 -s 32m
$> lfs getstripe test
test
lmm_stripe_count:
5
lmm_stripe_size:
33554432
lmm_stripe_offset: 1259
obdidx
objid
1259
2162557
1403
2165796
955
2163063
1139
2161496
699
2161171
test
objid
0x20ff7d
0x210c24
0x210177
0x20fb58
0x20fa13
group
0
0
0
0
0
•Note: a file’s striping pattern is permanent, and set upon creation
•
•
•
lfs setstripe creates a new, 0 byte file
The striping pattern can be changed for a directory; every new file or directory
created within will inherit its striping pattern
Simple API available for configuring striping – portable to other Lustre systems
Performance Optimization on Blue Waters
52
IOBUF – I/O Buffering Library
• Optimize I/O performance with minimal effort
• Asynchronous prefetch
• Write back caching
• stdin, stdout, stderr disabled by default
• No code changes needed
• module load iobuf
• Recompile & relink the code
Application
IOBUF
Linux IO infrastructure
File Systems / Lustre
• Ideal for sequential read or write operations
Performance Optimization on Blue Waters
53
IOBUF – I/O Buffering Library
• Globally (dis)enable by (un)setting IOBUF_PARAMS
• Fine grained control
• Control buffer size, count, synchronicity, prefetch
• Disable iobuf per file
Example:
export IOBUF_PARAMS='*:verbose'
export IOBUF_PARAMS=
'*.in:count=4:size=32M,*.out:count=8:size=64M'
Performance Optimization on Blue Waters
54
IOBUF – MPI-IO Sample Verbose Output
Performance Optimization on Blue Waters
55
I/O Utility: Darshan
• Darshan was developed at Argonne
• It is “a scalable HPC I/O characterization tool…
designed to capture an accurate picture of application
I/O behavior… with minimum overhead”
• I/O Characterization
•
•
•
•
Sheds light on the intricacies of an application’s I/O
Useful for application I/O debugging
Pinpointing causes of extremes
Analyzing/tuning hardware for optimizations
• http://www.mcs.anl.gov/research/projects/darshan/
Performance Optimization on Blue Waters
56
Darshan Specifics
• Darshan collects per-process statistics (organized by
file)
• Counts I/O operations, e.g. unaligned and sequential
accesses
• Times for file operations, e.g. opens and writes
• Accumulates read/write bandwidth info
• Creates data for simple visual representation
• More
• Requires no code modification (only re-linking)
• Small memory footprint
• Includes a job summary tool
Performance Optimization on Blue Waters
57
Summary Tool Example Output
Performance Optimization on Blue Waters
58
Performance Optimization on Blue Waters
59
Test 06: Use Darshan to perform I/O analysis
• Challenge: How much time the application spends in write
operation?
• Running the test:
•
•
•
•
module unload perftools
module load darshan
ftn app.f90
// compile the application
qsub run
// submit the job
• export DARSHAN_LOGPATH=./
• aprun -n16 -d2 ./a.out input.psf input.pdb
• darshan-job-summary.pl *.gz
// get job summary
• darshan-parser *.gz > darshan.log // get darshan log
• Results:
• Calculation time =
• IO write time
=
10.6 seconds
6.0 seconds
Performance Optimization on Blue Waters
60
Test 06: Three possible solutions
• Look in Darshan job-summary PDF file (figure above): 6.2 sec
• Get from Darshan log for coor.xyz.0: CP_F_CLOSE_TIMESTAMP CP_F_OPEN_TIMESTAMP = 1401918203.0 – 1401918197.5 = 5.5 sec
• Manually put timers in designated spots in the source code: 6.0 sec
Presentation Title
61
Good I/O Practices
• Opening a file for writing/appending is expensive, so:
• If possible, open files as read-only
• Avoid large numbers of small writes
while(forever){
open(“myfile”);
write(a_byte); close(“myfile”); }
•
Be gentle with metadata (or suffer its wrath)
• limit the number of files in a single directory
• Instead opt for hierarchical directory structure
• ls contacts the metadata server, ls –l communicates with every OST
assigned to a file (for all files)
• Avoid wildcards: rm –rf *, expanding them is expensive over many files
• It may even be more efficient to pass metadata through MPI than have all
processes hit the MDS (calling stat)
• Avoid updating last access time for read-only operations (NO_ATIME)
Performance Optimization on Blue Waters
62
Acknowledgements
Thanks to
Kalyana Chadalavada, Robert
Brunner, Manisha Gajbe, Galen
Arnold
for contributing to this presentation
63
Download