Optimisation progress for UM7.8 - The Centre for Australian Weather

advertisement
Optimisation progress for UM7.8
Ilia Bermous
7 April 2011
Thank you to Joerg Henrichs, Martin Dix and Mike Naughton for
some help and advices during the work
The Centre for Australian Weather and Climate Research
A partnership between CSIRO and the Bureau of Meteorology
Description of global forecast test job
Global model with N320L70 resolution
Based on Fabrizio’s xazje job (which came from forecast step of
Chris’s APS1 ACCESS-G development suite)
24 hour integration with ~30GB output
~3100-3250 GCR iterations per run (3162 with 7.5 and 3223 with 7.8)
for 120 time steps.
Timing results are given in terms of Elapsed CPU Time and Elapsed
Wallclock Time from internal UM model timers as reported in UM job
output for each run.
All runs used Mike’s version of UM run script; this script is very
simple and flexible for these kinds of tasks.
2
UM7.5 best performance results on Solar
 Used software
 Intel11.0.083 compiler
 OpenMPI mpi/sun-8.2 library
 UM model environment settings and source change
 Joerg’s byte swapping procedure
 Q_POS_METHOD=5 for the improved QPOS algorithms
 Lustre file system striping
 Best elapsed times (in sec) with decomposition of 20x24 =>
480 cores (i.e. under 500 cores)
 Full I/O: (475; 497) (484; 513)
 NO I/O:
3
(330; 335)
(306; 311)
(492; 519)
(310; 315)
Major UM7.8 developments for performance
improvement
 Asynchronous parallel I/O (requires OpenMP)
 This new feature is activated at both build and run stage
 Only works with OpenMP – requires UMUI “Use OpenMP” option in the
“User Information and Submit Method => Job submission method” panel is
selected
 PMSL revised algorithm (Jacobi algorithm)
 A revised algorithm based on a Jacobi solver is introduced and the number
if iterations increased. Even with more iterations this new method is cheaper
and scales at higher node counts.
 Optimisation for FILL_EXTERNAL_HALOS
 resulted in ~5% reduction in fill_external_halos routine runtime cost
 Improved QPOS algorithms
 All new versions are significantly quicker at scale (~10%) but require
scientific validation as the science is altered somewhat. The "level" method
has been validated in PS25 and is now being used in the global model at the
Met Office.
4
Summary of attempts made for UM7.8
 OpenMP usage with Intel compiler and 1 thread on Solar
 OpenMP usage with Intel compiler and 2 threads on Solar
 OpenMP usage with SunStudio compiler and 2 threads on Solar
 OpenMP usage with Intel compiler and 2 threads on NCI system
 OpenMP only for “io_services” library with Intel compiler and 2
threads on Solar
5
OpenMP usage with Intel compiler
and 1 thread on Solar
 Problems resolved and reported
 Found a number of cases in the sources for inconsistent usage
of allocate/deallocate statements and IF block logic => reported
to the UM developers
 Significant impact on the performance if TMPDIR is used and
modified in the UM scripts => reported to the developers
 A couple of missing environment variables such as
OMP_NUM_THREADS and OMP_STACKSIZE should be set by the
UMUI scripts if multithreading is used
 A run time crash problem with the usage of Intel11.0.083
compiler is resolved by using Intel11.1.073, the most latest
available compiler on our site
6
OpenMP usage with Intel compiler
and 1 thread on Solar (cont #2)
 Performance results (20x24x1, Lustre striping, FLUME_IOS_NPROC=8)
UM7.8
UM7.5 (without multithreading)
Full I/O
No I/O
Full I/O
No I/O
370; 389
221; 225
484; 512
330; 335
362; 389
205; 210
475; 497
306; 311
389; 416
209; 212
492; 519
310; 315
228; 235
277; 285
Conclusions:
1.
The full I/O case with UM7.8 runs over 20% faster than with UM7.5
2.
UM7.8 without usage of I/O runs ~1.5 times faster than with UM7.5
3.
There is no visible performance improvement in the I/O part: 179sec vs 186sec
4.
In red are results from runs to compare performance section by section
7
OpenMP usage with Intel compiler
and 1 thread on Solar (cont #3)
 Performance comparison between top 6 sections for UM7.8 and UM7.5
without usage of I/O
UM7.8
UM7.5
PE_Helmholtz
68.15
PE_Helmholtz
70.76
SL_Full_wind
31.35
ATM_STEP
52.30
ATM_STEP
26.46
SL_Full_wind
31.19
SL_Thermo
21.52
SL_Thermo
27.20
READDUMP
13.53
READDUMP
12.67
Atmos_Physics2
9.55
NI_filter_Ctl
15.63
Conclusions:
1.
Comparing the top sections the major performance improvements are coming from
ATM_STEP (25.74sec), NI_filter_Ctl (9.59sec) and SL_Thermo (5.68sec) which
gives in total of 41.01sec
8
UM7.8 performance comparison: full I/O,
20x24 decomposition and Lustre striping
EXE1 – Intel11.1.073, -openmp, 1 thread, FLUME_IOS_NPROC=8,
buffer_size=6000 (results from the previous slide)
EXE2 – UMUI standard building procedure using Intel11.0.083 with the
“safe” level of optimisation (UMUI build job xbauk)
EXE3 – Intel11.1.073, bld.cfg is based on Imtiaz’s version without
“-WB -warn all -warn nointerfaces -align all” (due to Intel compiler problems)
EXE1
EXE2
EXE3
370; 389
381; 411
370; 396
362; 389
362; 395
369; 397
389; 416
366; 392
382; 406
Conclusion:
1.
Usage of OpenMP with a single thread and parallel I/O with
FLUME_IOS_NPROC=8 does NOT provide any performance advantage
9
OpenMP usage with Intel compiler
and 2 threads on Solar
 The same slow performance issue found and investigated in
detail for UM7.5 and reported in August 2010 still exists and
the issue has not been addressed by Intel at all
 Monitoring execution of the models sometimes the job starts to
run fast for the first 10-15 steps then it slows down significantly,
sometimes this may happen from the start of a run
Elapsed times for a 14x18 decomposition and 2 threads per MPI
process with 504 cores and without I/O are (4915sec; 4921sec)
in comparison with (360sec; 366sec) using 14x18x1 without I/O
 Conclusions:
 this long standing problem must be addressed by Intel
 At the moment it is not the most critical issue in getting
asynchronous parallel I/O functionality with UM7.8
10
OpenMP usage with SunStudio compiler
and 2 threads on Solar
 Problems resolved and reported
 Usage of POINTER INTENT attributes which is not supported by Fortran
standard => used a work around, reported problem to the UM
development team
 Multithreading performance results
 UM performance using 2 threads per MPI process is better than without
OpenMP, but scaling is very poor (Lustre striping was not used):
14x18x2threads + FLUME_IOS_NPROC=4 => 508 cores => 753sec
14x18x1thread + FLUME_IOS_NPROC=0 => 252 cores => 794sec
Note:
- Date command output was used to calculate elapsed times
- Several runs using different run configurations such as 20x24x1 and 16x32x1
had crash problems, the nature of these problems have not been
investigated
- Usage of different optimisation options such as
-O3, -O5, –xtarget=native, –xarch=native, –dalign, -g
does not make any visible impact on the performance results
11
OpenMP usage with Intel compiler
and 2 threads on NCI system
 The same slow performance issue as on Solar does exist on NCI
system using Intel11.1.073 compiler and openmpi1.4.3 library:
(3565sec; 3703sec)
 Monitoring execution of the model: the job started to run slow from
the first time step
 Usage of the latest Intel12.0.084 compiler
 Compilation crashes for a file, a work around to use “-O0” instead of
“-O2” recommended by Martin fixes the problem
 Execution with 14x18 decomposition using a single thread crashes,
this run time problem has not been investigated
12
OpenMP only for “io_services” library with Intel
compiler and 2 threads on Solar
Due to a slow performance issue with multithreading for the computational
part, the main idea in this approach is
 to compile all UM7.8 sources excluding the “io_services” library without
usage of “-openmp” compilation option
 to compile the “io_services” library with multithreading using the
“-openmp” compilation option
 UM7.8 major terms in relation to asynchronous parallel I/O:
 FLUME_IOS_NPROC – number of MPI tasks allocated to act as IO servers
 IOS_Spacing – the gap between IO servers in MPI_COMM_WORLD (for
optimal performance a node has no more than one IO server)
 buffer_size – amount of data (MB) that each IO server can have outstanding
 IOS_use_async_stash – use asynchronous communications to accelerate
diagnostics output
 IOS_use_async_dump – asynchronous DUMP output not currently available
13
OpenMP only for “io_services” library with Intel
compiler and 2 threads on Solar (cont #2)
 Found problems/issues
 Using 14x18x2 configuration with FLUME_IOS_NPROC=8 and spacing of
8 (8 nodes are overcommitted) a run time error problem was produced
forrtl: severe (40): recursive I/O operation, unit 6, file unknown
work around: several write statements to produce similar diagnostic output
have been commented out (as per Joerg’s message, Peter Kerney has
reported this problem to Intel Support)
 The main asynchronous parallel IO functionality due to the latest model
development is available only if an MPI library allows that multiple threads
can call MPI with no restrictions (MPI_THREAD_MULTIPLE), unfortunately
a single threaded support (MPI_THREAD_SINGLE) is provided by our MPI
library (OpenMPI), this is checked by an MPI_QUERY_THREAD call which
returns the current level of thread support
Comment: this is another example of an obstacle when the user has a
different platform from the platform used by the developer
14
OpenMP only for “io_services” library with Intel
compiler and 2 threads on Solar (cont #3)
With the current version of MPI library to be able to use some parts of
the implemented UM7.8 functionality Joerg suggested to overwrite the
UM7.8 setting of MPI_THREAD_SINGLE with MPI_THREAD_FUNNELED
(The task can be multi-threaded, but only the main thread will make MPI
calls. All MPI calls are funneled to the main thread. )
Results (in sec) using 20x24 decomposition with Lustre file system striping
(4Mb, 8 ways), buffer_size= 6000
488 cores (FLUME_IOS_NPROC=8),
8 MPI processes per node
560 cores (FLUME_IOS_NPROC=10),
7 MPI processes per node
431; 436
390; 395
425; 429
404; 413
423; 428
406; 414
15
OpenMP only for “io_services” library with Intel
compiler and 2 threads on Solar (cont #4)
 Conclusions
 Usage of MPI_THREAD_FUNNELED does not provide a visible
performance improvement in comparison with the results
achieved with the usage of a single thread only
 In a case of not overcomitting the nodes on which multithreading
is not used gives slightly better performance results which are
similar to the results obtained with the usage of a single thread
only
 The number of wasted cores in a second configuration when only
7 MPI processes are used can be reduced to 0 with the usage of
the functionality provided by Joerg’s mprun.py script using its
explicit form which will take 10-15 lines of text for a single run
command
16
Next steps for future work
 Merge Joerg’s byte swapping procedure from UM7.5 into UM7.8
(Joerg agreed to do this task)
 Addressing by Solar Help a request on a thread multiple version of
the MPI library could be provided to our site to be able to use the
asynchronous functionality with UM7.8
 Validation of the numerical results produced with UM7.8
 By providing just presented information to Paul Selwood ask him
 What are the IOS main parameter settings used at UKMO site
with UM7.8?
 What kind of performance improvement is produced in
comparison with UM7.5 for the I/O part?
 What kind of parameter settings can be recommended for our
case?
17
Download