Optimisation progress for UM7.8 Ilia Bermous 7 April 2011 Thank you to Joerg Henrichs, Martin Dix and Mike Naughton for some help and advices during the work The Centre for Australian Weather and Climate Research A partnership between CSIRO and the Bureau of Meteorology Description of global forecast test job Global model with N320L70 resolution Based on Fabrizio’s xazje job (which came from forecast step of Chris’s APS1 ACCESS-G development suite) 24 hour integration with ~30GB output ~3100-3250 GCR iterations per run (3162 with 7.5 and 3223 with 7.8) for 120 time steps. Timing results are given in terms of Elapsed CPU Time and Elapsed Wallclock Time from internal UM model timers as reported in UM job output for each run. All runs used Mike’s version of UM run script; this script is very simple and flexible for these kinds of tasks. 2 UM7.5 best performance results on Solar Used software Intel11.0.083 compiler OpenMPI mpi/sun-8.2 library UM model environment settings and source change Joerg’s byte swapping procedure Q_POS_METHOD=5 for the improved QPOS algorithms Lustre file system striping Best elapsed times (in sec) with decomposition of 20x24 => 480 cores (i.e. under 500 cores) Full I/O: (475; 497) (484; 513) NO I/O: 3 (330; 335) (306; 311) (492; 519) (310; 315) Major UM7.8 developments for performance improvement Asynchronous parallel I/O (requires OpenMP) This new feature is activated at both build and run stage Only works with OpenMP – requires UMUI “Use OpenMP” option in the “User Information and Submit Method => Job submission method” panel is selected PMSL revised algorithm (Jacobi algorithm) A revised algorithm based on a Jacobi solver is introduced and the number if iterations increased. Even with more iterations this new method is cheaper and scales at higher node counts. Optimisation for FILL_EXTERNAL_HALOS resulted in ~5% reduction in fill_external_halos routine runtime cost Improved QPOS algorithms All new versions are significantly quicker at scale (~10%) but require scientific validation as the science is altered somewhat. The "level" method has been validated in PS25 and is now being used in the global model at the Met Office. 4 Summary of attempts made for UM7.8 OpenMP usage with Intel compiler and 1 thread on Solar OpenMP usage with Intel compiler and 2 threads on Solar OpenMP usage with SunStudio compiler and 2 threads on Solar OpenMP usage with Intel compiler and 2 threads on NCI system OpenMP only for “io_services” library with Intel compiler and 2 threads on Solar 5 OpenMP usage with Intel compiler and 1 thread on Solar Problems resolved and reported Found a number of cases in the sources for inconsistent usage of allocate/deallocate statements and IF block logic => reported to the UM developers Significant impact on the performance if TMPDIR is used and modified in the UM scripts => reported to the developers A couple of missing environment variables such as OMP_NUM_THREADS and OMP_STACKSIZE should be set by the UMUI scripts if multithreading is used A run time crash problem with the usage of Intel11.0.083 compiler is resolved by using Intel11.1.073, the most latest available compiler on our site 6 OpenMP usage with Intel compiler and 1 thread on Solar (cont #2) Performance results (20x24x1, Lustre striping, FLUME_IOS_NPROC=8) UM7.8 UM7.5 (without multithreading) Full I/O No I/O Full I/O No I/O 370; 389 221; 225 484; 512 330; 335 362; 389 205; 210 475; 497 306; 311 389; 416 209; 212 492; 519 310; 315 228; 235 277; 285 Conclusions: 1. The full I/O case with UM7.8 runs over 20% faster than with UM7.5 2. UM7.8 without usage of I/O runs ~1.5 times faster than with UM7.5 3. There is no visible performance improvement in the I/O part: 179sec vs 186sec 4. In red are results from runs to compare performance section by section 7 OpenMP usage with Intel compiler and 1 thread on Solar (cont #3) Performance comparison between top 6 sections for UM7.8 and UM7.5 without usage of I/O UM7.8 UM7.5 PE_Helmholtz 68.15 PE_Helmholtz 70.76 SL_Full_wind 31.35 ATM_STEP 52.30 ATM_STEP 26.46 SL_Full_wind 31.19 SL_Thermo 21.52 SL_Thermo 27.20 READDUMP 13.53 READDUMP 12.67 Atmos_Physics2 9.55 NI_filter_Ctl 15.63 Conclusions: 1. Comparing the top sections the major performance improvements are coming from ATM_STEP (25.74sec), NI_filter_Ctl (9.59sec) and SL_Thermo (5.68sec) which gives in total of 41.01sec 8 UM7.8 performance comparison: full I/O, 20x24 decomposition and Lustre striping EXE1 – Intel11.1.073, -openmp, 1 thread, FLUME_IOS_NPROC=8, buffer_size=6000 (results from the previous slide) EXE2 – UMUI standard building procedure using Intel11.0.083 with the “safe” level of optimisation (UMUI build job xbauk) EXE3 – Intel11.1.073, bld.cfg is based on Imtiaz’s version without “-WB -warn all -warn nointerfaces -align all” (due to Intel compiler problems) EXE1 EXE2 EXE3 370; 389 381; 411 370; 396 362; 389 362; 395 369; 397 389; 416 366; 392 382; 406 Conclusion: 1. Usage of OpenMP with a single thread and parallel I/O with FLUME_IOS_NPROC=8 does NOT provide any performance advantage 9 OpenMP usage with Intel compiler and 2 threads on Solar The same slow performance issue found and investigated in detail for UM7.5 and reported in August 2010 still exists and the issue has not been addressed by Intel at all Monitoring execution of the models sometimes the job starts to run fast for the first 10-15 steps then it slows down significantly, sometimes this may happen from the start of a run Elapsed times for a 14x18 decomposition and 2 threads per MPI process with 504 cores and without I/O are (4915sec; 4921sec) in comparison with (360sec; 366sec) using 14x18x1 without I/O Conclusions: this long standing problem must be addressed by Intel At the moment it is not the most critical issue in getting asynchronous parallel I/O functionality with UM7.8 10 OpenMP usage with SunStudio compiler and 2 threads on Solar Problems resolved and reported Usage of POINTER INTENT attributes which is not supported by Fortran standard => used a work around, reported problem to the UM development team Multithreading performance results UM performance using 2 threads per MPI process is better than without OpenMP, but scaling is very poor (Lustre striping was not used): 14x18x2threads + FLUME_IOS_NPROC=4 => 508 cores => 753sec 14x18x1thread + FLUME_IOS_NPROC=0 => 252 cores => 794sec Note: - Date command output was used to calculate elapsed times - Several runs using different run configurations such as 20x24x1 and 16x32x1 had crash problems, the nature of these problems have not been investigated - Usage of different optimisation options such as -O3, -O5, –xtarget=native, –xarch=native, –dalign, -g does not make any visible impact on the performance results 11 OpenMP usage with Intel compiler and 2 threads on NCI system The same slow performance issue as on Solar does exist on NCI system using Intel11.1.073 compiler and openmpi1.4.3 library: (3565sec; 3703sec) Monitoring execution of the model: the job started to run slow from the first time step Usage of the latest Intel12.0.084 compiler Compilation crashes for a file, a work around to use “-O0” instead of “-O2” recommended by Martin fixes the problem Execution with 14x18 decomposition using a single thread crashes, this run time problem has not been investigated 12 OpenMP only for “io_services” library with Intel compiler and 2 threads on Solar Due to a slow performance issue with multithreading for the computational part, the main idea in this approach is to compile all UM7.8 sources excluding the “io_services” library without usage of “-openmp” compilation option to compile the “io_services” library with multithreading using the “-openmp” compilation option UM7.8 major terms in relation to asynchronous parallel I/O: FLUME_IOS_NPROC – number of MPI tasks allocated to act as IO servers IOS_Spacing – the gap between IO servers in MPI_COMM_WORLD (for optimal performance a node has no more than one IO server) buffer_size – amount of data (MB) that each IO server can have outstanding IOS_use_async_stash – use asynchronous communications to accelerate diagnostics output IOS_use_async_dump – asynchronous DUMP output not currently available 13 OpenMP only for “io_services” library with Intel compiler and 2 threads on Solar (cont #2) Found problems/issues Using 14x18x2 configuration with FLUME_IOS_NPROC=8 and spacing of 8 (8 nodes are overcommitted) a run time error problem was produced forrtl: severe (40): recursive I/O operation, unit 6, file unknown work around: several write statements to produce similar diagnostic output have been commented out (as per Joerg’s message, Peter Kerney has reported this problem to Intel Support) The main asynchronous parallel IO functionality due to the latest model development is available only if an MPI library allows that multiple threads can call MPI with no restrictions (MPI_THREAD_MULTIPLE), unfortunately a single threaded support (MPI_THREAD_SINGLE) is provided by our MPI library (OpenMPI), this is checked by an MPI_QUERY_THREAD call which returns the current level of thread support Comment: this is another example of an obstacle when the user has a different platform from the platform used by the developer 14 OpenMP only for “io_services” library with Intel compiler and 2 threads on Solar (cont #3) With the current version of MPI library to be able to use some parts of the implemented UM7.8 functionality Joerg suggested to overwrite the UM7.8 setting of MPI_THREAD_SINGLE with MPI_THREAD_FUNNELED (The task can be multi-threaded, but only the main thread will make MPI calls. All MPI calls are funneled to the main thread. ) Results (in sec) using 20x24 decomposition with Lustre file system striping (4Mb, 8 ways), buffer_size= 6000 488 cores (FLUME_IOS_NPROC=8), 8 MPI processes per node 560 cores (FLUME_IOS_NPROC=10), 7 MPI processes per node 431; 436 390; 395 425; 429 404; 413 423; 428 406; 414 15 OpenMP only for “io_services” library with Intel compiler and 2 threads on Solar (cont #4) Conclusions Usage of MPI_THREAD_FUNNELED does not provide a visible performance improvement in comparison with the results achieved with the usage of a single thread only In a case of not overcomitting the nodes on which multithreading is not used gives slightly better performance results which are similar to the results obtained with the usage of a single thread only The number of wasted cores in a second configuration when only 7 MPI processes are used can be reduced to 0 with the usage of the functionality provided by Joerg’s mprun.py script using its explicit form which will take 10-15 lines of text for a single run command 16 Next steps for future work Merge Joerg’s byte swapping procedure from UM7.5 into UM7.8 (Joerg agreed to do this task) Addressing by Solar Help a request on a thread multiple version of the MPI library could be provided to our site to be able to use the asynchronous functionality with UM7.8 Validation of the numerical results produced with UM7.8 By providing just presented information to Paul Selwood ask him What are the IOS main parameter settings used at UKMO site with UM7.8? What kind of performance improvement is produced in comparison with UM7.5 for the I/O part? What kind of parameter settings can be recommended for our case? 17