Understanding and Tuning I/O Performance ATPESC 2019 Glenn K. Lockwood National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory Q Center, St. Charles, IL (USA) July 28 – August 9, 2019 exascaleproject.org Hands on exercises: https://xgitlab.cels.anl.gov/ATPESC-IO/hands-on Parallel file systems in principle File system that spreads files across multiple servers (∴ many NICs and drives) You and your application see one big file 4 MiB file 1 MiB 1 MiB 1 MiB server0 2 ATPESC 2019, July 28 – August 9, 2019 1 MiB server1 1 MiB 1 MiB 1 MiB server2 PFS driver on your compute nodes see a collection of chunks 1 MiB server3 PFS servers see individual chunks Hands on exercises: https://xgitlab.cels.anl.gov/ATPESC-IO/hands-on The speed of light still applies compute node0 chunk0 chunk1 chunk2 chunk3 One compute node can't talk to every storage server at full speed chunk0 server0 3 chunk1 server1 ATPESC 2019, July 28 – August 9, 2019 chunk2 server2 chunk3 server3 Hands on exercises: https://xgitlab.cels.anl.gov/ATPESC-IO/hands-on The speed of light still applies node0 node1 node2 node3 chunk0 chunk1 chunk2 chunk3 One storage server can't talk to every compute node at full speed chunk0 chunk1 chunk2 chunk3 server0 4 server1 ATPESC 2019, July 28 – August 9, 2019 server2 server3 Hands on exercises: https://xgitlab.cels.anl.gov/ATPESC-IO/hands-on The speed of light still applies node0 node1 node2 node3 chunk0 chunk1 chunk2 chunk3 Accidental imbalance caused by a server failure 5 chunk0 chunk1 chunk2 chunk3 server0 server1 server2 ATPESC 2019, July 28 – August 9, 2019 server3 Hands on exercises: https://xgitlab.cels.anl.gov/ATPESC-IO/hands-on The speed of light still applies node0 node1 node2 node3 c0 chunk1 chunk2 chunk3 Accidental imbalance caused by misalignment c0 c1 server0 6 c1 c2 server1 ATPESC 2019, July 28 – August 9, 2019 c2 c3 server2 c3 server3 Hands on exercises: https://xgitlab.cels.anl.gov/ATPESC-IO/hands-on The speed of light still applies node0 node1 node2 node3 chunk0 chunk1 chunk2 chunk3 Overall goals when doing I/O to a PFS: • Each client and server handle the same data volume 7 chunk0 chunk1 chunk2 chunk3 server0 server1 server2 server3 ATPESC 2019, July 28 – August 9, 2019 • Work around gotchas specific to the PFS implementation Hands on exercises: https://xgitlab.cels.anl.gov/ATPESC-IO/hands-on The speed of light still applies 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Overall goals when doing I/O to a PFS: • Each client and server handle the same data volume 8 0 1 2 3 4 5 6 7 8 9 10 11 server0 server1 server2 ATPESC 2019, July 28 – August 9, 2019 12 13 14 15 • Work around server3 gotchas specific to the PFS implementation Understanding I/O performance and behavior using Darshan exascaleproject.org Hands on exercises: https://xgitlab.cels.anl.gov/ATPESC-IO/hands-on Job-Level Performance Analysis • Darshan provides insight into the I/O behavior and performance of a job • darshan-job-summary.pl creates a PDF file summarizing various aspects of I/O performance – Percent of runtime spent in I/O – Operation counts – Access size histogram – Access type histogram – File usage 10 ATPESC 2019, July 28 – August 9, 2019 Hands on exercises: https://xgitlab.cels.anl.gov/ATPESC-IO/hands-on Example: Is your app doing what you think it's doing? • App opens 129 files (one “control” file, 128 data files) • User expected one ~40 KiB header per data file • Darshan showed 512 headers being written • Code bug: header was written 4× per file 11 ATPESC 2019, July 28 – August 9, 2019 Hands on exercises: https://xgitlab.cels.anl.gov/ATPESC-IO/hands-on Example: When doing the right thing goes wrong • Large-scale astrophysics application using MPI-IO • Used 94,000,000 CPU hours at NERSC since 2015 Wrote 17 GiB using 4 KiB transfers 40% of walltime spent doing I/O even with MPI-IO Case study courtesy Jialin Liu and Quincey Koziol (NERSC) 12 ATPESC 2019, July 28 – August 9, 2019 12 13 Hands on exercises: https://xgitlab.cels.anl.gov/ATPESC-IO/hands-on Example: When doing the right thing goes wrong • Collective I/O was being disabled by middleware due to type mismatch in app code • After type mismatch bug fixed, collective I/O gave 40× speedup Wrote 756 GiB in 1 MiB transfers (>40× speedup) 6% of walltime spent doing I/O 13 ATPESC 2019, July 28 – August 9, 2019 Hands on exercises: https://xgitlab.cels.anl.gov/ATPESC-IO/hands-on Example: Redundant read traffic • Applications sometimes read more bytes from a file than the file’s size – Can cause disruptive I/O network traffic and storage contention – Good candidate for aggregation, collective I/O, or burst buffering • Common pattern in emerging AI/ML workloads • Example: – Scale: 6,138 processes – Run time: 6.5 hours – Avg. I/O time per proc: 27 minutes • 1.3 TiB of file data • 500+ TiB read! 14 61 ATPESC 2019, July 28 – August 9, 2019 Hands on exercises: https://xgitlab.cels.anl.gov/ATPESC-IO/hands-on Example: Small writes to shared files • Scenario: Small writes can contribute to poor performance – Particularly when writing to shared files – Candidates for collective I/O or batching/buffering of write operations • Example: – Issued 5.7 billion writes to shared files, each less than 100 bytes in size – Averaged just over 1 MiB/s per process during shared write phase 15 62 ATPESC 2019, July 28 – August 9, 2019 Hands on exercises: https://xgitlab.cels.anl.gov/ATPESC-IO/hands-on Example: Excessive metadata overhead • Scenario: Most I/O time spent on metadata ops (open, stat, etc.) – close() cost can be misleading due to write-behind cache flushing! – Remedy: coalescing files to eliminate extra metadata calls • Example: – Scale: 40,960 processes, > 20% time spent in I/O – 99% of I/O time in metadata operations – Generated 200,000+ files with 600,000+ write and 600,000+ stat calls 16 63 ATPESC 2019, July 28 – August 9, 2019 Hands on exercises: https://xgitlab.cels.anl.gov/ATPESC-IO/hands-on Available Darshan Analysis Tools • Docs: http://www.mcs.anl.gov/research/projects/darshan/docs/darshan-util.html • Officially supported tools – darshan-job-summary.pl: Creates PDF with graphs for initial analysis – darshan-summary-per-file.sh: Similar to above, but produces a separate PDF summary for every file opened by application – darshan-parser: Dumps all information into text format • Third-party tools – darshan-ruby: Ruby bindings for darshan-util C library https://xgitlab.cels.anl.gov/darshan/darshan-ruby – HArshaD: Easily find and compare Darshan logs https://kaust-ksl.github.io/HArshaD/ – pytokio: Detect slow Lustre OSTs, create Darshan scoreboards, etc. https://pytokio.readthedocs.io/ 17 ATPESC 2019, July 28 – August 9, 2019 17 Measuring I/O performance exascaleproject.org Hands on exercises: https://xgitlab.cels.anl.gov/ATPESC-IO/hands-on Fine-tuning performance requires benchmarking • Darshan tells you what your application is trying to do • A lot more is happening that Darshan cannot (is not allowed to) see • Benchmarking your I/O pattern is often a necessary part of optimization Source: Thomas Krenn https://www.thomas-krenn.com/en/wiki/Linux_Storage_Stack_Diagram CC BY-SA 3.0 19 ATPESC 2019, July 28 – August 9, 2019 Hands on exercises: https://xgitlab.cels.anl.gov/ATPESC-IO/hands-on IOR – A tool for measuring storage system performance • MPI application benchmark that reads and writes data in configurable ways • I/O pattern can be interleaved or random • Input: Desired transfer sizes, block sizes, and segment counts • Output: Bandwidth and IOPS • Configurable backends – POSIX, STDIO, MPI-IO – HDF5, PnetCDF, rados https://github.com/hpc/ior/releases 20 ATPESC 2019, July 28 – August 9, 2019 Hands on exercises: https://xgitlab.cels.anl.gov/ATPESC-IO/hands-on First attempt at benchmarking an I/O pattern • On a 700 GB/sec Lustre file system (Cori) • 4 nodes, 16 ppn • Performance makes no sense – write performance is awful – read performance is phenomenal $ mpirun -n 64 ./ior -t 1m -b 16m -s 256 ... Max Write: 450.54 MiB/sec (472.43 MB/sec) Max Read: 27982.41 MiB/sec (29341.68 MB/sec) 21 ATPESC 2019, July 28 – August 9, 2019 Hands on exercises: https://xgitlab.cels.anl.gov/ATPESC-IO/hands-on Try breaking up output into multiple files • IOR provides -F option to make each rank read/write to its own file – Reduces lock contention within file – Can cause significant metadata load at scale • Problem: 250 GB/sec from 4 nodes is faster than light $ mpirun -n 64 ./ior -t 1m -b 1m -s 256 -F ... Max Write: 13887.92 MiB/sec (14562.54 MB/sec) Max Read: 259730.43 MiB/sec (272347.09 MB/sec) 22 ATPESC 2019, July 28 – August 9, 2019 Hands on exercises: https://xgitlab.cels.anl.gov/ATPESC-IO/hands-on Effect of page cache when measuring I/O bandwidth • Uses unused compute node memory to store file contents – Write-back for small I/O performance – Read cache for re-read data • Easy to accidentally measure memory bandwidth, not file system bandwidth! 23 ATPESC 2019, July 28 – August 9, 2019 Hands on exercises: https://xgitlab.cels.anl.gov/ATPESC-IO/hands-on Avoid reading back what you just wrote using shifts • IOR provides -C to shift MPI ranks by one node before reading back • Read performance looks reasonable • But what about write cache? $ mpirun -n 64 ./ior -t 1m -b 1m -s 256 -F –C ... Max Write: 13398.16 MiB/sec (14048.98 MB/sec) Max Read: 6950.81 MiB/sec (7288.45 MB/sec) 24 ATPESC 2019, July 28 – August 9, 2019 Hands on exercises: https://xgitlab.cels.anl.gov/ATPESC-IO/hands-on Flushing write-back cache to include time-to-persistence • IOR provides -e option to force fsync(2) and write back all dirty pages • Measures time to write data to durable media—not just page cache • Without fsync, close(2) operation may include hidden sync time $ mpirun -n 64 ./ior -t 1m -b 1m -s 256 -F -C –e ... Max Write: 12289.16 MiB/sec (12886.11 MB/sec) Max Read: 6274.31 MiB/sec (6579.09 MB/sec) 25 ATPESC 2019, July 28 – August 9, 2019 Hands on exercises: https://xgitlab.cels.anl.gov/ATPESC-IO/hands-on How well does single-file I/O perform now? $ mpirun -n 64 ./ior -t 1m -b 1m -s 256 -C –e ... Max Write: 455.49 MiB/sec (477.62 MB/sec) Max Read: 2317.75 MiB/sec (2430.33 MB/sec) Don’t forget to set Lustre striping! $ lfs setstripe –c 8 . $ mpirun -n 64 ./ior -t 1m -b 1m -s 256 -C –e ... Max Write: 2234.91 MiB/sec (2343.47 MB/sec) Max Read: 4451.02 MiB/sec (4667.24 MB/sec) 26 ATPESC 2019, July 28 – August 9, 2019 Hands on exercises: https://xgitlab.cels.anl.gov/ATPESC-IO/hands-on Corollaries of I/O benchmarking • Understand cache effects – can correct I/O problems at small scale – tends not to behave well at scale • Utilize caches when it makes sense – small but sequential writes – read then re-read (e.g., BLAST bioinformatics app) 27 ATPESC 2019, July 28 – August 9, 2019 28 Hands on exercises: https://xgitlab.cels.anl.gov/ATPESC-IO/hands-on Effect of misaligned I/O • Common to write a small, fixed-size header and big blobs of data • Does not map well to parallel file systems offset = 0 Rank 0 Rank 0 1 MiB OST0 28 ATPESC 2019, July 28 – August 9, 2019 1 MiB OST1 Rank 1 1 MiB OST2 Rank 2 1 MiB OST4 Rank 3 Rank 0 1 MiB OST0 OST1 29 Hands on exercises: https://xgitlab.cels.anl.gov/ATPESC-IO/hands-on Effect of misaligned I/O • Parallel HDF5 and PnetCDF allow you to adjust alignment • nc_header_align_size and nc_var_align_size hints • H5Pset_alignment operation • Consult user docs for best alignment Unaligned Aligned Padding OST0 29 ATPESC 2019, July 28 – August 9, 2019 OST1 OST2 30 Hands on exercises: https://xgitlab.cels.anl.gov/ATPESC-IO/hands-on Effect of misaligned I/O $ mpirun -n 128 ./ior -t 1m -b 1m -s 256 -a HDF5 -w ... Max Write: 785.43 MiB/sec (823.58 MB/sec) $ mpirun -n 128 ./ior -t 1m -b 1m -s 256 \ -a HDF5 -w -O setAlignment=1M ... Max Write: 1318.53 MiB/sec (1382.58 MB/sec) 30 ATPESC 2019, July 28 – August 9, 2019 Understanding I/O Behavior and Performance in Complex Storage Systems exascaleproject.org 32 Hands on exercises: https://xgitlab.cels.anl.gov/ATPESC-IO/hands-on “I observed performance XYZ. Now what?” • A climate vs. weather analogy: It is snowing outside. Is that normal? • You need context to know: – Does it ever snow there? – What time of year is it? – What was the temperature yesterday? + =? – Do your neighbors see snow too? – Should you look at it first hand? • It is similarly difficult to understand a single application performance measurement without broader context: How do we differentiate typical I/O climate from extreme I/O weather events? 32 ATPESC 2019, July 28 – August 9, 2019 Hands on exercises: https://xgitlab.cels.anl.gov/ATPESC-IO/hands-on “I observed performance XYZ. Now what?” 33 ATPESC 2019, July 28 – August 9, 2019 Darshan ??? ??? ??? ??? CN CN CN CN ION CN CN CN CN ION CN CN CN CN ION Storage Fabric • If Darshan doesn't flag any problems, what next? ??? ??? ??? ??? ??? SN SN SN SN Hands on exercises: https://xgitlab.cels.anl.gov/ATPESC-IO/hands-on “I observed performance XYZ. Now what?” • If Darshan doesn't flag any problems, what next? Darshan • Need a big picture view LDMS LMT smartctl LMT UFM mmperfmon DDNtool blktrace LDMS • Each uses its own data formats, resolutions, and scopes 34 ATPESC 2019, July 28 – August 9, 2019 CN CN CN CN ION CN CN CN CN ION CN CN CN CN ION Storage Fabric • Many component-level tools SN SN SN SN Hands on exercises: https://xgitlab.cels.anl.gov/ATPESC-IO/hands-on “I observed performance XYZ. Now what?” • Integrate, correlate, and analyze I/O behavior from the system as a whole for holistic understanding 35 ATPESC 2019, July 28 – August 9, 2019 Darshan characterization DDNtool LDMS CN CN CN CN ION CN CN CN CN ION CN CN CN CN ION Storage Fabric • Total Knowledge of I/O (TOKIO) designed to address this LDMS LMT smartctl Holistic I/O LMT UFM mmperfmon SN SN SN SN blktrace 36 Hands on exercises: https://xgitlab.cels.anl.gov/ATPESC-IO/hands-on TOKIO approach to I/O performance analysis • Integrate existing instrumentation tools • Index data sources in their native format – Tools to align and link data sets – Connectors to produce coherent views on demand • Develop analysis methods that integrate data • Produce tools that share a common interface and data format https://www.nersc.gov/research-and-development/tokio/ 36 ATPESC 2019, July 28 – August 9, 2019 37 Hands on exercises: https://xgitlab.cels.anl.gov/ATPESC-IO/hands-on Example: Unified Monitoring and Metrics Interface (UMAMI) • Pluggable dashboard that displays the I/O performance of an app in context of • other system telemetry Historical samples (for a given application) are plotted over time • historical records Each metric is shown in a separate row 37 ATPESC 2019, July 28 – August 9, 2019 Box plots relate current values to overall variance 38 Hands on exercises: https://xgitlab.cels.anl.gov/ATPESC-IO/hands-on UMAMI Example: What Made Performance Increase? Performance for job is 2.5x higher than usual System background load is typical Server CPU load is low after longterm steady climb Corresponds to data purge that freed up disk blocks 38 ATPESC 2019, July 28 – August 9, 2019 39 Hands on exercises: https://xgitlab.cels.anl.gov/ATPESC-IO/hands-on Example: Identifying Straggling OSTs TOKIO utility to generate heatmaps of parallel file system servers Most servers wrote all data between 3:42 and 3:44 at ~100 GiB/sec 39 ATPESC 2019, July 28 – August 9, 2019 OST0015 didn’t finish writing until 3:49 and caused 3x slowdown! 40 Hands on exercises: https://xgitlab.cels.anl.gov/ATPESC-IO/hands-on Example: TOKIO darshan_bad_ost to find stragglers • Combine file:OST mappings with file:performance mappings in Darshan logs • Correlate slow files with slow OSTs over any number of Darshan logs $ darshan_bad_ost ./glock_ior_id1149144_4-9-65382-13265564257186991065_1.darshan OST Name OST#21 OST#8 OST#7 ... Correlation P-Value -0.703 2.473e-305 <-- this OST looks unhealthy -0.102 4.039e-06 -0.067 0.00246 https://pytokio.readthedocs.io/en/latest/api/tokio.cli.darshan_bad_ost.html 40 ATPESC 2019, July 28 – August 9, 2019 4 1 Hands on exercises: https://xgitlab.cels.anl.gov/ATPESC-IO/hands-on I/O understanding in complex systems takeaway • Complex storage architectures require an holistic approach to performance analysis – I/O problems sometimes out of users’ control! – More storage tiers, more things can go wrong • Software stacks to address this complexity are emerging – Darshan for application and runtime library – Implementation-specific tools for file system and storage devices – TOKIO to connect everything together • For more information, see: – https://www.mcs.anl.gov/research/projects/darshan/ – https://www.nersc.gov/research-and-development/tokio/ 41 ATPESC 2019, July 28 – August 9, 2019 Application Data Model Support Transformations Parallel File System I/O Hardware Thank you! exascaleproject.org