Uploaded by batemo9346

ATPESC 2019 Track-3 8 8-2 415pm Lockwood-Understanding and Tuning IO Performance

advertisement
Understanding and Tuning I/O
Performance
ATPESC 2019
Glenn K. Lockwood
National Energy Research Scientific Computing Center
Lawrence Berkeley National Laboratory
Q Center, St. Charles, IL (USA)
July 28 – August 9, 2019
exascaleproject.org
Hands on exercises: https://xgitlab.cels.anl.gov/ATPESC-IO/hands-on
Parallel file systems in principle
File system that spreads files across multiple servers (∴ many NICs and drives)
You and your application
see one big file
4 MiB file
1 MiB
1 MiB
1 MiB
server0
2
ATPESC 2019, July 28 – August 9, 2019
1 MiB
server1
1 MiB
1 MiB
1 MiB
server2
PFS driver on your
compute nodes see a
collection of chunks
1 MiB
server3
PFS servers see
individual chunks
Hands on exercises: https://xgitlab.cels.anl.gov/ATPESC-IO/hands-on
The speed of light still applies
compute node0
chunk0
chunk1
chunk2
chunk3
One compute node
can't talk to every
storage server at full
speed
chunk0
server0
3
chunk1
server1
ATPESC 2019, July 28 – August 9, 2019
chunk2
server2
chunk3
server3
Hands on exercises: https://xgitlab.cels.anl.gov/ATPESC-IO/hands-on
The speed of light still applies
node0
node1
node2
node3
chunk0
chunk1
chunk2
chunk3
One storage server
can't talk to every
compute node at full
speed
chunk0
chunk1
chunk2
chunk3
server0
4
server1
ATPESC 2019, July 28 – August 9, 2019
server2
server3
Hands on exercises: https://xgitlab.cels.anl.gov/ATPESC-IO/hands-on
The speed of light still applies
node0
node1
node2
node3
chunk0
chunk1
chunk2
chunk3
Accidental
imbalance caused
by a server failure
5
chunk0
chunk1
chunk2
chunk3
server0
server1
server2
ATPESC 2019, July 28 – August 9, 2019
server3
Hands on exercises: https://xgitlab.cels.anl.gov/ATPESC-IO/hands-on
The speed of light still applies
node0
node1
node2
node3
c0
chunk1
chunk2
chunk3
Accidental
imbalance caused
by misalignment
c0
c1
server0
6
c1
c2
server1
ATPESC 2019, July 28 – August 9, 2019
c2
c3
server2
c3
server3
Hands on exercises: https://xgitlab.cels.anl.gov/ATPESC-IO/hands-on
The speed of light still applies
node0
node1
node2
node3
chunk0
chunk1
chunk2
chunk3
Overall goals when
doing I/O to a PFS:
• Each client and
server handle the
same data volume
7
chunk0
chunk1
chunk2
chunk3
server0
server1
server2
server3
ATPESC 2019, July 28 – August 9, 2019
• Work around
gotchas specific to
the PFS
implementation
Hands on exercises: https://xgitlab.cels.anl.gov/ATPESC-IO/hands-on
The speed of light still applies
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Overall goals when
doing I/O to a PFS:
• Each client and
server handle the
same data volume
8
0 1 2 3
4 5 6 7
8 9 10 11
server0
server1
server2
ATPESC 2019, July 28 – August 9, 2019
12 13 14 15 • Work around
server3
gotchas specific to
the PFS
implementation
Understanding I/O performance and
behavior using Darshan
exascaleproject.org
Hands on exercises: https://xgitlab.cels.anl.gov/ATPESC-IO/hands-on
Job-Level Performance Analysis
• Darshan provides insight into the I/O
behavior and performance of a job
• darshan-job-summary.pl creates a PDF file
summarizing various aspects of I/O
performance
– Percent of runtime spent in I/O
– Operation counts
– Access size histogram
– Access type histogram
– File usage
10
ATPESC 2019, July 28 – August 9, 2019
Hands on exercises: https://xgitlab.cels.anl.gov/ATPESC-IO/hands-on
Example: Is your app doing what
you think it's doing?
• App opens 129 files (one “control” file, 128
data files)
• User expected one ~40 KiB header per data
file
• Darshan showed 512 headers being written
• Code bug: header was written 4× per file
11
ATPESC 2019, July 28 – August 9, 2019
Hands on exercises: https://xgitlab.cels.anl.gov/ATPESC-IO/hands-on
Example: When doing the right thing goes wrong
• Large-scale astrophysics application using MPI-IO
• Used 94,000,000 CPU hours at NERSC since 2015
Wrote 17 GiB
using 4 KiB
transfers
40% of walltime
spent doing I/O
even with MPI-IO
Case study courtesy Jialin Liu and
Quincey Koziol (NERSC)
12
ATPESC 2019, July 28 – August 9, 2019
12
13
Hands on exercises: https://xgitlab.cels.anl.gov/ATPESC-IO/hands-on
Example: When doing the right thing goes wrong
• Collective I/O was being disabled by middleware due to type mismatch in app code
• After type mismatch bug fixed, collective I/O gave 40× speedup
Wrote 756 GiB in
1 MiB transfers
(>40× speedup)
6% of walltime
spent doing I/O
13
ATPESC 2019, July 28 – August 9, 2019
Hands on exercises: https://xgitlab.cels.anl.gov/ATPESC-IO/hands-on
Example: Redundant read traffic
• Applications sometimes read more bytes from a file than the file’s size
– Can cause disruptive I/O network traffic and storage contention
– Good candidate for aggregation, collective I/O, or burst buffering
• Common pattern in emerging AI/ML
workloads
• Example:
– Scale: 6,138 processes
– Run time: 6.5 hours
– Avg. I/O time per proc:
27 minutes
• 1.3 TiB of file data
• 500+ TiB read!
14
61
ATPESC 2019, July 28 – August 9, 2019
Hands on exercises: https://xgitlab.cels.anl.gov/ATPESC-IO/hands-on
Example: Small writes to shared files
• Scenario: Small writes can
contribute to poor performance
– Particularly when writing to shared files
– Candidates for collective I/O or
batching/buffering of write operations
• Example:
– Issued 5.7 billion writes to shared files,
each less than 100 bytes in size
– Averaged just over 1 MiB/s per process
during shared write phase
15
62
ATPESC 2019, July 28 – August 9, 2019
Hands on exercises: https://xgitlab.cels.anl.gov/ATPESC-IO/hands-on
Example: Excessive metadata overhead
• Scenario: Most I/O time spent on metadata
ops (open, stat, etc.)
– close() cost can be misleading due to write-behind
cache flushing!
– Remedy: coalescing files to eliminate extra metadata
calls
• Example:
– Scale: 40,960 processes, > 20% time spent in I/O
– 99% of I/O time in metadata operations
– Generated 200,000+ files with 600,000+ write and
600,000+ stat calls
16
63
ATPESC 2019, July 28 – August 9, 2019
Hands on exercises: https://xgitlab.cels.anl.gov/ATPESC-IO/hands-on
Available Darshan Analysis Tools
• Docs: http://www.mcs.anl.gov/research/projects/darshan/docs/darshan-util.html
• Officially supported tools
– darshan-job-summary.pl: Creates PDF with graphs for initial analysis
– darshan-summary-per-file.sh: Similar to above, but produces a separate PDF summary for
every file opened by application
– darshan-parser: Dumps all information into text format
• Third-party tools
– darshan-ruby: Ruby bindings for darshan-util C library
https://xgitlab.cels.anl.gov/darshan/darshan-ruby
– HArshaD: Easily find and compare Darshan logs
https://kaust-ksl.github.io/HArshaD/
– pytokio: Detect slow Lustre OSTs, create Darshan scoreboards, etc.
https://pytokio.readthedocs.io/
17
ATPESC 2019, July 28 – August 9, 2019
17
Measuring I/O performance
exascaleproject.org
Hands on exercises: https://xgitlab.cels.anl.gov/ATPESC-IO/hands-on
Fine-tuning performance requires
benchmarking
• Darshan tells you what your application is trying
to do
• A lot more is happening that Darshan cannot (is
not allowed to) see
• Benchmarking your I/O pattern is often a
necessary part of optimization
Source: Thomas Krenn
https://www.thomas-krenn.com/en/wiki/Linux_Storage_Stack_Diagram
CC BY-SA 3.0
19
ATPESC 2019, July 28 – August 9, 2019
Hands on exercises: https://xgitlab.cels.anl.gov/ATPESC-IO/hands-on
IOR – A tool for measuring storage
system performance
• MPI application benchmark that reads
and writes data in configurable ways
• I/O pattern can be interleaved or
random
• Input: Desired transfer sizes, block
sizes, and segment counts
• Output: Bandwidth and IOPS
• Configurable backends
– POSIX, STDIO, MPI-IO
– HDF5, PnetCDF, rados
https://github.com/hpc/ior/releases
20
ATPESC 2019, July 28 – August 9, 2019
Hands on exercises: https://xgitlab.cels.anl.gov/ATPESC-IO/hands-on
First attempt at benchmarking an I/O pattern
• On a 700 GB/sec Lustre file system
(Cori)
• 4 nodes, 16 ppn
• Performance makes no sense
– write performance is awful
– read performance is phenomenal
$ mpirun -n 64 ./ior -t 1m -b 16m -s 256
...
Max Write: 450.54 MiB/sec (472.43 MB/sec)
Max Read: 27982.41 MiB/sec (29341.68 MB/sec)
21
ATPESC 2019, July 28 – August 9, 2019
Hands on exercises: https://xgitlab.cels.anl.gov/ATPESC-IO/hands-on
Try breaking up output into multiple files
• IOR provides -F option to make each rank read/write to its own file
– Reduces lock contention within file
– Can cause significant metadata load at scale
• Problem: 250 GB/sec from 4 nodes is faster than light
$ mpirun -n 64 ./ior -t 1m -b 1m -s 256 -F
...
Max Write: 13887.92 MiB/sec (14562.54 MB/sec)
Max Read: 259730.43 MiB/sec (272347.09 MB/sec)
22
ATPESC 2019, July 28 – August 9, 2019
Hands on exercises: https://xgitlab.cels.anl.gov/ATPESC-IO/hands-on
Effect of page cache when measuring I/O bandwidth
• Uses unused compute node
memory to store file
contents
– Write-back for small I/O
performance
– Read cache for re-read data
• Easy to accidentally
measure memory
bandwidth, not file system
bandwidth!
23
ATPESC 2019, July 28 – August 9, 2019
Hands on exercises: https://xgitlab.cels.anl.gov/ATPESC-IO/hands-on
Avoid reading back what you just wrote using shifts
• IOR provides -C to shift MPI
ranks by one node before
reading back
• Read performance looks
reasonable
• But what about write cache?
$ mpirun -n 64 ./ior -t 1m -b 1m -s 256 -F –C
...
Max Write: 13398.16 MiB/sec (14048.98 MB/sec)
Max Read: 6950.81 MiB/sec (7288.45 MB/sec)
24
ATPESC 2019, July 28 – August 9, 2019
Hands on exercises: https://xgitlab.cels.anl.gov/ATPESC-IO/hands-on
Flushing write-back cache to include time-to-persistence
• IOR provides -e option to force fsync(2)
and write back all dirty pages
• Measures time to write data to durable
media—not just page cache
• Without fsync, close(2) operation may
include hidden sync time
$ mpirun -n 64 ./ior -t 1m -b 1m -s 256 -F -C –e
...
Max Write: 12289.16 MiB/sec (12886.11 MB/sec)
Max Read: 6274.31 MiB/sec (6579.09 MB/sec)
25
ATPESC 2019, July 28 – August 9, 2019
Hands on exercises: https://xgitlab.cels.anl.gov/ATPESC-IO/hands-on
How well does single-file I/O perform now?
$ mpirun -n 64 ./ior -t 1m -b 1m -s 256 -C –e
...
Max Write: 455.49 MiB/sec (477.62 MB/sec)
Max Read: 2317.75 MiB/sec (2430.33 MB/sec)
Don’t forget to set Lustre striping!
$ lfs setstripe –c 8 .
$ mpirun -n 64 ./ior -t 1m -b 1m -s 256 -C –e
...
Max Write: 2234.91 MiB/sec (2343.47 MB/sec)
Max Read: 4451.02 MiB/sec (4667.24 MB/sec)
26
ATPESC 2019, July 28 – August 9, 2019
Hands on exercises: https://xgitlab.cels.anl.gov/ATPESC-IO/hands-on
Corollaries of I/O benchmarking
• Understand cache effects
– can correct I/O problems at small
scale
– tends not to behave well at scale
• Utilize caches when it makes
sense
– small but sequential writes
– read then re-read
(e.g., BLAST bioinformatics app)
27
ATPESC 2019, July 28 – August 9, 2019
28
Hands on exercises: https://xgitlab.cels.anl.gov/ATPESC-IO/hands-on
Effect of misaligned I/O
• Common to write a small, fixed-size header and big blobs of data
• Does not map well to parallel file systems
offset = 0
Rank 0
Rank 0
1 MiB
OST0
28
ATPESC 2019, July 28 – August 9, 2019
1 MiB
OST1
Rank 1
1 MiB
OST2
Rank 2
1 MiB
OST4
Rank 3
Rank 0
1 MiB
OST0
OST1
29
Hands on exercises: https://xgitlab.cels.anl.gov/ATPESC-IO/hands-on
Effect of misaligned I/O
• Parallel HDF5 and PnetCDF allow you
to adjust alignment
• nc_header_align_size and
nc_var_align_size hints
• H5Pset_alignment operation
• Consult user docs for best alignment
Unaligned
Aligned
Padding
OST0
29
ATPESC 2019, July 28 – August 9, 2019
OST1
OST2
30
Hands on exercises: https://xgitlab.cels.anl.gov/ATPESC-IO/hands-on
Effect of misaligned I/O
$ mpirun -n 128 ./ior -t 1m -b 1m -s 256 -a HDF5 -w
...
Max Write: 785.43 MiB/sec (823.58 MB/sec)
$
mpirun -n 128 ./ior -t 1m -b 1m -s 256 \
-a HDF5 -w -O setAlignment=1M
...
Max Write: 1318.53 MiB/sec (1382.58 MB/sec)
30
ATPESC 2019, July 28 – August 9, 2019
Understanding I/O Behavior and
Performance in Complex Storage
Systems
exascaleproject.org
32
Hands on exercises: https://xgitlab.cels.anl.gov/ATPESC-IO/hands-on
“I observed performance XYZ. Now what?”
• A climate vs. weather analogy: It is snowing outside. Is that normal?
• You need context to know:
– Does it ever snow there?
– What time of year is it?
– What was the temperature yesterday?
+
=?
– Do your neighbors see snow too?
– Should you look at it first hand?
• It is similarly difficult to understand a single application performance
measurement without broader context: How do we differentiate typical I/O
climate from extreme I/O weather events?
32
ATPESC 2019, July 28 – August 9, 2019
Hands on exercises: https://xgitlab.cels.anl.gov/ATPESC-IO/hands-on
“I observed performance XYZ. Now what?”
33
ATPESC 2019, July 28 – August 9, 2019
Darshan
???
??? ???
???
CN
CN
CN
CN ION
CN
CN
CN
CN ION
CN
CN
CN
CN ION
Storage Fabric
• If Darshan doesn't flag any
problems, what next?
???
??? ???
??? ???
SN
SN
SN
SN
Hands on exercises: https://xgitlab.cels.anl.gov/ATPESC-IO/hands-on
“I observed performance XYZ. Now what?”
• If Darshan doesn't flag any
problems, what next?
Darshan
• Need a big picture view
LDMS LMT
smartctl
LMT UFM mmperfmon
DDNtool blktrace
LDMS
• Each uses its own data
formats, resolutions, and
scopes
34
ATPESC 2019, July 28 – August 9, 2019
CN
CN
CN
CN ION
CN
CN
CN
CN ION
CN
CN
CN
CN ION
Storage Fabric
• Many component-level tools
SN
SN
SN
SN
Hands on exercises: https://xgitlab.cels.anl.gov/ATPESC-IO/hands-on
“I observed performance XYZ. Now what?”
• Integrate, correlate, and
analyze I/O behavior from the
system as a whole for holistic
understanding
35
ATPESC 2019, July 28 – August 9, 2019
Darshan
characterization
DDNtool
LDMS
CN
CN
CN
CN ION
CN
CN
CN
CN ION
CN
CN
CN
CN ION
Storage Fabric
• Total Knowledge of I/O
(TOKIO) designed to address
this
LDMS LMT
smartctl
Holistic
I/O
LMT UFM mmperfmon
SN
SN
SN
SN
blktrace
36
Hands on exercises: https://xgitlab.cels.anl.gov/ATPESC-IO/hands-on
TOKIO approach to I/O performance analysis
• Integrate existing instrumentation tools
• Index data sources in their native format
– Tools to align and link data sets
– Connectors to produce coherent views on demand
• Develop analysis methods that integrate data
• Produce tools that share a common interface and
data format
https://www.nersc.gov/research-and-development/tokio/
36
ATPESC 2019, July 28 – August 9, 2019
37
Hands on exercises: https://xgitlab.cels.anl.gov/ATPESC-IO/hands-on
Example: Unified Monitoring and Metrics Interface (UMAMI)
• Pluggable dashboard that displays the I/O
performance of an app in context of
• other system telemetry
Historical samples (for a
given application) are
plotted over time
• historical records
Each metric is shown
in a separate row
37
ATPESC 2019, July 28 – August 9, 2019
Box plots
relate
current
values to
overall
variance
38
Hands on exercises: https://xgitlab.cels.anl.gov/ATPESC-IO/hands-on
UMAMI Example: What Made Performance Increase?
Performance for job is 2.5x higher
than usual
System background load is
typical
Server CPU load is low after longterm steady climb
Corresponds to data purge that
freed up disk blocks
38
ATPESC 2019, July 28 – August 9, 2019
39
Hands on exercises: https://xgitlab.cels.anl.gov/ATPESC-IO/hands-on
Example: Identifying Straggling OSTs
TOKIO utility to generate
heatmaps of parallel file
system servers
Most servers
wrote all data
between 3:42
and 3:44 at
~100 GiB/sec
39
ATPESC 2019, July 28 – August 9, 2019
OST0015 didn’t finish writing until
3:49 and caused 3x slowdown!
40
Hands on exercises: https://xgitlab.cels.anl.gov/ATPESC-IO/hands-on
Example: TOKIO darshan_bad_ost to find stragglers
• Combine file:OST mappings with file:performance mappings in Darshan logs
• Correlate slow files with slow OSTs over any number of Darshan logs
$ darshan_bad_ost ./glock_ior_id1149144_4-9-65382-13265564257186991065_1.darshan
OST Name
OST#21
OST#8
OST#7
...
Correlation
P-Value
-0.703 2.473e-305 <-- this OST looks unhealthy
-0.102 4.039e-06
-0.067
0.00246
https://pytokio.readthedocs.io/en/latest/api/tokio.cli.darshan_bad_ost.html
40
ATPESC 2019, July 28 – August 9, 2019
4
1
Hands on exercises: https://xgitlab.cels.anl.gov/ATPESC-IO/hands-on
I/O understanding in complex systems takeaway
• Complex storage architectures require an holistic
approach to performance analysis
– I/O problems sometimes out of users’ control!
– More storage tiers, more things can go wrong
• Software stacks to address this complexity are
emerging
– Darshan for application and runtime library
– Implementation-specific tools for file system and storage
devices
– TOKIO to connect everything together
• For more information, see:
– https://www.mcs.anl.gov/research/projects/darshan/
– https://www.nersc.gov/research-and-development/tokio/
41
ATPESC 2019, July 28 – August 9, 2019
Application
Data Model Support
Transformations
Parallel File System
I/O Hardware
Thank you!
exascaleproject.org
Download