xCAT HPC OE Lab

advertisement
Linux Cluster Production Readiness
Egan Ford
IBM
egan@us.ibm.com
egan@sense.net
Agenda
•
•
•
•
•
•
Production Readiness
Diagnostics
Benchmarks
STAB
Case Study
SCAB
What is Production Readiness?
• Production readiness is a series of tests to
help determine if a system is ready for use.
• Production readiness falls into two
categories:
– diagnostic
– benchmark
• The purpose is to confirm that all hardware
is good and identical (per class).
• The search for consistency and
predictability.
What are diagnostics?
• Diagnostic tests are usually pass/fail and
include but are not limited to
– simple version checks
• OS, BIOS versions
– inventory checks
• Memory, CPU, etc…
– configuration checks
• Is HT off?
– vendor supplied diagnostics
• DOS on a CD
Why benchmark?
• Diagnostics are usually pass/fail.
– Thresholds may be undocumented.
– ‘Why’ is difficult to answer.
• Diagnostics may be incomplete.
– They may not test all subsystems.
• Other issues with diagnostics:
–
–
–
–
False positives.
Inconsistent from vendor to vendor.
Do no real work, cannot check for accuracy.
Usually hardware based.
• What about software?
• What about the user environment?
Why benchmark?
• Benchmarks can be checked for
accuracy.
• Benchmarks can stress all used
subsystems.
• Benchmarks can stress all used
software.
• Benchmarks can be measured and you
can determine the thresholds.
Benchmark or diagnostics?
• Do both.
• All diagnostics should pass first.
• Benchmarks will be inconsistent if
diagnostics fail.
WARNING!
• The following slides will contain the
word ‘statistics’.
• Statistics cannot prove anything.
• Exercise commonsense.
A few words on statistics
• Statistics increases human knowledge
through the use of empirical data.
• ”There are three kinds of lies: lies,
damned lies and statistics.”
-- Benjamin Disraeli (1804-1881)
• ”There are three kinds of lies: lies,
damned lies and linpack.”
What is STAB?
• STatistical Analysis of Benchmarks
• A systematic way of running a series of
increasing complex benchmarks to find
avoidable inconsistencies.
• Avoidable inconsistencies may lead to
performance problems.
• GOAL: consistent, repeatable,
accurate results.
What is STAB?
• Each benchmark is run one or more times per
node, then the best representative of each node
(ignore for multinode tests) is grouped together
and analyzed as a single population. The results
are not as interesting as the shape of the
distribution of the results. Empirical evidence for
all the benchmarks in the STAB HOWTO suggest
that they should all form a normal distribution.
• A normal distribution is the classic bell curve that
appears so frequently in statistics. It is the sum of
smaller, independent (may be unobservable),
identically-distributed variables or random events.
Uniform Distribution
• Plot below is of 20000 random dice.
Normal Distribution
• Sum of 5 dice thrown 10000 times.
Normal Distribution
• Benchmarks also have many small independent
(may be unobservable) identically-distributed
variables that may affect performance, e.g.:
–
–
–
–
–
–
–
Competing processes
Context switching
Hardware interrupts
Software interrupts
Memory management
Process/Thread scheduling
Cosmic rays
• The above may be unavoidable, but is in part the
source a normal distribution.
Non-normal Distribution
•
Benchmarks may also have non-identically-distributed observable variables
that may affect performance, e.g.:
–
–
–
–
–
–
–
–
–
–
–
–
–
–
•
Memory configuration
BIOS Version
Processor speed
Operating system
Kernel type (e.g. NUMA vs SMP vs UNI)
Kernel version
Bad memory (e.g. excessive ECCs)
Chipset revisions
Hyper-Threading or SMT
Non-uniform competing processes (e.g. httpd running on some nodes, but not
others)
Shared library versions
Bad cables
Bad administrators
Users
The above is avoidable and is the purpose of the STAB HOWTO. Avoidable
inconsistencies may lead to multimodal or non-normal distributions.
STAB Toolkit
• The STAB Tools are a collection of scripts to help
run selected benchmarks and to analyze their
results.
– Some of the tools are specific to a particular benchmark.
– Others are general and operate on the data collected by the
specific tools.
• Benchmark specific tools comprise of benchmark
launch scripts, accuracy validation scripts,
miscellaneous utilities, and analysis scripts to
collect the data, report some basic descriptive
statistics, and create input files to be used with
general STAB tools for additional analysis.
STAB Toolkit
• With a goal of consistent, repeatable, accurate
results it is best to start with as few variables as
possible. Start with single node benchmarks, e.g.,
STREAM. If all machines have similar STREAM
results, then memory can be ruled out as a factor
with other benchmark anomalies. Next, work your
way up to processor and disk benchmarks, then
two node (parallel) benchmarks, then multi-node
(parallel) benchmarks. After each more
complicated benchmark run a check for consistent,
repeatable, accurate results before continuing.
The STAB Benchmarks
• Single Node (serial) Benchmarks:
–
–
–
–
STREAM (memory MB/s)
NPB Serial (uni-processor FLOP/s and memory)
NPB OpenMP (multi-processor FLOP/s and memory)
HPL MPI Shared Memory (multi-processor FLOP/s and
memory)
– IOzone (disk MB/s, memory, and processor)
• Parallel Benchmarks (for MPI systems only):
– Ping-Pong (interconnect µsec and MB/s)
– NAS Parallel (multi-node FLOP/s, memory, and
interconnect)
– HPL Parallel (multi-node FLOP/s, memory, and
interconnect)
Getting STAB
• http://sense.net/~egan/bench
– bench.tgz
• Code with source (all script)
– bench-oss.tgz
• OSS code (e.g. Gnuplot)
– bench-examples.tgz
• 1GB of collected data (all text, 186000+ files)
– stab.pdf (currently 150 pages)
• Documentation (WIP, check back before 11/30/2005)
Install STAB
• Extract bench*.tgz into home directory:
cd ~
tar zxvf bench.tgz
tar zxvf bench-oss.tgz
tar zxvf bench-examples.tgz
• Add STAB tools to PATH:
export PATH=~/bench/bin:$PATH
• Append to .bashrc:
export PATH=~/bench/bin:$PATH
Install STAB
• STAB requires Gnuplot 4 and it must be built a
specific way:
cd ~/bench/src
tar zxvf gnuplot-4.0.0.tar.gz
cd gnuplot-4.0.0
./configure --prefix=$HOME/bench --enable-thin-splines
make
make install
STAB Benchmark Tools
•
Each benchmark supported in this document contains an anal (short for
analysis) script. This script is usually run from a output directory, e.g.:
cd ~/bench/benchmark/output
../anal
•
benchmark
nodes
low
high
%
mean
median
std dev
bt.A.i686
cg.A.i686
ep.A.i686
ft.A.i686
lu.A.i686
mg.A.i686
sp.A.i686
4
4
4
4
4
4
4
615.77
159.78
11.51
448.05
430.60
468.12
449.01
632.08
225.08
11.53
448.90
436.59
472.54
449.87
2.65
40.87
0.17
0.19
1.39
0.94
0.19
627.85
191.05
11.52
448.63
433.87
470.86
449.58
632.02
193.16
11.52
448.81
434.72
472.12
449.72
8.06
26.86
0.01
0.39
2.51
2.00
0.39
The anal scripts produce statistics about the results to help find
anomalies. The theory is that if you have identical nodes then you
should be able to obtain identical results (not always true). The anal
scripts will also produce plot.* files for use by dplot to graphically
represent the distribution of the results, and by cplot to plot 2D
correlations.
Rant: % vs. normal distribution
• % is good?
– % variability can tell you something about the
data with respect to itself without knowing
anything about the data
– It is non-dimensional with a range (usually 0-100)
that has meaning to anyone.
– IOW, management understands percentages.
• % is not good?
– It minimizes the amount of useful empirical data.
– It hides the truth.
% is not good, exhibit A
•
Clearly this is a normal distribution, but the variability is 500%. This is an
extreme case where all the possible values exist for predetermined range.
% is not good, exhibit B
•
Low variability can hide a skewed distribution. Variability is low,
only 1.27%. But the distribution is clearly skewed to the right.
% is not good, exhibit C
•
A 5.74% variability hides a bimodal distribution. Bimodal distributions are clear
indicators that there is an observable difference between two different sets of
nodes.
STAB General Analysis Tools
•
dplot is for plotting distributions.
– All the graphical output used as illustrations in this document up to this
point was created with dplot.
– dplot provides a number of options for binning the data and analyzing the
distribution.
•
cplot is for correlating the results between two different sets of results.
– E.g., does poor memory performance correlate to poor application
performance?
•
danal is very similar to the output provided by the custom anal scripts
provided with each benchmark, but has additional output options.
– You can safely discard any anal screen output because it can be recreated
with danal and the resulting plot.benchmark file.
•
Each script will require one or more plot.benchmark files.
– dplot and danal are less strict and will work with any file of numbers as
long as the numbers are in the first column; subsequent columns are
ignored.
– cplot however requires the 2nd column; it is impossible to correlate two
sets of results without an index.
dplot
• The first argument to dplot must be the number of bins,
auto, or whole. auto (or a) will use the square root of the
number of results to determine the bin sizes and is usually
the best place to start. whole (or w) should only be used if
your results are whole numbers and if the data contains all
possible values between low and high. This is only useful
for creating plots like the dice examples at the beginning of
this document.
• The second argument is the plotfile. The plotfile must
contain one value per line in the first column, subsequent
columns are ignored. The order of the data is unimportant.
dplot a numbers.1000
dplot a numbers.1000 -n
dplot 19 numbers.1000 -n
dplot a plot.c.ppc64 -bi
dplot a plot.c.ppc64 –bi -std
dplot a plot.c.ppc64 –text
108 +--------------[]--------------------------------+ 0.22
|
[]
|
|
[]
|
|
[]
|
86 +--------------[]--------------------------------+ 0.18
|
[][]
|
|
::[][]
|
|
[][][]
|
65 +------------[][][]------------------------------+ 0.13
|
[][][]
|
|
[][][]
|
|
[][][]..
|
43 +------------[][][][]----------------------------+ 0.09
|
[][][][]
|
|
[][][][]
|
|
::[][][][][]
|
22 +----------[][][][][][]--------------------------+ 0.05
|
[][][][][][]
|
[][][][][][][]::
|
[]....
..[][][][][]..
..::::[][][][][][][][]::..[][][][][][][][][]
|
|
|
0 +-------+-------+-------+-------+-------+-------++ 0.00
2023
2046
2068
2090
2112
2134
2156
GUI vs Text
dplot a plot.c_omp.ppc64 –n -chi
chi-squared and scale
Abusing chi-squared
$ findn plot.c_omp.ppc64
X^2:
X^2:
X^2:
X^2:
X^2:
X^2:
X^2:
X^2:
X^2:
X^2:
X^2:
X^2:
X^2:
X^2:
X^2:
X^2:
X^2:
X^2:
X^2:
26.75, scale: 0.43, bins: 21, normal distribution probability: 14.30%
13.29, scale: 0.25, bins: 12, normal distribution probability: 27.50%
24.34, scale: 0.45, bins: 22, normal distribution probability: 27.70%
22.04, scale: 0.41, bins: 20, normal distribution probability: 28.20%
4.65, scale: 0.12, bins: 6, normal distribution probability: 46.00%
8.68, scale: 0.21, bins: 10, normal distribution probability: 46.70%
16.79, scale: 0.37, bins: 18, normal distribution probability: 46.90%
12.52, scale: 0.29, bins: 14, normal distribution probability: 48.50%
16.77, scale: 0.39, bins: 19, normal distribution probability: 53.90%
8.55, scale: 0.23, bins: 11, normal distribution probability: 57.50%
12.33, scale: 0.31, bins: 15, normal distribution probability: 58.00%
13.25, scale: 0.33, bins: 16, normal distribution probability: 58.30%
2.84, scale: 0.1, bins: 5, normal distribution probability: 58.40%
10.22, scale: 0.27, bins: 13, normal distribution probability: 59.70%
6.27, scale: 0.19, bins: 9, normal distribution probability: 61.70%
1.36, scale: 0.08, bins: 4, normal distribution probability: 71.60%
11.28, scale: 0.35, bins: 17, normal distribution probability: 79.20%
3.36, scale: 0.17, bins: 8, normal distribution probability: 85.00%
2.27, scale: 0.14, bins: 7, normal distribution probability: 89.30%
Abusing chi-squared
cplot
•
•
•
•
•
•
•
cplot or correlation plot is a perl front-end to Gnuplot to graphically
represent the correlation between any two sets of indexed numbers.
Correlation measures the relationship between two sets of results, e.g.
processor performance and memory throughput.
Correlations are often expressed as a correlation coefficient; a
numerical value with a range from -1 to +1.
A positive correlation would indicate that if one set of results increased,
the other set would increase, e.g. better memory throughput increases
processor performance.
A negative correlation would indication that if one set of results
increases, the other set would decrease, e.g. better processor
performance decreases latency.
A correlation of zero would indicate that there is no relationship at all,
IOW, they are independent.
Any two sets of results with a non-zero correlation is considered
dependent, however a check should be performed to determine if a
dependent set of results is statistically significant.
cplot
• A strong correlation between two sets of results
should produce more questions, not quick
answers.
• It is possible for two unrelated results to have a
strong correlation because they share something
in common.
– E.g. You can show a positive correlation with the sales of
skis and snowboards. It is unlikely that increased ski sales
increased snowboard sales, the mostly likely cause is an
increase in the snow depth (or a decrease in temperature) at
your local resort, i.e., something that is in common. The
correlation is valid, but it does not prove the cause of the
correlation.
cplot plot.c.ppc64 plot.cg.B.ppc64
cplot plot.c.ppc64 plot.mg.B.ppc64
Correlation of temperature to memory
performance
Correlation of 100 random numbers
Statistical Significance
Statistical Significance
Case Study
• 484 JS20 blades
– dual PPC970
– 2GB RAM
• Myrinet D
– Full Bisection Switch
• Cisco GigE
– 14:1 over subscribed
Diagnostics
• Vendor supplied (passed)
• BIOS versions (failed)
• Inventory
– Number of CPUs (passed)
– Total Memory (failed)
• OS/Kernel Versions (passed)
BIOS Versions (failed)
• All nodes but node443 have BIOS dated
10/21/04. node443 is dated 09/02/2004.
• Inconsistent BIOS versions can affect
performance.
Command output:
# rinv compute all | tee /tmp/foo
# cat /tmp/foo | grep BIOS | awk '{print $4}' | sort | uniq
09/02/2004
10/21/2004
# cat /tmp/foo | grep BIOS | grep 09/02/2004
node433: VPD BIOS:
09/02/2004
Memory quantity (failed)
• All nodes except node224 have 2GB
RAM.
Command output:
# psh compute free | grep Mem | awk '{print $3}' | sort | uniq
1460116
1977204
1977208
#psh compute free | grep Mem | grep 1460116
node224:
Mem:
1460116
...
STREAM
• The STREAM benchmark is a simple
synthetic benchmark program that measures
sustainable memory bandwidth (in MB/s)
and the corresponding computation rate for
simple vector kernels.
• STREAM C, FORTRAN, and C OMP are run
10 times on each node, then the best result
from each node is taken to be used to
compare consistency. Each result is also
tested for accuracy.
STREAM validation results
• node483 failed to pass OMP test 3 of 10 test
for accuracy. Try replacing memory,
processors, and then system board in that
order.
Command output:
# cd ~/bench/stream/output.raw
# ../checkresults
checking stream_c_omp.ppc64.node483.3...failed
STREAM consistency results
# cd ~/bench/stream/output
# ../anal
stream results
benchmark
c.ppc64
c_omp.ppc64
f.ppc64
nodes
484
484
484
low
2031.43
1993.49
2007.16
high
2147.98
2124.24
2092.68
%
5.74
6.56
4.26
mean
2077.03
2050.00
2039.20
median
2069.02
2050.51
2034.63
std dev
23.20
22.86
17.87
NAS Serial
• The NAS Parallel Benchmarks (NPB) are a small set of
programs designed to help evaluate the performance of
parallel supercomputers. The benchmarks, which are
derived from computational fluid dynamics (CFD)
applications, consist of five kernels and three pseudoapplications.
• The NAS Serial Benchmarks are the same as the NAS
Parallel Benchmarks except that MPI calls have been taken
out and they run on one processor.
• bt.B, cg.B, ep.B, ft.B, lu.B, mg.B, and sp.B are run 5 times
on each node, then the best result from each node is taken
to be used to compare consistency. Each result is also
tested for accuracy.
NAS Serial validation results
•
node483 failed to pass to a number of tests. Try replacing memory,
processors, and then system board in that order.
Command output:
# cd ~/bench/NPB3.2/NPB3.2-SER/output.raw
# ../checkresults
checking bt.B.ppc64.node483.1...failed
checking bt.B.ppc64.node483.2...failed
checking bt.B.ppc64.node483.3...failed
checking bt.B.ppc64.node483.4...failed
checking bt.B.ppc64.node483.5...failed
checking cg.B.ppc64.node483.4...failed
checking ep.B.ppc64.node483.3...failed
checking ft.B.ppc64.node483.1...failed
checking ft.B.ppc64.node483.2...failed
checking ft.B.ppc64.node483.3...failed
checking ft.B.ppc64.node483.4...failed
checking lu.B.ppc64.node483.1...failed
checking mg.B.ppc64.node483.1...failed
checking mg.B.ppc64.node483.3...failed
checking sp.B.ppc64.node483.1...failed
checking sp.B.ppc64.node483.2...failed
checking sp.B.ppc64.node483.3...failed
checking sp.B.ppc64.node483.4...failed
checking sp.B.ppc64.node483.5...failed
NAS Serial consistency results
# cd ~/bench/NPB3.2/NPB3.2-SER/output
# ../anal
NPB Serial
benchmark
bt.B.ppc64
cg.B.ppc64
ep.B.ppc64
ft.B.ppc64
lu.B.ppc64
mg.B.ppc64
sp.B.ppc64
nodes
484
484
484
484
484
484
484
low
1077.69
40.93
9.88
480.87
516.88
618.16
530.48
high
1099.28
45.30
10.07
503.33
579.25
654.23
556.67
%
2.00
10.68
1.92
4.67
12.07
5.84
4.94
mean
1087.60
41.94
9.96
487.07
543.08
638.31
541.01
median
1087.67
41.38
9.96
486.23
542.88
638.85
540.77
std dev
4.67
1.31
0.04
3.71
12.46
6.76
3.99
How does memory correlate to
performance?
Statistically significant?
• Command output:
$ findc plot* | grep plot.c.ppc64
0.13
0.62
0.93
0.19
0.89
0.17
0.11
0.50
0.05
0.13
0.62
0.93
0.19
0.89
0.17
0.11
-0.50
-0.05
00
00
00
00
00
00
02
00
27
plot.bt.B.ppc64 plot.c.ppc64
plot.c.ppc64 plot.c_omp.ppc64
plot.c.ppc64 plot.cg.B.ppc64
plot.c.ppc64 plot.ep.B.ppc64
plot.c.ppc64 plot.f.ppc64
plot.c.ppc64 plot.ft.B.ppc64
plot.c.ppc64 plot.lu.B.ppc64
plot.c.ppc64 plot.mg.B.ppc64
plot.c.ppc64 plot.sp.B.ppc64
NAS OMP
• The NAS OpenMP Benchmarks are the
same as the NAS Parallel Benchmarks
except that the MPI calls have been
replaced with OpenMP calls to run on
multiple processors on a shared memory
system (SMP).
• bt.B, cg.B, ep.B, ft.B, lu.B, mg.B, and sp.B
are run 5 times on each node, then the best
result from each node is taken to be used to
compare consistency. Each result is also
tested for accuracy.
NAS OMP validation results
• node483 failed to pass to a number of tests. Try replacing
memory, processors, and then system board in that order.
Command output:
# cd ~/bench/NPB3.2/NPB3.2-OMP/output.raw
# ../checkresults
checking bt.B.ppc64.node483.1...failed
checking bt.B.ppc64.node483.2...failed
checking bt.B.ppc64.node483.3...failed
checking bt.B.ppc64.node483.4...failed
checking bt.B.ppc64.node483.5...failed
checking ft.B.ppc64.node483.1...failed
checking ft.B.ppc64.node483.2...failed
checking ft.B.ppc64.node483.3...failed
checking ft.B.ppc64.node483.4...failed
checking ft.B.ppc64.node483.5...failed
checking lu.B.ppc64.node483.1...failed
checking lu.B.ppc64.node483.3...failed
checking lu.B.ppc64.node483.4...failed
checking mg.B.ppc64.node483.1...failed
checking mg.B.ppc64.node483.2...failed
checking mg.B.ppc64.node483.3...failed
checking mg.B.ppc64.node483.4...failed
checking mg.B.ppc64.node483.5...failed
checking sp.B.ppc64.node483.1...failed
checking sp.B.ppc64.node483.2...failed
checking sp.B.ppc64.node483.3...failed
checking sp.B.ppc64.node483.4...failed
checking sp.B.ppc64.node483.5...failed
NAS OMP consistency results
# cd ~/bench/NPB3.2/NPB3.2-OMP/output
# ../anal
NPB OpenMP
benchmark
bt.B.ppc64
cg.B.ppc64
ep.B.ppc64
ft.B.ppc64
lu.B.ppc64
mg.B.ppc64
sp.B.ppc64
nodes
484
484
484
484
484
484
484
low
1850.99
67.31
19.69
593.39
739.30
751.40
722.73
high
1898.65
73.30
20.36
615.77
820.71
819.38
824.39
%
2.57
8.90
3.40
3.77
11.01
9.05
14.07
mean
1871.41
68.96
19.88
604.74
773.09
792.03
745.99
median
1870.45
68.44
19.88
604.61
772.05
797.10
747.33
How does memory correlate to
performance?
Statistically significant?
• Command output:
$ findc plot* | grep plot.f.ppc64
0.37
0.37
00
plot.bt.B.ppc64 plot.f.ppc64
0.89
0.89
00
plot.c.ppc64 plot.f.ppc64
0.64
0.64
00
plot.c_omp.ppc64 plot.f.ppc64
0.77
0.77
00
plot.cg.B.ppc64 plot.f.ppc64
0.07
-0.07
12
plot.ep.B.ppc64 plot.f.ppc64
0.20
-0.20
00
plot.f.ppc64 plot.ft.B.ppc64
0.29
-0.29
00
plot.f.ppc64 plot.lu.B.ppc64
0.81
-0.81
00
plot.f.ppc64 plot.mg.B.ppc64
0.65
-0.65
00
plot.f.ppc64 plot.sp.B.ppc64
$ findc plot* | grep plot.c_omp.ppc64
0.29
0.29
00
plot.bt.B.ppc64 plot.c_omp.ppc64
0.62
0.62
00
plot.c.ppc64 plot.c_omp.ppc64
0.54
0.54
00
plot.c_omp.ppc64 plot.cg.B.ppc64
0.03
-0.03
51
plot.c_omp.ppc64 plot.ep.B.ppc64
0.64
0.64
00
plot.c_omp.ppc64 plot.f.ppc64
0.06
-0.06
19
plot.c_omp.ppc64 plot.ft.B.ppc64
0.20
-0.20
00
plot.c_omp.ppc64 plot.lu.B.ppc64
0.56
-0.56
00
plot.c_omp.ppc64 plot.mg.B.ppc64
0.44
-0.44
00
plot.c_omp.ppc64 plot.sp.B.ppc64
HPL
• HPL is a software package that solves a (random)
dense linear system in double precision (64 bits)
arithmetic on distributed-memory computers. It can
thus be regarded as a portable as well as freely
available implementation of the High Performance
Computing Linpack Benchmark.
• xhpl is run 10 times on each node, then the best
result from each node is taken to be used to
compare consistency. Each result it also tested for
accuracy.
• NOTE: nodes 215 and 224 were excluded from
this test. node215 would not boot up. node224
only had 1.5GB of RAM. This test used 1.8GB
RAM.
HPL validation test
• node483 failed to pass any test. Try replacing
memory, processors, and then system board in
that order.
• Command output:
# cd ~/bench/hpl/output.raw.single
# ../checkresults
checking xhpl.ppc64.node483.1...failed
checking xhpl.ppc64.node483.10...failed
checking xhpl.ppc64.node483.2...failed
checking xhpl.ppc64.node483.3...failed
checking xhpl.ppc64.node483.4...failed
checking xhpl.ppc64.node483.5...failed
checking xhpl.ppc64.node483.6...failed
checking xhpl.ppc64.node483.7...failed
checking xhpl.ppc64.node483.8...failed
checking xhpl.ppc64.node483.9...failed
HPL consistency and correlation
# cd ~/bench/hpl/output
# ../anal
HPL results
benchmark
xhpl.ppc64
nodes
482
low
11.62
high
12.04
%
3.61
mean
11.89
median
11.89
Ping-Pong
• Ping-Pong is a simple benchmark that measures latency
and bandwidth for different message sizes.
• Ping-Pong benchmarks should be run for each network
(e.g. Myrinet and GigE). First run the serial Ping-Pongs
and then the parallel Ping-Pongs. The purpose of the
serial benchmarks is to find any single node or set of nodes
that is not performing as well as the other nodes. The
purpose of the parallel benchmarks is to help calculate
bisectional bandwidth and test that system wide MPI jobs
can be run.
• There are four patterns, 3 deterministic and 1 random. The
purpose for all four is to help isolate poor performing nodes
and possibly poor performing routes or trunks (e.g. bad
uplink cable).
Ping-Pong
• Sorted
Ping-Pong
• Cut
Ping-Pong
• Fold
Myrinet consistency check
# cd ~/bench/PMB2.2.1/output.gm
# ../anal spp sort bw
spp sort bw results
bytes
pairs
low
1
242
0.08
...
4194304
242
87.62
# ../anal spp cut bw
...
4194304
242
87.13
# ../anal spp fold bw
...
4194304
242
87.17
# ../anal spp shuffle bw
...
4194304
242
87.61
high
0.11
%
37.50
mean
0.11
median
0.11
std dev
0.00
234.93
168.12
232.49
233.43
9.38
234.99
169.70
232.16
233.15
9.40
235.04
169.63
232.13
233.16
9.39
234.77
167.97
232.14
232.70
9.36
The 4194304 results the mean and median are very close together
and also close to the high indicating a one or a few nodes with poor
performance.
Myrinet consistency
# head -5 plot.spp.*.bw.4194304
==> plot.spp.cut.bw.4194304 <==
87.13
node164-node406
230.95 node107-node349
231.36 node147-node389
231.41 node091-node333
231.43 node045-node287
==> plot.spp.fold.bw.4194304 <==
87.17
node079-node406
227.58 node214-node271
229.34 node010-node475
231.40 node091-node394
231.48 node177-node308
==> plot.spp.shuffle.bw.4194304 <==
87.61
node024-node406
231.47 node091-node166
231.51 node227-node003
231.55 node110-node293
231.57 node013-node231
==> plot.spp.sort.bw.4194304 <==
87.62
node405-node406
228.64 node039-node040
231.64 node231-node232
231.66 node091-node092
231.66 node481-node482
Bisectional Bandwidth
ppp cut bw results
bytes
4194304
pairs
242
low
60.28
high
233.44
%
287.26
mean
138.94
median
137.92
Demonstrated BW = 242 * 138.94 = 33623.48 MB/s ~= 32.8 GB/s (262.4 Gb/s)
std dev
36.87
IP consistency check
# cd ~/bench/PMB2.2.1/output.ip
# ../anal spp sort bw
spp sort bw results
bytes
pairs
low
high
1
241
0.01
0.01
...
4194304
241
60.76
101.76
# ../anal spp cut bw
...
4194304
241
45.54
89.88
# ../anal spp fold bw
...
4194304
241
50.91
100.60
# ../anal spp shuffle bw
...
4194304
241
49.31
100.71
%
0.00
mean
0.01
median
0.01
std dev
0.00
67.48
99.91
100.26
3.53
97.36
86.96
88.60
6.58
97.60
87.33
88.48
6.30
104.24
87.26
88.53
6.72
IP consistency check
•
•
The sorted pair output will be easiest to analyze for problem since each
pair will be restricted to a single switch within each Bladecenter. The
other tests will run across the network and may have higher variability.
Running the following command reviles that the pairs in bold performed
poorly:
# head -5 plot.spp.sort.bw.4194304
==> plot.spp.sort.bw.4194304 <==
60.76
node025-node026
68.97
node023-node024
79.97
node325-node326
98.83
node067-node068
98.85
node071-node072
98.94
node337-node338
98.98
node175-node176
99.02
node031-node032
99.11
node401-node402
99.16
node085-node086
•
•
This may or may not be a problem. The uplink performance will be less
60MB/s/node because BC can at best provide an average of 35MB/s
per blade (with a 4 cable trunk). Many Myrinet-based clusters only use
GigE for management and NFS, both have greater bottlenecks
elsewhere.
You may want to check the switch logs and consider reseating the
switches and blades.
IP consistency check
Running the following command reviles that there may be an uplink problem with
nodes in BC #2. i.e. node015-node028.
# head -20 plot.spp.cut.bw.4194304 plot.spp.fold.bw.4194304
plot.spp.shuffle.bw.4194304
==> plot.spp.cut.bw.4194304 <==
45.54
node025-node268
50.47
node026-node269
54.85
node024-node267
56.27
node002-node245
57.08
node022-node265
58.50
node023-node266
62.74
node020-node263
69.37
node016-node259
69.48
node015-node258
69.56
node021-node264
69.73
node018-node261
71.06
node028-node271
71.42
node019-node262
71.45
node042-node285
72.06
node027-node270
72.31
node017-node260
84.69
node224-node465
86.40
node225-node466
87.10
node001-node244
87.54
node084-node327
IP consistency check
==> plot.spp.fold.bw.4194304 <==
50.91
node026-node459
51.72
node023-node462
55.32
node002-node483
58.39
node025-node460
60.24
node024-node461
65.66
node018-node467
68.09
node022-node463
68.28
node020-node465
69.96
node021-node464
70.23
node015-node470
70.27
node016-node469
70.61
node019-node466
71.12
node027-node458
71.50
node017-node468
74.35
node028-node457
84.75
node235-node252
85.02
node236-node251
85.79
node237-node250
85.94
node238-node249
87.19
node118-node367
IP consistency check
==> plot.spp.shuffle.bw.4194304 <==
49.31
node001-node126
49.46
node029-node026
51.25
node024-node063
56.34
node274-node025
58.14
node023-node100
68.00
node019-node248
68.67
node443-node015
68.88
node018-node228
69.29
node020-node091
69.38
node028-node240
70.68
node022-node102
70.80
node027-node106
71.63
node021-node423
71.96
node291-node017
72.52
node460-node411
72.66
node016-node040
78.61
node031-node011
83.85
node041-node050
84.82
node407-node393
85.08
node420-node399
The cut, fold, and shuffle tests run from BC to BC, and the nodes in BC #2 repeatable
show up. Consider checking the uplink cables, ports, and the BC switch.
Bisectional Bandwidth
ppp cut bw results
bytes
4194304
pairs
241
low
6.18
high
17.36
%
180.91
mean
7.95
median
7.28
Demonstrated BW = 241 * 7.95 = 1915.95 MB/s ~= 1.87 GB/s (14.96 Gb/s)
std dev
1.82
NAS MPI (8 node, 2ppn)
•
•
The NAS Parallel Benchmarks (NPB) are a small set of programs
designed to help evaluate the performance of parallel supercomputers.
The benchmarks, which are derived from computational fluid dynamics
(CFD) applications, consist of five kernels and three pseudoapplications.
bt.B, cg.B, ep.B, ft.B, is.B, lu.B, mg.B, and sp.B are run 10 times on
each set of 8 unique nodes using 2 different node set methods: sorted
and shuffle.
– Sorted. Sets of 8 nodes are selected from a sorted list and assigned
adjacently, e.g. node001-node008, node009-node016, etc…, this is used
to find consistency within the same set of nodes.
– Shuffle. Sets of 8 nodes are selected from a shuffled list. Nodes are
reshuffled between runs.
•
•
Both sorted and shuffle sets are run in parallel, i.e. all the sorted sets of
8 are run at the same time, then all the shuffle sets are run at the same
time.
NOTE: node215 and node446 were not included in the shuffle and
sorted tests. node215 failed to boot, node446 failed to startup Myrinet.
NAS MPI verification
Verification command output:
# cd ~/bench/NPB3.2/NPB3.2-MPI/output.raw.shuffle
This command will find the failed results and place the names of the results filenames into the
file ../failed:
# ../checkresults ../failed
This command will find the common nodes in all failed results in the file ../failed and sort them
by number of occurrences (occurrences are counted by processor, not node):
# xcommon ../failed | tail
node395 12
node440 12
node056 12
node464 12
node043 12
node429 14
node297 14
node391 20
node174 22
node483 96
NAS MPI Consistency check
• Consistency check command output:
# cd ~/bench/NPB3.2/NPB3.2-MPI/output.raw.shuffle
# ../analm
NPB MPI
benchmark
bt.B.16
cg.B.16
ep.B.16
ft.B.16
is.B.16
lu.B.16
mg.B.16
sp.B.16
runs
600
600
600
600
600
600
600
600
low
9089.46
1095.60
155.81
2102.39
87.06
5069.36
3265.89
2156.46
high
10415.15
1685.61
160.64
3232.49
185.29
5892.62
3898.99
2404.05
%
14.58
53.85
3.10
53.75
112.83
16.24
19.39
11.48
mean
10204.94
1570.48
158.48
3052.71
155.97
5529.00
3737.80
2340.00
median
10217.94
1575.38
158.37
3066.45
154.39
5531.17
3739.77
2340.05
std dev
143.14
57.70
0.59
130.37
12.94
111.84
74.91
26.89
NAS MPI Consistency
The leading cause of variable for a stable system is switch contention. The only
way to determine what is normal is to run the same set of benchmarks multiple
times on an isolated set of stable nodes (nodes that passed single node tests)
with the rest of the switch not in use. I did not have time to run a series of serial
parallel tests, but this is close:
# cd ~/bench/NPB3.2/NPB3.2-MPI/output.raw.sort
# ../analm $(nr –l node001-node080)
NPB MPI
benchmark
bt.B.16
cg.B.16
ep.B.16
ft.B.16
is.B.16
lu.B.16
mg.B.16
sp.B.16
runs
100
100
100
100
100
100
100
100
low
10025.30
1678.27
150.45
3248.41
159.31
5156.19
3491.76
2259.08
high
10266.00
1787.76
160.02
3694.40
168.14
5522.79
3685.78
2308.16
%
2.40
6.52
6.36
13.73
5.54
7.11
5.56
2.17
mean
10129.42
1714.04
158.49
3563.50
163.91
5346.95
3613.65
2289.66
median
10120.54
1712.43
158.38
3575.43
164.22
5350.06
3614.44
2290.30
std dev
44.30
15.39
1.03
81.22
1.98
87.51
37.25
9.55
The above results are from the first 80 nodes run sorted. Each set of 8 nodes
were isolated to a single Myrinet line card reducing switch contention (however
each 2 sets of nodes did share a single line card). Also to avoid possible
variability because of memory performance I limited the report to the first 80
nodes.
NAS MPI Distribution
NAS MPI Correlation BT BW vs. Perf
NAS MPI Distribution w/o node406
NAS MPI Correlation BT STREAM vs.
Perf
NAS MPI Correlation BT STREAM vs.
Perf
$ CPLOTOPTS="-dy ," findc plot* | grep plot.c.ppc64
0.09
0.00
0.14
0.22
0.21
0.41
0.42
-0.09
0.00
-0.14
-0.22
-0.21
-0.41
-0.42
05
100
00
00
00
00
00
plot.c.ppc64
plot.c.ppc64
plot.c.ppc64
plot.c.ppc64
plot.c.ppc64
plot.c.ppc64
plot.c.ppc64
plot.cg.B.16
plot.ep.B.16
plot.ft.B.16
plot.is.B.16
plot.lu.B.16
plot.mg.B.16
plot.sp.B.16
HPL MPI
•
•
HPL is a software package that solves a (random) dense linear system
in double precision (64 bits) arithmetic on distributed-memory
computers. It can thus be regarded as a portable as well as freely
available implementation of the High Performance Computing Linpack
Benchmark.
xhpl is run 10 (15 times for sorted) times on each set of 8 unique nodes
using 2 different node set methods: sorted and shuffle.
– Sorted. Sets of 8 nodes are selected from a sorted list and assigned
adjacently, e.g. node001-node008, node009-node016, etc…, this is used
to find consistency within the same set of nodes.
– Shuffle. Sets of 8 nodes are selected from a shuffled list. Nodes are
reshuffled between runs.
•
Both sorted and shuffle sets are run in parallel, i.e. all the sorted sets of
8 are run at the same time, then all the shuffle sets are run at the same
time.
HPL MPI verification
# cd ~/bench/hpl/output.raw.shuffle
This command will find the failed results and place the names of the results
filenames into the file ../failed:
# ../checkresults ../failed
This command will find the common nodes in all failed results in the file
../failed and sort them by number of occurrences (occurrences are counted
by processor, not node):
# xcommon ../failed | tail
node073 2
node121 2
node090 2
node406 2
node308 2
node276 2
node103 2
node199 2
node435 4
node483 20
HPL MPI consistency
# cd ~/bench/hpl/output.raw.shuffle
# ../analm
HPL results
benchmark
runs
low
xhpl.16.15000
600
51.14
xhpl.16.30000
600
69.34
high
60.66
78.48
%
18.62
13.18
mean
59.31
77.16
median
59.48
77.35
std dev
1.00
1.08
HPL MPI correlations
Summary
•
•
•
•
node483 has accuracy issues.
node406 has weak Myrinet performance.
BC2 has a switch or uplink issue.
nodes 1-84 has a different memory
configuration that does correlate to
application performance.
• Applications at large scales my experience
no performance anomalies.
What is SCAB?
• SCalability Analysis of Benchmarks
• The purpose of the SCAB HOWTO is to
verify that the cluster you just built actually
can do work at scale. This can be
accomplished by running a few industry
accepted benchmarks.
• The STAB/SCAB tools provide tools to plot
the scalability for visual analysis.
• The STAB HOWTO should be completed
first to rule out any inconsistencies that may
appear as scaling issues.
The Benchmarks
• PMB (Pallas MPI Benchmark)
• NPB (NAS Parallel Benchmark)
• HPL (High Performance Linpack)
PMB
• The Pallas MPI Benchmark (PMB) provides a concise set
of benchmarks targeted at measuring the most important
MPI functions.
• NOTE: Pallas has been acquired by Intel. Intel has
released the IMB (Intel MPI Benchmark). The IMB is a
minor update of the PMB. The IMB were not used because
they failed to execute properly for all MPI implementations
that I tested.
• IMPORTANT: Consistent PMB Ping-Ping should be
achieved before running this benchmark (STAB
Lab). Unresolved inconsistencies in the interconnect may
appear as scaling issues.
• The main purpose of this test is as a diagnostic to answer
the following questions:
– Are my MPI implementation basic functions complete?
– Does my MPI implementation scale?
PMB
•
•
Example plot from larger BC cluster.
Very impressive. For the Sendrecv benchmark this cluster scales from
2 nodes to 240! Could this be a non-blocking GigE
configuration? Another benchmark can help answer that question.
PMB
•
•
Example plot from larger BC cluster.
Quite revealing. The sorted benchmark has the 4M message size
performing at ~115MB/s for all node counts, but shuffled it falls
gradually as the number of nodes increase to ~10MB/s. Why?
PMB
•
•
This cluster is partitioned into 14 nodes/BladeCenter
Chassis. Each chassis has a GigE switch with only 4 uplinks, 3
of the 4 uplinks are bonded together to form a single 3Gbit uplink
to a stacked SMC GigE core switch. Assuming no blocking with
the core switch, this solution blocks at 14:3.
The Sendrecv benchmark is based on MPI_Sendrecv, the
processes form a periodic communication chain. Each process
sends to the right and receives from the left neighbor in the
chain.
PMB
• Based on the previous illustration it is easy to see
why the sorted list performed so well. Most of the
traffic was isolated to good performing local
switches and the jump from chassis to chassis
through the SMC core switch only requires the
bandwidth of a single link (1Gb full duplex).
• The shuffled list has small odds that its left
neighbor (receive from) and its right neighbor
(send to) will be on the same switch. This was
illustrated in the second plot.
• Moral of the story.
– Don’t trust interconnect vendors that do not provide the node
list.
– Ask for sorted and shuffled benchmarks.
PMB Myrinet GM
PMB Myrinet GM
PMB Myrinet MX
PMB Myrinet MX
PBM IB
PBM IB
Questions w/ Answers
• Egan Ford, IBM
egan@us.ibm.com
egan@sense.net
Download