Document

advertisement
Thirteen modern ways to fool the masses
with performance results on parallel
computers
Georg Hager
Erlangen Regional Computing Center (RRZE)
University of Erlangen
Erlangen--Nuremberg
6th Erlangen International High End Computing Symposium
RRZE, 04.06.2010
1991
ƒ David H. Bailey
Supercomputing Review, August 1991, p. 54-55
Twelve Ways to Fool the Masses When Giving Performance Results
“Twelve
on Parallel Computers”
1.
Quote only 32-bit performance results, not 64-bit results.
2.
Present performance figures for an inner kernel, and then represent these
figures as the performance of the entire application.
3.
Quietly employ assembly code and other low-level language constructs.
4.
Scale up the problem size with the number of processors, but omit any
mention of this fact.
5.
Quote performance results projected to a full system.
6.
Compare your results against scalar, unoptimized code on Crays.
7.
When direct run time comparisons are required, compare with an old code on an obsolete system.
8
8.
If MFLOPS rates must be quoted,
quoted base the operation count on the parallel implementation
implementation, not on
the best sequential implementation.
9.
Quote performance in terms of processor utilization, parallel speedups or MFLOPS per dollar.
10 Mutilate the algorithm used in the parallel implementation to match the architecture
10.
architecture.
11. Measure parallel run times on a dedicated system, but measure conventional run times in a busy
environment.
12 If all else fails,
12.
fails show
sho prett
pretty pict
pictures
res and animated videos,
ideos and don't talk abo
aboutt performance
performance.
04.06.2010
Fooling the masses
2
1991 …
If y
you were p
plowing
g a field,, which
would you rather use?
Two strong
g oxen
or 1024 chickens?
(Attributed to Seymour Cray)
04.06.2010
Fooling the masses
3
1991 …
Strong I/O facilities
32-bit vs. 64-bit FP
arithmetic
SIMD/MIMD
parallelism
Vectorization
No parallelization
standards
04.06.2010
Fooling the masses
System-specific
optimizations
ti i ti
4
Today we have…
Multicore processors
with shared/separate caches, shared data paths
Hybrid hierarchical systems
Hybrid,
with multi-socket, multi-core, ccNUMA, heterogeneous networks
04.06.2010
Fooling the masses
5
Today we have…
Ants all over the place
Cell, Clearspeed, GPUs...
04.06.2010
Fooling the masses
6
Today we have…
Commodity everywhere
x86-type processors, cost-effective interconnects, GNU/Linux
04.06.2010
Fooling the masses
7
The landscape of High Performance Computing and the way
we think about HPC has changed
g over the last 19 years,
y
, and
we need an update!
Still many of Bailey’s
Still,
Bailey s points are valid without change
04.06.2010
Fooling the masses
8
Stunt 1
Report scalability, not absolute performance.
Speedup:
work/time with N workers
S(N ) =
work/time with 1 worker
“Good” scalability ↔ S(N) ≈ N , but there is no mention of how fast you
can solve your problem!
Consequence: Comparing different systems is much easier when using
scalability instead of work/time directly
04.06.2010
Fooling the masses
9
Stunt 1: Scalability vs. performance
And… instant success!
180
45
Speedu
up
Perform
mance
(wo
ork/time)
160
40
140
35
120
30
100
25
80
20
60
15
40
10
20
5
0
0
10
20
30
NEC
NEC
04.06.2010
Fooling the masses
40
Cluster
Cluster
50
60
70
# CPUs or nodes
10
Stunt 2
Slow down code execution.
This is useful whenever there is some noticeable “non-execution”
overhead
Parallel speedup with work ~ Nα:
(α=0: strong,
strong α=1: weak scaling)
s + (1 − s ) N α
S(N ) =
s + (1 − s ) N α −1 + cα ( N )
Now let’s
let s slow down execution by a factor of μ>1 (for strong scaling):
μ
1
Sσ ( N ) =
=
μ (s + (1 − s ) / N ) + c( N ) s + (1 − s ) / N + c( N ) / μ
I.e., if there is overhead, the slow code/machine scales better:
S μ ( N ) > S μ =1 ( N ) if c( N ) > 0
04.06.2010
Fooling the masses
11
Stunt 2: Slow computing
Corollaries:
1. Do not use high compiler optimization levels or the latest compiler
versions.
2. If scalability is still bad, parallelize some short loops with OpenMP. That
way you can get some extra bonus for a scalable hybrid code.
If someone asks for time to solution, answer that if you had a bigger
machine, you could get the solution as fast as you want. This is of
course due to the superior scalability of your code
code.
04.06.2010
Fooling the masses
12
Stunt 2: Slow computing
“Slowness” has some surprises in store…
ƒ Let’s look at μ=2:
fast N=4
calc
slow N=8
calc
comm
slow N=8
comm
Tf
Tf
Ts < Tf ?
1/Tf < 2/Ts ?
Ts
This happens if
c′( N ) < 0 @ s = 0
04.06.2010
fast N=4
Fooling the masses
This happens
pp
if
c( µN )
c( N ) >
@ s=0
µ
Ts
What’s the
catch here?
13
Stunt 2: Slow computing
Example for µ=4 and c(N)~N-2/3 at strong scaling:
40
35
Performa
ance
30
25
20
15
10
5
0
0
10
20
30
fast
04.06.2010
40
slow
Fooling the masses
50
60
70
ƒ The performance is
better with µN slow
CPUs than with N fast
CPUs
ƒ “Slow computing”
p
g can
effectively lessen the
impact of
comm nication
communication
overhead
ƒ We assume that the
network is the same in
both machines
# nodes
14
Stunt 3 (The power of obfuscation, part I)
If scalability doesn’t look good enough, use a logarithmic scale to
drive your point home.
Everything looks OK if you plot it the right way!
1. Linear plot: bad scaling,
strange things at N=32
2. Log-log plot: better
g, but still the
scaling,
N=32 problem
3 Log-linear
3.
Log linear plot: N=32
problem gone
100
70
70
45
40
60
60
35
50
50
30
40
40
25
10
20
30
30
15
20
20
10
105
10
4 … and remove the ideal
4.
scaling line to make it
perfect!
04.06.2010
Fooling the masses
0
010
1
011
10
20
30
10
10
40
50
60
100
100
100
70
Speedup
Speedup
Ideal
Speedup
Ideal
Speedup
Ideal
15
Stunt 3: Log scale
© Top500
p
`08
04.06.2010
Fooling the masses
16
Stunt 4
If you must report actual performance, quietly employ weak scaling
to show off
It’s all in that bloody denominator…
s + (1 − s ) N α
S(N ) =
s + (1 − s ) N α −1 + cα ( N )
At α=1 the world looks so much nicer:
s + (1 − s ) N
S(N ) =
1 + c( N )
… but keep in mind: Do not mention the term “weak scaling” or you will be
asked nasty questions about parallel efficiency.
04.06.2010
Fooling the masses
17
Stunt 4: Weak scaling
But weak scaling gives us much more than just a “straight” graph. It gives
us perfect scaling if we choose the right metric to look at!
Assumption: Weak scaling with parallel efficiency ε = S(N)/N << 1 and no
other overhead
Æ
S ( N ) = s + (1 − s) N
has a small slope
p
But: If we choose a metric for work that is
applicable to the parallel part alone
alone,
work/time scales linearly.
So all you need to do is plot Mflop/s, MLUP/s,
or anything that doesn’t happen in the serial
part and you can even show real performance
numbers!
Æ See also stunt #10
04.06.2010
Fooling the masses
18
Stunt 5 (The power of obfuscation, part II)
Instead of performance, plot absolute runtime vs. CPU count
Very, very popular indeed!
1,2
1
CPU tim
me
per corre
Ru
untime
Nobody will be able to tell
whether your code actually
Scales
C
Corollary:
ll
???
0,6
0,4
0,2
CPU time per core is even
better because it omits
most overheads…
04.06.2010
0,8
Fooling the masses
0
0
10
20
30
40
50
60
70
# CPUs
19
Stunt 6 (The power of obfuscation, part III)
Compare different systems by showing the log of parallel efficiency
vs. CPU count
Remember: Legends
can be any size you
like!
0
Paralle
el efficien
ncy
Unusual ways of putting
data together surprise
and confuse your
audience
di
# nodes/CPUs
1
10
20
30
40
50
60
70
Cl t
Cluster
0,1
NEC
0,01
Cluster eff.
04.06.2010
Fooling the masses
NEC eff.
20
Stunt 7
Emphasize the quality of your shiny accelerator
code by comparing it with scalar, unoptimized
code
d on a single
i l core off an old
ld standard
t d d CPU.
CPU
And use GCC 2.7.2.
Anything else is a waste of time.
And besides, don’t the compiler guys always say that
they’re “multi-core enabled”?
Corollary:
Use single precision on the GPU but double precision on the CPU. This
will cut on the effective bandwidths
bandwidths, cache size
size, and peak performance
of the latter and let the former shine.
04.06.2010
Fooling the masses
21
Stunt 8
Always quote GFlops, MIps, Watts per Flop or any other irrelevant
interesting metric instead of inverse time to solution.
Flops are so cool it hurts:
for(i=0; i<N; ++i)
for(j=0; j<N; ++j)
b[i][j] = 0.25*(a[i-1][j]+a[i+1][j]+a[i][j-1]+a[i][j+1]);
for(i=0; i<N; ++i)
for(j=0; j<N; ++j)
b[i][j] = 0.25*a[i-1][j]+0.25*a[i+1][j]
+0.25*a[i][j-1]+0.25*a[i][j+1];
“Floptimization”
Watts/Flop are an ingenious fallback – who would dare question a truly
“green”
g
application/system?
pp
y
Except
p maybe
y some investors…
04.06.2010
Fooling the masses
22
Stunt 9
Ignore affinity and topology issues. Real scientists are not bothered
by such details.
Multi-core, cache groups, ccNUMA, SMT, network hierarchies etc. are just
parts of a vicious plot to take the fun out of computing.
computing Ignoring those
issues will make them go away. If people ask specific questions about
it,, answer that it’s the OS’s or the compiler’s
p
jjob.
OpenMP overhead
Shared cache re
re-use
use
P P
P P
C
C
C
C
C
C
C
C
C
P P
P P
C
C
C
C
C
C
C
MI
MI
Memory
Memory
04.06.2010
Fooling the masses
C
C
OS buffer cache
Bandwidth contention
Intra-node MPI
ccNUMA page placement
23
Stunt 9: Affinity issues
ƒ Re-using shared cache on multi-core CPUs? More cores mean
more performance, do they not?
core0
core1
y-directio
on
tmp(:,:,3)
p(
)
Memory
x(:,:,:)
z-direction
04.06.2010
Fooling the masses
24
Stunt 9: Affinity issues
ƒ Memory bandwidth saturation? ccNUMA effects? Shouldn’t the OS
put the threads and pages where they are supposed to be?
Parallel STREAM performance
04.06.2010
Fooling the masses
25
Stunt 9: Affinity issues
ƒ Intra-node MPI is infinitely fast! Look at those latencies!
MPI intra-node and inter-node latencies on Cray XT5
8
7,4
,
7
Latency [µs]
6
P P
P P
C
C
C
C
C
C
C
C
C
P P
P P
C
C
C
C
C
C
C
MI
MI
Memory
Memory
C
C
5
4
3
2
0,63
0,49
1
0
inter-node
04.06.2010
Fooling the masses
inter-socket
intra-socket
26
Stunt 9: Affinity issues
ƒ Intra-node MPI is infinitely fast! Low-level benchmarking is
unreliable!
Shared cache
advantage
Between two cores
of one socket
Between two nodes
via interconnect
fabric
Between two sockets
of one node (cache
effects eliminated))
04.06.2010
Fooling the masses
27
Stunt 9: Affinity issues
Why should you reverse engineer the overcomplicated cache
topology of those modern systems?
Xeon E5420
2 Threads
shared L2
same socket
different socket
pthreads_barrier_wait
5863
27032
27647
omp barrier (icc 11.0)
576
760
1269
Spin loop
259
485
11602
Nehalem
2 Threads
Shared SMT
threads
shared L3
different socket
pthreads_barrier_wait
23352
4796
49237
omp barrier (icc 11.0)
2761
479
1206
Spin loop
17388
267
787
04.06.2010
Fooling the masses
28
Stunt 9: Affinity – if you still insist…
ƒ Command line tools for Linux:
ƒ easy to install
ƒ works with standard linux 2.6 kernel
ƒ simple and clear to use
ƒ support Intel and AMD CPUs
ƒ Current tools:
ƒ likwid-topology: Print thread and cache topology
ƒ likwid-perfCtr: Measure performance counters
ƒ likwid-features: View and enable/disable hardware prefetchers
ƒ likwid-pin: Pin threaded application without touching code
Open source project (GPL v2):
http://code.google.com/p/likwid/
04.06.2010
Fooling the masses
29
Stunt 10
If you really can’t reduce communication overhead, argue in favor of
“reliable inefficiency.”
And fill any machine.
1
0,8
Fractio
on of runtiime
Even if you spend 80%
of time comm
communicating,
nicating
that’s ok if the ratio
stays constant – it
means you can scale
to any size!
0,6
0,4
0,2
0
1
10
Calculation
100
Computation
1000
# nodes/CPUs
Efficiency constant
for large N
04.06.2010
Fooling the masses
30
Stunt 11 (The power of obfuscation, part IV)
Performance modeling is for wimps. Show real data. Plenty.
And then some.
If nasty
t questions
ti
pop up,
say your code is so
complex that no model
can describe it.
04.06.2010
300
250
Perfformance
e
Don’t try to make sense
of your
o r data by
b fitting it
to a model. Instead, show
at least 8 graphs per plot,
all in bright pastel colors,
with different symbols.
Machine 1
200
Machine 2
Machine 3
150
Machine 4
Machine 5
Machine 6
100
Machine 7
Machine 8
50
0
0
Fooling the masses
100
200
300
400
500
600
# nodes/CPUs
31
Stunt 12
If they get you cornered, blame it all on OS jitter.
They will understand and nod knowingly.
Corollary:
Depending on the audience,
TLB misses may work just as fine
fine.
04.06.2010
Fooling the masses
32
Stunt 13
If all else fails, show pretty pictures and animated videos, and don’t
talk about performance.
In four decades of supercomputing, this was always the best-selling plan,
and it will stay that way forever
forever.
04.06.2010
Fooling the masses
33
THANK YOU
Download