Thirteen modern ways to fool the masses with performance results on parallel computers Georg Hager Erlangen Regional Computing Center (RRZE) University of Erlangen Erlangen--Nuremberg 6th Erlangen International High End Computing Symposium RRZE, 04.06.2010 1991 David H. Bailey Supercomputing Review, August 1991, p. 54-55 Twelve Ways to Fool the Masses When Giving Performance Results “Twelve on Parallel Computers” 1. Quote only 32-bit performance results, not 64-bit results. 2. Present performance figures for an inner kernel, and then represent these figures as the performance of the entire application. 3. Quietly employ assembly code and other low-level language constructs. 4. Scale up the problem size with the number of processors, but omit any mention of this fact. 5. Quote performance results projected to a full system. 6. Compare your results against scalar, unoptimized code on Crays. 7. When direct run time comparisons are required, compare with an old code on an obsolete system. 8 8. If MFLOPS rates must be quoted, quoted base the operation count on the parallel implementation implementation, not on the best sequential implementation. 9. Quote performance in terms of processor utilization, parallel speedups or MFLOPS per dollar. 10 Mutilate the algorithm used in the parallel implementation to match the architecture 10. architecture. 11. Measure parallel run times on a dedicated system, but measure conventional run times in a busy environment. 12 If all else fails, 12. fails show sho prett pretty pict pictures res and animated videos, ideos and don't talk abo aboutt performance performance. 04.06.2010 Fooling the masses 2 1991 … If y you were p plowing g a field,, which would you rather use? Two strong g oxen or 1024 chickens? (Attributed to Seymour Cray) 04.06.2010 Fooling the masses 3 1991 … Strong I/O facilities 32-bit vs. 64-bit FP arithmetic SIMD/MIMD parallelism Vectorization No parallelization standards 04.06.2010 Fooling the masses System-specific optimizations ti i ti 4 Today we have… Multicore processors with shared/separate caches, shared data paths Hybrid hierarchical systems Hybrid, with multi-socket, multi-core, ccNUMA, heterogeneous networks 04.06.2010 Fooling the masses 5 Today we have… Ants all over the place Cell, Clearspeed, GPUs... 04.06.2010 Fooling the masses 6 Today we have… Commodity everywhere x86-type processors, cost-effective interconnects, GNU/Linux 04.06.2010 Fooling the masses 7 The landscape of High Performance Computing and the way we think about HPC has changed g over the last 19 years, y , and we need an update! Still many of Bailey’s Still, Bailey s points are valid without change 04.06.2010 Fooling the masses 8 Stunt 1 Report scalability, not absolute performance. Speedup: work/time with N workers S(N ) = work/time with 1 worker “Good” scalability ↔ S(N) ≈ N , but there is no mention of how fast you can solve your problem! Consequence: Comparing different systems is much easier when using scalability instead of work/time directly 04.06.2010 Fooling the masses 9 Stunt 1: Scalability vs. performance And… instant success! 180 45 Speedu up Perform mance (wo ork/time) 160 40 140 35 120 30 100 25 80 20 60 15 40 10 20 5 0 0 10 20 30 NEC NEC 04.06.2010 Fooling the masses 40 Cluster Cluster 50 60 70 # CPUs or nodes 10 Stunt 2 Slow down code execution. This is useful whenever there is some noticeable “non-execution” overhead Parallel speedup with work ~ Nα: (α=0: strong, strong α=1: weak scaling) s + (1 − s ) N α S(N ) = s + (1 − s ) N α −1 + cα ( N ) Now let’s let s slow down execution by a factor of μ>1 (for strong scaling): μ 1 Sσ ( N ) = = μ (s + (1 − s ) / N ) + c( N ) s + (1 − s ) / N + c( N ) / μ I.e., if there is overhead, the slow code/machine scales better: S μ ( N ) > S μ =1 ( N ) if c( N ) > 0 04.06.2010 Fooling the masses 11 Stunt 2: Slow computing Corollaries: 1. Do not use high compiler optimization levels or the latest compiler versions. 2. If scalability is still bad, parallelize some short loops with OpenMP. That way you can get some extra bonus for a scalable hybrid code. If someone asks for time to solution, answer that if you had a bigger machine, you could get the solution as fast as you want. This is of course due to the superior scalability of your code code. 04.06.2010 Fooling the masses 12 Stunt 2: Slow computing “Slowness” has some surprises in store… Let’s look at μ=2: fast N=4 calc slow N=8 calc comm slow N=8 comm Tf Tf Ts < Tf ? 1/Tf < 2/Ts ? Ts This happens if c′( N ) < 0 @ s = 0 04.06.2010 fast N=4 Fooling the masses This happens pp if c( µN ) c( N ) > @ s=0 µ Ts What’s the catch here? 13 Stunt 2: Slow computing Example for µ=4 and c(N)~N-2/3 at strong scaling: 40 35 Performa ance 30 25 20 15 10 5 0 0 10 20 30 fast 04.06.2010 40 slow Fooling the masses 50 60 70 The performance is better with µN slow CPUs than with N fast CPUs “Slow computing” p g can effectively lessen the impact of comm nication communication overhead We assume that the network is the same in both machines # nodes 14 Stunt 3 (The power of obfuscation, part I) If scalability doesn’t look good enough, use a logarithmic scale to drive your point home. Everything looks OK if you plot it the right way! 1. Linear plot: bad scaling, strange things at N=32 2. Log-log plot: better g, but still the scaling, N=32 problem 3 Log-linear 3. Log linear plot: N=32 problem gone 100 70 70 45 40 60 60 35 50 50 30 40 40 25 10 20 30 30 15 20 20 10 105 10 4 … and remove the ideal 4. scaling line to make it perfect! 04.06.2010 Fooling the masses 0 010 1 011 10 20 30 10 10 40 50 60 100 100 100 70 Speedup Speedup Ideal Speedup Ideal Speedup Ideal 15 Stunt 3: Log scale © Top500 p `08 04.06.2010 Fooling the masses 16 Stunt 4 If you must report actual performance, quietly employ weak scaling to show off It’s all in that bloody denominator… s + (1 − s ) N α S(N ) = s + (1 − s ) N α −1 + cα ( N ) At α=1 the world looks so much nicer: s + (1 − s ) N S(N ) = 1 + c( N ) … but keep in mind: Do not mention the term “weak scaling” or you will be asked nasty questions about parallel efficiency. 04.06.2010 Fooling the masses 17 Stunt 4: Weak scaling But weak scaling gives us much more than just a “straight” graph. It gives us perfect scaling if we choose the right metric to look at! Assumption: Weak scaling with parallel efficiency ε = S(N)/N << 1 and no other overhead Æ S ( N ) = s + (1 − s) N has a small slope p But: If we choose a metric for work that is applicable to the parallel part alone alone, work/time scales linearly. So all you need to do is plot Mflop/s, MLUP/s, or anything that doesn’t happen in the serial part and you can even show real performance numbers! Æ See also stunt #10 04.06.2010 Fooling the masses 18 Stunt 5 (The power of obfuscation, part II) Instead of performance, plot absolute runtime vs. CPU count Very, very popular indeed! 1,2 1 CPU tim me per corre Ru untime Nobody will be able to tell whether your code actually Scales C Corollary: ll ??? 0,6 0,4 0,2 CPU time per core is even better because it omits most overheads… 04.06.2010 0,8 Fooling the masses 0 0 10 20 30 40 50 60 70 # CPUs 19 Stunt 6 (The power of obfuscation, part III) Compare different systems by showing the log of parallel efficiency vs. CPU count Remember: Legends can be any size you like! 0 Paralle el efficien ncy Unusual ways of putting data together surprise and confuse your audience di # nodes/CPUs 1 10 20 30 40 50 60 70 Cl t Cluster 0,1 NEC 0,01 Cluster eff. 04.06.2010 Fooling the masses NEC eff. 20 Stunt 7 Emphasize the quality of your shiny accelerator code by comparing it with scalar, unoptimized code d on a single i l core off an old ld standard t d d CPU. CPU And use GCC 2.7.2. Anything else is a waste of time. And besides, don’t the compiler guys always say that they’re “multi-core enabled”? Corollary: Use single precision on the GPU but double precision on the CPU. This will cut on the effective bandwidths bandwidths, cache size size, and peak performance of the latter and let the former shine. 04.06.2010 Fooling the masses 21 Stunt 8 Always quote GFlops, MIps, Watts per Flop or any other irrelevant interesting metric instead of inverse time to solution. Flops are so cool it hurts: for(i=0; i<N; ++i) for(j=0; j<N; ++j) b[i][j] = 0.25*(a[i-1][j]+a[i+1][j]+a[i][j-1]+a[i][j+1]); for(i=0; i<N; ++i) for(j=0; j<N; ++j) b[i][j] = 0.25*a[i-1][j]+0.25*a[i+1][j] +0.25*a[i][j-1]+0.25*a[i][j+1]; “Floptimization” Watts/Flop are an ingenious fallback – who would dare question a truly “green” g application/system? pp y Except p maybe y some investors… 04.06.2010 Fooling the masses 22 Stunt 9 Ignore affinity and topology issues. Real scientists are not bothered by such details. Multi-core, cache groups, ccNUMA, SMT, network hierarchies etc. are just parts of a vicious plot to take the fun out of computing. computing Ignoring those issues will make them go away. If people ask specific questions about it,, answer that it’s the OS’s or the compiler’s p jjob. OpenMP overhead Shared cache re re-use use P P P P C C C C C C C C C P P P P C C C C C C C MI MI Memory Memory 04.06.2010 Fooling the masses C C OS buffer cache Bandwidth contention Intra-node MPI ccNUMA page placement 23 Stunt 9: Affinity issues Re-using shared cache on multi-core CPUs? More cores mean more performance, do they not? core0 core1 y-directio on tmp(:,:,3) p( ) Memory x(:,:,:) z-direction 04.06.2010 Fooling the masses 24 Stunt 9: Affinity issues Memory bandwidth saturation? ccNUMA effects? Shouldn’t the OS put the threads and pages where they are supposed to be? Parallel STREAM performance 04.06.2010 Fooling the masses 25 Stunt 9: Affinity issues Intra-node MPI is infinitely fast! Look at those latencies! MPI intra-node and inter-node latencies on Cray XT5 8 7,4 , 7 Latency [µs] 6 P P P P C C C C C C C C C P P P P C C C C C C C MI MI Memory Memory C C 5 4 3 2 0,63 0,49 1 0 inter-node 04.06.2010 Fooling the masses inter-socket intra-socket 26 Stunt 9: Affinity issues Intra-node MPI is infinitely fast! Low-level benchmarking is unreliable! Shared cache advantage Between two cores of one socket Between two nodes via interconnect fabric Between two sockets of one node (cache effects eliminated)) 04.06.2010 Fooling the masses 27 Stunt 9: Affinity issues Why should you reverse engineer the overcomplicated cache topology of those modern systems? Xeon E5420 2 Threads shared L2 same socket different socket pthreads_barrier_wait 5863 27032 27647 omp barrier (icc 11.0) 576 760 1269 Spin loop 259 485 11602 Nehalem 2 Threads Shared SMT threads shared L3 different socket pthreads_barrier_wait 23352 4796 49237 omp barrier (icc 11.0) 2761 479 1206 Spin loop 17388 267 787 04.06.2010 Fooling the masses 28 Stunt 9: Affinity – if you still insist… Command line tools for Linux: easy to install works with standard linux 2.6 kernel simple and clear to use support Intel and AMD CPUs Current tools: likwid-topology: Print thread and cache topology likwid-perfCtr: Measure performance counters likwid-features: View and enable/disable hardware prefetchers likwid-pin: Pin threaded application without touching code Open source project (GPL v2): http://code.google.com/p/likwid/ 04.06.2010 Fooling the masses 29 Stunt 10 If you really can’t reduce communication overhead, argue in favor of “reliable inefficiency.” And fill any machine. 1 0,8 Fractio on of runtiime Even if you spend 80% of time comm communicating, nicating that’s ok if the ratio stays constant – it means you can scale to any size! 0,6 0,4 0,2 0 1 10 Calculation 100 Computation 1000 # nodes/CPUs Efficiency constant for large N 04.06.2010 Fooling the masses 30 Stunt 11 (The power of obfuscation, part IV) Performance modeling is for wimps. Show real data. Plenty. And then some. If nasty t questions ti pop up, say your code is so complex that no model can describe it. 04.06.2010 300 250 Perfformance e Don’t try to make sense of your o r data by b fitting it to a model. Instead, show at least 8 graphs per plot, all in bright pastel colors, with different symbols. Machine 1 200 Machine 2 Machine 3 150 Machine 4 Machine 5 Machine 6 100 Machine 7 Machine 8 50 0 0 Fooling the masses 100 200 300 400 500 600 # nodes/CPUs 31 Stunt 12 If they get you cornered, blame it all on OS jitter. They will understand and nod knowingly. Corollary: Depending on the audience, TLB misses may work just as fine fine. 04.06.2010 Fooling the masses 32 Stunt 13 If all else fails, show pretty pictures and animated videos, and don’t talk about performance. In four decades of supercomputing, this was always the best-selling plan, and it will stay that way forever forever. 04.06.2010 Fooling the masses 33 THANK YOU