Performance Benefits on HPCx from Power5 chips and SMT HPCx User Group Meeting 28 June 2006 Alan Gray EPCC, University of Edinburgh Contents • Introduction and System Overview • Benchmark Results • Synthetic • Applications • Simultaneous Multithreading • Conclusions 28/06/06 Introduction and System Overview 28/06/06 Introduction • HPCx underwent upgrade in November 2005 from Power4 to Power5 technology. • New system features Simultaneous Multithreading (SMT) • We will compare the new and old systems via benchmark results, both synthetic and involving real applications representing typical use of the system. • the use of SMT is also investigated • Also included for comparison are results from EPCC’s Blue Gene/L system. 28/06/06 Systems for comparison • Previous HPCx (Phase 2): 50 IBM e-Server p690+ nodes – – – – SMP cluster, 32 Power4 1.7 GHz processors per node 32 GB of RAM per node Federation interconnect 6.2 TFLOP/s Linpack • HPCx (Phase 2a): 96 IBM e-Server p575 nodes – SMP cluster, 16 Power5 1.5GHz processors per node – Power5 have improved memory architecture over Power4 – 32 GB of RAM per node (twice as much per processor than Phase 2) – Federation interconnect (same as Phase 2) – 7.4 TFLOP/s Linpack, No 46 on top500 • BlueSky: Single e-Server Blue Gene frame – 1024 dual core chips, 2048 PowerPC440 processors, 700 MHz – 512 MB of RAM per chip (distributed memory system), shared between the two cores – 4.7 TFLOP/s Linpack, joint No 73 on top500 28/06/06 5 Benchmark Results: Synthetic 28/06/06 Synthetic benchmarks: Intel MPI suite • Ping Pong Benchmark – 2 processes communicate, either over the switch or within a node via shared memory • Switch communication: Insignificant difference (not surprising – same switch) • Intra node communication: Phase2a has better asymptotic bandwidth but slightly higher latency 28/06/06 Synthetic benchmarks: Intel MPI suite • Multi Ping Pong: All • available processors utilised Modified to ensure that all comms utilise the switch • No difference between Phase 2 and 2a (not surprising – same switch) 28/06/06 Streams performance (scale) • Streams benchmark • gives measure of memory bandwidth Hardware limit is 2 load+store per cycle • Can clearly see • caches Phase2a significantly better than Phase 2 for all memory levels 28/06/06 9 Benchmark Results: Applications 28/06/06 CASTEP: AL2O3 • Density functional theory • • application, Payne et al., 2002, Segall et al., 2002 Widely used in the UK (largest user of HPCx) Benchmark: Al2O3: 270 atom slab sampled with 2 k-points • Phase 2a around 1.3 times faster than Phase 2 – Even although clocks are slower – Code is taking advantage of improved memory bandwidth 28/06/06 11 H2MOL • Solves time dependent Schrödinger equation for laser driven dissociation of H2-molecules • Refines grid when increasing processor count, hence constant work/proc • Phase2a almost a factor of • 2 faster than Phase 2 Writing of intermediate results shows up poor IO on Blue Gene 28/06/06 12 PCHAN • Finite difference code for • • • Turbulent Flow: shock/boundary layer interaction (SBLI) Communications: haloexchanges between adjacent computational sub-domains Phase 2a around 2 times faster than Phase 2 Very good scaling all systems – HPCx superscales 28/06/06 13 AIMPRO • Ab Initio Modelling • • PROgram Determines structure of atoms using Born and Oppenheimer approximation Benchmark: DS-4 - 433 atoms, 12124 basis functions and 4 k-points • Phase 2a outperforming Phase 2 by around a factor of 1.2 28/06/06 MDCASK • MDCASK: classical molecular dynamics code to study radiation damage in metals • Benchmark used: 1372000 atoms in Ti lattice • Performance is worse on Phase 2a than on Phase 2. – by factor larger than clock frequency ratio – Scaling also worse on Phase 2a • Classical Molecular dynamics codes are characterised by many strided memory accesses. – degradation could be due to sensitivity to increased latency in some part of memory subsystem 28/06/06 15 LAMMPS • Classical Molecular • Dynamics - can simulate wide range of materials Rhodopsin Benchmark: 2048000 atoms • Performance degradation • again on new system. Factor is clock ratio at low processor numbers, but scaling worse on Phase 2a. 28/06/06 NAMD 2.6b1: APO-A1 92224 atoms • NAMD: classical molecular dynamics code designed for high-performance simulation of large biomolecular systems • ApoA1 benchmark 92224 atoms • Similarly to other classical molecular dynamics, performs worse on Phase 2a. 28/06/06 17 DL_POLY3: Gramicidin 792960 atoms • DL_POLY is a general • • purpose molecular dynamics package DL_POLY3 uses a distributed domain decomposition model The benchmark: a system of eight Gramicidin-A species (792,960 atoms) • performs slightly better on Phase2a, but not as well as some of the other codes 28/06/06 18 Simultaneous Multithreading 28/06/06 Simultaneous Multithreading (SMT) • Theoretical peak floating point performance of • microprocessors has steadily risen in recent years Actual performance of apps, relative to theoretical peak, has dropped substantially – i.e. Number of cycles for which floating point units are idle is rising • Due to latencies involved with processor operations. • Compiler attempts to schedule instructions to minimise waste cycles – but effectiveness is limited by lack of independent instructions • SMT: multiple threads can issue instructions to the functional units in each cycle. – no. independent instructions increases, no. idle cycles decreases 28/06/06 Simultaneous Multithreading (SMT) • Power 5 processors on HPCx have 2 floating point units, and • support SMT with 2 threads. Hence have 2 virtual processes (MPI tasks or OpenMP threads) running per physical processor. No SMT: #@tasks_per_node = 16 With SMT: #@ tasks_per_node = 32 #@ requirements = (Feature == “SMT”) • Disadvantages: – More communication – Memory limit per task is halved 28/06/06 SMT: Streams • Compare open squares (No SMT) with open circles (SMT) • With SMT, there are twice as many tasks per node. For direct comparison, SMT results have been multiplied by a factor of 2 • No difference observed in memory bandwidth with SMT. • Of course, caches are effectively halved in size • Therefore, any improvements in apps must be due to reduced memory latency (as expected). 28/06/06 SMT: Classical Molecular Dynamics • Reminder: Classical Molecular Dynamics codes did worse than expected on Phase 2a, likely due to sensitivity to increased memory latency. • Such codes benefit from SMT: seems that latencies are successfully hidden. • Benefit limited to lower processor counts. At high counts large amount of communication takes over. • For NAMD, up to factor of 1.4 improvement and crossover point is around 512 processors. 28/06/06 SMT: Classical Molecular Dynamics • Reminder: Classical Molecular Dynamics codes did worse than expected on Phase 2a, likely due to sensitivity to increased memory latency. • Such codes benefit from SMT: seems that latencies are successfully hidden. • Benefit limited to lower processor counts. At high counts large amount of communication takes over. • For MDcask, up to factor of 1.4 improvement and crossover point is around 256 processors. 28/06/06 SMT: Classical Molecular Dynamics • Reminder: Classical Molecular Dynamics codes did worse than expected on Phase 2a, likely due to sensitivity to increased memory latency. • Such codes benefit from SMT: seems that latencies are successfully hidden. • Benefit limited to lower processor counts. At high counts large amount of communication takes over. • For DL_POLY, up to factor of 1.2 improvement and crossover point is around 64 processors. 28/06/06 SMT: Castep and H2MOL • Reminder: Performance of Castep and H2MOL codes improved on new system. – No performance benefit seen with SMT. – SMT degrades performance in certain situations 28/06/06 Conclusions • HPCx upgraded from Power4 to Power5 technology recently. • Although new chips have slightly lower clock frequency, significant improvements observed in majority of applications – due to better memory bandwidth • Some types of application, in particular Classical Molecular Dynamics, have not performed as well as expected on new system. – These apps characterised by many strided memory accesses – Sensitivity to an increased latency could be to blame. • Performance benefits with the use of SMT have been observed in certain situations – In particular for those codes which didn't do as well as expected – Users should benchmark their own codes 28/06/06