HPPAC 2012 Monday, May 21st LLNL-PRES-552151 This work has been authored by Lawrence Livermore National Security, LLC under contract DE-AC52-07NA27344 with the U.S. Department of Energy. Accordingly, the United States Government retains and the publisher, by accepting this work for dissemination, acknowledges that the United States Government retains a nonexclusive, paid up, irrevocable, world-wide license to publish or reproduce the disseminated form of this work or allow others to do so, for United States Government purposes. Traditional • All components can operate at highest power level simultaneously • Power provisioned for “worst case” • Users are happily oblivious (about power) • Few if any applications limited by power Lawrence Livermore National Laboratory Exascale (if not sooner) • Not all components can operate at highest power level simultaneously • Power provisioning is best effort • Users must tune power for performance • Nearly every application limited by power 2 LLNL-PRES-552151 Traditional Exascale (if not sooner) • Utilization measured in node-hours • Weak-scaling jobs perform best using as many nodes as possible • Running all components as fast as possible reliably leads to top performance Lawrence Livermore National Laboratory • Utilization measured in kilowatt hours • Weak-scaling jobs may perform optimally with fewer, faster nodes • Running all components as fast as possible cannot be done. Running most components at identical speeds is suboptimal 3 LLNL-PRES-552151 Average Processor Power Bound exascale rzmerl (EarlyApril) (Mid (?) April) Average Processor Power Bound Power (Watts) Each processor uses some amount of power Processors Lawrence Livermore National Laboratory Sum of processor power draw divided Lost performance by processor count must be atpower or below Total Linpack Short-term Mid-term Long-term processor +solution: solution: solution: this divided Intel Turbo bylevel. processor Boost count Disable Buy Schedule should more Turbo power be less Boost than globally to the optimize bound performance (This does not scale) GHz non-turbo (2.6 GHz) max turbo (3.3 GHz) 4 LLNL-PRES-552151 Runtime Average Power Limit (RAPL) • Measures cumulative joules (power x time) • Three separate power meters • Clamping on package and DRAM power Turbo suppression Effective frequency libmsr currently under development Lawrence Livermore National Laboratory 5 LLNL-PRES-552151 Can placeenergy Introduced Onboard user-specified on Sandy meters Bridge measure limitProcessors on average accumulated powerjoules. over a user-specific time window. Divide by time to get average power. Lawrence Livermore National Laboratory Source: Intel 64 and IA-32 Software Developer’s Manual, Volume 3B 6 LLNL-PRES-552151 Setting LOCK fixesuntil power limits until reboot Two windows allows tweaking peak and Limits are ignored enable bits are setaverage power Power limit is enforced using average watts Watts granularity: 0.125W Higher bound, smaller window over user specified window. Minimum power bound: 51W for peak power Lower bound, wider window for average power Resolution: ~1ms Max Window: ~46ms Lawrence Livermore National Laboratory Source: Intel 64 and IA-32 Software Developer’s Manual, Volume 3B 7 LLNL-PRES-552151 Similar interface for DRAM power control Only one power limit supported Lawrence Livermore National Laboratory Source: Intel 64 and IA-32 Software Developer’s Manual, Volume 3B 8 LLNL-PRES-552151 rzzin mg.C.8 64 processors 34 power bounds Processors No 51W Power Power Bound are Bound heterogeneous under a power bound Processors require take similar sametime amountshould Where of power the hot Significant variation processors go? in power Individual processor efficiency Power has Is is not worth variation changed paying expected a premium and acceptable efficient processors? Efficiency variation manifests as performance variation Lawrence Livermore National Laboratory 9 LLNL-PRES-552151 rzmerl NPB C.8 234 processors Avergae Watts Wide variation in power consumption across applications Provisioning power for most power-hungry application leaves remaining applications node-bound, not powerbound Processors ordered by cg.C.8 average PKG power Lawrence Livermore National Laboratory 10 LLNL-PRES-552151 rzmerl NPB C.8 234 processors Avergae Watts Memory power substantially lower than package power Processors ordered by cg.C.8 average PKG power Lawrence Livermore National Laboratory 11 LLNL-PRES-552151 Overprovision hardware • Processors are cheap and plentiful • Power is not Measure performance at max power consumption • May require turning off nodes • Running out of nodes before running out of power means application is not power-bound Expect heterogeneous processor performance • Put most-efficient nodes on the critical path if possible • Put least-efficient nodes where they will do the least harm Lawrence Livermore National Laboratory 12 LLNL-PRES-552151