Exascale: Power, Cooling, Reliability, and Future Arithmetic John Gustafson HPC User Forum Seattle, September 2010 The First Cluster: The “Cosmic Cube” 64 Intel 8086-8087 nodes 700 Watts, total 6 cubic feet Chuck Seitz and Geoffrey Fox developed an alternative to Cray vector mainframes for $50,000, in 1984. Motivated by QCD physics problems, but soon found useful for a very wide range of apps. But Fox & Seitz did not invent the Cosmic Cube. Stan Lee did. From Wikipedia: “A device created by a secret society of scientists to further their ultimate goal of world conquest.” Tales of Suspense #79, July 1966 Terascale Power Use Today (Not to Scale) TFLOP Machine Today 2000 W Heat removal (all levels, chip to facility) 40% of total power consumed 950 W 19% (81% efficient power supplies) 5 kW 1500 W Control Disk Comm. Power supply loss 7.5 nJ/instruction @ 0.2 instruction/flop Memory 100 W 100 W 150 W 10 TB disk @ 10 W per TB 100 pJ comm. per flop 0.1 byte/flop @ 1.5 nJ per byte Compute 200 W 1 Tflops/s @ 200 pJ per flop Derived from data from S. Borkar, Intel Fellow Let’s See that Drawn to Scale… A SIMD accelerator approach gives up Control to reduce wattage per Tflops/s. Which can work, for some applications that are very regular and SIMD-like (vectorizable with long vectors). Energy Cost by Operation Type Operation 64-bit multiply-add Read 64 bits from cache Move 64 bits across chip Execute an instruction Read 64 bits from DRAM Approximate energy consumed today 200 pJ 800 pJ 2000 pJ 7500 pJ 12000 pJ Notice that 12000 pJ @ 3 GHz = 36 watts! SiCortex’s solution: drop the memory speed, but the performance dropped proportionately. Larger caches actually reduce power consumption. Energy Cost of a Future HPC Cluster Exaflop Petaflop Teraflop Power 20 MW 20 KW 20 W Size Data Center Cabinet Chip/Module No cost for interconnect? Hmm… But while we’ve building HPC clusters, Google and Microsoft have been very, very busy… Cloud computing has already eclipsed HPC for sheer scale • • • Cloud computing means using a remote data center to manage scalable, reliable, on-demand access to applications Provides applications and infrastructure over the Internet “Scalable” here means: – Possibly millions of simultaneous users of the app. – Exploiting thousand-fold parallelism in the app. From Tony Hey, Microsoft Mega-Data Center Economy of Scale • • A 50,000-server facility is 6–7x more Over 18 million square feet. Each. cost-effective than a 1,000 server facility in key respects Don’t expect a TOP500 score soon. • Secrecy? • Or… not enough interconnect? Technology Cost in small data center Cost in large data center Network $95 per Mbps per month $13 per Mbps per month 7.1 Storage $2.20 per GB per month $0.40 per GB per month 5.7 Administration ~140 servers per Administrator >1000 Servers per Administrator 7.1 Data courtesy of James Hamilton Ratio Each data center is the size of 11.5 football fields Computing by the truckload • Build racks and cooling and communication together in a “container” • • Hookups: power, cooling, and interconnect • But: designed for capacity computing, not capability computing I estimate each center is already over 70 megawatts… and 20 petaflops total! From Tony Hey, Microsoft Arming for search engine warfare It’s starting to look like… a steel mill! This work is licensed under a Creative Commons Attribution 3.0 U.S. License A steel mill takes ~500 megawatts • • • • Half the steel mills in the US are abandoned Maybe some should be converted to data centers! Self-contained power plant Is this where “economy of scale” will top out for clusters as well? With great power comes great responsibility —Uncle Ben Yes, and also some really big heat sinks. —John G. An unpleasant math surprise lurks… 64-bit precision is looking long in the tooth. (gulp!) Bits x86 80 (stack only) 80 70 Cray 64 CDC 60 60 most vendors 64 50 40 Univac, IBM 36 30 Year Zuse 22 20 1940 1950 1960 1970 1980 1990 2000 2010 At 1 exaflop/s (1018), 15 decimals don’t last long. It’s unlikely a code uses the best precision. • Too few bits gives unacceptable errors • Too many bits wastes memory, bandwidth, joules • This goes for integers as well as floating point Optimum precision 80 IEEE 754 double precision 64 48 32 16 0 All floating-point operations in application Ways out of the dilemma… • • • • Better hardware support for 128-bit, if only for use as a check Interval arithmetic has promise, if programmers can learn to use it properly (not just apply it to point arithmetic methods) Increasing precision automatically with the memory hierarchy might even allow a return to 32-bit Maybe it’s time to restore Numerical Analysis to the standard curriculum of computer scientists? The hierarchical precision idea More capacity Lower latency; More precision, range L1 Reg Cache 16 FP values 1-cycle latency 69 significant decimals 10–2525221 to 10+2525221 range L2 Cache 2 million FP values 30-cycle latency 15 significant decimals 10–307 to 10+307 range 4096 FP values 6-cycle latency 33 significant decimals 10–9864 to 10+9864 range Local RAM 2 billion FP values 200-cycle latency 7 significant decimals 10–38 to 10+38 range Another unpleasant surprise lurking… No more automatic caches? • Hardware cache policies are designed to minimize miss rates, at the expense of low cache utilization (typically around 20%). • Memory transfers will soon be half the power consumed by a computer, and computing is already power-constrained. • Software will need to manage memory hierarchies explicitly. And source codes need to expose memory moves, not hide them. Summary • Mega-data center clusters have eclipsed HPC clusters, but HPC can learn a lot from their methods in getting to exascale. • Clusters may grow to the size of steel mills, dictated by economies of scale. • We may have to rethink the use of 64-bit flops everywhere, for a variety of reasons. • Speculative data motion (like, automatic caching) reduces operations per watt… it’s on the way out.