The past, present, and future of Green Computing Kirk W. Cameron SCAPE Laboratory Virginia Tech SCAPE Laboratory Confidential 1 Enough About Me • • • • • • • • Associate Professor Virginia Tech Co-founder Green500 Co-founder MiserWare Founding Member SpecPower Consultant for EPA Energy Star for Servers IEEE Computer “Green IT” Columnist Over $4M Federally funded “Green” research SystemG Supercomputer 2 What is SCAPE? • Scalable Performance Laboratory – Founded 2001 by Cameron • Vision – Improve efficiency of high-end systems • Approach – Exploit/create technologies for high-end systems – Conduct quality research to solve important problems – When appropriate, commercialize technologies – Educate and train next generation HPC CS SCAPE Laboratory Confidential 3 The Big Picture (Today) • Past: Challenges – Need to measure and correlate power data – Save energy while maintaining performance • Present – Software/hardware infrastructure for power measurement – Intelligent Power Management (CPU Miser, Memory Miser) – Integration with other toolkits (PAPI, Prophesy) • Future: Research + Commercialization – Management Infra-Structure for Energy Reduction – MiserWare, Inc. – Holistic Power Management 4 1882 - 2001 5 Prehistory 1882 - 2001 • Embedded systems • General Purpose Microarchitecture – – – – Circa 1999 power becomes disruptive technology Moore’s Law + Clock Frequency Arms Race Simulators emerge (e.g. Princeton’s Wattch) Related work continues today (CMPs, SMT, etc) 6 2002 7 Server Power 2002 • IBM Austin – Energy-aware commercial servers [Keller et al] • LANL – Green Destiny [Feng et al] • Observations – IBM targets commercial apps – Feng et al achieve power savings in exchange for performance loss 8 HPC Power 2002 • My observations – Power will become disruptive to HPC – Laptops outselling PC’s – Commercial power-aware not appropriate for HPC $800,000 per year per megawatt! $4,000/yr TM CM-5 .005 Megawatts $12,000/yr Residential A/C .015 Megawatts $680,000/yr $8 million/yr $9.6 million/yr Intel ASCI Red High-speed train Conventional Power Plant Earth Simulator 10 Megawatts .850 Megawatts 300 Megawatts 12 Megawatts 9 HPPAC Emerges 2002 • SCAPE Project – High-performance, power-aware computing – Two initial goals • Measurement tools • Power/energy savings – Big Goals…no funding (risk all startup funds) 10 2003 - 2004 11 Cluster Power 2003 - 2004 • IBM Austin – On evaluating request-distribution schemes for saving energy in server clusters, ISPASS ‘03 [Lefurgy et al] – Improving Server Performance on Trans Processing Workloads by Enhanced Data Placement. SBAC-PAD ’04 [Rubio et al] • Rutgers – Energy conservation techniques for disk array-based servers. ICS ’04 [Bianchini et al] • SCAPE – High-performance, power-aware computing, SC04 – Power measurement + power/energy savings 12 2003 - 2004 PowerPack Measurement Scalable, synchronized, and accurate. Hardware power/energy profiling Power/Energy Profiling Data Baytech Management unit Baytech Powerstrip Multi-meter AC Power from outlet Data Log Multi-meter Multi-meter DC Baytech Power Strip Single node Multi-meter control MM Thread MM Thread MM Thread Multi-meter Control Thread Microbenchmarks Data collection Multi-meter DC Power from power supply Data Analysis Data Repository AC Applications DVS control PowerPack libraries (profile/control) DVS Thread DVS Thread DVS Thread DVS Control Thread Software power/energy control High-performance Power-aware Cluster 13 After frying multiple components… 14 PowerPack Framework (DC Power Profiling) If node .eq. root then call pmeter_init (xmhost,xmport) call pmeter_log (pmlog,NEW_LOG) endif <CODE SEGMENT> If node .eq. root then call pmeter_start_session(pm_label) endif <CODE SEGMENT> If node .eq. root then call pmeter_pause() call pmeter_log(pmlog,CLOSE_LOG) call pmeter_finalize() endif Multi-meters + 32-node Beowulf 15 Power Profiles – Single Node Power consum ption distribution for system idle System Power: 39 Watt CPU 14% Pow er Supply 33% Power consum ption distribution for m emory performance bound (171.swim ) System Power: 59 Watt Pow er Supply 21% CPU 35% Memory 10% Disk 11% NIC 1% Fans 23% Other Chipset 8% Fans 15% Other Chipset 5% NIC 1% Disk 7% Memory 16% • CPU is largest consumer of power typically (under load) 16 Power Profiles – Single Node 40 .0 Power Consumption Distribution for Different Workloads Power Consumption for Various Workloads N ote : on ly p owe r c ons ume d b y CPU , memo ry, di sk an d N IC are con sid ered here CPU-bound memory-bound CPU-bound memory-bound 35 .0 30 .0 network-bound network-bound 25 .0 disk-bound disk-bound 20 .0 15 .0 10 .0 5 .0 0 .0 i dl e 171 .swi m 16 4.gz ip C PU M emo ry cp Di sk scp NIC 17 NAS PB FT – Performance Profiling compute reduce (comm) compute all-to-all (comm) About 50% time spent in communications. 18 Power Profile of FT Benchmark (class B, NP=4) 30 initialize startup iteration 1 iteration 2 iteration 3 25 20 15 10 P owe r (wa tt) 5 CPU power 0 10 5 memory power 0 5 disk power 0 5 NIC power 0 0 20 40 60 80 10 0 1 20 14 0 1 60 1 80 20 0 Tim e (s ec ond) Power profiles reflect performance profiles. 19 One FFT Iteration one iteration 30 evolve fft cffts1 25 transpose_x_yz transpose_local cffts1 cffts2 CPU Power transpose_finish Power (Watts) mpi_all-to-all 20 send-recv send-recv wait send-recv 15 Memory Power 10 5 0 110 115 120 125 130 135 140 145 150 Time (Seconds) SCAPE Laboratory Confidential 20 2005 - present 21 Intuition confirmed 2005 - Present 22 HPPAC Tool Progress 2005 - Present • PowerPack – Modularized PowerPack and SysteMISER – Extended analytics for applicability – Extended to support thermals • SysteMISER – Improved analytics to weigh tradeoffs at runtime – Automated cluster-wide, DVS scheduling – Support for automated power-aware memory 23 Predicting CPU Power 2005 - Present 30 Estimated CPU Power Measured CPU Power 25 Power (Watts) 20 15 10 5 0 0 10 20 30 40 50 60 70 80 90 100 Time (Seconds) 24 Predicting Memory Power 2005 - Present 12 Estimated Memory Power Measured Memory Power 10 Power (Watts) 8 6 4 2 0 0 10 20 30 40 50 60 70 80 90 100 Time (Seconds) 25 Correlating Thermals BT 2005 - Present 26 Correlating Thermals MG 2005 - Present SCAPE Laboratory Confidential 27 Tempest Results FT 2005 - Present 28 SysteMISER 2005 - Present • Our software approach to reduce energy – Management Infrastructure for Energy Reduction • Power/performance – measurement – prediction – control The Heat Miser. 29 Power-aware DVS scheduling strategies 2005 - Present CPUSPEED Daemon [example]$ start_cpuspeed [example]$ mpirun –np 16 ft.B.16 Internal scheduling MPI_Init(); <CODE SEGMENT> setspeed(600); <CODE SEGMENT> setspeed(1400); <CODE SEGMENT> MPI_Finalize(); External Scheduling [example]$ psetcpuspeed 600 [example]$ mpirun –np 16 ft.B.16 NEMO & PowerPack Framework for saving energy 30 CPU MISER Scheduling (FT) 2005 - Present Normalized Energy and Delay with CPU MISER for FT.C.8 normalized delay 1.20 normalized energy 1.00 0.80 0.60 0.40 0.20 0.00 auto 600 800 1000 1200 1400 CPU MISER 36% energy savings, less than 1% performance loss See SC2004, SC2005 publications. 31 Where else can we save energy? • Processor – DVS 2005 - Present – Where everyone starts. • NIC – Very small portion of systems power • Disk – A good choice (our future work) • Power-supply – A very good choice (for a EE or ME) • Memory – Only 20-30% of system power, but… 32 The Power of Memory 2005 - Present Effects of increased memory on system power (90 Watt CPU, 9 Watt 4GB DIMM) 100% % of system power 90% 80% 70% 60% Percentage of system power for memory 50% Percentage of system power for CPUs 40% 30% 20% 10% 0% 0 32 64 96 128 160 192 224 256 Amount of memory per processor (GB) 33 Memory Management Policies 2005 - Present Default Static Dynamic Memory MISER = Page Allocation Shaping + Allocation Prediction + Dynamic Control 34 Memory MISER Evaluation of Prediction and Control 2005 - Present Memory Online Memory Demand 8 7 Gigabytes 6 5 4 3 2 1 0 0 5000 10000 15000 20000 25000 30000 35000 Time (seconds) Prediction/control looks good, but are we guaranteeing performance? 35 Memory MISER Evaluation of Prediction and Control 2005 - Present Memory Online Memory Demand 8 7 Gigabytes 6 5 4 3 2 1 0 22850 22860 22870 22880 22890 22900 22910 22920 22930 22940 22950 Time (seconds) Stable, accurate prediction using PID controller. But, what about big (capacity) spikes? 36 Memory MISER Evaluation of Prediction and Control 2005 - Present Memory Online Memory Used 8 7 Gigabytes 6 5 4 3 2 1 0 16940 16960 16980 17000 17020 17040 17060 Time (seconds) Memory MISER guarantees performance in “worst” conditions. 37 Memory MISER Evaluation Energy Reduction 2005 - Present FLASH Memory Demand 6 Gigabytes 5 4 Pinned pages (OS) decrease efficiency Memory Demand High freq cyclic alloc/dealloc Tiered increases in memory allocations … 3 Stable PID control 48 40 … Stable PID control 32 24 2 16 1 8 0 0 t0 t1 t2 Time t3 Devices Devices Online t4 30% total system energy savings, less than 1% performance loss 38 Present - 2012 39 SystemG Supercomputer @ VT SystemG Stats • 325 Mac Pro Computer nodes, each with two 4-core 2.8 gigahertz (GHZ) Intel Xeon Processors. • Each node has eight gigabytes (GB) random access memory (RAM). Each core has 6 MB cache. • Mellanox 40Gb/s end-to-end InfiniBand adapters and switches. • LINPACK result: 22.8 TFLOPS (trillion operations per sec) • Over 10,000 power and thermal sensors • Variable power modes: DVFS control (2.4 and 2.8 GHZ), FanSpeed control, Concurrency throttling,etc. (Check: /sys/devices/system/cpu/cpuX/Scaling_avaliable_frequencies.) • Intelligent Power Distribution Unit: Dominion PX (remotely control the servers and network devices. Also monitor current, voltage, power, and temperature through Raritan’s KVM switches and secure Console Servers.) Deployment Details 1U 24 U * 13 racks total, 24 nodes on each rack and 8 nodes on each layer. 8U * 5 PDUs per rack. 1U 8U Raritan PDU Model DPCS12-20. Each single PUD in SystemG has an unique IP address and Users can use IPMI to access and retrieve 1U 8U information from the PDUS and also control them such as remotely shuting down and restarting machines, recording system AC power, etc. * There are two types of switch: 1) Ethernet Switch: 1 Gb/sec Ethernet switch. 36 nodes share one Ethernet switch. 2) InfiniBand switch: 40 Gb/sec InfiniBand switch. 24 nodes (which is one rack) share one IB switch. Data collection system and Labview Sample diagram and corresponding front panel from Labview: A Power Profile for HPCC benchmark suite Published Papers And Useful Links Papers: 1. 2. Rong Ge, Xizhou Feng, Shuaiwen Song, Hung-Ching Chang, Dong Li, Kirk W. Cameron, PowerPack: Energy profiling and analysis of High-Performance Systems and Applications, IEEE Transactions on Parallel and Distributed Systems, Apr. 2009. Shuaiwen Song, Rong Ge, Xizhou Feng, Kirk W. Cameron, Energy Profiling and Analysis of the HPC Challenge Benchmarks, The International Journal of High Performance Computing Applications, Vol. 23, No. 3, 265-276 (2009) NI system set details: http://sine.ni.com/nips/cds/view/p/lang/en/nid/202545 http://sine.ni.com/nips/cds/view/p/lang/en/nid/202571 The future… Present - 2012 • PowerPack – Streaming sensor data from any source • PAPI Integration – Correlated to various systems and applications • Prophesy Integration – Analytics to provide unified interface • SysteMISER – Study effects of power-aware disks and NICs – Study effects of emergent architectures (CMT, SMT, etc) – Coschedule power modes for energy savings 46 Outreach • • • • See http://green500.org See http://thegreengrid.org See http://www.spec.org/specpower/ See http://hppac.cs.vt.edu SCAPE Laboratory Confidential 48 Acknowledgements • My SCAPE Team – – – – – – Dr. Xizhou Feng (PhD 2006) Dr. Rong Ge (PhD 2008) Dr. Matt Tolentino (PhD 2009) Mr. Dong Li (PhD Student, exp 2010) Mr. Song Shuaiwen (PhD Student, exp 2010) Mr. Chun-Yi Su, Mr. Hung-Ching Chang • Funding Sources – National Science Foundation (CISE: CCF, CNS) – Department of Energy (SC) – Intel 49 Thank you very much. http://scape.cs.vt.edu cameron@cs.vt.edu Thanks to our sponsors: NSF (Career, CCF, CNS), DOE (SC), Intel 50