Zehan Cui, Yan Zhu, Yungang Bao, Mingyu Chen Institute of Computing Technology, Chinese Academy of Sciences July 28, 2011 Motivation Design & Implementation Experiments Conclusion & Work in Progress Motivation Design & Implementation Experiments Conclusion & Work in Progress Watts/Server [source: The Problem of Power Consumption in Servers,Intel,2009] CPU no longer dominates the system power. [source: Barroso et. al. , The datacenter as a computer, 2009] Measurement is the basis. Hardware model Low power measurement Software Component-Level: ATX-based method accuracy Directly powered through ATX wires. Modern motherboards mostly have dedicated ATX wires for processor. VRM (Voltage Regulation Module) loss Usually deduced from multi ATX wires. Platform dependent. Motivation Design & Implementation Experiments Conclusion & Work in Progress Disk & CPU ◦ Similar to other ATX-based methods Memory & Add-in Card Devices ◦ Wrapper-based methods Advantages ◦ Accurate: direct measurement ◦ Easy-to-use: no deduction needed ◦ Portable: multi-platform Power Supply Current Sensor Prototype ◦ Disk power ◦ CPU power ◦ Memory power Component Count Description Wrapper Card 1 Memory power measurement. Intermediate Card 1 8 channels. DMM 2 Agilent 34411A. Collector 1 PC • Support DDR2-400 DIMM. • A channel is capable of converting one current into voltages. • One channel each. • Max speed: 50K samples per second. • LAN interface. • Collect data from DMM. Motivation Design & Implementation Experiments Conclusion & Work in Progress Component Detail CPU Intel Core2 Duo E4500 Memory DDR2-400 2GB UDIMM Disk 640GB SATA # of Cores: 2 Clock Speed: 2.2GHz L2 Cache: 2MB FSB Speed: 800MHz Frequency: 200MHz Max Bandwidth: 3.2GB/s 401.bzip2 from SPECCPU2006 50 CPU Memory Disk (unit: Watt) 45 Power of Components 40 35 30 25 20 15 10 5 0 0 10 20 30 Time from Beginning 40 50 (unit: Second) 60 70 More frequently we measure the power, more details we can get. Observation: 5,000 samples/s is an appropriate sample frequency at component level. Higher BW, but lower Power Lower BW, Higher Power Malloc 512MB Access in different strides Two causes ◦ ◦ Row conflict Lots of TLB miss Time: 6.5 times longer Power: slightly lower Energy: 5.9 times higher increase row buffer hit rate large page may be more efficient What is the relationship between performance and power? 64MB memory ◦ Random vs. Sequential Jump at least 64B eliminate cache hit Large page(2MB) eliminate TLB miss Load/Sotre_Unit % = LSU_stall_time/CPU_Cycle Observation: It seems that DRAM power is already proportional to bandwidth. But the fact is that … Use different SEEDs to generate different random access patterns; Power varies less than 1.1%. Observation: DRAM power is highly correlated to two factors • Load/Store Unit Utilization • Sequential / Random We can build memory power models based on the two factors rather than Bandwidth. Motivation Design & Implementation Experiments Conclusion & Work in Progress We use a hybrid approach ◦ ATX-Based CPU/Disk ◦ Wrapper card DRAM/… 5KHz is an appropriate sampling frequency to disclose fine-grain power behavior. DRAM power is highly correlated to Load/Store Unit Utilization, rather than Bandwidth. Upgrade current system ◦ Support DDR3 ◦ Support Large memory capacity ◦ Support 40 simultaneous measuring channels Use FPGA to collect measured data Correlate the measured power data with high-level semantics information Thanks! & Questions? Backup Wrapper Card already exists We only did several small modifications Current Sensor Power Supply Signals Normal DIMM: Dual-Inline Memory Module DIMM slot Motherboard With our initial wrapper card Wrapper Card DIMM DIMM slot Motherboard I/O Circuitry Banks Row Decoder Driver s Column Decoder [Source: H. David et. al., Memory Power Management via Dynamic Voltage/Frequency Scaling, ICAC, 2011] Recievers Runs at bus speed • Independent arrays Clock sync/distribution • Asynchronous: On-Die Termination Bank 0 Bus drivers and receivers independent of • Required by bus electrical Buffering/queueing memory bus speed characteristics for reliable operation • Resistive element that dissipates power Sense when busAmps is active Write FIFO Registers • • • • ODT 28 Can be approximately divided into ◦ Background power considered to be stable ◦ Bank power active/precharge Related to frequency of row operation ◦ I/O power Burst proportional to bandwidth ◦ Termination power Termination resistors Proportional to bandwidth P=U*I Doesn’t fluctuate too much, less than 2% in our platform. DC Voltage ADC CSA or DMM Data Collector (PC) DC Current DC Voltage (Current-Sense Amplifier) Possible reason for non-proportional of random power in slide17: ◦ When bandwidth is low, auto-precharge (caused by refresh) cause every access needs ACTIVE; the bank power is proportional to bandwidth. ◦ When bandwidth is high, some access may hit in the row buffer, which need less ACTIVE; the slope of bank power increase is lower than before.