PRACTICAL DYNAMIC THERMAL MANAGEMENT ON INTEL DESKTOP COMPUTER Guanglei Liu Department of Electrical and Computer Engineering Florida International University July 12, 2012 Major Professor: Dr. Gang Quan Thermal Design Challenges Number of transistors keeps increasing • Nearly 40 billon transistors are integrated into single die [Mizunuma, 2009 ICCAD] More complicated architectures are built • 80 core single chip processor has been demonstrated by Intel [Vangal, 2007 ISSCC] Figure from Intel Microprocessor Technology Lab, 2011 High transistor density increases power density Electric Bill • U.S. Datacenters: 120 billon kilowatt hours in 2012 • 9 billion dollar, 15% of all energy in U.S. Environmental concerns • In U.S, 46% of electricity is generated by fossil fuels. Source: Environmental Protection Agency (EPA) Report High power density brings up the on-chip temperatures and causes thermal issues Thermal Issues Computing system cooling solutions Increase package/cooling costs Mechanical Cooling Solution • • 1-3 dollar per watt [Skadron, ICSA 2003] Data center, each watt on computing, ½ - 1 watt for cooling [Brill, 2007] Affect reliability • Air-cooling (e.g. fan + heat sink) • • As much as 50% reduction of device’s life span for every 10oC increase [Yeo, DAC 2008] Degrade performance • 10-15% more circuit delay for each 15oC increase [Santarini, EDN 2005] Increase Leakage power consumption • Temperature from 65oC to 110oC can increase the leakage power by 38% for IC circuits.[Santarini, EDN 2005] Crush the computing system • Processor’s self-protect mechanism automatically shuts down processor to avoid physical damage [Rohou, WFDO 1999] Cooling cost takes 51% of overall server power budget [Lefurgy, COM 2003] Noise level increases 10dB as fan speed increases by 50% [Lyon, STMMS 2004] High cooling cost Liquid-cooling • High density liquid absorb 3500 times more heat than air [Chu, DMR 2004] Dynamic Thermal Management (DTM) • Dynamic voltage and frequency scaling (DVFS) technique [Kim, HPCA 2008] Sacrifice system performance [Gunther, ITJ 2001] • Task migration [Lim QED 2002] • Clock gating • Fetch toggling [Brooks, HPCA 2001] Related Theoretical Work Thermal-aware throughput maximization [Chantem et al., ISLPED 2009] [Zhang et al., ICCAD 2007] [Chatha et al., DAC 2010] Overall energy reduction under peak temperature constraints [Bao et al., DATE 2010] [Andrei et al., DAC 2009] [Huang et al., DATE 2011] Peak temperature minimization [Chaturvedi et al., ASPDAC 2011] [Liu et al., RTAS 2010] [Qiu et al., ICESS 2010] Real-time guarantee under peak temperature constraint [Chaturvedi et al., CIT 2010] [Wang et al., RTS 2006] [Huang et al., RTSS 2009] Those theoretical work are derived based on simplified mathematical thermal models and idealized assumptions Our Research Goal: To develop up a practical hardware platform that enables us to investigate the limitations of the existing theoretical work, and develop practical and effective DTM techniques to accommodate those limitations Major contributions Practical hardware platform • • Intel i5 Quad core Linux operating system [SouthEast 2011] Thermal management validation • DTM techniques VS air-cooling • DTM vs DPM algorithm •Fundamental DTM principles validation [SUSCOM 2012] Reactive DTM Single-core •Limitations of theoretical works • Non-constant sampling period • Thermal profiling analysis [GreenCom 2012] Proactive DTM algorithm Multi-core • Neighbor-aware temperature prediction • Algorithm for multicore with task migration [DATE 2012] [ASP2012] Practical Hardware Platform CPU_affinity module Migrate process between cores Dell Precision T1500 workstation SPEC CPU2000 Benchmark Linux kernel version of 2.6.23 Integers and floating point operations DVFS technique Task migration DVFS technique Cpufreq module 12 different speed levels DVFS Technique DVFS technique Fluke current clamp, Multimeter SPEC Benchmark Intel i5 quad core Power measureme nt Temperature capturing Fan Speed Control Fan control Cooling/ CPU power consumption CoreTemp driver Read on-chip thermal sensor Fancontrol shell script Computing system hardware monitoring tool Lm-sensors Tool Manually adjust fan speed Temperature value Fan Speed Voltag e value Monitor system information Our Approach Buffer zone and safe region Enhanced reactive DTM (ERDTM) is maximum possible temperature increment 4oC Buffer zone: Safe region: Temperature Offline thermal profiling analysis Build up a temperature vs. speed lookup table Run benchmarks with different speed levels Collect corresponding peak temperatures TURESHOLD T Buffer zone Tsafe Safe region Time Experimental results Experiment setup Frequency lookup table 1.1 FSDTM VS-DTM ERDTM 1.08 Throughput (%) Four identical tasks assigned to four cores to simulate single-core environment Temperature threshold is 55oC Construct the lookup table offline DTM algorithm Performance evaluation 1.06 1.04 1.02 1 0.98 0.96 galgel ammp lucas equake vpr gcc parser crafty SPEC CPU2000 Benchmark ERDTM average throughput improvement is 8.1% FSDTM algorithm VS-DTM algorithm Number of violations 87 ERDTM algorithm Number of violations Number of violations 12 0 Neighbor-aware temperature prediction Our Neighbor-aware prediction Training process where and are weights, which are obtained by collecting training data Obtained offline Run the tasks and record temperature information Individual increment factor Processor temperature increment Neighbor increment factor Heat transfer from neighbor processor Apply least-square estimation Neighbor-aware Task Migration NADTM Algorithm Conventional approach: Always migrate task from hottest core to the coolest core. Our migration strategy Predict thermal emergency choose the migration candidate with the minimum Migrate task Heat factor: to evaluate the processor hotness DVFS technique Increasing factor: to evaluate the temperature increment Performance analysis 48 Threshold NADTM OS Default Temperature (Celsius) 46 44 NADTM algorithm can effectively control the temperature under the threshold It has a small temperature oscillation of 1oC 42 40 38 36 34 0 50 100 150 200 Time (Second) Multiple task Single task An average of 5.8% overall throughput improvement An average of 3.6% overall throughput improvement Journals 1. 2. 3. Guanglei Liu, M. Fan, G. Quan, M. Qiu “On-Line Predictive Thermal Management under Peak Temperature Constraints for Practical Multi-core Platforms”, Journal of Low Power Electronics (ASP). (under review), 2012. Guanglei Liu, G. Quan, M. Qiu “Practical Dynamic Thermal Management on An Intel Desktop Computer ” , Embedded Software Design, Journal of Sustainable Computing (SUSCOM) (under review), 2012. H. Huang, V. Chaturvedi, Guanglei Liu, G. Quan, ”Leakage Aware Scheduling On Maximum Temperature Minimization For Periodic Hard Real-Time Systems”, Journal of Low Power Electronics (ASP), 2012. Peer Reviewed Conferences 1. 2. 3. 4. 5. Guanglei Liu, M. Fan, G. Quan, “Neighbor-Aware Dynamic Thermal Management for Multi-core Platform”, The 15th Design, Automation, and Test in Europe (DATE 2012), Dresden, Germany, March 12-16, 2012. Guanglei Liu, G. Quan, M. Qiu, “The Practical On-line Scheduling for Throughput Maximization on Intel Desktop Platform under the Maximum Temperature Constraint“, The 2011 IEEE/ACM Green Computing and Communications (GreenCom 2011), Sichuan, China, August 4-5, 2011. Guanglei Liu, G. Quan, ”Thermal Aware Scheduling on an Intel Desktop Computer,” IEEE SouthEast Conference (SouthEast 2011), Nashville, Tennessee, March 17-20, 2011. Guanglei Liu, J. Fan, “Framework for Statistical Analysis of Homogeneous Multi- core Power Grid Networks“, IEEE 8th International Conference on ASIC (ASICON 2009), Changsha, China, October 20-23, 2009. C. Liu, J. Tan, R. Chen, Guanglei Liu, J. Fan, “Thermal Aware Clocktree Optimization in Nanometer VLSI Systems Considering Temperature Variations“, IEEE 40th Southeastern Symposium on System Theory (SSST 2008), New Orleans, LA, March 17-18, 2008. Thank You for Your Attention !