Towards Optimal Sensor Placement for Hot Server Detection in Data Centers Xiaodong Wang, Xiaorui Wang, Guoliang Xing, Jinzhu Chen, Cheng-Xian Lin. and Yixin Chen. Outline Introduction Related work Hot server detection problem CFD-guided sensor placement Evaluation Summary 2 Introduction Thermal monitoring is important in data center operation: Overheating is harmful to data center. Malfunction of hardware components. Server shut down. Excessive cooling energy is consumed. Operation of cooling systems is not efficient enough. Excessive energy consumption required by overcooling. To have precise hot server detection: Precise hot server detection can guide air conditioning system. Thermal dynamics in data center need to be better studied. Place more sensors to increase thermal visibility. 3 Related Work Studies of thermal profile [Choi et al. HPCA ‘07 ] studied thermal profile of a rack. [Patel et al. IPACK ‘01] studied the air temperature specification of a data center in normal condition. Not used to guide sensor placement. Improve thermal monitoring with sensor networks: [Liang et al. SenSys ‘09] deployed sensor networks in data center to achieve a high-fidelity thermal monitoring. [Moore et al. USENIX ‘05] and [Bash et al. USENIX, ’07] proposed to allocate server job and workload based on thermal readings from sensor networks. How to effectively place sensors? 4 Hot Server Detection Problem Problem to solve: To intelligently place sensors for a maximum hot server detection probability. Problem formulation: Given M locations to monitor and N (N<M) sensors to use: 1 max M Subject to the constraint: PFi P 1i M Di 1 i M PDi : Detection probability of overheating at monitored location i PFi : False alarm rate of overheating at monitored location i 5 Problem Solving Architecture Overheating data center analysis. Analyze the data center in overheating condition. Obtain the temperature distribution for overheating cases. Find the sensor placement solution. Sensor readings usually are corrupted by noise. Sensors need to collaboratively make hot server detection decision (data fusion) Overheating Analysis CFD Modeling Sensor Placement Data fusion & placement algorithm Temperature Interpolation 6 Overheating Data Center Analysis Computational Fluid Dynamics (CFD) model for overheating data center A finite volume method. Datacenter physical model Power consumption …… temperature distribution CFD Modeling Temperature Interpolation CRAC settings Example: A/C in A/C Out A/C Out 7 In/Out A/C A/C Out Overheating Data Center Analysis (cont’d) Spatial temperature interpolation Results from CFD are discrete in locations. Granularity of CFD modeling is a tradeoff between accuracy and computational complexity. Inverse Distance Weighting (IDW) interpolation: Weighted average of the available temperature data. Optimize sensor placement based on the overheating analysis To achieve a maximized average overheating server detection probability. 8 CFD Guided Sensor Placement Sensor placement with existing solver: To decide the x, y, and z variables of each sensor location. Constrained Simulated Annealing (CSA) An existing solver with 3N variables. Computational time increases exponentially. Lightweight Sensor Placement (LSP): Only searches placement solution at areas with clustered racks. A greedy algorithm, which adds sensors one by one. Search space and computational time are significantly reduced. 9 Simulation Setup Experiment environment setup CFD software packages: Gambit and Fluent Server room size: 32m x 7m x 3m 13 racks in the server room. 4 monitored locations each rack (52 locations in total) 14,400 watts power consumption for each overheating rack. CRAC settings are collected by external sensor. 10 Simulation Results Different sensor numbers Baselines: Uniformly Random, current practice. CFD+ proportional. Using more sensors increases the detection probability. CFD+LSP (our solution) is closest to the optimal solution 11 Simulation Results (cont’d) Different temperature threshold Detection probability decreases when temperature threshold increases Different fusion range: A proper fusion range can increase the detection probability. 12 Hardware Experiment in a Server Room Setup: A small cluster of two racks is used. Overheating is created by a heater Results: 13 Summary We place sensors intelligently in data centers To reach a maximum hot server detection probability Various overheating conditions are studied to guide sensor placement CFD is used to analyze data centers under overheating condition. Future consideration: Integrate with thermal control approaches. More detail CFD modeling. 14 Q&A Thank You! Acknowledgement NSF CAREER Award CNS-0845390 NSF under Grants CNS-0720663, CNS-0915959, CCF1017336, and CNS-0954039 Microsoft Research under a Power-Aware Computing Award 15 Appendix A Sensor readings usually are corrupted by noise. Tm ( xi , yi , zi ) Tr ( xi , yi , zi ) Ni 2 Overheating scenario detected when the measured temperature is larger than the threshold. 1 n 2 PD P Tr xi , yi , zi Ni n i 1 False alarm happens when the overheating detection is intrigued by noise only. 1 n 2 PF P Ni C n i 1 16 Appendix B Rack clustering The closest distance of two monitored locations in two different clusters is larger than 2R. Inverse Distance Weighting (IDW) interpolation: n T (l0 ) iT li i i 1 17 1 / di n p 1 / di i 1 p