Australian Government Data Centre Strategy 2010-2025 Better Practice Guide: Data Centre Cooling November 2013 22/11/2013 3:00 PM 22/11/2013 3:00 PM Contents Contents 2 1. Introduction 3 Purpose 3 Scope 3 Policy Framework 3 Related documents 4 2. Discussion 6 Overview 6 Fundamentals 6 Common Concepts and Definitions 8 Common Cooling Problems 9 Improving Cooling Efficiency 10 Potential Work Health Safety Issues 12 Capacity Planning for Cooling Systems 13 Operations Effectiveness and Continuous Improvement 15 Optimising Cooling Systems 16 Maintenance 16 Sustainability 17 Trends 17 Conclusion 18 3. Better Practices 19 Operations 19 Planning 20 4. Conclusion 22 Summary of Better Practices 22/11/2013 3:00 PM 22 1. Introduction The purpose of this guide is to advise Australian Government agencies on ways to improve operations relating to the data centre cooling. Many government functions are critically dependent upon information and communication technology (ICT) systems based in data centres. The principal purpose of the data centre’s cooling systems is to provide conditioned air to the ICT equipment with the optimum mix of temperature, humidity and pressure. In larger, purpose built data centres, the cooling system also cools the support equipment and the general office area. Cooling is typically a lower priority for management attention. The ICT equipment and power are given more attention because when they fail, the impacts are more immediate and disruptive. However, extended cooling system failures can be more damaging than power failures. Further, inefficient or ineffective cooling is a major cause of waste in data centre operations, and contributes to increased hardware failures. Historically, making cooling systems efficient has resulted in reducing data centre operating costs by up to 50 per cent. Ineffective cooling results in some ICT equipment running at higher temperatures and so failing sooner. Sound operations practices are essential to efficient and effective data centre cooling. This guide on cooling forms part of a set of better practice guides for data centres. Purpose The cooling system is a major cost driver for data centres, and interacts with many other data centre systems. The intent of this guide is to assist managers to assess how well a cooling system meets their agency’s needs, and to reduce the capital and operating costs relating to data centre cooling. Scope This guide addresses the operations and processes required to achieve better results from data centre cooling technology. This guide does not consider data centre cooling technology in detail, as information at this level is widely available, subject to rapid change, contentious and specialised. Industry will be able to supply agencies with advice and cooling technology to meet their specific needs. The discussion will be restricted to cooling technology and operations that are relevant to data centres used by Australian Government agencies. Policy Framework The guide has been developed within the context of the Australian public sector’s data centre policy framework. This framework applies to agencies subject to the Better Practice Guide: Data Centre Cooling | 3 Financial Management and Accountability Act 1997 (FMA Act). The data centre policy framework seeks financial, technical and environmental outcomes. The Australian Government Data Centre Strategy 2010 – 2025 (data centre strategy) describes actions that will avoid $1 billion in future data centre costs. The data centre facilities panel, established under the coordinated procurement policy, provides agencies with leased data centre facilities. The Australian Government ICT Sustainability Plan 2010 – 2015 describes actions that agencies are to take to improve environmental outcomes. The ICT sustainability plan sets goals for power consumption in data centres, which is the key factor driving the need for cooling. The National Construction Code was created in 2011 by combining the Building Code of Australia and the Plumbing Code of Australia. The National Construction Code controls building design in Australia, and may be further modified by State Government and council regulations. The Technical Committee 9.9 (TC9.9) of the American Society of Heating, Refrigeration and Air-conditioning Engineers (ASHRAE) publishes guidance on data centre cooling1. The Australian Refrigeration Council (ARC) is the organisation appointed by the Australian Government able to accredit individuals to handle refrigerants and related chemicals. The Australian Institute of Refrigeration, Air conditioning and Heating (AIRAH) is a specialist membership association for air conditioning, refrigeration, heating and ventilation professionals. AIRAH provides continuing professional development, accreditation programs and technical publications. Related documents Information about the data centre strategy, and DCOT targets and guidance can be obtained from the Data Centre section (datacentres@finance.gov.au). The data centre better practice guides also cover: 1 Power: the data centre infrastructure supplying power safely, reliably and efficiently to the ICT equipment and the supporting systems. Structure: the physical building design provides for movement of people and equipment through the site, floor loading capacity, reticulation of cable, air and water. The design also complements the fire protection and security better practices. Data Centre Infrastructure Management: the system that monitors and reports the state of the data centre. Also known as the building management system. Fire protection: the detection and suppression systems that minimise the effect of fire on people and the equipment in the data centre. The reports of the Technical Committee 9.9 are widely referenced. http://tc99.ashraetcs.org/documents/ASHRAE%20Networking%20Thermal%20Guidelines.pdf http://tc99.ashraetcs.org/documents/ASHRAE%202012%20IT%20Equipment%20Thermal%20Management%2 0and%20Controls_V1.0.pdf Better Practice Guide: Data Centre Cooling | 4 Security: the physical security arrangements for the data centre. This includes access controls, surveillance and logging throughout the building, as well as perimeter protection. Equipment racks: this guide brings together aspects of power, cooling, cabling, monitoring, fire protection, security and structural design to achieve optimum performance for the ICT equipment. Environment: this guide examines data centre sustainability, including packaging, electric waste, water use and reducing green house gas generation. Better Practice Guide: Data Centre Cooling | 5 2. Discussion Overview This section discusses key concepts and practices relating to data centre cooling. The focus is on operations, as this is essential for efficient, reliable and effective performance of the information and communication technology (ICT) in the data centre. If more background is needed on cooling systems design, there is a vast amount of publicly available information, from the general to the very detailed. AIRAH2 provides material relating to Australia’s general refrigeration industry. The Green Grid3 and ASHRAE TC9.9 are industry bodies that provide vendor independent information relating to data centre cooling. The refrigeration industry has been operating for over 150 years, and the air conditioning industry for over 100 years. As data centres are recent, designers first adapted other air conditioning systems to suit data centre needs. In the last 15 years, purpose built equipment has been developed. The result is diversity and innovation. There are a range of designs and choice in equipment makes and models. And there is constant innovation in cooling designs and technology. Cooling a data centre is a continuing challenge. Data centres have dynamic thermal environments. This requires regular monitoring and adjustments due to interactions between ICT, power and cooling systems. External factors, such as the weather, time of day and seasons, also contribute to the challenge. The better practice is to ensure that the overhead costs due to cooling are minimised over the data centre’s life. Fundamentals The basic physics of a data centre is that electricity is converted to heat and noise. The cooling system must transfer enough heat quickly enough from the ICT equipment to prevent equipment failures. There are two types of air conditioning systems, comfort and precision. Comfort systems are designed for human use while precision systems are designed for ICT equipment. Comfort systems have a lower capital cost, but have a higher operating cost as they are less efficient than precision systems in cooling data centres. Units The wide spread use of archaic units of measurement is due to the longevity of the refrigeration industry. As the core issue the cooling system is intended to handle is 2 www.airah.org.au 3 www.thegreengrid.org Better Practice Guide: Data Centre Cooling | 6 energy, this paper uses the International Standard (SI) unit, the watt. This allows comparison between the power consumed in the data centre and the cooling required. Agencies are encouraged to use the watt as the basic unit. Heating, Cooling and Humidity In a simplified model of data centre cooling there are three major elements. Supply Heat rejected Cooling Input air Transport Waste heat carried away Demand ICT and Other Sources Power supplied to ICT equipment becomes heat Cooling Figure 1: A Simplified Model for Data Centre Cooling Supply: this creates the cooling. Common technologies are refrigeration, chillers, and use of the ambient temperature. Cooled air is very commonly used, due to overall cost and current ICT designs. Liquids are much more effective (10 to 80 times). Typically, smaller cooling systems use air only (with a sealed refrigerant) while larger cooling systems use a combination of air and liquid. Transport: ensures that enough cooling is delivered soon enough, and enough heat is removed quickly enough, to maintain the ICT equipment at the correct temperature. Typically, air is used to deliver the cooling and carry away the heat at the equipment rack, and fans are used to move the air. Demand: the sources that create the heat, mostly due to ICT. The other sources include the people that work in the data centre, the external ambient temperature that transfers into the data centre, and the other support systems. The cooling system itself generates heat. The uninterruptible power supply (UPS) also generates heat in standby and operation, in particular batteries. The cooling system is also required to adjust the humidity of the air and to remove particles. Depending upon the climate at a data centre’s location, moisture may need to be added or removed. Similarly, the types and amount of particles to be removed from the air are determined by the location and external events. ICT Equipment Heat Generation Each item of ICT hardware in the data centre needs an adequate supply of cooling air to be supplied to it, and the heated air removed. Different types of equipment have different needs. Server and network ICT equipment depend on high performance computer chips. Most computer chips are able to generate enough heat to damage themselves and nearby components. Chips are designed to operate reliably at between 55°C and Better Practice Guide: Data Centre Cooling | 7 70°, with fans blowing cooling air over the chips to remove heat. Manufacturers specify the range of inlet air temperature needed for the equipment to operate reliably. Other ICT devices, such as disk storage and tape drives, consume much less electricity and generate much less heat. Although these types of devices also have chips, they are designed to operate at lower temperatures. Common Concepts and Definitions This section illustrates commonly used terms and concepts in data centres operated by APS agencies. Note that this is not a comprehensive or definitive list. There are many other ways used to implement cooling systems in data centre. Figure 2 shows one type of data centre cooling system. A chiller on the left receives water warmed in the data centre, chills the water and pumps it back to the data hall. The Computer Room Air Handler (CRAH) uses the chilled water to create cool air. The cool air is pushed into the under floor plenum to the ICT racks. The ICT equipment is cooled by the air. The warm air is drawn back to the CRAH, to be cooled again. CRAH Chiller Chilled water loop ICT Rack Under floor plenum Figure 2: Data Centre Cooling using CRAH and chilled water Figure 3 shows another common type of cooling system. The Computer Room Air Conditioner (CRAC) creates cool air and pushes this into the under floor plenum to the ICT racks. The ICT equipment is cooled by the air. The warm air is drawn back to the CRAC, to be cooled again. Exchanger Economizer CRAC ICT Rack Under floor plenum Figure 3: Cooling system using CRAC CRACs may use refrigerants (as shown in Figure 3) or chilled water (similar to Figure 2) to cool the air. The refrigerant is pumped to the exchanger to remove the excess heat, before being returned to the CRAC.A common technology for the exchanger is the cooling tower, which uses water vapour to cool and humidify external air as it is drawn in, while warm air is expelled to the outside air. Better Practice Guide: Data Centre Cooling | 8 Free air cooling is a technique of bringing air that has a suitably low temperature into the data centre, either directly to the ICT rack or indirectly to chill the water. This is also known as an ‘economizer’. Figure 4 shows the direct exchange (DX) cooling system. The DX unit has the same mode of operating as a CRAC, and use refrigerants for cooling. DX units may also have free air cooling. Exchanger DX Unit ICT Rack Under floor plenum Figure 4: DX cooling system Dedicated chilled water systems generate more cost effective cooling for larger data centres. Smaller data centres will typically use refrigeration or share chilled water with the building air conditioning system. The cooling system must also manage the humidity, the amount of water vapour in the air. The preferred range is 20% to 80% relative humidity. Below 20%, there is a risk of static electricity causing ICT equipment failure. Above 80%, there is a risk of condensation.4 A blanking panel is a solid plate put in a rack to prevent cool and warm air from mixing. Blanking panels are placed as required, and provide a 1 to 5 per cent improvement in efficiency. The inlet air temperature is the temperature measured at the point of entering a piece of ICT. Knowing the inlet air temperature across all equipment is important when maximising the data centre’s efficiency. The examples in this section showed an under floor plenum to reflect the design most commonly used in APS data centres. Data halls first began using raised floors (under floor plenums) in 1965. Many recently built data centres use solid concrete slabs rather than raised floors. There are merits in both approaches. Common Cooling Problems The figure below illustrates many common problems in data centre cooling. The cooling air rises from the under-floor plenum for racks1 and 2. However, rack 3 is drawing the warmed air from rack 2 directly into the equipment in rack 3. This increases the rate of equipment failure in rack 3. Before entering rack 2, the cooling air mixes with warmed air from rack 1. Thus, rack 2 is not cooled as effectively. In both rack 1 and 2, the warm air is moving back from right to left inside the racks. Blanking would stop this air mixing. 4 Large data centres can have issues. http://www.theregister.co.uk/2013/06/08/facebook_cloud_versus_cloud/ Better Practice Guide: Data Centre Cooling | 9 As well as the increased rate of equipment failure, the energy to cool the air has been wasted by allowing the warm air leaving a rack to mix before entering the rack. ` Cold air Warm air Hot air Rack 1 Rack 2 Rack 3 Figure 5: Common cooling problems Another common problem occurs when two or more CRACs have misaligned target states. For example, if one CRAC is set for a relative humidity of 60% and another CRAC is set to 40%, then both ‘battle’ to reach their preferred humidity level. As both CRACs will operate for far longer than necessary, electricity and water will be wasted. Older CRACs use belts to drive the fans. The belts wear and shed dust throughout the data hall. This dust passes into the ICT equipment’s cooling systems, reducing efficiency slightly. It is cost effective to replace belts as the newer drives have lower operating costs due to lower power use and lower maintenance requirements. Noise is acoustic energy, and can be used as a proxy measure of data centre efficiency. Higher levels of noise mean that more energy is being lost from various data centre components and converted to acoustic energy. Two key sources are fans and air-flow blockages. Fans are essential to air movement. However, lowering the inlet air temperature means the fans need to operate less frequently, or at lower power. Another common issue is blockages in transporting the cool or hot air. Blockages mean that more pressure is needed to transport the air. Pressure is generated by the fans (for example, in the CRACs or CRAHs), which must use more power to create the greater pressure. Bundles of cables in the underfloor plenum or creating a curtain in the equipment racks are very common faults that are easily remedied. Ductwork with several left and right turns is another common issue. Smooth curves minimise turbulence and allow for more efficient air movement. Improving Cooling Efficiency There are several simple, inexpensive techniques that have consistently improved cooling system performance by 10 to 40 per cent. Better Practice Guide: Data Centre Cooling | 10 Hot / Cold Aisle Cold air Hot air Figure 6: Hot aisle configuration Hot aisle / cold aisle: aligning the ICT equipment in the racks so that all the cold air is drawn from one side of the rack and expelled from the other side. The racks are then aligned in rows, so that the hot air from two rows blow toward each other, while the alternate row, the cold air is drawn into two racks. Changing from randomly arranged equipment to this arrangement reduces cooling costs by 15% to 25%. Hot / Cold Aisle Containment Enclosing one of the hot or cold aisles gives greater efficiencies by further preventing hold and cold air from mixing. Cooling costs are reduced by another 10% over hot / cold aisle alignment.5 Cold air Hot air Cold aisle containment Hot aisle containment Figure 7 Cold or hot aisle containment Hot or cold aisle containment is nearly always cost effective in data centres that have not implemented hot / cold aisle alignment and purpose-built containment solutions are available. Containment can be retrofitted into existing data halls, using an inexpensive material such as plywood, MDF or heavy plastic. However, due care is necessary, for example to ensure that the fire suppression system will still operate as designed. 5 Moving from random placement to hot / cold aisle containment means 25% to 35% reductions in cooling costs. Better Practice Guide: Data Centre Cooling | 11 Raising the Temperature Since 2004, ASHRAE TC 9.9 has published guidance on the appropriate temperature of a data centre. Many organisations have reported major reductions in cooling costs by operating the data centre at temperatures in the mid-twenties. Class The most recent advice was released in October 2011, and subsequently incorporated into a book in 2012. The following table presents the recommended and allowable conditions for different classes of ICT equipment. A1 to A4 Equipment Environmental Specifications Product Operations Product Power Off Maximum Dry-Bulb Humidity Maximum Maximum Dry-Bulb Relative Maximum Rate of Temperature Range, Non- Dew Point Elevation Temperature Humidity Dew Point Change (°C) Condensing (°C) (m) (°C) (%) (°C) (°C/hour) Recommended (Applies to all A classes; individual data centres can choose to expand this range based upon the analysis described in ASHRAE documents) 18 to 27 5.5°C DP to 60% RH and 15°C DP Allowable A1 A2 A3 A4 B C 15 to 32 10 to 35 5 to 40 5 o 45 5 to 35 5 to 40 20% to 80% RH 20% to 80% RH -12°C DP & 8% RH to 85% RH -12°C DP & 8% RH to 90% RH 8% RH to 80% RH 8% RH to 80% RH 17 3050 5/20 5 to 45 8 to 80 27 21 3050 5/20 5 to 45 8 to 80 27 24 3050 5/20 5 to 45 8 to 80 27 24 3050 5/20 5 to 45 8 to 80 27 28 3050 NA 5 to 45 8 to 80 29 28 3050 NA 5 to 45 8 to 80 29 ASHRAE offers two cautions, around noise and operating life. It is not enough to raise the temperature of the air entering the data hall. The facilities and ICT staff must know the temperature of the air entering the ICT equipment, and how the equipment will respond. Beyond a certain temperature, any savings made in the cooling system will be lost as fans in the ICT equipment work harder and longer. ASHRAE also advise that raising the temperature does reduce operating life. However, for many types of ICT equipment the operating life is significantly longer than the economic life. Agencies should monitor the life of their ICT assets, including failure rates. Agencies should also discuss their plans with the ICT equipment manufacturer. Potential Work Health Safety Issues The cooling system can pose health risks to staff and members of the public. With planning, these risks can be treated and managed. The common risks are noise, bacteria and heat. Heat is a possible risk once the data centre design includes zones in which the temperature is intended to reach over 35°C. Hot aisle containment systems can Better Practice Guide: Data Centre Cooling | 12 routinely have temperatures over 40 C. Staff must follow procedures to monitor their environmental temperature, including duration. Hydration, among other mitigation steps, will be required. A typical data centre is noisy, and this is a potential risk to be managed by data centre management, staff and visitors.6 Australian work health safety regulations identify two types of harmful noise.7. Harmful noise can cause gradual hearing loss over a period of time or be so loud that it causes immediate hearing loss. Hearing loss is permanent. The exposure standard for noise is defined as an LAeq,8h of 85 dB(A) or an LC,peak of 140 dB(C). LAeq,8h means the eight hour equivalent continuous A-weighted sound pressure level in decibels, referenced to 20 micropascals, determined in accordance with AS/NZS 1269.1. This is related to the total amount of noise energy a person is exposed to in the course of their working day. An unacceptable risk of hearing loss occurs at LAeq,8h values above 85 dB(A). LC,peak means the C-weighted peak sound pressure level in decibels, referenced to 20 micropascals, determined in accordance with AS/NZS 1269.1. It usually relates to loud, sudden noises such as a gunshot or hammering. LC,peak values above 140 dB(C) can cause immediate damage to hearing. Guidance and professional services are available to manage risks due to noise.8 Capacity Planning for Cooling Systems Planning for cooling capacity requires taking several perspectives of the data centre. The data centre can be broken down into a set of volumes of space. The first volume is the ICT equipment in the rack. The second is groups of racks, or pods (if used). The third is the data hall, and the last is the whole data centre. For each volume, the key questions are: How much heat is being generated? How is the heat distributed in that volume of space? How much cooling can be delivered to that volume, and at what rate? Figure 8 shows a simplified representation of the data centre power systems. These are the components that create the heat that requires cooling. Each component subsystem has its own cooling needs and characteristics. 6 Safework Australia, “Managing Noise and Preventing Hearing Loss in the Workplace: Code of Practice”, http://www.safeworkaustralia.gov.au/sites/SWA/about/Publications/Documents/627/Managing_Noise_and_Preventi ng_Hearing_Loss_at_Work.pdf 7 Safework SA, “Noise in the Workplace” http://www.safework.sa.gov.au/uploaded_files/Noise.pdf 8 Australian Hearing, ”Protecting Your Hearing” http://www.hearing.com.au/digitalAssets/9611_1256528082241_NFR2157-October-09.pdf Better Practice Guide: Data Centre Cooling | 13 Main switchboard Local distribution lines to the building UPS Equipment racks DB ICT equipment Backup generator Office area Fire protection system HVAC system Security system Data Centre Infrastructure Management Figure 8: Conceptual view of data centre power systems A simple approach to determine the total demand is: Measure the total data centre power used at the main switchboard. Most of this power will be converted to heat. If the UPS uses chemical batteries, then these create heat throughout their life, even if the UPS is not supplying power. This heat must be added to the demand. The backup generator will require cooling when operating. This must be included in the total demand. If, when operating on backup power, there is a substantial amount of the ICT equipment turned off, then If the office area uses the data centre cooling, then this must be included. People generate about 1 kW of heat. Margin for peak demand, such as hot weather. Headroom for growth in demand must be added. To control capital expenditure, the headroom should be decided more on the length of time needed for a capacity upgrade, and not the life of the data centre. Using the data centre power as measured at the main switchboard is preferable over adding up all the name plate power required by the ICT equipment. The ICT equipment name plate describes the maximum amount of cooling required. As the ICT equipment is usually operating at less than maximum power, using the sum of all the name plate ratings of the equipment will lead to over provisioning of the cooling system. Headroom can then be created, based on expected growth. The total cooling demand can then be allocated to the various halls and rooms. The number of CRACs (or other types of technology) to supply this cooling should be enough so that at least one CRAC can be shut down for maintenance while the full cooling capacity is delivered to the data hall. (This is known as N+1 redundancy). The equipment racks will house different classes of ICT equipment, and so the cooling needs will vary between racks. Some racks may use 20kW of power, and so need 20kW of cooling, while other racks use 1kW. It is necessary to consider the upper and lower cooling needs to ensure the cooling is distributed appropriately. Better Practice Guide: Data Centre Cooling | 14 Operations Effectiveness and Continuous Improvement A program to consistently improve operational effectiveness requires measurement and standard reporting. The level of investment is influenced by the data centre power bill and state of the data centre. Data centres that follow no better practices may reduce their power bill by over 50 per cent. Measurement and reporting will achieve these savings sooner. Measurement There are many options and possible measurement points in a data centre. Many types of data centre equipment now include thermal sensors and reporting. The precision and accuracy of these sensors is variable and should be checked. Agencies may choose to sample at various points in the data centre, and extrapolate for similar points. This approach reduces costs, at the expense of accuracy. Sampling may be unsuitable in data centres with a high rate of change, or when trialling higher data hall temperatures. The following points should be considered for reporting: Inlet air temperature for each device. Exit point from the cooling unit. Base of the rack. Top of the rack Return to the cooling unit. External ambient temperature. Chilled water loop. For liquid cooling, this needs to be extended for the transfer points from the liquid to the ICT equipment and back again. A FLIR camera can be useful in identifying the air temperature and movement. The camera captures small temperature gradients. This information can be used to find hot spots, as the basis for efficiency improvements and removing causes of faults. Data centres over medium size, or ones supplying critical services should use an automated data collection and reporting product. The recording frequency can be as low as 15 minutes, but ideally every 5 minutes. There must be two thermal alarms, a warning for temperatures approach the point at which equipment may fail, and a second for when temperatures exceed equipment operating thresholds. NABERS and PUE Agencies that are planning to control data centre costs should use a consistent metric. The APS Data Centre Optimisation Target policy specifies the use of the Power Usage Effectiveness (PUE) metric, and sets a target range of 1.7 to 1.9. The National Australian Built Environment Rating System (NABERS) power for data centre metric Better Practice Guide: Data Centre Cooling | 15 was launched in February 2013. NABERS should, over time, replace PUE for APS data centres. A rating of 3 to 3.5 stars is equivalent to DCOT’s PUE target. A key difference between NABERS and PUE is the ability to compare different data centres. NABERS is explicitly designed for the purpose of comparing different data centres. PUE is intended to be a metric for improving the efficiency of an individual data centre, not for comparing data centres. Optimising Cooling Systems The Plan , Do, Check, Act approach (Deming Cycle) is suitable for optimising cooling systems. The measurement and reporting systems can establish the baseline and report on the effect of the changes. Typical changes include: Reduce the heat to be removed by reducing the electricity used by the ICT equipment. There are many actions that can be taken, including virtualisation, consolidation, using modern equipment, and turning off idle hardware. Reduce the amount of energy needed to cool the air, by using the external environment. Free air cooling can be retrofitted to most existing data centres. Common techniques are to draw in external air, and to pipe the chilled water through the external environment. Rarer examples include using rivers, seas and underground pipes. Reduce the energy used to move the air to and from the ICT equipment. One approach is to prevent cold and hot from mixing. This can use blanking panels, containment and hot / cold aisle alignment. Another approach is to remove barriers, allowing the air to move more freely. A third approach is to use the fans less. Possible actions include using variable speed fans (replacing fixed speed fans), using a cycle of pushing very cold air into the data hall then turning the fans off and letting the hall warm up, and using larger, more efficient fans. Once the effect of the trial has been measured, any beneficial changes can be made, the new baseline established and the next range of actions planned. Maintenance The efficiency and performance of the cooling system is tied to the maintenance regime. Skimping on routine maintenance usually incurs higher running costs and reduces the reliability of the data centre. At minimum, the routine maintenance regime should follow the manufacturer’s specifications. Most modern air conditioning units provide historical information which is useful in monitoring overall performance, and when optimising cooling system performance. Larger data centres, and those with more stringent reliability needs, will find preventive and active maintenance are likely to be necessary. Preventive (or predictive) maintenance is the scheduled replacement of components and units before they are expected to fail. This type of maintenance relies on advice from the manufacturer (which may change from time to time based on field experience) and on the performance of the systems in the data centre. Better Practice Guide: Data Centre Cooling | 16 Active maintenance involves replacing equipment once any warning signs are noticed. Active maintenance relies on detailed monitoring of all components, and being able to set up parameters for ‘normal’ and ‘abnormal’ operation. Managers should note anecdotal evidence of ‘car park servicing’, a form of fraud in which maintenance work is claimed to be done but has not. This risk can be minimised by escorting service provides, and by monitoring the operating history. Sustainability Two considerations for sustainable cooling operations are refrigerants and water. Most refrigeration and air-conditioning equipment uses either ozone depleting or synthetic greenhouse gases, which are legally controlled in Australia. The Ozone Protection and Synthetic Greenhouse Gas Management Act 1989 (the Act) controls the manufacture, import and export of a range of ozone depleting substances and synthetic greenhouse gases. The import, export and manufacture of these 'controlled substances', and the import and manufacture of certain products containing or designed to contain some of these substances, is prohibited in Australia unless the correct licence or exemption is held. More information is available here: http://www.environment.gov.au/atmosphere/ozone/index.html. The Australian Government ICT Sustainability Plan 2010-2015 describes a control process which should be used for reducing water use in cooling systems. There is no specific target for water use. http://www.environment.gov.au/sustainability/government/ictplan/index.html Some cooling technology uses significant quantities of water9, and their use may be banned under extreme drought conditions. The Green Grid has developed the Water Usage Effectiveness (WUE) metric10, to assist data centre operators to develop controls to manage their water use. Agencies should note that there are cooling systems designs that have closed water loops, which need only tens of litres of topup water. Some data centres also use rain fall to improve their sustainability. Trends There are several trends that are likely to affect data centre cooling systems: 9 Power efficiency: there is a steady reduction of the amount of power used in all classes of ICT equipment, even as price/performance improves. In some racks the amount of power needed will fall, meaning less cooling is required. LAN switches with copper interfaces are an example of this. Densification: there is a steady reduction in size for some classes of ICT equipment, notably servers and storage. Racks with servers are likely to consume more power. As the servers become physically smaller, more servers will fit into a rack. While each server uses less power, the greater number of servers means more power is needed. Evaporative cooling towers can use millions of litres per year. http://www.airah.org.au/imis15_prod/content_files/bestpracticeguides/bpg_cooling_towers.pdf 10 The Green Grid: http://www.thegreengrid.org/en/Global/Content/white-papers/WUE Better Practice Guide: Data Centre Cooling | 17 Cloud computing: this is likely to slow down the rate of expansion of data centre ICT capacity, and may significantly reduce the ICT systems in the data centre. Conclusion Cooling is an essential overhead once the ICT equipment uses more than about 10kW of power. Ideally, the cooling systems keep the temperature and humidity within a range that preserves the life of the ICT equipment. As cooling is typically the largest overhead agencies should concentrate on making the cooling efficient. However, this challenging and complex work is a lower priority activity than ensuring the power supply and managing ICT moves and changes in the data centre. Key points are: Design and operate for the specific and the whole. Consider all aspects of the data centre when upgrading or tuning the cooling systems. Ensure that the consequences of changes are considered on other data centre equipment, not only the ICT equipment. The design must consider the whole data centre, and not a subset. Extrapolating from the likely performance of the cooling system at a single rack is likely to produce errors. Instead, model the likely air flow for the entire data centre. Then be sure to measure it. Air inlet temperature is a key metric: this is the temperature of the air cooling the ICT equipment. Being able to measure the air temperature at this point is central to understanding how well the cooling system is working. Change the ICT equipment, change the air flow. The cooling system behaviour will change as the hardware, rack, cables etc change. This means monitoring and tuning the cooling system is a continual task. Raise the temperature, cautiously. Raising the data centre temperature to more than 24°C has proven effective in many sites for reducing energy use and saving money. However, the ASHRAE guidance clearly advises taking care when doing so. As the temperature rises, there may be changes in the air flow, resulting in new hot spots. As well, different makes and models of ICT equipment may require different temperatures and humidity ranges to operate reliably. Agencies must confirm these details with the equipment manufacturer’s specifications. When things go wrong. The operations, disaster recovery and business continuity plans need to explicitly consider minor and major failures, and the time needed to restart. In a major cooling failure, the ICT equipment can continue to warm the data centre air. Time will be needed to remove this additional heat once the cooling system restarts. Safety. Noise, heat and bacteria are all potential issues with a data centre cooling system. Good operating procedures and training will address these risks. Better Practice Guide: Data Centre Cooling | 18 3. Better Practices Operations The data hall has been arranged in hot and cold aisles, or in a containment solution. All obstructions to the air flow are removed. In particular, cables do not share air ducts or lay across the path that air is intended to follow. This includes to and from the data hall, within the racks, and under the raised floor. The temperature of the essential equipment in the data centre is monitored and recorded. The humidity of the data centre is monitored and recorded. All deviations from ‘Recommended’ to ‘Allowable’ (or worse) are analysed, corrected and reported. The actions to keep the temperature and humidity to recommended levels are documented and practiced. There is a routinely applied process for finding, investigating and if needed, removing hot spots from the data centre. A FLIR camera may assist in this process. The noise levels in the data centre are monitored and reported. A noise management plan is operating. The agency has an energy efficiency target that involves the data centre’s cooling system. The target may be based on PUE or NABERS. The progress towards this target is being tracked and reported to the agency’s executive monthly. There is routine maintenance of all cooling system elements, as per the manufacturer’s requirements. The maintenance includes pipes and ducts as well as major equipment. There is a control plan in place to ensure that the maintenance has been performed adequately. The data centre is monitored for dust and other particles. Sources of particles, including unfiltered outside air, wearing belts and older floor tiles, are removed over time. Filter paper may be used as an interim measure to improve equipment reliability by removing particles. The disaster recovery and business continuity plans include the impacts and controls for the partial or complete failure of the cooling system. There are rehearsals of the limitation, bypass and recovery activities for cooling system operations. There is a plan for managing leaks and spills in the data centre. This plan is rehearsed from time to time. There is a method for ensuring that procedures are followed and documentation is maintained. This method may be based on ISO 9000 or other framework. There are training and/or communications processes in place to ensure staff know and follow procedures. Better Practice Guide: Data Centre Cooling | 19 Planning A plan is being followed for: Cooling systems asset replacement. Altering the capacity and distribution of the cooling. Noise management. A plan exists for the actions following the failure of the cooling system. The length of time to restore the operating temperature is known and updated from time to time. Agencies with larger data centres may conduct a complex fluid dynamic analysis of the data centre from time to time. All planning work involves the ICT, facilities and property teams. The plans are approved by the senior responsible officer, and included in agency funding. Better Practice Guide: Data Centre Cooling | 20 Fundamental ⊠ Measuring and reporting the power consumption of the cooling system. Measuring and reporting the inlet temperature of ICT equipment. ⊠ Monitoring the outlet temperature of ICT equipment. ⊠ Maintaining hot and cold aisle alignment for racks and ICT equipment. If hot/cold aisle containment has not been implemented, evaluate the ⊠ business case for containment. ⊠ The temperature of the inlet air has been raised to at least 22°. There are active operations processes to raise the data centre temperature to ⊠ reduce energy costs. The temperature range in which the cooling systems and the ICT systems use the least amount of energy has been identified. ⊠ ⊠ The water use by the cooling system is measured and reported. There are active operations processes to minimise water use. The work health and safety plans include noise management. There are plans ⊠ to reduce noise. The relationship between raising temperature and noise ⊠ levels is measured and used in operations and capacity planning. There is a capacity plan for the cooling system. Options for upgrading or replacing parts of the cooling system are documented. The cooling equipment is maintained according to manufacturers’ All works are inspected and verified as being conducted as required. The operating hours of key equipment (e.g. CRACs) is tracked. ⊠ specifications. The cooling system is cleaned according to manufacturer’s specifications and ⊠ government regulations. ⊠ There is a plan to manage leaks and spills in the data centre. There is a plan, endorsed by senior management, for changing the cooling ⊠ systems capacity. Better Practice Guide: Data Centre Cooling | 21 4. Conclusion Agencies that use better practices in their data centres can expect lower costs, better reliability, and improved safety than otherwise. Implementing the better practices will give managers more information about data centre cooling, enabling better decisions. Overall, the data centre will be better aligned to the agency’s strategic objectives and the total cost of ownership will be lower. Agencies will also find it simpler and easier to report against the mandatory objectives of the data centre strategy. The key metric is avoided costs, that is, the costs that agencies did not incur as a result of improvements in their data centres. Capturing avoided costs is most effective when done by an agency in the context of a completed project that has validated the original business case. Summary of Better Practices Cooling is an essential overhead once the ICT equipment uses more than about 10kW of power. Cooling is typically the largest data centre overhead and agencies should ensure that the cooling system is efficient. Key points are: Design and operate for the specific and the whole. The design must consider the whole data centre, and how each piece of equipment is cooled. The likely air flow through the data centre should be modelled and measured routinely. Air inlet temperature is a key metric: Being able to measure the air temperature at this point is central to understanding how well the cooling system is working. Change the ICT equipment, change the air flow. The cooling system behaviour will change as the data centre configuration changes. Monitoring and tuning the cooling system is a continual task. Raise the temperature, cautiously. Raising the data centre temperature has proven effective in many sites for reducing energy use and so saving money. The ASHRAE guidance clearly advises taking care when doing so. When things go wrong. In a major cooling failure, the ICT equipment can continue to warm the data centre air. Time will be needed to remove this additional heat once the cooling system restarts. Safety. Noise, heat and bacteria are all potential issues with a data centre cooling system. Better Practice Guide: Data Centre Cooling | 22