A White Paper from the experts in Business-Critical Continuity™ Responding to major power outages by Mike O’Keeffe Abstract With the increased reliance on server availability and the demand for data centres to remain operational throughout the year, it is imperative that they are 100% resilient and withstand any deficiencies in the mains power supply. The resilience of the critical power infrastructure, including its operational management, is often established at the design phase and does not evolve with the installation. This can result in shortfalls in the original design philosophy developing and, in extreme circumstances, can result in unnecessary supply interruptions. In many cases, where a complete loss of power has been experienced, this is the result of multiple failings within the power infrastructure and/or its management. In any operation it is imperative that the response to these outages is well managed and the outcome is a more resilient power supply and operational support mechanism. Responses that are required to major power outages can be categorised into a number of areas: • • • • Initial response to incident (level 1 response) Fault identification and repair Fault analysis Lessons learnt and post fault actions The objective in all of these areas is three fold: 1. Ensure any H&S/ Environmental risks are contained 2. Restoration of the critical power infrastructure, to the designed topology, is achieved as soon as possible: Site topology designed to provide: • Redundancy • Fault tolerance of system • Maintainability 3. Lessons are learnt to prevent similar incidents happening again Initial response to incident (level 1 response) The initial response to any site incident will be performed by either site-based staff or the facilities management staff responsible for the location. The associated timeframes in place for the Service Level Agreement are usually less than one hour for major sites or two hours for locations where the impact is less severe. The initial response (level 1 response) must include the containment of the incident, in particular any health and safety or environmental issues that may have arisen (equipment faults, defective batteries, fuel leaks, etc.), as well as preventing the further escalation of the problem encountered. Suitably trained site-based staff will have a detailed knowledge of the site and a broad understanding of the variety of technologies employed within the critical power system to enable this containment to take place. Where specialist equipment is utilised, such as UPS, static switches, site specific standby or generator configurations, the support of the White paper - Responding to major power outages maintainer, either the OEM or specialist support contractor, is essential in ensuring the following: • Site engineers or facility managers are trained to be able to switch and read equipment alarms • Support is available from a permanently manned, 24/7 technical support help desk • Remote monitoring of installed equipment allows detailed interrogation of equipment to verify site findings and provide important data for fault identification • Reliable contractual call out response from the OEM or specialist support contractor is achieved to enable the effective resolution of the problem to commence It is important at the initial stages of the fault that not only is the problem contained but that the gathering of data, to enable the investigation into the event, is undertaken in a controlled manner. On many occasions specialist repairs have been undertaken prior to the establishment of the root cause of the fault or when vital status data of the equipment has not been captured at the time of the incident and immediately post the event. This has resulted in repeated faults and continuing disruption to operations. Specialist support engineers are trained not only in maintenance and repair of equipment, but also fault finding and investigation. Field engineers’ reports provide vital information into the fault investigation. Further time stamped, reliable information, to assist the fault investigation, can be gathered from manufacturers’ remote monitoring systems, which can provide a back up to any sitebased control or recording system. This enables detailed verification of events and analysis of recorded disturbances to be undertaken to confirm the nature of the incident. Key suppliers should be part of the incident management process. Establishing the contractual support arrangements required to achieve all these actions is not something that should be assumed or left until an incident takes place. To ensure that the above occurs, sites must ensure they partner with suppliers who actively manage their reactive service, typically having the following available: • Multiple (regionally based) callout engineers and managed spares availability • 24/7 technical help desk • Management escalation process • Remote system monitoring • In-country training facilities for site engineers on installed equipment • Availability of critical power equipment (UPS and Generators) to hire in emergency situations to contain the problem In any incident, fault containment should be the first priority, followed by the accurate gathering of data to enable the fault investigation process to be continued: • Fault Identification and repair • Fault analysis • Lessons learnt and post fault action Fault identification and repair When a major incident within a critical power infrastructure occurs, the correct initial response is to assess any environmental or health and safety risks. Once this has been cleared, the investigation can be launched. The most common occurrence is that the site’s critical power system has provided the correct level of fault tolerance and the load is still fully protected. It is also possible that power is still available to the load - either supplied from the UPS system reserve supply (unconditioned mains) or supported by the site’s standby generators. In either scenario, it is desirable to return the system to the designed level of redundancy in the shortest possible time frame. To enable the next course of action to be taken requires an understanding of the failure mode. This can vary significantly and will often include process, system and equipment failures. Information gathered manually on the incident will need to be verified with site data acquisition and BMS systems, and ideally verified with the manufacturer’s remote monitoring systems. The availability of a manufacturer’s remote monitoring system can significantly reduce the time involved in the decision making process by providing accurate and rapid information on the events encountered. To be able to access information (verification of equipment status, supply status, fault type and timing information on any supply interruptions, pre-fault loadings, battery status, temperature conditions in the equipment locality and switching actions), taken pre and post event, significantly aids the level of understanding. The availability of accurate data, technical support and a variety of support services will ensure that not only are correct decisions made, but that these can be rapidly put into action. White paper - Responding to major power outages UPS remote monitoring system - Events control The fault type will dictate the response required. Some failures can be rectified quickly by switching actions, whilst others will require specialist repairs. The table below highlights the variety of problems that can be encountered, together with typical resolutions. Fault type Required Response Operational restriction Exceeded • Overload • Operator error Verify actions taken on site or conditions present before system fault. The potential for a repeat fault needs to be removed before the appropriate actions are taken to restore the system. External influence • Foreign body • Environmental influence - Physical (temperature, humidity, etc.) - Electrical Requires a detailed understanding of the situation on site. This can be formulated from a combination of physical investigations and a review of available data. A resolution can be as simple as restoring climate control facilities or as detailed as investigating supply tolerance problems. Critical power equipment failure • Component Failure • System malfunction Where equipment failures or system malfunctions have occurred, specialist support will be required to ensure restoration is achieved in a reasonable time frame. Where severe physical damage has occurred or confidence that the root cause of the problem cannot be established, alternative supply arrangements will need to be made or the load temporarily transferred. In extreme circumstances temporary hire generators or UPS systems may need to be installed. The site’s risk management strategy, ideally, will have highlighted where these are physically located and electrically connected. Dedicated connection points are often designed in, or retrofitted into, system designs for accommodating these possibilities. The availability of the specialist maintainer / O.E.M., which can provide trained engineers and required parts within a rapid time frame, is essential if repairs are to be completed and the disruption of providing temporary supplies avoided. The availability and management of spares is something that the maintenance support company is best placed to manage, particularly if the support is provided by the original equipment manufacturer. A number of issues such as the following, can make spares holdings, on the local site, redundant after a relatively short time frame: • Version control (software/ firmware and hardware) • Component obsolescence management • Component shelf life (capacitors and batteries) • Secure and accessible storage A competent specialist maintainer or support company will provide a regionalised managed spares service to ensure repair times are not delayed unnecessarily and, wherever possible, a first time fix can be achieved. In conjunction with this managed spares service, having parts readily available can reduce the time to repair more common faults such as fan failures, etc. Post Fault Analysis Once the correct identification of a major fault incident has been established, the designed level of redundancy will have been restored via switching, repair or the connection of temporary UPS/Generator systems. With this restored, a full analysis of the incident can then be undertaken having contained any Health & Safety or Environmental risks. To confirm the fault identified initially is correct and to provide detailed analysis for lessons learnt, it is important that collected data is accurate and delivered in a timely manner. Typical data sources are as follows: • Time stamped event records from site management systems • Event reports from manufacturers • Site monitoring of electrical parameters • Manufacturers remote monitoring reports • Detailed technical fault reports from manufacturers and specialist support companies - to component level if required Operator and sub-contractor interviews Reports can then be utilised to verify the events experienced and actions taken, and the confirmed events can be supplemented with a detailed fault analysis report. Fault types can be classified into two broad categories, defined by the root cause of the problem: Operation outside design parameters: This can be verified by reviewing recorded measurements, switching events, staff interviews and information from the manufacturer’s remote monitoring systems. Whilst the impact White paper - Responding to major power outages of these events can be as severe as an equipment failure, they can be readily identified and operational restrictions or practices that contributed to the fault can be easily corrected. With the fault repaired, the system can be returned to normal operation. Equipment Failure: If a component failure has occurred this could be either a weakness in the component (software / firmware or hardware), an age related failure or a result of an external influence. Working with a reputable manufacturer who has full access to the manufacturing records of the installed equipment will provide ready access to failure rates and previously experienced fault conditions. The leading manufacturers will operate internal fault reporting and corrective action systems (FRACAS) where all faults are tracked and recommendations made on the management of the installed base of equipment. This is based on the trending information received. The installed system should be checked to make sure it contains the latest software / firmware and hardware board revisions, as part of the investigation. If not, then it raises the question of whether this was a contributory factor to the fault. In addition, a lot of equipment in the critical power environment contains parts considered as “consumable”, as defined by their normal life expectancy. These include batteries, capacitors, and fans, as well as standby generator fluids, filters, hoses and belts. As part of the support service provided these should be regularly replaced in line with the manufacturer’s recommendations. Where the equipment fault is due to an external influence foreign body, environmental or supply tolerances - the root cause can be more difficult to establish, as the resulting damage is often significant. Where a foreign body is suspected of being the cause of a fault of an item of electronic equipment (UPS, Transfer Switch, PDU etc), it is important to establish the origin of the foreign body. These have been known to derive from a number of different sources, typically: • Original site installation • Post commissioning site works • Other site sources Whilst it could be considered unlikely that conductive materials (which are sufficiently large enough to cause short circuits within equipment) could be transported by moving air, the level of cooling provided within densely populated power rooms should be considered. Individual UPS equipment could have a large number of powerful fans drawing air from cooling systems resulting in significant airflow within the room. Sources of conductive material that could be transported by moving air are typically: • Lagging from cooling ducts • Residue from galvanised cable tray manufacturing (see below) Fault Analysis: Incident Management Process Flakes of metal from galvanised cable trays Where the origin of the foreign body cannot be readily identified, but the evidence of a short circuit is clear, the investigation may need to involve shorted components being sent for micro-structural examination to collect further data. The fault below clearly shows where the arcing fault occurred on the isolating switch. Typical power system events to be reviewed: • Did the power system respond to the fault as expected? • This can include Static switch transfer times • Did parallel UPS systems take up the load on failure? • Did generator Auto-change over panels operate correctly? • Did the power system protection co-ordination work correctly? Typical Incident management process review: • Did the site based / Facilities management staff undertakes the correct actions during the incident? • Was the system restored in the correct manner? • Was the response from maintainer as per contractual agreement? • Were parts and trained engineers available to undertake any required repairs in a timely manner? The more detailed the understanding of the fault incident, the more comprehensive the action plan, and ultimately, the reduced likelihood of a repeat fault. Isolator with clear signs of flash over between phases Analysis of the debris in the area of the fault can provide information as to the make up of the foreign body. In this case, significant traces of aluminium were located where this was not utilised within the manufactured product. In addition to understanding the experienced failure it is also important to understand if the wider power system and incident management process operated correctly. White paper - Responding to major power outages Quickly understanding the incident and its potential causes enables correct decisions to be made and rapidly implemented to restore designed levels of redundancy. With the site design redundancy successfully restored, a full analysis of the incident can be undertaken, enabling any lessons learnt from the event to be put into a post fault action plan. As faults often involve multiple failures, of process and/or equipment, this phase can be lengthy. However, it is important that the review is thorough and all items are followed up to close out. To minimise the potential for ongoing or repeat disruption to data centre loads it is essential that any major incidents are followed up with a thorough investigation and a post incident action plan. Whether the fault has resulted in an unexpected loss of redundancy, exposed a design weakness to certain fault categories or has simply exposed a requirement for training of local staff, the improvement requirements need to be captured and appropriate actions initiated. In major incidents, it is likely that the root cause failure analysis has highlighted a number of failures in the site process and the critical power infrastructure, the total system or individual equipment. The lessons learnt could lead to a number of changes in the management of the site and/or the equipment, as well as potential system enhancements. Analysing events in a logical and structured manner will ensure that all areas are given due consideration. Typical areas that will need to be covered are detailed below, but these will vary depending on the fault experienced, the site, and the installed critical power system. Process Failures • Did the incident management process operate correctly? • Review of contractual support agreements in place • Review of site and sub-contractor performance during the incident. Were service level agreements achieved? • Was data for incident analysis readily available? • Site system monitoring • Enhanced supplier remote system monitoring • Local staff’s ability to analyse alarm and fault conditions • Sufficient data available to rapidly analyse the fault and restore supplies? Change in operating practices • Capacity management process - how are equipments loaded? • Review of switching process - were these undertaken correctly? • Are adequate change control and subcontractor management controls in place? Review of staff competencies • Technical capability • Any identified training requirements? • System Failures Review of system operating performance • Redundancy • Fault tolerance performance Can system resilience be improved? • Enhancement of site topology / system redundancy • Power supply configurations / use of parallel feeds to loads Has system testing been regularly undertaken? • Generator testing • Load balancing of parallel UPS • Full load tests for Generators and UPS systems • Building Integrated System Testing and Black Building tests Did the system protection co-ordination operate as expected? Review of system cooling performance White paper - Responding to major power outages Equipment Failures Are any changes in the planned maintenance regime required? • Increased program of maintenance • Lengthening timeframes for maintenance works to ensure thoroughness • Inclusion of a regular deep clean of the local environment • Re-calibration of equipments Changes in equipment management practices • Spares availability and management • Implementation of manufacturers recommended equipment upgrades • Replacement of consumables (generator fluids, UPS batteries, capacitors and fans) as per manufacturers recommendations Equipment life cycle analysis • Is sufficient planning in place for equipment refurbishment/replacement? Establishing the contractual support arrangements required to achieve the correct response to critical power system failure is not something that should either be assumed or left until an incident occurs. To ensure the above occurs, sites must ensure they partner with suppliers who actively manage their reactive service and typically make vital services available. Chloride Services of Emerson Network Power provides the following to support customers when most required: • A permanently staffed technical support centre, 24/7 • Comprehensive remote monitoring offering for UPS and other items of critical power plant • Strategically managed spares holdings • Large team of trained critical power UPS, cooling and generator engineers • Dedicated UK and factory technical support engineers Technical and management escalation process • Availability of hire generators and UPS About the author Mike O'Keeffe Mike O'Keeffe is the General Manager of the Chloride Service Business in the United Kingdom. Mike graduated from North Staffordshire University with a Degree in Electrical & Electronic Engineering and has spent more than 20 years in the Electrical Supply Industry. Mike and his team are responsible for the delivery of all elements of Service activity in the UK to over 3,500 customers. Emerson Network Power Chloride Products & Services George Curl Way Southampton Hampshire SO18 2RY T +44 (0)23 8061 0311 F +44 (0)23 8061 0852 E uk.enquiries.chloride@emerson.com Emerson Network Power. The global leader in enabling Business-Critical Continuity™. AC Power Embedded Computing Infrastructure Management & Monitoring Connectivity Embedded Power Outside Plant Industrial Power Power Switching & Controls DC Power EmersonNetworkPower.com Precision Cooling Racks & Integrated Cabinets Services Emerson, Business-Critical Continuity and Emerson Network Power are trademarks of Emerson Electric Co. or one of its affiliated companies. ©2011 Emerson Electric Co.