Responding to major power outages Mike O’Keeffe Business-Critical Continuity™

advertisement
A White Paper from the experts
in Business-Critical Continuity™
Responding to major power outages
by Mike O’Keeffe
Abstract
With the increased reliance on server availability and the demand for data centres to remain
operational throughout the year, it is imperative that they are 100% resilient and withstand any
deficiencies in the mains power supply.
The resilience of the critical power infrastructure, including its operational management, is
often established at the design phase and does not evolve with the installation. This can result
in shortfalls in the original design philosophy developing and, in extreme circumstances, can
result in unnecessary supply interruptions.
In many cases, where a complete loss of power has been experienced, this is the result of
multiple failings within the power infrastructure and/or its management. In any operation it is
imperative that the response to these outages is well managed and the outcome is a more
resilient power supply and operational support mechanism.
Responses that are required to major power outages can be
categorised into a number of areas:
•
•
•
•
Initial response to incident (level 1 response)
Fault identification and repair
Fault analysis
Lessons learnt and post fault actions
The objective in all of these areas is three fold:
1. Ensure any H&S/ Environmental risks are contained
2. Restoration of the critical power infrastructure, to the
designed topology, is achieved as soon as possible:
Site topology designed to provide:
• Redundancy
• Fault tolerance of system
• Maintainability
3. Lessons are learnt to prevent similar incidents happening again
Initial response to incident (level 1 response)
The initial response to any site incident will be performed by
either site-based staff or the facilities management staff
responsible for the location. The associated timeframes in place
for the Service Level Agreement are usually less than one hour
for major sites or two hours for locations where the impact is
less severe.
The initial response (level 1 response) must include the
containment of the incident, in particular any health and safety
or environmental issues that may have arisen (equipment faults,
defective batteries, fuel leaks, etc.), as well as preventing the
further escalation of the problem encountered.
Suitably trained site-based staff will have a detailed knowledge
of the site and a broad understanding of the variety of
technologies employed within the critical power system to
enable this containment to take place. Where specialist
equipment is utilised, such as UPS, static switches, site specific
standby or generator configurations, the support of the
White paper - Responding to major power outages
maintainer, either the OEM or specialist support contractor, is
essential in ensuring the following:
• Site engineers or facility managers are trained to be able to
switch and read equipment alarms
• Support is available from a permanently manned, 24/7
technical support help desk
• Remote monitoring of installed equipment allows detailed
interrogation of equipment to verify site findings and provide
important data for fault identification
• Reliable contractual call out response from the OEM or
specialist support contractor is achieved to enable the
effective resolution of the problem to commence
It is important at the initial stages of the fault that not only is the
problem contained but that the gathering of data, to enable the
investigation into the event, is undertaken in a controlled
manner. On many occasions specialist repairs have been
undertaken prior to the establishment of the root cause of the
fault or when vital status data of the equipment has not been
captured at the time of the incident and immediately post the
event. This has resulted in repeated faults and continuing
disruption to operations.
Specialist support engineers are trained not only in maintenance
and repair of equipment, but also fault finding and investigation.
Field engineers’ reports provide vital information into the fault
investigation.
Further time stamped, reliable information, to assist the fault
investigation, can be gathered from manufacturers’ remote
monitoring systems, which can provide a back up to any sitebased control or recording system. This enables detailed
verification of events and analysis of recorded disturbances to be
undertaken to confirm the nature of the incident.
Key suppliers should be part of the incident management
process. Establishing the contractual support arrangements
required to achieve all these actions is not something that
should be assumed or left until an incident takes place.
To ensure that the above occurs, sites must ensure they
partner with suppliers who actively manage their reactive
service, typically having the following available:
• Multiple (regionally based) callout engineers and managed
spares availability
• 24/7 technical help desk
• Management escalation process
• Remote system monitoring
• In-country training facilities for site engineers on installed
equipment
• Availability of critical power equipment (UPS and Generators)
to hire in emergency situations to contain the problem
In any incident, fault containment should be the first
priority, followed by the accurate gathering of data to
enable the fault investigation process to be continued:
• Fault Identification and repair
• Fault analysis
• Lessons learnt and post fault action
Fault identification and repair
When a major incident within a critical power infrastructure
occurs, the correct initial response is to assess any environmental
or health and safety risks. Once this has been cleared, the
investigation can be launched.
The most common occurrence is that the site’s critical power
system has provided the correct level of fault tolerance and the
load is still fully protected. It is also possible that power is still
available to the load - either supplied from the UPS system
reserve supply (unconditioned mains) or supported by the site’s
standby generators. In either scenario, it is desirable to return
the system to the designed level of redundancy in the shortest
possible time frame.
To enable the next course of action to be taken requires an
understanding of the failure mode. This can vary significantly
and will often include process, system and equipment failures.
Information gathered manually on the incident will need to be
verified with site data acquisition and BMS systems, and ideally
verified with the manufacturer’s remote monitoring systems.
The availability of a manufacturer’s remote monitoring system
can significantly reduce the time involved in the decision
making process by providing accurate and rapid information on
the events encountered.
To be able to access information (verification of equipment
status, supply status, fault type and timing information on any
supply interruptions, pre-fault loadings, battery status,
temperature conditions in the equipment locality and switching
actions), taken pre and post event, significantly aids the level of
understanding.
The availability of accurate data, technical support and a variety
of support services will ensure that not only are correct
decisions made, but that these can be rapidly put into action.
White paper - Responding to major power outages
UPS remote monitoring system - Events control
The fault type will dictate the response required. Some failures
can be rectified quickly by switching actions, whilst others will
require specialist repairs. The table below highlights the variety
of problems that can be encountered, together with typical
resolutions.
Fault type
Required Response
Operational restriction
Exceeded
• Overload
• Operator error
Verify actions taken on site or
conditions present before system
fault. The potential for a repeat
fault needs to be removed before
the appropriate actions are taken
to restore the system.
External influence
• Foreign body
• Environmental influence
- Physical (temperature,
humidity, etc.)
- Electrical
Requires a detailed understanding
of the situation on site. This can be
formulated from a combination of
physical investigations and a
review of available data. A
resolution can be as simple as
restoring climate control facilities
or as detailed as investigating
supply tolerance problems.
Critical power equipment
failure
• Component Failure
• System malfunction
Where equipment failures or
system malfunctions have
occurred, specialist support will be
required to ensure restoration is
achieved in a reasonable time
frame.
Where severe physical damage has occurred or confidence that
the root cause of the problem cannot be established, alternative
supply arrangements will need to be made or the load
temporarily transferred.
In extreme circumstances temporary hire generators or UPS
systems may need to be installed. The site’s risk management
strategy, ideally, will have highlighted where these are physically
located and electrically connected. Dedicated connection points
are often designed in, or retrofitted into, system designs for
accommodating these possibilities.
The availability of the specialist maintainer / O.E.M., which can
provide trained engineers and required parts within a rapid time
frame, is essential if repairs are to be completed and the
disruption of providing temporary supplies avoided. The
availability and management of spares is something that the
maintenance support company is best placed to manage,
particularly if the support is provided by the original equipment
manufacturer.
A number of issues such as the following, can make spares
holdings, on the local site, redundant after a relatively short
time frame:
• Version control (software/ firmware and hardware)
• Component obsolescence management
• Component shelf life (capacitors and batteries)
• Secure and accessible storage
A competent specialist maintainer or support company will
provide a regionalised managed spares service to ensure repair
times are not delayed unnecessarily and, wherever possible, a
first time fix can be achieved. In conjunction with this managed
spares service, having parts readily available can reduce the time
to repair more common faults such as fan failures, etc.
Post Fault Analysis
Once the correct identification of a major fault incident has been
established, the designed level of redundancy will have been
restored via switching, repair or the connection of temporary
UPS/Generator systems. With this restored, a full analysis of the
incident can then be undertaken having contained any Health &
Safety or Environmental risks.
To confirm the fault identified initially is correct and to
provide detailed analysis for lessons learnt, it is important
that collected data is accurate and delivered in a timely
manner. Typical data sources are as follows:
• Time stamped event records from site management systems
• Event reports from manufacturers
• Site monitoring of electrical parameters
• Manufacturers remote monitoring reports
• Detailed technical fault reports from manufacturers and
specialist support companies - to component level if required
Operator and sub-contractor interviews
Reports can then be utilised to verify the events experienced
and actions taken, and the confirmed events can be
supplemented with a detailed fault analysis report.
Fault types can be classified into two broad categories, defined
by the root cause of the problem:
Operation outside design parameters:
This can be verified by reviewing recorded measurements,
switching events, staff interviews and information from the
manufacturer’s remote monitoring systems. Whilst the impact
White paper - Responding to major power outages
of these events can be as severe as an equipment failure, they
can be readily identified and operational restrictions or
practices that contributed to the fault can be easily corrected.
With the fault repaired, the system can be returned to normal
operation.
Equipment Failure:
If a component failure has occurred this could be either a
weakness in the component (software / firmware or hardware),
an age related failure or a result of an external influence.
Working with a reputable manufacturer who has full access to
the manufacturing records of the installed equipment will
provide ready access to failure rates and previously experienced
fault conditions.
The leading manufacturers will operate internal fault reporting
and corrective action systems (FRACAS) where all faults are
tracked and recommendations made on the management of the
installed base of equipment. This is based on the trending
information received.
The installed system should be checked to make sure it contains
the latest software / firmware and hardware board revisions, as
part of the investigation. If not, then it raises the question of
whether this was a contributory factor to the fault.
In addition, a lot of equipment in the critical power environment
contains parts considered as “consumable”, as defined by their
normal life expectancy. These include batteries, capacitors, and
fans, as well as standby generator fluids, filters, hoses and belts.
As part of the support service provided these should be
regularly replaced in line with the manufacturer’s
recommendations.
Where the equipment fault is due to an external influence foreign body, environmental or supply tolerances - the root
cause can be more difficult to establish, as the resulting damage
is often significant.
Where a foreign body is suspected of being the cause of a fault
of an item of electronic equipment (UPS, Transfer Switch, PDU
etc), it is important to establish the origin of the foreign body.
These have been known to derive from a number of different
sources, typically:
• Original site installation
• Post commissioning site works
• Other site sources
Whilst it could be considered unlikely that conductive materials
(which are sufficiently large enough to cause short circuits
within equipment) could be transported by moving air, the level
of cooling provided within densely populated power rooms
should be considered. Individual UPS equipment could have a
large number of powerful fans drawing air from cooling systems
resulting in significant airflow within the room.
Sources of conductive material that could be transported by
moving air are typically:
• Lagging from cooling ducts
• Residue from galvanised cable tray manufacturing (see below)
Fault Analysis: Incident Management Process
Flakes of metal from galvanised cable trays
Where the origin of the foreign body cannot be readily
identified, but the evidence of a short circuit is clear, the
investigation may need to involve shorted components being
sent for micro-structural examination to collect further data.
The fault below clearly shows where the arcing fault occurred on
the isolating switch.
Typical power system events to be reviewed:
• Did the power system respond to the fault as expected?
• This can include Static switch transfer times
• Did parallel UPS systems take up the load on failure?
• Did generator Auto-change over panels operate correctly?
• Did the power system protection co-ordination work
correctly?
Typical Incident management process review:
• Did the site based / Facilities management staff undertakes
the correct actions during the incident?
• Was the system restored in the correct manner?
• Was the response from maintainer as per contractual
agreement?
• Were parts and trained engineers available to undertake any
required repairs in a timely manner?
The more detailed the understanding of the fault incident, the
more comprehensive the action plan, and ultimately, the
reduced likelihood of a repeat fault.
Isolator with clear signs of flash over between phases
Analysis of the debris in the area of the fault can provide
information as to the make up of the foreign body. In this case,
significant traces of aluminium were located where this was not
utilised within the manufactured product.
In addition to understanding the experienced failure it is also
important to understand if the wider power system and incident
management process operated correctly.
White paper - Responding to major power outages
Quickly understanding the incident and its potential causes
enables correct decisions to be made and rapidly implemented
to restore designed levels of redundancy. With the site design
redundancy successfully restored, a full analysis of the incident
can be undertaken, enabling any lessons learnt from the event
to be put into a post fault action plan. As faults often involve
multiple failures, of process and/or equipment, this phase can
be lengthy. However, it is important that the review is thorough
and all items are followed up to close out.
To minimise the potential for ongoing or repeat disruption to
data centre loads it is essential that any major incidents are
followed up with a thorough investigation and a post incident
action plan.
Whether the fault has resulted in an unexpected loss of
redundancy, exposed a design weakness to certain fault
categories or has simply exposed a requirement for training of
local staff, the improvement requirements need to be captured
and appropriate actions initiated.
In major incidents, it is likely that the root cause failure analysis
has highlighted a number of failures in the site process and the
critical power infrastructure, the total system or individual
equipment. The lessons learnt could lead to a number of
changes in the management of the site and/or the equipment,
as well as potential system enhancements.
Analysing events in a logical and structured manner will ensure
that all areas are given due consideration. Typical areas that will
need to be covered are detailed below, but these will vary
depending on the fault experienced, the site, and the installed
critical power system.
Process Failures
• Did the incident management process operate correctly?
• Review of contractual support agreements in place
• Review of site and sub-contractor performance during the
incident. Were service level agreements achieved?
• Was data for incident analysis readily available?
• Site system monitoring
• Enhanced supplier remote system monitoring
• Local staff’s ability to analyse alarm and fault conditions
• Sufficient data available to rapidly analyse the fault and
restore supplies?
Change in operating practices
• Capacity management process - how are equipments
loaded?
• Review of switching process - were these undertaken
correctly?
• Are adequate change control and subcontractor
management controls in place?
Review of staff competencies
• Technical capability
• Any identified training requirements?
• System Failures
Review of system operating performance
• Redundancy
• Fault tolerance performance
Can system resilience be improved?
• Enhancement of site topology / system redundancy
• Power supply configurations / use of parallel feeds to loads
Has system testing been regularly undertaken?
• Generator testing
• Load balancing of parallel UPS
• Full load tests for Generators and UPS systems
• Building Integrated System Testing and Black Building tests
Did the system protection co-ordination operate as
expected?
Review of system cooling performance
White paper - Responding to major power outages
Equipment Failures
Are any changes in the planned maintenance regime
required?
• Increased program of maintenance
• Lengthening timeframes for maintenance works to ensure
thoroughness
• Inclusion of a regular deep clean of the local environment
• Re-calibration of equipments
Changes in equipment management practices
• Spares availability and management
• Implementation of manufacturers recommended equipment
upgrades
• Replacement of consumables (generator fluids, UPS
batteries, capacitors and fans) as per manufacturers
recommendations
Equipment life cycle analysis
• Is sufficient planning in place for equipment
refurbishment/replacement?
Establishing the contractual support arrangements required to
achieve the correct response to critical power system failure is
not something that should either be assumed or left until an
incident occurs. To ensure the above occurs, sites must ensure
they partner with suppliers who actively manage their reactive
service and typically make vital services available.
Chloride Services of Emerson Network Power provides the
following to support customers when most required:
• A permanently staffed technical support centre, 24/7
• Comprehensive remote monitoring offering for UPS and
other items of critical power plant
• Strategically managed spares holdings
• Large team of trained critical power UPS, cooling and
generator engineers
• Dedicated UK and factory technical support engineers
Technical and management escalation process
• Availability of hire generators and UPS
About the author
Mike O'Keeffe
Mike O'Keeffe is the General Manager of the Chloride
Service Business in the United Kingdom. Mike graduated
from North Staffordshire University with a Degree in
Electrical & Electronic Engineering and has spent more
than 20 years in the Electrical Supply Industry. Mike and
his team are responsible for the delivery of all elements of
Service activity in the UK to over 3,500 customers.
Emerson Network Power
Chloride Products & Services
George Curl Way
Southampton
Hampshire
SO18 2RY
T +44 (0)23 8061 0311
F +44 (0)23 8061 0852
E uk.enquiries.chloride@emerson.com
Emerson Network Power.
The global leader in enabling Business-Critical Continuity™.
AC Power
Embedded Computing
Infrastructure Management & Monitoring
Connectivity
Embedded Power
Outside Plant
Industrial Power
Power Switching & Controls
DC Power
EmersonNetworkPower.com
Precision Cooling
Racks & Integrated Cabinets
Services
Emerson, Business-Critical Continuity and Emerson Network Power are trademarks of Emerson Electric Co. or one of its affiliated companies. ©2011 Emerson Electric Co.
Download