RELIABILITY & AVAILABILITY Within The Facility presented by Geoff Cope, PE Reliability & Availability for Mission Critical Design Reliability & Availability Basics Reliability & Availability Modeling Practical Examples Conclusions Questions & Answers Reliability & Availability for Mission Critical Design Reliability/Availability study is used for making comparisons. Industry standards Infrastructure architecture (making design decisions) Infrastructure improvements (Before & After) Identify most important component (Weakest Link) Helping with financial decisions (reliability Vs cost) Reliability & Availability for Mission Critical Design Industry Standard “Uptime Institute” Developed Tier Classifications Tier I & Tier II may have single points of failure (99% available) Tier III may have some single points of failure (99.9% available) Tier IV has no single points of failure (99.99% available) Recently modified white paper to include electrical distribution concepts Tier III & Tier IV are provided with dual corded mechanical systems and redundant pathways. Reliability & Availability for Mission Critical Design Infrastructure Architecture System Architecture UPS Systems Stored Energy Systems Mechanical Systems Single Point of Failure (SPOF) Reliability & Availability for Mission Critical Design System Architecture Single System (N) Multi-unit systems Parallel System (N) Parallel Redundant (N+1, N+2, etc…) 2N or System + System 2(N+1) two parallel redundant systems Isolated redundant system Catcher System Distributed Redundant System These configurations greatly affect the reliability of a system Reliability & Availability for Mission Critical Design UPS Systems Double Conversion Rotary line interactive Off-Line Low Voltage Medium Voltage UPS types do not generally affect the reliability analysis for Tier III & Tier IV Systems. Reliability does not equal power quality. Reliability & Availability for Mission Critical Design Electrical System Stored Energy Chemical Batteries (wet cell, dry cell) Fly Wheels Super Capacitors Fuel Cells Mechanical System Stored Energy Ice Storage Chilled Water Storage Stored Energy Approaches may have a significant impact. Reliability & Availability for Mission Critical Design Mechanical Systems Reliability modeling is not typically used Data is not as readily available for mechanical components. Useful models can be made Reliability level on mechanical system is significantly lower than electrical system Reliability & Availability for Mission Critical Design A single point of failure (SPOF) is a point within a system that will bring the system’s process to a halt. Common SPOFs include: EPO systems Single module systems Parallel redundant systems Automatic transfer switches Shared infrastructure space Reliability & Availability for Mission Critical Design A facilities reliability is dependant on the number of single points of failure within a system. If your facility can pass the following tests, then you have a highly reliable facility. Shotgun Chainsaw Fire bomb Hand grenade Eliminate Single Points of Failure to Improve reliability Reliability & Availability for Mission Critical Design The only way to survive a catastrophic facility failure is to have: Redundant Staff Data Mirroring Capabilities Geo-Redundancy (A Redundant Facility) Reliability & Availability Basics Reliability Vs Availability 757 Example: A Boeing 757 is a very reliable aircraft in which the jet may fly halfway around the world two or three times and is then placed into a maintenance shop for one to two weeks for maintenance. This aircraft is very reliable, but due to the amount of required maintenance, it has a low availability. Reliability & Availability Basics Bathtub curve shows failure over various region of life. It is a function describing the variation of hazard rate with time ( life of a unit) Reliability & Availability Basics Reliability: probability that an item will perform its function under given condition for a stated time interval . . . It is not a guarantee!! Availability: Probability that an item will perform its required function under given conditions at a stated instant in time Steady-State Availability Or : Reliability & Availability Basics MTTR & MTBF Mean Time To Repair = MTTR (1/μ) It is the mean time to restore the system to an operating condition MTBF = Mean Time Between Failures (1/) λ = Failure rate (probability that an item will fail in the time (t0, t0+Δt) giving the item is operable at time t0 Reliability & Availability Basics MTBF MTBF is also the Statistical point at which 63% of a large homogenous population of items will fail Example: a homogenous population of UPS's have an MTBF of 250,000h. This means that in about 28 years 63% of the UPS’s have failed to perform their function MTBF has nothing to do with Lifetime. MTBF significantly exceeds the lifetime. Example: Current word Average Life Expectancy is 66 years. The U.S. has a failure (death) rate of 9.6 people per 1000 persons/year, therefore an MTBF of 104 years!! Reliability & Availability Basics Reliability Vs Availability Reliability is time dependant The longer the time, the lower the reliability, regardless of the system design. The availability of a system is always calculated the same regardless of past or future events An availability of 99.9% could mean that a facility was down for 8.76 hours per year or suffer 525 one minute outages per year. Reliability & Availability Basics Reliability Not a Commonly used Metric for Data Center Discussions People are used to hearing about 5-9s of availability. Reliability numbers are much lower (2-9s). Reliability numbers are directly proportional to a period of time. Reliability numbers will indicate how often a facility’s infrastructure requires maintenance. Reliability & Availability Basics Availability Common Misuses of the Availability Metric Extracting the mean downtime (MDT) from Inherent Availability Equating the hours of downtime per year. If a facility has 99.999% availability, it implies that the facilities downtime may be 5.26 minutes per year, but a better assumption would be that it may suffer a 63 minute outage over a 12 year life expectancy of the facility infrastructure. Reliability & Availability Basics System Reliability A system is a collection of components arranged according to a certain structure. Subsystem: unit or device that is part of a larger system Reliability modeling: - Analyzing components and their relationships. - Relating components states (working or failed) to system state. System reliability analysis tools: - Reliability Block Diagram (RBD) - Fault Tree Analysis (FTA) Reliability & Availability Basics Other Factors to Consider Maintainability: Probability that an item can be repaired in the interval (0, t). If all single points of failure are eliminated, in most cases the system will be maintainable. Scale-ability: A system should be design such that it can be increased in capacity so that the infrastructure can grow with the load. Flexibility: Design your facility such that the electrical and mechanical systems can be easily reconfigured to adapt to changing technologies. Simplicity: Keep it Simple. The more complex a system is, the more unreliable the system will become. Reliability & Availability Basics Inherent Availability is used in lieu of Operational Availability Considers only down time for repair of failures Excludes unscheduled and scheduled maintenance Excludes Logistics Time: Part & Technician Availability Excludes delays related to schedule Is useful as a metric for analyzing the system design as uncontrolled parameters need to be kept the same. By eliminating all of the variations, it is possible to calculate the inherent availability of a system regardless of all logistic matters. Reliability & Availability Modeling Assumptions: Binary Reliability Analysis The system has n components; A component may only be in two possible states, working or failed. Xi is used to indicate the state of component i. The system may only be in two possible states, working or failed; Φ(X) is used to denote the state of the system (Structure function). The state of the system is completely determined by the states of the components. Reliability & Availability Modeling Typical System Reliability Models Series systems: System fails if one component fails. Parallel systems: System works if there is one component working. Paralleled systems always share a single point of failure. Reliability & Availability Modeling Typical System Reliability Models Series-parallel systems: Parallel-series systems: Reliability & Availability Modeling Typical System Reliability Models K-out-of-N systems (NODES) Example: UPS Set Reliability & Availability Modeling Modeling Examples Existing infrastructure 2N UPS 2N Generator Plant 2N UPS Distribution N+2 Mechanical System Mechanical equipment single corded System Availability = 99.99% Modeling Examples Identified single points of failure Identified most important component Made Recommendations Modeling Examples Proposed infrastructure Eliminate single points of failure Eliminate most important component “ATS 4” Dual cord the Mechanical equipment System Availability = 99.995% Modeling Examples Model Redundant Utilities Small improvement on availability System Availability = 99.997% Expensive to implement Recommended not to implement Modeling Examples Program Requirements 2N Utility 2(N+1) UPS N+2 Generator Plant 2N UPS Distribution N+2 Mechanical System Dual Corded Mechanical Equipment Modeled System Availability = 99.9995% Cost: $20,000,000 Modeling Examples Distributed Redundant Option 2N Utility N+1 UPS N+2 Generator Plant 2N UPS Distribution N+2 Mechanical System Dual Corded Mechanical Equipment Modeled System Availability = 99.9992% Cost: $16,000,000 Modeling Examples Medium Voltage Design 2N Utility 2N UPS N+2 Generator Plant 2N UPS Distribution N+2 Mechanical System Dual Corded Mechanical Equipment Modeled System Availability = 99.9994% Cost: $18,000,000 Modeling Examples Mechanical system 2N Mechanical system Modeled System Availability = 99.97% Modeling Examples Mech. & Elec. System 2N Mechanical system with 2N Electrical system Modeled System Availability = 99.92% RELIABILITY VS COST Conclusion Questions & Answers