RELIABILITY & AVAILABILITY Within The Facility presented by Geoff

RELIABILITY & AVAILABILITY Within The Facility presented by Geoff Cope, PE Reliability & Availability for Mission Critical Design  Reliability & Availability Basics  Reliability & Availability Modeling  Practical Examples  Conclusions  Questions & Answers Reliability & Availability for Mission Critical Design Reliability/Availability study is used for making comparisons.  Industry standards  Infrastructure architecture (making design decisions)  Infrastructure improvements (Before & After)  Identify most important component (Weakest Link)  Helping with financial decisions (reliability Vs cost) Reliability & Availability for Mission Critical Design Industry Standard “Uptime Institute”  Developed Tier Classifications  Tier I & Tier II may have single points of failure (99% available)  Tier III may have some single points of failure (99.9% available)  Tier IV has no single points of failure (99.99% available)  Recently modified white paper to include electrical distribution concepts  Tier III & Tier IV are provided with dual corded mechanical systems and redundant pathways. Reliability & Availability for Mission Critical Design Infrastructure Architecture  System Architecture  UPS Systems  Stored Energy Systems  Mechanical Systems  Single Point of Failure (SPOF) Reliability & Availability for Mission Critical Design System Architecture  Single System (N)  Multi-unit systems  Parallel System (N)  Parallel Redundant (N+1, N+2, etc…)  2N or System + System  2(N+1) two parallel redundant systems  Isolated redundant system  Catcher System  Distributed Redundant System These configurations greatly affect the reliability of a system Reliability & Availability for Mission Critical Design UPS Systems  Double Conversion  Rotary  line interactive  Off-Line  Low Voltage  Medium Voltage UPS types do not generally affect the reliability analysis for Tier III & Tier IV Systems. Reliability does not equal power quality. Reliability & Availability for Mission Critical Design Electrical System Stored Energy  Chemical Batteries (wet cell, dry cell)  Fly Wheels  Super Capacitors  Fuel Cells Mechanical System Stored Energy  Ice Storage  Chilled Water Storage Stored Energy Approaches may have a significant impact. Reliability & Availability for Mission Critical Design Mechanical Systems  Reliability modeling is not typically used  Data is not as readily available for mechanical components.  Useful models can be made  Reliability level on mechanical system is significantly lower than electrical system Reliability & Availability for Mission Critical Design  A single point of failure (SPOF) is a point within a system that will bring the system’s process to a halt.  Common SPOFs include:  EPO systems  Single module systems  Parallel redundant systems  Automatic transfer switches  Shared infrastructure space Reliability & Availability for Mission Critical Design  A facilities reliability is dependant on the number of single points of failure within a system.  If your facility can pass the following tests, then you have a highly reliable facility.  Shotgun  Chainsaw  Fire bomb  Hand grenade  Eliminate Single Points of Failure to Improve reliability Reliability & Availability for Mission Critical Design  The only way to survive a catastrophic facility failure is to have:  Redundant Staff  Data Mirroring Capabilities  Geo-Redundancy (A Redundant Facility) Reliability & Availability Basics  Reliability Vs Availability  757 Example:  A Boeing 757 is a very reliable aircraft in which the jet may fly halfway around the world two or three times and is then placed into a maintenance shop for one to two weeks for maintenance. This aircraft is very reliable, but due to the amount of required maintenance, it has a low availability. Reliability & Availability Basics  Bathtub curve shows failure over various region of life. It is a function describing the variation of hazard rate with time ( life of a unit) Reliability & Availability Basics  Reliability: probability that an item will perform its function under given condition for a stated time interval . . . It is not a guarantee!!  Availability: Probability that an item will perform its required function under given conditions at a stated instant in time Steady-State Availability Or : Reliability & Availability Basics MTTR & MTBF  Mean Time To Repair = MTTR (1/μ) It is the mean time to restore the system to an operating condition  MTBF = Mean Time Between Failures (1/)  λ = Failure rate (probability that an item will fail in the time (t0, t0+Δt) giving the item is operable at time t0 Reliability & Availability Basics MTBF  MTBF is also the Statistical point at which 63% of a large homogenous population of items will fail  Example: a homogenous population of UPS's have an MTBF of 250,000h. This means that in about 28 years 63% of the UPS’s have failed to perform their function  MTBF has nothing to do with Lifetime. MTBF significantly exceeds the lifetime.  Example: Current word Average Life Expectancy is 66 years. The U.S. has a failure (death) rate of 9.6 people per 1000 persons/year, therefore an MTBF of 104 years!! Reliability & Availability Basics Reliability Vs Availability  Reliability is time dependant  The longer the time, the lower the reliability, regardless of the system design.  The availability of a system is always calculated the same regardless of past or future events  An availability of 99.9% could mean that a facility was down for 8.76 hours per year or suffer 525 one minute outages per year. Reliability & Availability Basics Reliability Not a Commonly used Metric for Data Center Discussions  People are used to hearing about 5-9s of availability.  Reliability numbers are much lower (2-9s).  Reliability numbers are directly proportional to a period of time.  Reliability numbers will indicate how often a facility’s infrastructure requires maintenance. Reliability & Availability Basics Availability Common Misuses of the Availability Metric  Extracting the mean downtime (MDT) from Inherent Availability  Equating the hours of downtime per year. If a facility has 99.999% availability, it implies that the facilities downtime may be 5.26 minutes per year, but a better assumption would be that it may suffer a 63 minute outage over a 12 year life expectancy of the facility infrastructure. Reliability & Availability Basics System Reliability  A system is a collection of components arranged according to a certain structure.  Subsystem: unit or device that is part of a larger system  Reliability modeling: - Analyzing components and their relationships. - Relating components states (working or failed) to system state.  System reliability analysis tools: - Reliability Block Diagram (RBD) - Fault Tree Analysis (FTA) Reliability & Availability Basics Other Factors to Consider  Maintainability: Probability that an item can be repaired in the interval (0, t). If all single points of failure are eliminated, in most cases the system will be maintainable.  Scale-ability: A system should be design such that it can be increased in capacity so that the infrastructure can grow with the load.  Flexibility: Design your facility such that the electrical and mechanical systems can be easily reconfigured to adapt to changing technologies.  Simplicity: Keep it Simple. The more complex a system is, the more unreliable the system will become. Reliability & Availability Basics  Inherent Availability is used in lieu of Operational Availability  Considers only down time for repair of failures  Excludes unscheduled and scheduled maintenance  Excludes Logistics Time: Part & Technician Availability  Excludes delays related to schedule  Is useful as a metric for analyzing the system design as uncontrolled parameters need to be kept the same.  By eliminating all of the variations, it is possible to calculate the inherent availability of a system regardless of all logistic matters. Reliability & Availability Modeling Assumptions: Binary Reliability Analysis  The system has n components;  A component may only be in two possible states, working or failed. Xi is used to indicate the state of component i.  The system may only be in two possible states, working or failed; Φ(X) is used to denote the state of the system (Structure function). The state of the system is completely determined by the states of the components. Reliability & Availability Modeling Typical System Reliability Models  Series systems: System fails if one component fails.  Parallel systems: System works if there is one component working. Paralleled systems always share a single point of failure. Reliability & Availability Modeling Typical System Reliability Models  Series-parallel systems:  Parallel-series systems: Reliability & Availability Modeling Typical System Reliability Models  K-out-of-N systems (NODES)  Example: UPS Set Reliability & Availability Modeling Modeling Examples  Existing infrastructure  2N UPS  2N Generator Plant  2N UPS Distribution  N+2 Mechanical System  Mechanical equipment single corded  System Availability = 99.99% Modeling Examples  Identified single points of failure  Identified most important component  Made Recommendations Modeling Examples  Proposed infrastructure  Eliminate single points of failure  Eliminate most important component “ATS 4”  Dual cord the Mechanical equipment  System Availability = 99.995% Modeling Examples Model Redundant Utilities  Small improvement on availability  System Availability = 99.997%  Expensive to implement  Recommended not to implement Modeling Examples Program Requirements  2N Utility  2(N+1) UPS  N+2 Generator Plant  2N UPS Distribution  N+2 Mechanical System  Dual Corded Mechanical Equipment  Modeled System Availability = 99.9995%  Cost: $20,000,000 Modeling Examples Distributed Redundant Option  2N Utility  N+1 UPS  N+2 Generator Plant  2N UPS Distribution  N+2 Mechanical System  Dual Corded Mechanical Equipment  Modeled System Availability = 99.9992%  Cost: $16,000,000 Modeling Examples Medium Voltage Design  2N Utility  2N UPS  N+2 Generator Plant  2N UPS Distribution  N+2 Mechanical System  Dual Corded Mechanical Equipment  Modeled System Availability = 99.9994%  Cost: $18,000,000 Modeling Examples Mechanical system  2N Mechanical system  Modeled System Availability = 99.97% Modeling Examples Mech. & Elec. System  2N Mechanical system with 2N Electrical system  Modeled System Availability = 99.92% RELIABILITY VS COST Conclusion Questions & Answers

RELIABILITY & AVAILABILITY Within The Facility presented by Geoff

Related documents

Products

Support

RELIABILITY & AVAILABILITY Within The Facility presented by Geoff

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib