ME402 MAINTENANCE ENGINEERING MODULE 6 LECTURE NOTES 2 Reliability Engineering 101: Definition, Goals, Techniques How do you evaluate the quality of the products you buy? Traditional quality control in a factory will consist of performing predefined checks and tests. If the product satisfies set requirements, it is deemed good to go. However, you will never say that you bought a quality product if you had to go through the reclamation process two or more times before the warranty period expired. Reliability and reliability engineering help us quantify product quality by adding the dimension of time to the quality equation. In other words, we no longer just want to know if a product can perform its intended function at the moment of purchase. Instead, we want to make sure that the product works without major malfunctions under normal conditions for as long as possible. Reliability engineering does not only help organizations produce more reliable products, but it also informs maintenance teams on how to maintain them to increase MTBF (mean time between failures) and asset lifespan. If you’re interested in learning more, in continuation of this article, we will talk about: • • • • the concept of reliability core principles of reliability engineering the basics of reliability assessment and ways in which reliability engineers can improve equipment reliability What is reliability? Reliability is a term used to describe the ability of a component or system to meet certain performance standards over a certain period of time, assuming normal operating conditions. To put it in another perspective, if we have two systems that operate under the same conditions, the one that works longer with less major hiccups is the more reliable one. Since no one can predict the future and guarantee that a product won’t fail for exactly X hours of use, calculating reliability (see page 8)comes with a dose of uncertainty that is expressed in the form of probability. Among other things, we can use reliability calculation to estimate what is the chance that a system will work properly after x hours or days of use. Naturally, the reliability of any system will be high in the beginning and decline over time. Reliability is often confused with durability, quality, and availability. While the concepts are similar, they should not be used interchangeably. Here’s a short explanation for each. 1|Page ME402 MAINTENANCE ENGINEERING MODULE 6 LECTURE NOTES 2 Reliability vs durability Durability can be defined as the ability of a physical product to remain functional, without requiring excessive maintenance or repair, when faced with the challenges of normal operation over its design lifetime (definition stolen from Tim Cooper). The main difference between reliability and durability is that durability is mostly concerned with how long a product can last despite the breakdowns it survives, while reliability is trying to reduce the overall number and frequency of those breakdowns. Moreover, the durability component is used to describe a characteristic of physical items, while reliability can be used for virtual systems too. Depending on the product and its field of application, durability can be expressed in hours of use, the number of operational cycles, or years of existence. Reliability vs quality Quality is a concept that is hard to define. One popular way to describe it is by looking at the factors that affect product quality. This leads us to the concept of eight dimensions of quality. This is actually an easy way to differentiate between reliability and quality as we can just consider reliability (and durability if you look closer) to be one dimension of quality. If we take reliability as a standalone concept, another way to look at their relationship is by saying that a reliable system is one that keeps his quality over time. 2|Page ME402 MAINTENANCE ENGINEERING MODULE 6 LECTURE NOTES 2 Reliability vs availability Availability shows the percentage of time that a system is available (fully operational) to perform what it is designed to do. The concept is very often used in IT to describe the availability of cloud infrastructure. Systems with the highest availability are in the 99.99% range (which means that a service/system is not available for only ~52 minutes out of the whole year; often just to perform scheduled maintenance). Availability is impacted by reliability and maintainability. More reliable systems will experience fewer failures which will improve their availability. Similarly, the faster you perform scheduled maintenance, the less downtime you will have, which again leads to increased availability. What is reliability engineering? Reliability engineering refers to the systematic application of best engineering practices and techniques to make more reliable products in a cost-effective manner. Reliability engineering methodology can be applied across the product lifecycle: from design and manufacturing to operation and maintenance. That being said, the main value of reliability engineering lies in the early detection of possible reliability issues. If we catch a reliability issue at an early stage of the product lifecycle like the design stage, we can greatly minimize future costs (i.e. by eliminating the need for a significant product redesign after it is already in the market). This idea is represented in the graph below. Image source The goals of reliability engineering are as follows: 3|Page ME402 MAINTENANCE ENGINEERING MODULE 6 LECTURE NOTES 2 1. To use engineering knowledge and techniques to prevent certain failure modes and to reduce the likelihood and frequency of failures. 2. To identify and correct the causes of failures that do occur, despite the efforts to prevent them. 3. To determine ways of dealing with failures that do occur, if their causes have not been corrected. 4. To apply methods for estimating the likely reliability of new designs and for analyzing reliability data. If you look at the list more closely, you will see that the goals are ordered in a way that follows the natural progress of the application of different reliability methods. There is no sense in trying to add redundancies for all identified failures if some of them can be prevented with simple design changes. In other words, the above list represents steps that should be followed in sequential order to ensure reliability practices are applied costeffectively. The basics of reliability assessment The end goal of reliability assessment is to have a robust set of qualitative and quantitative evidence that the use of our component/system will not come with an unacceptable level of risk. It is an integral part of reliability engineering. In this context, risk can be defined as the combination of probability of failure (how likely it is that failure will happen) and failure severity (what is the fallout of the failure; can include safety risk, potential secondary damage, cost of spare parts and labor, production losses, etc.). Understanding failure mechanisms and failure modes It is not always easy to draw the line between cause and failure. If that wasn’t the case, there would be little need for reliability engineers and failure analysis. To understand failure modes and failure mechanisms well enough to address them efficiently, complex systems need to be “broken down” into components. This way you can analyze them on an individual level, as well as based on how they interact with one another. In addition to everything said, the way the system interacts with its user and the environment is another element to add to the list of things that need to be considered as both misuse and poor working conditions can reduce product reliability. 4|Page ME402 MAINTENANCE ENGINEERING MODULE 6 LECTURE NOTES 2 Common tasks and techniques used in reliability engineering Depending on how complex the system is and the type of the system we are looking at, there are a variety of techniques and tasks that can be applied as a part of our reliability engineering efforts: • • • • • • • • • • • • • • • Root cause analysis (RCA) Reliability centered maintenance (RCM) FMEA and FMECA Design FMEA and Process FMEA Physics of failure (PoF) Built-in self-test Reliability block analysis Field data analysis Fault tree analysis Eliminating single point of failure (SPOF) Human error analysis Operational hazard analysis Looking at maintenance history to analyze failure rates and collect failure data All kinds of data collection tests that measure how system/component perform under stress … By using all of these measures, we can find weak points of our system and see what are the chances that these weaknesses might result in malfunctions. If the perceived risk is high enough, we have to deal with them through corrective action. Common solutions come in the form of design changes (e.g. adding redundancy), detection control, maintenance guidelines, and user training. Quantifying reliability As we mentioned in the intro of this article, reliability is often the game of chances (probability). Since you are dealing with percentages and statistical data to define risk, it is very important that the whole team is on the same page and agrees about the acceptable levels of risk that they are trying to achieve. This is why it is very important to use a precise language when describing problems and proposing solutions. Moreover, because of incomplete statistical data and other uncertainties, some reliability professionals recommend focusing on solutions rather than failure chances. 5|Page ME402 MAINTENANCE ENGINEERING MODULE 6 LECTURE NOTES 2 ″ For part/system failures, reliability engineers should concentrate more on the “why and how”, rather than predicting “when”. Understanding “why” a failure has occurred (e.g. due to over-stressed components or manufacturing issues) is far more likely to lead to improvement in the designs and processes used than quantifying “when” a failure is likely to occur (e.g. via determining MTBF). To do this, first the reliability hazards relating to the part/system need to be classified and ordered (based on some form of qualitative and quantitative logic if possible) to allow for more efficient assessment and eventual improvement. “ O’Connor, Patrick D. T. (2002), Practical Reliability Engineering How can reliability engineers improve equipment reliability at their facility? There are several ways in which reliability engineers can help to improve and optimize maintenance processes at their facility that will ultimately result in increased equipment reliability. We discuss a few of them below. Helping with the design and development of spare parts Wear and tear that comes with daily use doesn’t discriminate. Most assets will need to be fitted with spare parts on a regular basis to continue operating in an efficient manner. Companies that have the right resources might opt-in to use CNC machines or 3-D printing to create their own parts instead of constantly restocking their spare parts inventory. Furthermore, they might have an old machine with spare parts that are no longer sold or have to deal with a nasty breakdown that requires a custom part. In these scenarios, reliability engineers can work closely with the maintenance team to design, test, and produce quality replacement parts that will improve the reliability of onsite assets. Performing root cause analysis One thing reliability engineers should be very good at is identifying and understanding failure causes. Because of that, they can be tasked with performing root cause analysis (RCA). They can examine OEM manuals, maintenance practices, equipment maintenance logs, and other documentation to find the reasons why specific machines are failing and suggest how to eliminate and/or mitigate each of the found failure causes. One way to address potential causes is by applying RCM practices. 6|Page ME402 MAINTENANCE ENGINEERING MODULE 6 LECTURE NOTES 2 Making sure maintenance actions address the right failure modes This is an extension of the previous point. Since the last point was concentrated on finding what you are not doing (which failure modes you are not addressing), let’s focus here on what you might be doing wrong. Most companies will find themselves in a situation where they are performing regular maintenance on an asset and that asset is still experiencing breakdowns. While there can be many reasons for that, one of them is that maintenance technicians are doing something wrong – like not addressing the right failure modes. This is where referring to RCA analysis can be very helpful. Similarly, reliability engineers can occasionally check how different maintenance practices are executed and how they can be improved. They can check if the maintenance team is using outdated practices and doing preventive maintenance tasks that add value and address the right problems. All of these should be easily accessible in a good CMMS system. To learn more about CMMS you can check out our guide What is a CMMS and how does it work. Last but not least, reliability engineers can also help with choosing the right condition monitoring sensors and equipment for the implementation of advanced maintenance strategies like Condition-based maintenance and Predictive maintenance. Final thoughts Serious reliability engineering efforts bring serious results. With the right knowledge, reliability techniques can be implemented regardless of the size of your company. Going forward, we hope that organizations will continue to invest in reliability as it helps everyone involved. Production companies benefit from producing better quality products, maintenance teams have less trouble maintaining them, and users have fewer performance issues over the lifespan of their product. It’s a win-win-win situation. Are you a reliability engineer or a maintenance professional and think we missed an important point? Share your thoughts in the comments below. 7|Page ME402 MAINTENANCE ENGINEERING MODULE 6 LECTURE NOTES 2 System Reliability & Availability Calculations A business imperative for companies of all sizes, cloud computing allows organizations to consume IT services on a usage-based subscription model. To evaluate the dependability of a system, the promise of cloud computing depends on two viral metrics: • • Service reliability Service vailability Vendors offer service level agreements (SLAs) to meet specific standards of reliability and availability. An SLA breach not only incurs cost penalty to the vendor but also compromises end-user experience of apps and solutions running on the cloud network. Though reliability and availability are often used interchangeably, they are different concepts in the engineering domain. Let’s explore the distinction between reliability and availability, then move into how both are calculated. What is reliability? Reliability is the probability that a system performs correctly during a specific time duration. During this correct operation: o No repair is required or performed o The system adequately follows the defined performance specifications Reliability follows an exponential failure law, which means that it reduces as the time duration considered for reliability calculations elapses. In other words, reliability of a system 8|Page ME402 MAINTENANCE ENGINEERING MODULE 6 LECTURE NOTES 2 will be high at its initial state of operation and gradually reduce to its lowest magnitude over time. What is availability? Availability refers to the probability that a system performs correctly at a specific time instance (not duration). Interruptions may occur before or after the time instance for which the system’s availability is calculated. The service must: o Be operational o Adequately satisfy the defined specifications at the time of its usage Availability is measured at its steady state, accounting for potential downtime incidents that can (and will) render a service unavailable during its projected usage duration. For example, a 99.999% (Five-9’s) availability refers to 5 minutes and 15 seconds of downtime per year. (Learn more about availability metrics and the 9s of availability.) Incident & service metrics Before discussing how reliability and availability are calculated, let’s understand the incident service metrics used in these calculations. These metrics are computed through extensive experimentation, experience, or industrial standards; they are not observed directly. Therefore, the resulting calculations only provide relatively accurate understanding of system reliability and availability. The graphic, below, and following sections outline the most relevant incident and service metrics: 9|Page ME402 MAINTENANCE ENGINEERING MODULE 6 LECTURE NOTES 2 Failure rate The frequency of component failure per unit time. It is usually denoted by the Greek letter λ (Lambda) and is used to calculate the metrics specified later in this post. In reliability engineering calculations, failure rate is considered as forecasted failure intensity given that the component is fully operational in its initial condition. The formula is given for repairable and non-repairable systems respectively as follows: Repair rate The frequency of successful repair operations performed on a failed component per unit time. It is usually denoted by the Greek letter μ (Mu) and is used to calculate the metrics specified later in this post. Repair rate is defined mathematically as follows: Mean time to failure (MTTF) The average time duration before a non-repairable system component fails. The following formula calculates MTTF: Mean time between failure (MTBF) The average time duration between inherent failures of a repairable system component. The following formulae are used to calculate MTBF: 10 | P a g e ME402 MAINTENANCE ENGINEERING MODULE 6 LECTURE NOTES 2 Mean time to recovery (MTTR) The average time duration to fix a failed component and return to operational state. This metric includes the time spent during the alert and diagnostic process before repair activities are initiated. (The average time solely spent on the repair process is called mean time to repair.) Mean time to detection (MTTD) The average time elapsed between the occurrence of a component failure and its detection. Reliability and availability calculations The calculations below are computed for reliability and availability attributes of an individual component. The failure rate can be used interchangeably with MTTF and MTBF as per calculations described earlier. Reliability is calculated as an exponentially decaying probability function which depends on the failure rate. Since failure rate may not remain constant over the operational lifecycle of a component, the average time-based quantities such as MTTF or MTBF can also be used to calculate Reliability. The mathematical function is specified as: Availability determines the instantaneous performance of a component at any given time based on time duration between its failure and recovery. Availability is calculated using the following formula: 11 | P a g e ME402 MAINTENANCE ENGINEERING MODULE 6 LECTURE NOTES 2 Calculation of multi-component systems IT systems contain multiple components connected as a complex architectural. The effective reliability and availability of the system depends on the specifications of individual components, network configurations, and redundancy models. The configuration can be series, parallel, or a hybrid of series and parallel connections between system components. Redundancy models can account for failures of internal system components and therefore change the effective system reliability and availability performance. A reliability block diagram (RBD) may be used to demonstrate the interconnection between individual components. Alternatively, analytical methods can also be used to perform these calculations for large scale and complex networks. RBD demonstrating a hybrid mix of series and parallel connections between system components is provided: The basics of an RBD methodology are highlighted below. Using failure rates The effective failure rates are used to compute reliability and availability of the system using these formulae: • For series connected components, the effective failure rate is determined as the sum of failure rates of each component. o For N series-connected components: • For parallel connected components, MTTF is determined as the reciprocal sum of failure rates of each system component. o For N parallel-connected components: 12 | P a g e ME402 MAINTENANCE ENGINEERING MODULE 6 LECTURE NOTES 2 For hybrid systems, the connections may be reduced to series or parallel configurations first. Using availability and reliability specifications Calculate reliability and availability of each component individually. For series connected components, compute the product of all component values. For N series-connected components. • For parallel connected components, use the formula: o For N parallel-connected components. • For hybrid connected components, reduce the calculations to series or parallel configurations first. It can be observed that the reliability and availability of a series-connected network of components is lower than the specifications of individual components. For example, two components with 99% availability connect in series to yield 98.01% availability. The converse is true for parallel combination model. If one component has 99% availability specifications, then two components combine in parallel to yield 99.99% availability; and four components in parallel connection yield 99.9999% availability. Adding redundant components to the network further increases the reliability and availability performance. It’s important to note a few caveats regarding these incident metrics and the associated reliability and availability calculations. These metrics may be perceived in relative terms. Failure may be defined differently for the same components in different applications, use cases, and organizations. 13 | P a g e ME402 MAINTENANCE ENGINEERING MODULE 6 LECTURE NOTES 2 The value of metrics such as MTTF, MTTR, MTBF, and MTTD are averages observed in experimentation under controlled or specific environments. These measurements may not hold consistently in real-world applications. Organizations should therefore map system reliability and availability calculations to business value and end-user experience. Decisions may require strategic trade-offs with cost, performance and, security, and decision makers will need to ask questions beyond the system dependability metrics and specifications followed by IT departments. Additional resources The following literature was referenced for system reliability and availability calculations described in this article: Johnson, Barry. (1988). Design & analysis of fault tolerant digital systems. Chapters 1-4. Johnson, Barry. (1996). An introduction to the design and analysis of fault-tolerant systems. 1-87. ______________________________________________________________ About the author Muhammad Raza Muhammad Raza is a Stockholm-based technology consultant working with leading startups and Fortune 500 firms on thought leadership branding projects across DevOps, Cloud, Security and IoT. 14 | P a g e