SENG 637 Dependability, Dependability Reliability & Testing of Software Systems Ch t 3: Chapter 3 System S t R Reliability li bilit Department of Electrical & Computer Engineering, University of Calgary B.H. Far (far@ucalgary.ca) http://www.enel.ucalgary.ca/People/far/Lectures/SENG637/ far@ucalgary.ca 1 Dependability Models Dependability models capture conditions that make a system fail in terms of the structural relationships between the system components Dependability models Reliability Graph Fault Tree model Reliability Block Diagram far@ucalgary.ca 2 System Reliability /1 A system usually consists of components Each component consists of sub-components Components may have Different reliability Different dependencies between them System reliability is a function of the reliabilities of the (sub-) (sub ) components and of the relationships between the components far@ucalgary.ca 3 Complexity Concerns For a system consisting of n components Every component can be in one of the two conditions: working or failed How many possible combinations of the status of these n components? How do you calculate the reliability of the system given the probability of each componentt being b i working ki (or ( failed)? f il d)? Reliability Block Diagram can help! 4 Reliability Block Diagram (RBD) Reliability Block Diagram (RBD) is a graphical representation p of how the components p of a system y are connected from reliability point of view. RBD is used to model the various series-parallel and complex block combinations (paths) that result in system success. success Reliability of the system is derived in terms of reliabilities of its individual components. The most common configurations of an RBD are the series and parallel configurations. A system is usually composed of combinations of serial and parallel configurations. configurations RBD analysis is essential for determining reliability, availability and down time of the system. far@ucalgary.ca 5 Reliability Block Diagram (RBD) Reliability Block Diagram helps reliability analysis using a functional diagram to portray and analyze the reliability relationship of components in a system. Each element of a system is represented by a block that is in some way interconnected with or through the other blocks of the system at a desired level of assembly. The basic relationship between components are depicted as lines that may be: Serial: if a single failure results in entire assembly or system failure P ll l if all Parallel: ll redundant d d units i failure f il causes system failure f il Or a combination of these two SENG637 (Winter 2008) far@ucalgary.ca 6 Reliability Block Diagram (RBD) In a serial system configuration, the elements must all ll workk for f the th system t to t workk andd the th system t fails f il if one of the components fails. The overall reliability of a serial system is lower than the reliability of its individual components. In parallel configuration, the components are considered to be redundant and the system will still cease to work if all the parallel components fail. The overall reliability of a parallel system is higher than the reliability of its individual components. far@ucalgary.ca 7 Reliability Block Diagram (RBD) In developing the reliability block diagram, a complete understanding of a system system’ss mission definition and service use profile is required. In addition, the system is broken down into subsystems/ functions that are necessary for mission success. Each essential function is represented as a separate block in the diagram, diagram and blocks are grouped according to series/parallel combinations. Lines are drawn connecting the blocks in a logical order for mission success. RBD can be used to Calculate system reliability Possible single point of failure (SPF) far@ucalgary.ca 8 RBD: Process Steps 1. 2. 3. 4. Define boundary of the system for analysis Break system into functional components Determine serial serial-parallel parallel combinations Represent each components as a separate block in the diagram 5. Draw lines connecting the blocks in a logical order d ffor mission i i success far@ucalgary.ca 9 No redundancy ALL component of the system y are needed to make the system function properly If any one of the components fails, the system fails Example far@ucalgary.ca System block b diagrram Serial System Reliability /1 Power Unit Processor Monitor 10 Serial System Reliability /2 Reliability Block Diagram S t is System i composedd off n independent i d d t serially i ll connected components. Failure of any component has a cross system effect, effect i.e., results in failure of the whole system. R1/1 R2 /2 ... Example Power Unit R1 /1 Processor R2 /2 far@ucalgary.ca Monitor R3 /3 11 Combining Reliabilities /1 Serial system reliability can be calculated from component reliabilities, if the components fail independently of each other. For serial systems: Qp Qp number of components k 1 Rk component p reliabilityy R Rk Components reliabilities (Rk) must be expressed with respect to a common interval. A serial system has always smaller reliability than its components (because Rk 1). far@ucalgary.ca 12 Example: Serial System A system is composed of 3 independent serially connected components R1 = 0.95 R2 = 0.87 R3 = 0.82 Note: all Rs must be given for a common duration, e.g., 10 hours of operation. Rsystem = 0.95 0.87 0.82 = 0.6777 Serial system reliability is smaller than any individual reliability of the components far@ucalgary.ca 14 Parallel System Reliability /1 Syste w System with t Redundancy edu da cy Only ONE out of N ((identical)) components p is needed to make the system function properly System block diagram If ALL of the components fail, the Server 1 Server 2 system fails f il Example far@ucalgary.ca 15 Parallel System Reliability /2 System is composed of n independent components connected in parallel. Failure of all components p results in the failure of the whole system (principle of active redundancy). R1 /1 n Unreliabil ityy Fp((t)) Fi((t)) i1 n i1 i1 Reliabilit y Rp(t) 1 Fi(t) 1 (1 - Ri(t)) far@ucalgary.ca R2 /2 ... n 16 Parallel System Example The system is composed of 2 identical servers connected in parallel R1 = 0.6777 R2 = 0.6777 Server 1 Rs1 /s1 Server 2 Rs2 /s2 Rsystem = 1 – ((1 – 0.6777) (1 – 0.6777)) = 0.8961 Parallel system reliability is greater than any individual reliability of the components far@ucalgary.ca 17 Sequential System Reliability When one component fails, the next one is assigned t fulfill to f lfill the th job j b (e.g., ( t l h telephone switching it hi circuits) i it ) This is done in sequence until no components left li bili ffor such h system: Reliability C1 C2 Ci n 1 R sys (t) i0 λ i t i i! e λ it Cn far@ucalgary.ca 18 Mixed System Reliability /1 Syste w System with t Redundancy edu da cy Only ONE out of N set of components p is needed System block diagram to make the system function properly Po e U Power Unit it Po e Unit Power U it If ALL sets of components fail, the Processor Processor system fails f il Example Monitor far@ucalgary.ca Monitor 19 Mixed System Example Reliability Block Diagram System level Server 1 Rs1 /s1 Server 2 Rs2 /s2 Component level Power Unit R1 //1 Processor R2 //2 Monitor R3 //3 Power Unit R1 /1 Processor R2 /2 Monitor R3 /3 far@ucalgary.ca 20 Various Configurations /1 Parallel-Series configuration Power Unit Power Unit Processor Processor Monitor Monitor far@ucalgary.ca 21 Various Configurations /2 Reliability Block Diagram for Parallel-Series configuration Power Unit R1 //1 Processor R2 //2 Monitor R3 //3 Power Unit R1 /1 Processor R2 /2 Monitor R3 /3 far@ucalgary.ca 22 Parallel--Series System Parallel R11 R12 R1j R1n R21 R22 R2j R2n Ri1 Ri2 Rij Rin Rm1 Rm2 Rmj Rmn path n Path reliabilit y Ri Rij j 1 m m n i1 i1 j 1 System reliabilit y R 1 (1 - Ri) 1 (1 - Rij) far@ucalgary.ca 23 Various Configurations /3 Series-Parallel configuration Power o e Unit U t1 Processor ocesso 1 Monitor o to 2 Bus 1 B 2 Bus Power Unit 2 Processor 1 far@ucalgary.ca Monitor 2 24 Various Configurations /4 Reliability Block Diagram for Series-Parallel configuration Power Unit R1 /1 Processor R2 /2 Monitor R3 /3 Power Unit R1 /1 Processor R2 /2 Monitor R3 /3 far@ucalgary.ca 25 Series--Parallel System Series R11 R12 R1j R1n R21 R22 R2j R2n Ri1 Ri2 Rij Rin Rm1 Rm2 Rmj Rmn subsystem m Subsystem reliabilit y Rj 1 (1 - Rij) i1 m System reliability R R j 1 (1 - Rij) j 1 j 1 i1 n far@ucalgary.ca n 26 Active Redundancy Employs parallel systems. All components are active at the same time. Each component is able to meet the functional requirements of the system. Each component satisfies the minimum reliability condition for the system. Only one component is required to meet the functional requirements of the system. System only fails if all components fail. far@ucalgary.ca 28 m – out of – n System System has n components. At least m components need to work correctly for the system to function properly (m n). m=n: serial i l system m=1: parallel system R1 R2 m/n Ri e.g.: airplane p ew with 4 eengines g es can c fly with only 2 engines. Rn n R sys (t) 1 R(t)i (1 R(t)) ni i0 i m 1 Assumption: All components have the same reliability. far@ucalgary.ca n n! i i! (n i)! 29 RBD: Example /1 Possible single point of failure (SPF) far@ucalgary.ca 30 RBD: Example /2 Possible over redundancy far@ucalgary.ca 31 SPF: How to Avoid? Adopt redundancy Use dissimilar methods – consider id common-cause vulnerability l bilit Adopt a fundamental design change Use equipment i which hi h is i extremely reliable/robust li bl / b Perform frequent Preventive Maintenance/ R l Replacement t repair i before b f failure f il happens h Reduce or eliminate service and/or environmental stresses extreme e treme stress leads to failure fail re far@ucalgary.ca 32 RBD: When and How When to use RBD? How to use RBD? When reliability Wh li bilit off a complex l system t mustt be b calculated l l t d and the reliability-wise weaknesses of the system must be identified. Draw the RDB diagram. Calculate the system reliability using the RBD diagram. diagram Perform calculation such as availability and downtime. There e e are a e a number u be oof auto automated ated tools, too s, integrated teg ated with the other methods, such as Fault Tree Analysis (FTA), to generate the diagram and to analyze it. far@ucalgary.ca 33 RBD: Benefits & Limitations RBD is the simplest way of visualizing the reliability of the complex systems. The benefits of the RBD are: Establishes reliability goals; Evaluates component failure impact on overall system safety; Provides a basis for “what what if if” analysis; Allocates component reliability by calculating system MTBF; Provides cost savings in large system trouble-shooting; Estimates system reliability; Analyzes various system configurations in trade-off studies; Identifies potential design problems; Determines system sensitivity to component failures Disadvantages are that some complex constructs, such as standby, db branching b hi andd load l d sharing, h i etc., cannot be b clearly l l represented using the traditional RBD constructs. far@ucalgary.ca 34 RBD: Conclusion Restricting assumptions: Statistical independence Failures i d independence d Repairs i d independence d far@ucalgary.ca 35 Example 1 In reliability prediction for an assembled system we usually use a “bottoms-up” p approach pp byy estimatingg the failure rate for each subsystem and then combining the failure rates for the entire assembly. The following figure illustrates a system in which the subsystems A, B, C, and D are in a serial configuration Each subsystem is composed of several parts configuration. which are connected as serial or parallel as shown. far@ucalgary.ca 36 Example 1 (cont’d) (cont d) The failure rates of serial parts 1, 2, and 3 are 0.1, 0.3, and 0.5 per hour, respectively. Determine the failure rate and MTTF for subsystem A. A 1 2 3 A 0.1 0.3 0.5 0.9 failures per hour 1 1 MTTFA 1.11 hours 0.9 far@ucalgary.ca 37 Example 1 (cont’d) (cont d) The failure rates of parallel parts 4, 5, and 6 in subsystem B are 0.2, 0.4, and 0.25 per hour, respectively. Determine the failure rate and MTTF for subsystem B. RB e Bt 1 (1 R4 )(1 R5 )(1 R6 ) 0 20 0 40 0 25 RB e Bt 1 (1 e 0.20 )(1 e 0.40 )(1 e 0.25 ) RB e Bt 0.9868 B = 0.0133 failures per hour MTTFB 1 B 75.1879 hours far@ucalgary.ca 38 Example 1 (cont’d) (cont d) The failure rate of parallel parts 7 is constant and is 0.2 per hour. Determine the failure rate and MTTF for subsystem C in which at least 2 out of 3 items must be working. n n i RC 1 R7 i 1 R7 R7 e 0.2 0.8187 i 0 i 3 0 3 1 3 2 RC 1 R7 (1 R7 ) R7 (1 R7 ) 1 (1 R7 )3 3R7 (1 R7 ) 2 1 0 m 1 RC 1 ((1 0.8187))3 3 0.8187(1 ( 0.8187)) 2 =0.9133 RC e t 0.9133 MTTFC 1 C C =1.004 failures per hour 0.9960 hours far@ucalgary.ca 39 Example 1 (cont’d) (cont d) The failure rates of parallel parts 8 and 9 in subsystem b t D are 00.25 25 and d 0.2 0 2 per hhour, respectively. Determine the failure rate and MTTF for subsystem D. D RD e Dt 1 (1 R8 )(1 R9 ) RD e Dt 1 (1 e 0.25 )(1 e 0.20 ) RD e Dt 0.9599 D = 0.0409 failures per hour MTTFD 1 D 24.4498 24 4498 hours h far@ucalgary.ca 40 Example 1 (cont’d) (cont d) Determine the overall failure rate and MTTF for the whole system. system A B C D 0.9 0.0133 1.004 0.0409 system 1.9582 failures per hour y MTTFsystem 1 system y 0.5107 hours far@ucalgary.ca 41 Example 2 For the mixed mode system composed of serial and parallel components shown below, calculate the overall system reliability. li bilit The Th numbers are reliability for each module. SENG521 (Fall 2004) far@ucalgary.ca 42 Example 2 (cont’d) (cont d) R1 = 1 – (1 – 0.9)(1 – 0.8)(1 – 0.9) = 0.998 R2 = 1 – (1 – 0.9)(1 – 0.95) = 0.995 R3 = 1 – (1 – 0.9)(1 – 0.9) = 0.990 R4 = 0.998 0 998 x 0.995 0 995 x 0.99 0 99 = 0.983 0 983 R5 = 0.95 x 0.99 x 0.95 = 0.893 Rsystem = 1 – (1 – 0.983)(1 – 0.893) = 0.998 SENG521 (Fall 2004) far@ucalgary.ca 43 Example 3 The fastening of two mechanical parts in an automobile brake system should be reliable. It is done by means of two flanges which are pressed together with 4 bolt and nut pairs E1 to E4 placed 90 degrees to each other. Experience shows that the fastening holds when at least two opposing bolt and nut pairs work. E3 E2 E1 E4 Example 3 (cont’d) (cont d) a) Draw the reliability block diagram (RBD) for this fi ti off the fixation th system. t Example 3 (cont’d) (cont d) b) Compute reliability of the fixation described in (a) if all bolt and nut pairs have the reliability of R=0.90. R1 = R2 = 0.9 × 0.9 = 0.81 Rsystem = 1 – (1 – R1) (1 – R2) = 1 – (1 – 0.81) 0 81) (1 – 0.81) 0 81) = 0.9639 0 9639 Example 3 Q1 (cont’d) (cont d) c) Name the redundancy type if only any two of the bolt and nuts were sufficient for the required function. Two-out-of-four redundancy. Example 3 (cont’d) (cont d) d) Compute reliability for case (c) if all bolt and nut pairs have the reliability of R=0.90. 4 4 0 4 1 3 R 1 0.9 1 0.9 0.9 1 0.9 1 0 4 3 1 0.1 3.6 0.1 0.9963 Hazard Analysis “A common mistake in engineering, is to put too much confidence in software. There seems to be a feeling among non-software professionals that software will not or cannot fail, fail which leads to complacency and over reliance on computer functions.” Nancy G. Leveson Leveson, N.G., Safeware – System Safety and Computers, Addison-Wesley, 1995. far@ucalgary.ca 49 Hazard Analysis Goal Identify events that may eventually lead to accidents Identify individual elements/operations within a system that render it vulnerable vulnerable…(Single (Single Point Failures) Determine impact on system Techniques FMEA: Failure Modes and Effects Analysis FMECA: Failure Modes, Effects and Criticality Analysis ETA: Event Tree Analysis FTA: Fault Tree Analysis HAZOP: HAZard and OPerability studies far@ucalgary.ca 50 FMEA FMEA is a technique to identify and prioritize how items fail, fail and identify the effects of failure. failure FMEA is used when Designing products or processes, to identify and avoid failure-prone designs. Investigating why existing systems have failed and to identify possible causes. causes Investigating possible solutions, to help select one with an acceptable risk. Planning actions, in order to identify risks in the plan and hence identify countermeasures. far@ucalgary.ca 51 FMEA: Usage Industries that frequently use FMEA: Consumer products: toys/home appliances/home electronics Automotive industry Aero/space: Boeing, NASA Defence industry: DoD Process industries, e.g., oil and gas, chemical plants industrial control systems Software industry? not quite popular yet far@ucalgary.ca 52 FMEA: When? FMEA cannot be done until design has proceeded to the point that system elements have been selected at the level the analysis is to explore in software system after software architecture is finalized (requirement volatility kills FMEA!) FMEA can bbe ddone anytime ti in i the th system t lifetime, from initial design onward far@ucalgary.ca 53 FMEA: Process /1 Define the system to be analyzed, and obtain necessary drawings charts drawings, charts, descriptions descriptions, diagrams diagrams, component lists lists. Break the system down into convenient and logical elements. Know exactly what you’re analyzing; is it an area, activity, equipment? – all of it, or part of it? What targets are to be considered? What mission phases are included? System breakdown can be either functional (according to what the System elements “do”), or geographic/architectural (i.e., according to where the system elements “are”), or both (i.e., functional within the geographic, or vice versa). Establish a coding system to identify system elements. Analyze (FMEA) the elements. far@ucalgary.ca 54 FMEA: Process /2 FMEA Outputs – Spreadsheet documenting each failure mode possible causes possible poss b e co consequences seque ces possible remedies – Usually computer records kept far@ucalgary.ca 55 FMEA Questions Identification: How ( i.e., in what ways) can this element fail (failure modes)? Ramification: What will happen pp to the system and its environment if this element does fail in each of the ways available to it (failure effects)? Prevention: What needs to be done to prevent or mitigate the problem? far@ucalgary.ca 56 FMEA: How to /1 Interface matrix Components Reflects how components of the system are connected & function together Components far@ucalgary.ca 57 FMEA: How to /2 far@ucalgary.ca 58 FMEA Template Sample FMEA template N o . Part Name Part No. Function 1 Position Controller Receive a demand position Failure Mode Mechanisms & Causes of Failure Effect(s) of Failure •Loose cable connection •Incorrect demand signal •Wear and tear •Operator error •Motor fails to move •Position controller breakdown in a long-run Current Control P.R.A. P S D R 2 4 4 4 1 3 8 48 Recommended Corrective Action •Replace faulty wire •QC checked •Intensive training for operators. P = Probabilities (chance) of occurrences S = Seriousness of failure D = Likelihood that the defect will reach the customer R = Risk priority measure (P x S x D) 1 = very low or none 2 = low or minor 3 = moderate or significant 4 = high 5 = very high or catastrophic far@ucalgary.ca 59 Action Taken FMEA: Benefits Discover potential single-point failures. Assesses risk (FMECA) for potential, single-element failures for each identified target, within each mission phase. phase Knowing these things helps to: Optimize O ti i reliability, li bilit hence h mission i i accomplishment. li h t Guide design evaluation and improvement. Guide design of system to “fail fail safe safe” or crash softly softly. Guide design of system to operate satisfactorily using equipment of “low” reliability. Guide component/manufacturer selection. far@ucalgary.ca 60 FMEA: Limitations FMEA will find and summarize system vulnerability i terms in t off SPFs, SPF andd it requires i lots l t off time, ti money, and effort to do so. What if several SPFs are identified? Developing SPF paranoia leading to total system redundification! But SPFs are abundant and we encounter them daily, yyet continue to function. Most system nastiness failures come from complex threats, not from SPFs don’t ignore SPFs, just keep them in perspective. far@ucalgary.ca 61 FMEA: Summary Method: from known or predicted failure modes of components, t determine d t i possible ibl effects ff t on system t Good for hazard identification early in development, by considering possible failures of system functions: loss of function (omission failure) function performed incorrectly function performed when not required (commission failure) No good for concurrent multiple failures No ggood when requirements q change g far@ucalgary.ca 62 Fault Tree Analysis (FTA) Fault tree analysis is a graphical representation of the major faults or critical failures associated with a pproduct, the causes for the faults, and potential countermeasures. FTA helps identify areas of concern for new system design or for improvement of existing systems. It also helps identify corrective actions to correct or mitigate problems. problems FTA can also be defined as a graphic “model” of the pathways within a system that can lead to a foreseeable, undesirable event event. The pathways interconnect contributory events and conditions, using standard logic symbols. Fault tree analysis is useful both in designing new systems, products or services or in dealing with identified problems in existing products/services. As part of process improvement, it can be used to help identify root causes of trouble and to g remedies and countermeasures. design far@ucalgary.ca 63 Steps in FTA FMEA (Failure Modes and Effects Analysis) determines the failure modes that are likely to cause failure events. Then it determines what single i l or multiple l i l point i failures f il could ld produce d those h top level l l events. FMEA asks the question, “What can go wrong?” even if the product meets specification. In order to perform FTA one first needs to perform FMEA. far@ucalgary.ca 65 FTA: Example 1 far@ucalgary.ca 66 FTA: Example 2 Tank overflow AND Inlet open Outlet closed Inlet Valve B OR Inlet valve failed X Controller Outlet Valve A Controller failed Y far@ucalgary.ca Wrong control t inlet to i l t valve l OR AND Sensor X fails Sensor Y fails 67 FTA: Benefits & Limitations Benefits: The primary advantages of fault tree analyses l are the th meaningful i f l data d t they th produce d which hi h allow evaluation and improvement of the overall reliability of the system. system It also evaluates the effectiveness of and need for redundancy. Limitation: The undesired event evaluated must be foreseen and all significant contributors to the failure must be anticipated. This effort may be very time consuming and expensive. Finally, the overall success of the process depends on the skill of the analyst l t involved. i l d far@ucalgary.ca 68 FTA: Summary Method: trace faults stepwise back through system design to possible causes a tree with a top event at the root logic gates at branches, linking each event with its “immediate” causes initiating faults at leaves (eventually) Good for tracing system hazards to component failures, and allocating safety requirements Good for systems with multiple failures Good for checking completeness of safety requirements Can be difficult, time-consuming, hard to maintain far@ucalgary.ca 69 far@ucalgary.ca 70