Aviation reliability: Programs & calculation Chapter 19 text March 16 393SYS 1 Reliability Definition (in statistical term): ‘the probability of failure free operation of an item in a specified environment for a specified amount of time’ Examples: If eight delays and cancellations are experienced in 200 flights, that means 96% of flights dispatched on time for the airline. Effective February 15, 2007, the FAA ruled that US-registered ETOPS-207 operators can fly over most of the world provided that the IFSD rate is 1 in 100,000 engine hours. This limit is more stringent than ETOPS-180 (2 in 100,000 engine hours). March 16 393SYS 2 Two main approaches of reliability in the aviation industry First approach is the overall airline reliability, essential means the dispatch reliability, that is, how often the airline achieves an on-time departure of its scheduled flights. The reasons of delay are categorized as maintenance, procedures, personnel, flight operations, air traffic control (ATC). etc. Second approach is to consider reliability as programs specifically designed to address the problems of maintenancewhether or not they cause delays and provide analysis of and corrective actions for those items to provide the overall reliability of equipment. This contributes to the dispatch reliability as well as the overall operation. March 16 393SYS 3 Reliability Program (for maintenance) A set of rules and practices for managing and controlling a maintenance program. The main function is to monitor the performance of the vehicles and their associated equipment and call attention to any need for corrective action. Additional functions: Monitor the effectiveness of those corrective actions Provide data to justify adjusting the maintenance interval or maintenance program procedure as appropriate March 16 393SYS 4 Maintenance programs have four types of reliability Statistical reliability Historical reliability Event-oriented reliability Dispatch reliability March 16 393SYS 5 Statistical reliability Based upon collection and analysis of ‘events’ such as failure, removal, and repair rates of systems or components. March 16 393SYS 6 Historical reliability Comparison of current event rates with those of past experience. Commonly used when new equipment is introduced and no established statistic is available. March 16 393SYS 7 Event-oriented reliability Events like bird strikes, hard landing, in-flight shutdowns (IFSD), lighting strikes or other accidents that do not occur on a regular basis and therefore produce no useable statistical or historical data. In ETOPS, FAA designated certain events to be tracked as ‘event-oriented reliability program’. Each occurrence of the events must be investigated to determinate the cause to prevent recurrence. IFSD causes; for example: due to flameout, internal failure, crew-initiated shutoff, foreign object ingestion, icing, inability to obtain and/or control desired thrust. March 16 393SYS 8 Dispatch reliability Measurement of an airline operation respect to on-line departure. It receives considerable attention from regulatory authorities(e.g. FAA), airlines and passengers. Actually, it is just a special form of the event-oriented reliability approach. March 16 393SYS 9 Danger of misinterpreted reliability data (1) A pilot experienced a rudder control problem and called in two hours from arriving an airport. He writes up the problem in the aircraft logbook and reports it by radio to the flight operation unit at the airport. Upon arrival, the maintenance crew check the log and find the write-up and begin troubleshooting. The repair actions take a little longer then scheduled turnaround time and cause delay. Since maintenance is at work and rudder is the problem, the delay is charged to the maintenance department. If the pilot and the flight operation unit knew the problem and informed the maintenance two hours before landing, the maintenance people can spent the time prior to landing to perform troubleshooting analysis and the delay could have been prevented. So, an alter in airline procedure can avoid the delay. A good reliability program should avoided same delay in the future by altering the procedure, not regardless of who or what is to blame. March 16 393SYS 10 Danger of misinterpreted reliability data (2) If there were 12 write-ups of rudder problems during the month and only one of them caused a delay, there is actually two problems to investigate. 1. The delay, which may/or may not be caused by rudder the problems 2. The 12 rudder write-ups that may ,in fact, be related to an underlying maintenance problem. Dispatch delay constitutes one problem and the rudder system malfunction constitutes another. They may overlap but they are two different problems. Delay is a event-oriented reliability that must be investigated on its own; the 12 rudder problems should be addressed by the statistical (or historical) reliability problem separately. March 16 393SYS 11 Elements of a Reliability Program 1. 2. 3. 4. 5. 6. 7. March 16 Data collection Problem area alerting Data display Data analysis Corrective actions Follow-up analysis Monthly report 393SYS 12 Data Collection: allows operator to compare present performance with the past, typical data type are: 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. Flight time and cycle for each aircraft Cancellations and delays over 15 minutes Unscheduled component removals Unscheduled engine removals In-flight shutdowns of engines Pilot reports or logbook write-ups Cabin logbook write-up Component failures (shop maintenance) Maintenance check package findings Critical failures March 16 393SYS 13 Problem detection: alerting systems alerting systems for quick identify areas where performance is significantly different from normal so that possible problems can be investigated. Standards for event rates are set according to past performance. March 16 393SYS 14 Problem detection 2: setting & adjusting alert levels alert levels recalculation (yearly) and filtering of false alarms March 16 393SYS 15 Quality Control Charts and the Seven Run Rule A control chart is a graphic display of data that illustrates the results of a process over time. It helps prevent defects and allows you to determine whether a process is in control or out of control The seven run rule states that if seven data points in a row are all below the mean, above the mean, or increasing or decreasing, then the process needs to be examined for non-random problems March 16 393SYS 16 Control Chart of 12” ruler March 16 393SYS 17 Control Chart contiu. The output of a production process will fluctuate. The causes of fluctuation can just be random or non-random due to desirable/undesirable process change. Control charts graph and measure process data against control limits. Control charts can distinguish the random variation from assignable causes or nonrandom causes. We cannot adjust random variation out of a process. Process adjustments for random variation are neither necessary nor desirable. This is over-adjustment or tempering, and it makes the process worse. We can and must investigate assignable causes (or non-random causes). Points outside the control limits are evidence of process problems. Analyst must investigate every out of control point for an assignable cause. They must record their findings and any corrective actions. For example, a tool adjustment, or change in Formal Technical Review format or worn tooling, may correct the problem. March 16 393SYS 18 Pattern analyzing of Control Chart 7-Run rule 7-run-rule is used to filter out the random variation in a production process. shows the ‘trends’ that are caused by the ‘assignable causes’ or non-random causes that required investigation and possible corrective action to be taken. 7-run-rule pattern: seven points above mean value; seven points below mean value; seven points or all increasing ; or seven points all decreasing the patterns are indicators of non-random problems which can be symptom of process out of control. March 16 393SYS 19 To develop a Control Chart to determine project stability Plot individual metric values on a chart. Compute the mean value for the metrics value and plot the line. Plot the Upper Control Limit and Lower Control Limit. Compute a standard deviation as (Upper-control-limit mean)/3. Plot lines one and two standard deviation above and below Am. If any of the standard deviation lines is less than 0.0, it need not be plotted unless the metric being evaluated takes on values that are less than 0.0. The Std Dev.# is then plotted on the control chart. March 16 393SYS 20 Other Pattern analyzing of Control Chart 7 run rule is just another method to filter false alarms. Other pattern analyzing methods include but not limited to : metric value lay outside UCL or LCL 2 out of 3 successive metrics values lay more than 2 standard deviations away from the mean; 4 out of 5 successive metrics values lay more than 1 standard deviations away from the mean; others… March 16 393SYS 21 Reliability : Basic Calculation & Application March 16 393SYS 22 CHARACTERIZING FAILURE OCCURRENCES IN TIME Four general ways: 1. time of failure 2. time interval between failures 3. cumulative failures experienced up to a given time 4. failures experienced in a time interval March 16 393SYS 23 Time-based failure specification Failure number March 16 Failure time (sec) Failure interval (sec) 1 10 2 19 9 3 32 13 4 43 11 10 393SYS 24 Failure-based failure specification Time (sec) March 16 Cumulative failures Failure in interval 30 2 2 60 5 3 90 7 2 120 8 1 393SYS 25 TABLE 3 Typical probability distribution of failures Value of random variable (failures in time period) Probability 0 0.10 0 1 0.18 0.18 2 0.22 0.44 3 0.16 0.48 4 0.11 0.44 5 0.08 0.40 6 0.05 0.30 7 0.04 0.28 8 0.03 0.24 9 0.02 0.18 10 0.01 0.1 3.04 Mean failures March 16 Product of value and probability 393SYS 26 TIME VARIATION Mean value function - represents the average cumulative failures associated with each time point. Failure intensity function - is the rate of change of the mean value function or the no. of failures per unit time. March 16 393SYS 27 Probability distributions at times tA and tB Value of random variable (failures in time period) Probability Elapsed time ta = 1 hr Elapsed time tB = 5 hr 0 0.10 0.01 1 0.18 0.02 2 0.22 0.03 3 0.16 0.04 4 0.11 0.05 5 0.08 0.07 6 0.05 0.09 7 0.04 0.12 8 0.03 0.16 9 0.02 0.13 10 0.01 0.10 11 0 0.07 12 0 0.05 13 0 0.03 14 0 0.02 15 0 0.01 Mean failures March 16 3.04 393SYS 7.77 28 Mean Failure function F(t) 5 failure intensity function f(t) Time tA = 1 Time tB =5 10 Failure intensity (failures/hr) Mean Failures 10 Time Figure showing the Mean value function and failure intensity function March 16 393SYS 29 f(x) -- probability density function F(x) -- probability cumulative distribution function eg : When we toss three coins, there will be eight events have: If we set x is the number of coins which head side face up, we Event HHH HHT x 3 2 HTH 2 THH HTT THT TTH TTT 1 1 1 0 2 H = Head side up; T = Tail side up : X 0 1 2 3 f(x) 1/8 3/8 3/8 1/8 F(x) 1/8 1/2 7/8 1 March 16 393SYS 30 DISCRETE FAILURE FUNCTION f(t), the failure density function over a time interval [t1, t2] and is defined as the ratio of the number of failures occurring in the interval to the size of the original population, divided by the length of the time interval: f (t ) [n( t 1) n( t 2 )] / N ( t 2 t 1) Where n(t) is the number of the fault survivors at time t The f(t) can measure the overall speed at which failures are occurring March 16 393SYS 31 DISCRETE HAZARD FUNCTION Z(t), the failure rate, or the Hazard function is the probability that a failure occurs in some time interval [t1, t2], given that the system has survived up to time t. It is the ratio of the number of failures occurring in the interval to the size of the original population, divided by the length of the time interval: [n(t1) n(t2)]/ n(t1) Z(t) (t2 t1) The Z(t) can measure the instantaneous speed of failure March 16 393SYS 32 FAILURE CURVES March 16 393SYS 33 PROBABILITY OF SUCCESS F(t) is the probability of failure (= Cumulative Distribution) R(t) is the probability of success (= Reliability) F(t) + R(t) = 1 March 16 393SYS 34 DISCRETE FUNCTION EXAMPLE Failure data for 10 hypothetical electrical components Failure Number Operating time, h 1 8 2 20 3 34 4 46 5 63 6 86 7 111 8 141 9 186 10 266 March 16 393SYS 35 [n ( t 1) n ( t 2)] / N f (t ) (t 2 Z(t) Interval t1 to t2 Failure density/hr f(t) [10-9]/10 / (8-0) = 1//10/8 = 0.0125 0-8 Hazard rate/hr Z(t) [10-9]/10 / (8-0) = 1//10/8 = 0.0125 F(t) 1/10 [n(t1) n(t2)]/ n(t1) (t2 t1) R(t) MTTF 9/10 t 1) 8*10 =80 Overall MTTF 8/1 *10 = 80 8-20 [9-8]/10/(20-8) = 1/10/12 = 0.0083 [9-8]/9/(20-8) = 1/9/12 = 0.093 2/10 8/10 12 *9 =108 20/2 *10 = 100 20-34 [8-7]/10/(34-20) = 1/10/14 = 0.0071 March 16 [8-7]/8/(34-20) = 1/8/14 = 0.089 393SYS 3/10 7/10 14*8 34/3 *10 =112 = 113 36 Achieving a reliable system ref. Ian Summerville, 7e Ch20 Three basic strategies to achieve reliability Fault Avoidance Fault Tolerance Build fault-free systems from the start Build facilities into the system to let the system continue when faults cause system failures Fault Detection Use software validation techniques to discover faults prior to the system being put into operation For most systems, fault avoidance and fault detection suffice to provide the required level of reliability March 16 393SYS 37 Implementing Fault Avoidance Availability of a formal and unambiguous system specification Adoption of a quality philosophy by developers. Developers should be expected to write bug-free systems … March 16 393SYS 38 Implementing Fault Tolerance Even if somehow we build a fault-free system, we still need fault-tolerance in critical systems Fault-free does not mean failure-free Fault-free means that the system correctly meets its specifications Specifications may be incomplete or faulty or unaware of a requirement of the environment Can never conclusively prove that a system is fault-free March 16 393SYS 39 Aspects of Fault Tolerance Failure Detection Damage Assessment System must detect what damage the system failure has caused Fault Recovery System must be able to detect that the current state of the system has caused a failure or will cause a failure System must change the state of the system to a known “safe” state Can correct the damaged state (forward error recovery - harder) Can restore to a previous known “safe” state (backwards error recovery - easier) Fault Repair March 16 Modifying the system so that the failure does not recur Many software failures are transient and need no repair and normal processing can resume after fault recovery 393SYS 40 Implementing Fault Tolerance Hardware - Triple-Modular Redundancy (TMR) Hardware unit is replicated three (or more) times Output is compared from three units If one unit fails, its output is ignored Space Shuttle is a classic example Machine 1 Output Comparator Machine 2 Machine 3 March 16 393SYS 41 Implementing Fault Tolerance (2) Using Software N-Version programming Have multiple teams build different versions of the software and then execute them in parallel Assumes teams are unlikely to make the same mistakes Not necessarily a valid assumption, if teams all work from the same specification … March 16 393SYS 42 N-Version Programming Commonly used approach in railway signaling, aircraft systems & reactor protection system March 16 393SYS 43 System Configuration for Failure Event Diagram Divide system into a hierarchy set of components. The reliability of the components should be known or is easy to estimate or measure. Each component represented as a switch. If component is functioning, the switch is viewed as CLOSED and if not functioning, as OPEN. System success occurs if there is a continuous path through the configuration. The components are described as combination of 2 types - AND & OR configuration with independent failures representation. Express the reliability relationship between the components with Failure Even diagram. March 16 393SYS 44 EVENT DIAGRAM A B D E C Event diagram for AND-OR configuration March 16 393SYS 45 EVENT EXPRESSION RS = (A + B + C) * D * E where RS = Reliability of system = Probability of System success March 16 393SYS 46 TRUTH TABLE [ OR ] [ AND ] A B A+B A*B 1 1 1 1 1 0 1 0 0 1 1 0 0 0 0 0 March 16 393SYS 47 .AND. CONFIGURATION Rs = Reliability of the system. = R1XR2 where R1 & R2 are the reliability of components C1 & C2. R1 R2 Rs Rs with n components arranged in logical .AND. then, Rs = R1 X R2 X … X Rn March 16 393SYS 48 .OR. CONFIGURATION Rs = Reliability of the system. R1, R2, ..Rn = Reliability of component 1, 2, ..n. in this case, it is easier to calculate by the probability of failure F Fs = (1 - Rs) F1 = (1 - R1); F2 = (1 - R2) Fs = F1 * F2 = (1 - R1) * (1 - R2) Rs = 1 - Fs = 1 - [(1 - R1) * (1 - R2) * … (1 - Rn)] for n components arranged in logical .OR. March 16 393SYS R1 R 2 Rs 49 Reliability Acronyms MTBF - Mean Time Between Failures MTTF - Mean Time To Failure MTTR - Mean Time To Repair MTBF = MFFT + MTTR Many people consider it to be far more useful than measuring fault rate per LOC March 16 393SYS 50 Difference btw. Reliability and Availability Reliability - means no failures in the interval, say, 0 to t. Availability - means only that the system is up at time t. March 16 393SYS 51 AVAILABILITY A(T) Time (up) = -----------------------------------------Time(up) + Time(down) MTTF = -------------------------MTTF + MTTR = MTTF -------------MTBF Where MTTF = Mean Time To Fail MTTR = Mean Time To Repair MTBF = MTTF + MTTR Where MTBF = Mean Time Between Failure March 16 393SYS 53 Failure rate and Reliability during the random failure period Rs = e –t/m = e -tl where: Rs = e = t = m = l = Probability of failure free operation for period >=t 2.718 specified period of failure free operation mean time between failure (MTBF) failure rate = 1/m March 16 393SYS 54 SYSTEM CONFIGURATION Suppose we have Qp components with constant failure intensities. Their reliabilities are measured over a common time period. Assume that all must function successfully for system success. Then the system failure intensity l is given by Qp l lk k1 Where lk are the component failure intensities March 16 393SYS 55 Confusion surrounds MTBF : 1.5 Exp. failure Intuitive 1 0.5 3 2. 25 2. 5 2. 75 2 1. 25 1. 5 1. 75 1 0. 25 0. 50 0. 75 0 0 MTBF If the failure rate is constant, the probability that a product will operate without failure for a time equal to or greater than its MTBF is only 37%. This is based on the exponential distribution and it is contrary to the intuitive feeling that there is a 50-50 chance of exceeding an MTBF. March 16 393SYS 56 Computer harddisk with MTBF rating of 1 million hour mean that the average unit will run for 114 years before it fails? But harddisk has not been invented for 114 years, how can we know that it can be used for that long? MTBF of a drive is obtained by multiplying a large quantity of the drives with the number of hours running before experiencing a failure in the batch. For example, when a disk manufacturer batch tested 1500 units of hard disk and achieved an average of 30 days operation out of the batch between each individual unit failure, then MTBF of the disk is 1500 x 30 x 24 hours = 1 million hours.(for the period of testing time) March 16 393SYS 57