Reliability Centered Maintenance (RCM) Evolution of Maintenance At the very beginning, Maintenance was an appendix to Operations / Production: It existed only to fix failures, when they happened. These were the days of absolute Corrective Maintenance Evolution of Maintenance As times went by, it was detected that many failures have an almost regular pattern, failing after an average period. Therefore, one could choose regular intervals to fix the equipment BEFORE the failure: Preventive Maintenance Also know as Time Based Maintenance. Evolution of Maintenance However, very often these failures happen in irregular periods. To avoid an unwanted failure, the periods of Preventive Maintenance are shortened. If equipment conditions were known, the maintenance could be later. Technology development enabled to identify failure symptoms: Predictive Maintenance Also know as Condition Based Maintenance. Evolution of Maintenance Many pieces of equipment have sporadic activity (alarms, stand-by equipments, etc.). However, we must be sure that they are ready to run. These are "hidden faults“. Detect and prevent hidden failure is called: Detective Maintenance Evolution of Maintenance The different failure modes mean that there’s not one only approach, about Corrective, Preventive or Predictive Maintenance Programs. The correct balance will give in return better equipment reliability, thus the name: Reliability Centered Maintenance Remember, my kid, Prevention is better than Cure.... Take it easy, grandma, not always! Reliability Centered Maintenance (RCM) John Moubray 1949-2004 After graduating as a mechanical engineer in 1971, John Moubray worked for two years as a maintenance planner in a packaging plant and for one year as a commercial field engineer for a major oil company. In 1974, he joined a large multi-disciplinary management consulting company. He worked for this company for twelve years, specializing in the development and implementation of manual and computerized maintenance management systems for a wide variety of clients in the mining, manufacturing and electric utility sectors. He began working on RCM in 1981, and since 1986 was full time dedicated to RCM, founding Aladon LCC, which he led until his premature death in 2004. John Moubray is today considered a synonym of RCM. Reliability Centered Maintenance (RCM) Its origins What about a failure rate of 0.00006/event? Quite good, no? This was the average failure rate in commercial flights takeoffs, in the 50’s. Two thirds of them caused by equipment failures. Today, this would mean 2 accidents per day, with planes with more than 100 passengers!!! That’s why Reliability Centered Maintenance has begun in the Aeronautical Engineering. Pretty soon, Nuclear activities, Military, Oil & Gas industries also began to use RCM concepts and implement them in their facilities. Reliability Centered Maintenance (RCM) Reliability and Availability Reliability Reliability is a broad term that focuses on the ability of a product to perform its intended function. Mathematically speaking, reliability can be defined as the probability that an item will continue to perform its intended function without failure for a specified period of time under stated conditions. Reliability is a performance expectation. It’s usually defined at design. Availability Depends upon Operation uptime and Operating cycle. Availability is a performance result. Equipment history will tell us the availability. Bibliography: Kardec, Alan y Nascif, Julio - Manutenção- Função Estratégica, Editora Qualitymark Reliability Centered Maintenance (RCM) Reliability and Availability MTBF = Mean Time Between Failures MTTR = Mean Time To Repair A first definition: Availability = MTBF MTBF + MTTR Bibliography: Kardec, Alan y Nascif, Julio - Manutenção- Função Estratégica, Editora Qualitymark Reliability Centered Maintenance (RCM) Availability definitions MTBF = Mean Time Between Failures MTTR = Mean Time To Repair MTBM = Mean Time Between Maintenance actions M = Maintenance Mean Downtime (including preventive and planned corrective downtime) Inherent Availability: consider only corrective downtime Achieved Availability: consider corrective and preventive maintenance Operational Availability: ratio of the system uptime and total time Inherent Availability = Achieved Availability = Operational Availability = MTBF MTBF + MTTR MTBM MTBM + M Uptime Operation Cycle Reliability Centered Maintenance (RCM) Reliability and Availability 250 days 360 days Downtime 120 days 200 days 9d = 947 days 2 6 MTBF = (250 + 360 + 200 + 120) / 4 = 232.5 days MTTR = (9 + 6 + 2) / 3 = 5.67 days Availability = 232.5 / (232.5 + 5.67) = 97.62 % 180 days Downtime 400 days 7 233 days 120 days 4 3 MTBF = (180 + 400 + 120 + 233) / 4 = 233.25 days MTTR = (7 + 4 + 3) / 3 = 4.67 days Availability = 233.25 / (233.25 + 4.67) = 98.04 % = 947 days Reliability Centered Maintenance (RCM) Reliability and Availability Achieved Availability↑ = MTBM↑/ (MTBM+M↓) To improve Availability: Improve MTBM: •Reduce Preventive Programs to a minimum, or, have Preventive intervals as well defined as possible. •Using Predictive techniques whenever possible •Implementing Maintenance Engineering (RCM, TPM...) Minimize M: •Implementing Maintenance Engineering (Planning, Logistics...) •Improving personnel technical skills (training) •Developing Integrated Planning (Mntce+Ops+HSE+Inspection+...) Bibliography: Kardec, Alan y Nascif, Julio - Manutenção- Função Estratégica, Editora Qualitymark Reliability Centered Maintenance (RCM) Improving Productivity Productivity Improvement Factors: Detailed work planning Delivering equipments to Maintenance as clean as possible Check-list at the end of Maintenance activities Complete and comprehensive Equipment data available Supplies available on job site Skilled personnel Bibliography: Kardec, Alan y Nascif, Julio - Manutenção- Função Estratégica, Editora Qualitymark Reliability Centered Maintenance (RCM) Availability benchmark Reliability Centered Maintenance (RCM) Translating percents to daily routine... Availability % Downtime per year Downtime per month* Downtime per week 90% 36.5 days 72 hours 16.8 hours 95% 18.25 days 36 hours 8.4 hours 98% 7.30 days 14.4 hours 3.36 hours 99% 3.65 days 7.20 hours 1.68 hours 99.5% 1.83 days 3.60 hours 50.4 min 99.8% 17.52 hours 86.23 min 20.16 min 99.9% ("three nines") 8.76 hours 43.2 min 10.1 min 99.95% 4.38 hours 21.56 min 5.04 min 99.99% ("four nines") 52.6 min 4.32 min 1.01 min 99.999% ("five nines") 5.26 min 25.9 s 6.05 s 99.9999% ("six nines") 31.5 s 2.59 s 0.605 s Reliability Centered Maintenance (RCM) Maintenance Programs costs Maintenance Program Cost US$/HP/year Corrective (unplanned) 17 to 18 Preventive 11 to 13 Predictive / Planned Corrective NMW Chicago 7 to 9 Reliability Centered Maintenance (RCM) Benchmarking balance between Mtce programs Maintenance activities % Corrective actions 28 Preventive actions 36 Predictive actions 19 Maintenance studies 17 NMW Chicago Reliability Centered Maintenance (RCM) Definitions Failure rate (λ) Failure rate (λ) is defined as the reciprocal of MTBF: 1 λ (t ) = MTBF Reliability: R(t) Let P(t) be the probability of failure between 0 and t; reliability is defined as: R(t) = 1 – P(t) Bibliography: Lafraia, João Ricardo - Manual de Confiabilidade, Mantenabilidade e Disponibilidade, Editora Qualitymark Reliability Centered Maintenance (RCM) Some math... Considering rate failure (λ) constant, it is proven (check at www.weibull.com), that R(t), meaning the probability of having operated until instant t, is given by: R (t ) = e − λt This reinforces the idea that Reliability is function of time, it isn’t a definite number. So, it’s incorrect to affirm: “This equipment has a 0.97 reliability factor...”. We should rather say: “This equipment has 97% reliability for running, let’s say, 240 days...” Reliability Centered Maintenance (RCM) Tricks and tips... Historically, an equipment has 4 failures per year. Which is the reliability of this equipment for a 100 days run? λ =4/365 λ =0.011/day R(100) = e-0.011x100 = e-1.1 = 0.333 = 33.3% The probability of having no failure until 100 days is 33.3% Some upgrades have been made, so failure rate now is 2 per year (meaning that MTBF has doubled). Which is the reliability for a 100 days run? λ =2/365 λ =0.0055/day R(100) = e-0.0055x100 = e-0.55 = 0.577 = 57.7% The probability of having no failure until 100 days is 57.7%. As seen, doubling MTBF doesn’t double reliability. Reliability Centered Maintenance (RCM) Trick and tips... Historically, an equipment has a MTBF = 200 days. To improve 10% its reliability to operate on a 100 days run, which percent should MTBF be improved? λ =1/200 λ =0.005/day R(100) =e-0.005x100 = e-0.5 = 0.607 = 60.7% To improve this reliability in 10%, new reliability should be: R’(100) = 1.1 x 0.607 = 0.668 = e-λ’x100 Ln 0.668 = -λ’ x 100 -0.403 = -λ’ x 100 λ’= 0.00403 1/MTBF’ = 0.0043 MTBF’ = 232 days 232/200 = 1.16 MTBF should improve 16% Reliability Centered Maintenance (RCM) Trick and tips... As per the manufacturer, an equipment has a 90% reliability to run over one year. If you want to have a 95% confidence that it will not fail, how long should it take until the equipment undergo a Preventive maintenance or some predictive technique? 0.9 = e-λx365 ln 0.9 = -λ x 365 -0.1054 = -λ x 365 λ = 2.89 x 10-4/day 0.95 = e-λt ln 0.95 = -λt -0.0513 = - 2.89 x 10-4 x t t = 177.5 days For practical purposes, this equipment could be in a semester preventive / predictive program. Reliability Centered Maintenance (RCM) Tricks and Tips... Reliability and MTBF 1.2 MTBF=50 MTBF=100 MTBF=150 MTBF=200 MTBF=250 MTBF=300 MTBF=365 1 0.8 0.6 0.4 0.368 0.368 0.368 0.368 0.368 0.368 0.368 0.2 0 1 51 101 151 201 Days 251 301 351 Reliability Centered Maintenance (RCM) System in series 1 2 3 Let P1=5%, P2=10% and P3=20% be the failure probability of each component of this system, in a certain period. Which is the reliability of this system, in series? This system will run, provided that ALL its components run. So, their reliabilities are multiplied. R1 = 1 – P1 = 1 – 0.05 = 0.95 R2 = 1 – P2 = 1 – 0.10 = 0.90 R3 = 1 – P3 = 1 – 0.20 = 0.80 R = R1 x R2 x R3 = 0.95 x 0.90 x 0.80 = 0.6840 = 68.4% System failure probability 31.6% System failure probability is bigger than each individual component. System reliability is less than each component. Bibliography: Lafraia, João Ricardo - Manual de Confiabilidade, Mantenabilidade e Disponibilidade, Editora Qualitymark Reliability Centered Maintenance (RCM) System in parallel 1 2 3 Let P1=5%, P2=10% and P3=20% be the failure probability of each component of this system, in parallel, in a given period. Which is the reliability of the system, in parallel? This system will run until ALL components fail. In this case, the failure probabilities are multiplied. P = P1 x P2 x P3 = 0.05 x 0.10 x 0.20 = 0.0010 R = 1 – P = 0.999 = 99.9% System failure probability 0.1% System failure probability is less than each component. System reliability is bigger than each component. Bibliography: Lafraia, João Ricardo - Manual de Confiabilidade, Mantenabilidade e Disponibilidade, Editora Qualitymark Reliability Centered Maintenance (RCM) Mixed systems 1 2 4 3 5 If P1=10%, P2=5%, P3=15%, P4=2% and P5=20%, which is the system reliability? 123 45 R1= 1 – 0.10 = 0.90 R2= 1 – 0.05 = 0.95 R123 = 0.9 x 0.95 x 0.85 = 0.7268 P 123= 0.2733 R3= 1 - 0.15 = 0.85 R4= 1 – 0.02 = 0.98 R45 = 0.98 x 0.80 = 0.7840 R5= 1 – 0.20 = 0.80 System P123= 0.2733 Psystem = 0.2733 x 0.2160 = 0.0590 P45= 0.2160 Rsystem = 1 – 0.0590 = 0.941 = 94.1% P45= 0.2160 Reliability Centered Maintenance (RCM) Redundancy A B C The pumps A, B y C are feed pumps of a plant. To operate in full condition, it’s necessary that at least two of these three pumps are running. Failure probability of each one is 10%. Which is the reliability to run this plant at full production? Failure probability is P= 0.1 (10%), and reliability is R=1-0.1= 0.9 (90%) Three pumps in parallel, so: (R + P)3 = R3 + 3R2P + 3RP2 + P3= 0.93 + 3x0.92x0.1 + 3x0.9x0.12 + 0.13 (R + P)3 = 0.729 + 0.243 + 0.027 + 0.001 Three running: 0.729 Two running and one off: 0.243 One running and two off: 0.027 None running: 0.001 Reliability = 0.972 = 97.2 % No full production = 0.028 = 2.8 % Reliability Centered Maintenance (RCM) Redundancy A B C The pumps A, B y C are feed pumps of a plant. Pump A flow rate is 2,000 gpm, pump B flow rate is 1,800 gpm and pump C flow rate is 1,700 gpm. To operate, the plant need at least a feed rate of 3,600 gpm. Reliabilities are: RA=0.95, RB=0.90 and RC=0.85. Which is the plant reliability? As the plant needs at least 3,600 gpm, to supply this, there will be these cases: A∩B∩C 0.95 x 0.90 x 0.85 = 0.72675 A ∩ B ∩ notC 0.95 x 0.90 x (1 – 0.85) = 0.12825 A ∩ notB ∩ C 0.95 x (1 – 0.90) x 0.85 = 0.08075 Plant reliability = 0.93575 93.6% Reliability Centered Maintenance (RCM) Systems in series Systems in series 1 1 component 2 components 3 components 4 components 10 components 0.9 0.8 1 component 0.6 2 components 0.5 3 components 0.4 4 components 0.3 10 components 0.2 0.1 Component reliability 1 0.98 0.96 0.94 0.92 0.9 0.88 0.86 0.84 0.82 0.8 0.78 0.76 0.74 0.72 0.7 0.68 0.66 0.64 0.62 0.6 0.58 0.56 0.54 0.52 0 0.5 System reliability 0.7 Reliability Centered Maintenance (RCM) Systems in parallel Systems in parallel 1.2 10 components 1 4 components 3 components 2 components 1 component 2 components 3 components 4 components 10 components 1 component 0.6 0.4 0.2 Component reliability 1 0.98 0.96 0.94 0.92 0.9 0.88 0.86 0.84 0.82 0.8 0.78 0.76 0.74 0.72 0.7 0.68 0.66 0.64 0.62 0.6 0.58 0.56 0.54 0.52 0 0.5 System reliability 0.8 Reliability Centered Maintenance (RCM) System and Component Redundancy A B A B A’ B’ A’ B’ Component Redundancy System Redundancy Which of these systems would have a better overall reliability (let’s assume all components have the same reliability R)? AA’ and BB’ subsystems’ reliability: AB and A’B’ subsystems’ reliability: 1 - (1-R)2 =1 – 1 + 2R – R2 = 2R – R2 R2 System reliability: System reliability: R system redundancy = 1 – (1-R2)2 R component redundancy = (2R-R2)2 R system redundancy = 1 – 1 + 2R2-R4 R system redundancy = 2R2 - R4 R comp red - R syst red = (2R-R2)2 - (2R2 - R4) = 4R2 – 4R3 + R4 - 2R2 + R4 R comp red - R syst red = 2R4 – 4R3 + 2R2 = 2R2(R2 – 2R + 1) = 2R2(R-1)2≥ 0 R comp red ≥ R syst red Reliability Centered Maintenance (RCM) Active and Passive Redundancy A B Active Redundancy: Passive Redundancy: Both equipment are operating at the same time, sharing the load. If one fails, the other one will carry the load alone. One equipment is operating, and the other one is at stand-by, starting operating after the failure of the first one, pending upon a switch system. Reliability Centered Maintenance (RCM) Getting closer to real world... In systems with active redundancy all redundant components are in operation and are sharing the load with the main component. Upon failure of one component, the surviving components carry the load, and as a result, the failure rate of the surviving components may be increased. The reliability of an active, shared load, parallel system can be calculated as follows: where: λ1 is the failure rate for each unit when both are working and λ2 is the failure rate of the surviving unit when the other one has failed. If 2λ1 = λ2, then: Reliability Centered Maintenance (RCM) Getting closer to real world... In a system with active redundancy, reliability of each of the two components for 100 days is R=0.96, when sharing the load. If one compontents fails, the surviving one will have a 50% increase in its failure rate. Which is it the system reliability for 100 days? R(100) = 0.96 = e-λx100 ln 0.96 = -100λ λ1 = 0.00041 λ2 = 1.5 x λ1 = 0.000615 2 × 0.00041 − 0.000615×100 R (100) = e − 2×0.00041x100 + − e −2×0.00041×100 × e 2 × 0.00041 − 0.000615 R (100) = e −0.082 + 4 × e −0.0615 − e −0.082 R (100) = 0.9213 + 4 × (0.9404 − 0.9213) ( ( ) ) R (100) = 0.9977 If there were no increase in failure rate, system reliability would be 0.9984. Look like nothing, but this means a 30.5% decrease in system MTBF!!! Reliability Centered Maintenance (RCM) Getting closer to real world... The redundant or back-up components in passive or standby systems start operating only when one or more fail. The back-up components remain dormant until needed. For two identical components (primary and back-up) the formula is: R(t) = e-λt (1+λt), considering a perfect switch If the reliability of the switch is less than one, the reliability of the system is affected by the switching mechanism and is reduced accordingly: R(t) = e-λt (1+Rswλt), Rsw switch reliability The reliability of a standby system consisting of one primary component with constant failure rate λ1 and a backup component with constant failure rate λ2 is given by: Reliability Centered Maintenance (RCM) Getting closer to real world... Two feed pumps in a nuclear power plant are connected in a stand-by mode. One is active and one is on standby. The power plant will have to shut down if both feed pumps fail. If the time between failures of each pump has an exponential distribution with MTBF = 28,000 hours, and the failure rate of the switching mechanism λsw is 10-6 what is the probability that the power plant will not have to shut down due to a pump failure in 10,000 hours? R(t) = e-λt (1+Rswλt) R(t) = e-λt (1+Rswλt), Switch reliability: 10−6 ×104 Rsw = e 10−2 =e = e −0.01 = 0.9900 λ = 1/MTBF R (10000) = e −1 ×10000 28000 × (1 + 0.9900 × R (10000) = e −0.3571 × (1 + 0.3536) R (10000) = 0.6997 ×1.3536 R (10000) = 0.9471 1 ×10000) 28000 Reliability Centered Maintenance (RCM) Bathtub Curve Early Life (Burn-in, infant mortality) • large number of new component failures which decreases with time Useful Life • small number of apparently random failures during working life (λ constant) Wear-out • increasing number of failures with time as components wear out Reliability Centered Maintenance (RCM) Bathtub Curve Early Life: • sub-standard materials • often caused by poor / variable manufacturing and poor quality control • prevented by effective quality control, burn-in, and run-in, debugging techniques • weak components eventually replaced by good ones • probabilistic treatment less important Useful Life: • random or chance failures • may be caused by unpredictable sudden stress accumulations outside and inside of the components beyond the design strength • over sufficiently long periods frequency of occurrence (λ) is approximately constant • failure rate used extensively in Safety & Reliability analyses Wear-out period: • symptom of component ageing • prediction is important for replacement and maintenance policy Reliability Centered Maintenance (RCM) Different bathtub curves These statistics are from aeronautical industry. In a process plant, like a refinery, do you think the percent of each one would be about the same? Reliability Centered Maintenance (RCM) Different bathtub curves Which of these curves would be applicable to: A pump? An electronic instrument? A tire? Reliability Centered Maintenance (RCM) Failure modes Common sense tells that the best way to optimize the availability of plants is to implement some Preventive maintenance. Preventive maintenance means fixing or replacing some pieces of equipments and/or components in fixed intervals. Useful lifespan of equipments may be calculated with Failure Statistical Analysis, enabling Maintenance Department to implement Preventive Programs. This is true for some simple pieces of equipment and components, which may have a prevailing failure mode. Many components in contact with process fluids have a regular lifespan, as well as cyclic equipment, due to fatigue and corrosion. But, for many pieces of equipment there’s no connection between reliability and time. Furthermore, as seen in Reliability curves, defining the optimum interval for Preventive maintenance may be a hard task. Besides, fixing or even replacing the equipment may bring you back to Infant Mortality period... Reliability Centered Maintenance (RCM) Here begins wear-out period. Failures are likely to happen… λ Let’s define Preventive maintenance here… Preventive maintenance may cause failures earlier.... Time The failure likelihood is earlier!!!! Reliability Centered Maintenance (RCM) Turnarounds Turnarounds are often seen by Operations as an unique opportunity to have all problems solved, all equipment fixed… Meanwhile, for Maintenance, a Turnaround is a huge event, time & resources & costs consuming, in which ONLY should be done whatever CANNOT be done on the run, during normal operation. Frequently, Maintenance is asked to perform General Maintenance in ALL rotating equipment of a Unit, during its Turnaround. Matter of fact, if these equipment have spares, this General Maintenance should be done out of the TAR. Why do Operations want everything to be done during the TAR? 1) Because Ops don’t have enough confidence that it will be done during routine maintenance. 2) Because they don’t feel comfortable running with an equipment momentarily without spare… the same way when we have a flat tire, we just drive with the spare tire enough to hit the tire repair shop… Reliability Centered Maintenance (RCM) Turnarounds 1) Ops don’t have enough confidence that it will be done during routine maintenance. To improve TAR results, reversing the vicious cycle below, Maintenance management has to improve Routine Maintenance! To much to be done during TAR Many equipments left to TAR Many equipments left to Routine Maintenance Not in excess equipments to be done during TAR TAR won’t be able to perform all that has to be done TAR will carry out all services needed Good routine maintenance Unit running well Reliability Centered Maintenance (RCM) Turnarounds 2) Because they don’t feel comfortable running with an equipment momentarily without spare… the same way when we have a flat tire, we just drive with the spare tire enough to hit the tire repair shop… Consider these two pumps in a Passive Redundancy (one will be as stand-by). Assume that during the first 100 h after a General Maintenance such a pump will have a 70% reliability, and after this, for an one year period, it would run with 97% reliability (which are reasonable assumptions!!!). If General Maintenance is performed in a Preventive or Predictive Program, during normal operations, during repair time the unit will be running pending upon a unique pump, with a 97% reliability. If during TAR both pumps will be under General Maintenance, during the first 100 hours the system reliability (considering a perfect switch) would be 94.5% (using the R(t) = e-λt(1+λt) formula) . So, the unit would run for a period of time with two available pumps, but with an overall reliability below if it would be running with only one pump! Reliability Centered Maintenance (RCM) RCM Implementation Flowchart Will the failure affect directly Health, Safety or Environment? No Will the Failure affect adversely the Mission, Vision No and Core Values of the Company? Yes Yes Is there some Costeffective Monitoring Technology available? Yes No Will the failure cause major economic losses? (harm to systems and / or machines)? No Yes Deploy Monitoring techniques Are there regular failure patterns (time intervals)? No Yes Predictive Maintenance Preventive Maintenance Re-design the system, accept failure risk, or install redundancy Run-to-fail? Reliability Centered Maintenance (RCM) Another RCM Implementation Flowchart If this thing breaks will it be noticed? No Prevent it breaking If this thing breaks will it No hurt someone or the environment? If this thing breaks will it slow or stop production? No Can preventing it break reduce the reduce the risk to the environment and safety? Yes Check to see Prevent it if it is broken breaking No No Yes Yes Can preventing it break reduce the likelihood of multiple failures? Yes Yes Is it cheaper to prevent it breaking than the loss of production? Yes Re-design it Prevent it breaking No Let it break Is it cheaper to prevent it breaking than to fix it? Yes Prevent it breaking No Let it break