Failure Prevention and recovery Chapter -19 Summary What is failure? Why failures happen? How do we measure failures? Detection and analysis of failures. How operations can improve their reliability? How should the operations should recover from the failures? Failure Failure Failure Failure What is failure? At its simplest ‘failure’ is when something does not work as it should do. If the shop assistant who sells you an item of clothing ‘fails’ to inform you of the fact that it should be dry cleaned, it is technically a failure. Yet usually in operation management, we use the term failure to denote a more dramatic event. Usually we mean something stopping to do what it should do. So a piece of material fails, or a process fails. Why do operations fail? There are various reasons for the operations failures: I. Design failures II. Facilities failures III. Supplier failures IV. Customer failures V. Environmental failures A. Design failures A design may look fine on paper, but in real circumstances the limitations will become clearer. Design failures happen due to two different situations: Because miscalculating or overlooking a characteristic of demand – process fail to adjust with demand. For example a company process is designed to manufacture 3 televisions per hour, but the demand is to manufacture 7 televisions per hour. Unexpected circumstances – product size on the design becomes different from demanded size. Why systems fail Failures inside the operation Supply failures Design failures Facilities failures Staff failures Customer failures B. Facilities failure Failures with machines, equipments, buildings, and fittings. c. People failure People failures come in two types: Errors and Violations Errors – are mistakes in judgments (run motorbike on reserve petrol) Violations – are doing the things contrarily to the operating procedure. (driver avoiding changing the engine oil, causing major problems to engine) D. Supplier failure Failure in the delivery or quality of goods and services. (a music band of the hotel fails to turn – in) E. Customer failure Misuse of products and services from the production. F. Environmental disruption-related failure All the causes outside the opration. Example hurricanes, floods, lightning, temperature, fire, crime, theft, terrorism. Measure failures • Failures are usual happening as human failure. For example : 1. A machine failure may happen due to the poor design or maintenance . 2. A delivery failure by someone's errors to manage supply schedule. 3. Customers mistake, because no one to instruct the customer So, failures can be controlled to an extent, again an organization learn from failures. Thereby we call failures as opportunities. There are three main ways of measuring failure: 1. Failure rates – checking how often failure occurs. 2. Reliability - checking the chances of an occurrence of failure. 3. Availability – checking the amount of available useful operating time. FR (Failure Rate) measuring The number of failures occurring over a period of time. The failure of an airport security system can be measured by measuring the failure of security breaches. FR= number of failures × 100 total numbers of products tested Failure over-time – the ‘bath tub’ curve • Failure is a function of time. Different stages the probability for failing will be different. The curve that describes failure probability is called ‘bath-tub’ curve. According to this curve the failure probability is high at beginning and end of the life cycle There are three distinct stages. The ‘infant-mortality’ or ‘early-life’ stage where early failures occurred by defective parts or improper use. The ‘normal-life’ stage when the failure rate is usually low and constant. The wear-out stage – when the failure rate increases as it reaching the end of its working life. How failure is measured Normal-life stage Wear-out stage Failure rate ‘Infant-mortality’ stage Time Reliability measuring It measures the ability of a system, product or service to perform as expected over time. Rs = R1 ×R2 ×R3 ×Rn ….. Rs = reliability of system Here we consider that a single failure in a component of process causing failure to the whole components. So the more the components in a system, the lesser will be the reliability. MTBF (MEAN TIME BETWEEN FAILURES) MTBF = OPERAITNG HOURS NUMBER OF FAILURES III. Availability The degree to which the operation is ready to work. An operation is not available if it has either failed or is being repaired followed by a failure. Failure prevention and recovery There are three sets of activities which relate to failure: 1. The first – understanding what failures are occurring in the operation and why they are occurring. 2.Second – examine or find the ways to reduce chances for failure and minimize consequences of failures. 3. Third – make plans and procedures to help the organization from recovering when they occur. The three tasks of failure prevention and recovery Failure detection and analysis Finding out what is going wrong and why Improving system reliability Recovery Stopping things going wrong Coping when things do go wrong Mechanisms to detect failure There are six techniques to find out the failure: In-process checks – employees check that the service is acceptable during the process itself (restaurants ) Machine diagnostic checks – a machine is tested by putting it through many activities. ( computer service) Point of – departure – interviews - the staff may formally or informally check that the services has been satisfactory. Phone surveys – used to solicit opinions about products or services. Focus groups - group of customers are brought together to discover problems or finding out attitude towards products or services . Complaint feed back cards and questionnaires Many organizations using them for collecting views about products or services. Failure Analysis Understand why its has occurred. 1. Accident investigation – specifically trained staff analyze the cases of accident.( airplane, road accident) 2. Failure traceability - making sure an operation can trace ( fing proof or evidence) 3. Complaint analysis – analyze the complaints. CIT or critical incident analysis Finding out the satisfying and non – satisfying factors from customers. How failure is detected and analyzed Failure detection mechanisms include: – in-process checks – machine-diagnostic checks – point-of-departure interview Failure analysis procedures include: – accident investigation – failure mode-and-effect analysis – fault-tree analysis Failure mode and effect analysis Identify the product or service or process that are important in determining the effect of failures. Or identifying failures before they happen by providing checklist procedures. It has three steps What is likelihood that failure will occur? What would the consequence of failures be? How likely a failure to be detected before affecting customers?. Based on the above questions, we use the RPN or Risk Priority number and find out the cause of failure. There are seven steps involved in this Page 629 Failure modes effects analysis Normal operation Failure Probability of failure Severity of consequence Degree of severity Risk priority number Effect on customer Likelihood of detection Fault-tree analysis It is a logical procedure starting from a failure or potential failure and works back- wards to indentifying all possible causes and origins. Fault-tree analysis for below-temperature food being served to customers Food served to customer is below temperature Food is cold Plate is cold Plate warmer malfunction Plate taken too early from warmer Oven malfunction Key AND node Timing error by chef OR node Cold plate used Ingredients not defrosted Improving process reliability The responsibility of this step of operational managers is to prevent failures, we can do it by following 4 steps. 1. Design out fail points. 2. Build redundancy 3. Fail-safeing 4. Maintenance a. Design out fail points We can do it by proper product/service designing, by quality planning and control, by process controlling. b. Redundancy Building redundancy to an operation means, having a back-up system. (airplane, kidney, two red lights in cars) c.fail-safeing • Coming from Japanese methods of operations improvement. It is known as Poka-yoke in Japan, which means prevent. So the Poka-Yoke are devices used against failures. Poka-yoke (fail-safing) 3.5 inch diskette cannot be inserted unless it is orientated correctly. This is as far as a disk can be inserted upside-down. This feature, along with the fact that the diskette is not square, prohibits incorrect orientation. It is a control method. Warning lights and chimes alert the driver of potential problems. These devices employ a control method and a warning method. Poka-yoke (fail-safing) Filing cabinets can fall over if too many drawers are pulled out. For some filing cabinets, opening one drawer locks all the rest, reducing the chance of the filing cabinet tipping. It is a control method. The window in the envelope is not only a labour saving device. It prevents the contents of an envelope intended for one person being inserted in an envelope address to another. It is a control method. Examples for Poke-yoke techniques page 633 Maintenance Maintenance is how organizations try to avoid failure by taking care of their physical facilities. Benefits of maintenance 1. it enhances safety 2.It enhances reliability 3. It enhances quality 4. Low operation cost 5. Longer life 6. Higher end value ( can be sued as second hand) Three basic approaches for maintenance Run to breakdown ( RTB) - operate till something fails and do maintenance. Preventive Maintenance – eliminate or reduce chances of failure by servicing the facilities. Condition-based maintenance – perform maintenance only when facilities required. It is appliccable for expensive facilities. Mix of maintenance approaches A mixture of maintenance approaches is often used – in a motor car, for example Use preventive maintenance Use run-tobreakdown maintenance Use condition-based monitoring maintenance Total productive maintenance Means the productive maintenance carried out by all employees through small group activities. So TPM means maintenance management. Five goals of TPM PAGE 538 Paragraph 2 Reliability-centered maintenance • It is another method of maintenance where different types of maintenance for different parts of a process. One part in one process can have several different failure modes, each of which requires a different approach Shredding process Failures Cutter ‘wear out’ failure pattern Cutters Time One part in one process can have several different failure modes, each of which requires a different approach Shredding process Cutters Failures Cutter ‘shake loose’ failure pattern Time Recovery The activities designed to adjust with the failures are known as recovery. Failure planning The procedures which allow the operation to recover from failure is called failure planning The stages in failure planning Discover Act Learn Plan What’s happened Inform Find root cause Analyze failure Contain Engineer out Plan recovery What consequences Follow up Procedures of business continuity Avoid or recover from failures and keep business going. Page 643