Safety Examples Terms and Concepts Safety Architectures Safe Design Process Software Specific Stuff Sources Hard Time by Bruce Powell Douglass, which references Safeware by Nancy Leveson CSE 466 – Fall 2000 - Introduction - 1 What is a Safe System? Brake w/ local controller Brake Pedal Pedal Sensor Processor Bus Engine w/ local controller Is it safe? What does “safe” mean? How can we make it safe? CSE 466 – Fall 2000 - Introduction - 2 Terms and Concepts Reliability of component i can be expressed as the probability that component i is still functioning at some time t. Pi(t) = Probability of being operational at time t burn in period Low failure rate means nearly constant probability 1/(failure rate) = MTBF time Is system reliability Ps (t) = PPi(t) ? Assuming that all components have the same component reliability, Is a system w/ fewer components always more reliable ? Does component failure system failure ? CSE 466 – Fall 2000 - Introduction - 3 A Safe System A system is safe if it’s deployment involves assuming an acceptable amount of risk…acceptable to whom? Risk factors Probability of something bad happing Consequences of something bad happening (Severity) Example Airplane Travel – high severity, low probability Electric shock from battery powered devices – hi probability, low severity safe zone mp3 player danger zone (we don’t all have the same risk tolerance!) probability Desktop PC? severity airplane Nuclear power plant CSE 466 – Fall 2000 - Introduction - 4 More Precise Terminology Accident or Mishap: (unintended) Damage to property or harm to persons.. Release of Energy Release of Toxins Interference with life support functions Damage to a credit rating (Does this fit the definition?) Hazard: A state of the the system that will inevitably lead to an accident or mishap. Hydrogen leak in fuel cell system Backup battery dead on ventilator Supplying misleading information to safety personnel or control systems. This is the desktop PC nightmare scenario. Bad information Alarm bell broken CSE 466 – Fall 2000 - Introduction - 5 Faults A fault is an “unsatisfactory system condition or state”. A fault is not necessarily a hazard. In fact, assessments of safety are based on the notion of fault tolerance. Main H2 Valve stuck, back up H2 valve working Systemic faults Design Errors (includes process errors such as failure to test or failure to apply a safety design process) Faults due to software bugs are systemic Security breech Random Faults Random events that can cause permanent or temporary damage to the system. Includes EMI and radiation, component failure, power supply problems, wear and tear. CSE 466 – Fall 2000 - Introduction - 6 Component v. System Reliability is a component issue Safety and Availability are system issues A system can be safe even if it is unreliable! If a system has lots of redundancy the likelihood of a component failure (a fault) increases, but so may increase the safety and availability of that system. Safety and Availability are different and sometimes at odds. Safety may require the shutdown of a system that may still be able to perform its function. A backup system that can fully operate a nuclear power plant might always shut it down in the event of failure of the primary system. The plant could remain available, but it is unsafe to continue operation CSE 466 – Fall 2000 - Introduction - 7 Single Fault Tolerance (for safety) The existence of any single fault does not result in a hazard Single fault tolerant systems are generally considered to be safe, but more stringent requirements may apply to high risk cases…airplanes, power plants, etc. Backup H2 Valve Control Assume perfectly reliable valves watchdog protocol Main H2 Valve Control CSE 466 – Fall 2000 - Introduction - 8 If the handshake fails, then either one or both can shut off the gas supply. Is this a single fault tolerant system? Is This? Backup H2 Valve Control common mode failures watchdog handshake Main H2 Valve Control CSE 466 – Fall 2000 - Introduction - 9 Now Safe? Backup H2 Valve Control watchdog handshake Main H2 Valve Control CSE 466 – Fall 2000 - Introduction - 10 •Separate Clock Source •Power Fail-Safe (non-latching) Valves Does it ever end? Time is a Factor The TUV Fault Assessment Flow Chart T1: Fault tolerance time of the first failure T2: Time after which a second fault is likely Captures time, and the notion of “latent faults” T1 – tolerance time for first fault First Fault T2 – Time after which a second fault is likely Based on MTBF data yes Hazard after T1? Safety requires that Ttest<T1<T2 no no Fault Detected After T2? yes 2nd Fault System Unsafe yes no hazard? CSE 466 – Fall 2000 - Introduction - 11 System Safe Latent Faults Any fault this is not detectable by the system during operation has a probability of 1 – doesn’t count in single fault tolerance assessment Backup H2 Valve Control stuck valves could be latent if the controllers cannot check their state. watchdog handshake Main H2 Valve Control May as well assume that they are stuck! Detection might not mean diagnosis. If system can detect secondary affect of device failure before a hazard arises, then this could be considered safe CSE 466 – Fall 2000 - Introduction - 12 Design for Safety 1. Hazard Identification and Fault Tree Analysis, FMEA 2. Risk Assessment 3. Define Safety Measures 4. Create Safe Requirements 5. Implement Safety 6. Assure Safety Process 7. Test,Test,Test,Test,Test CSE 466 – Fall 2000 - Introduction - 13 1. Hazard Identification – Ventilator Example Human in Loop Mishap Severity Tolerance Time Fault Example Likelihood Detection Time Mechanism Exposure Time Hypoventilation Severe 5 min. Vent Fails – No pressure in reservoir. Rare 30sec Indep. pressure sensor w/ alarm 40sec Esophageal intubation Medium 30sec C02 sensor alarm 40sec User misattaches breathing hoses never N/A Different mechanic al fittings for intake and exhaust N/A Release valve failure Rare 0.01sec Secondary valve opens 0.01sec Overpressuriza tion Severe 0.05sec CSE 466 – Fall 2000 - Introduction - 14