Accidents/Mishaps TDDB47 Real Time Systems • Lecture 4: Dependability and fault tolerance Therac 25 – Six patients in USA and Canada got very high doses of radiation and severe burns from the cancer treatment system Therac 25. – Doses as high as 15,000-20,000 radiation units compared with the normal levels (~ 200 units) had been given. Five died. Mikael Asplund • Ariane 5 – Software reuse – ~$500 million dollars lost Real-Time Systems Laboratory Department of Computer and Information Science Linköping University, Sweden The lecture notes are partly based on lecture notes by Calin Curescu, Simin NadjmTehrani, Jörgen Hansson, Anders Törne. They also loosely follow Burns’ and Welling book “Real-Time Systems and Programming Languages”. These lecture notes should only be used for internal teaching purposes at Linköping University. Lecture 4: Dependability and fault tolerance Mikael Asplund Lecture 4: Dependability and fault tolerance Mikael Asplund 2 What is dependability? Excercise Property of a computing system which allows reliance to be justifiably placed on the service it delivers. [Avizienis et al.] Lecture 4: Dependability and fault tolerance Mikael Asplund 3 Lecture 4: Dependability and fault tolerance Mikael Asplund Attributes of dependability • [Sv. Pålitlighet] • Safety – non-occurrence of catastrophic consequences on the environment Availability – the readiness for usage Integrity – non-occurrence of unauthorized alteration of information Reliability – continuity of correct service Confidentiality – absence of unauthorized disclosure of information, Maintainability – ability to undergo repairs and modifications. • • • • • Lecture 4: Dependability and fault tolerance Mikael Asplund 4 Faults, Errors & Failures • Fault [sv. felkälla] – a defect within the system or a situation that can lead to failure • Error [sv. felyttring] – manifestation (symptom) of the fault - an unexpected behaviour • Failure [sv. misslyckande] – system not performing its intended function 5 Lecture 4: Dependability and fault tolerance Mikael Asplund 6 Transitions • Failure fault error failure fault • Systems are composed of components which are themselves systems, hence What is a bug? – a failure in a subcomponent is a fault in the component Lecture 4: Dependability and fault tolerance Mikael Asplund 7 Fault, Error, Failure • Goal of system verification and validation is to eliminate faults • Goal of safety/risk analysis is to focus on important faults • Goal of fault tolerance is to reduce effects of errors if they appear Lecture 4: Dependability and fault tolerance Mikael Asplund 8 Measuring Reliability and Availability • Time from an initial instant to the next failure event • Typical measures: – MTTF – mean time to failure – MTBF – mean time between failures – MTTR – mean time to repair – In order to eliminate or delay failures • Availability – Ratio of service time to elapsed time • MTTF/(MTTF+MTTR) Lecture 4: Dependability and fault tolerance Mikael Asplund 9 Lecture 4: Dependability and fault tolerance Mikael Asplund 10 Dependability How avoid failures? • Fault prevention • Fault tolerance • Fault removal • Fault forecasting IFIP WG 10.4 Lecture 4: Dependability and fault tolerance Mikael Asplund 11 Lecture 4: Dependability and fault tolerance Mikael Asplund 12 External factors Time and Duration of Faults • Transient fault – E.g. hardware components which have an adverse reaction to radioactivity. • Permanent fault – E.g., a broken wire or a software design error. Movie. • Intermittent fault – E.g. a hardware component that is heat sensitive, it works for a time, stops working, cools down and then starts to work again. Lecture 4: Dependability and fault tolerance Mikael Asplund 13 Lecture 4: Dependability and fault tolerance Mikael Asplund Failure types • Value domain • Timing Domain Commission • Arbitrary Fault models • Node failures – Crash – Omission – Byzantine – Early – Late – Omission • • Channel failures – Crash (and potential partitions) – Message loss – Erroneous/arbitrary messages Lecture 4: Dependability and fault tolerance Mikael Asplund 15 Lecture 4: Dependability and fault tolerance Mikael Asplund Necessary! • Increase overall complexity • Static – – – – • 16 N-Version Redundancy • 14 • From D. Lardner: Edinburgh Review, year 1824: – ”The most certain and effectual check upon errors which arise in the process of computation is to cause the same computations to be made by separate and independent computers*; and this check is rendered still more decisive if their computations are carried out by different methods.” Error masking properties Hardware – N-modular redundancy Software – N-version programming Influence on timeliness? Dynamic – – – – – Error detecting properties Software – recovery methods Hardware – switching to back-up or recovery module Data – self-correcting codes Time – re-computing a result Lecture 4: Dependability and fault tolerance Mikael Asplund * people who compute 17 Lecture 4: Dependability and fault tolerance Mikael Asplund 18 N-Version Programming • Design diversity • Independent generation of N (N > 2) functionally equivalent programs from the same initial specification • No interactions between groups. • The programs execute Ultimate N-Version • N-version specification • N-version development teams • N-version programming languages • N-version hardware – concurrently – with the same inputs • Ultra-reliable hardware voting • Their results are compared by a driver process – Results should be identical – If different – the consensus result is taken to be correct. Lecture 4: Dependability and fault tolerance Mikael Asplund 19 Lecture 4: Dependability and fault tolerance Mikael Asplund Consistent comparison problem • To what extent can votes be compared? 20 Software Dynamic Redundancy • Error detection – Environmental – Application • Text or integer arithmetic produce identical results • Damage confinement and assessment • Real numbers usually produce different values • Error recovery – Forward vs. backward error recovery • Fault treatment and continued service Lecture 4: Dependability and fault tolerance Mikael Asplund 21 Lecture 4: Dependability and fault tolerance Mikael Asplund Forward error recovery • Continues from an erroneous state by making selective corrections to the system state • Examples: Backward error recovery • Relies on restoring the system to a previous safe state and executing an alternative section of the program • Advantages – the erroneous state is cleared and it does not rely on finding the location or cause of the fault – can be used to recover from unanticipated faults including design errors • Disadvantages – Takes time – Might not be acceptable – redundant pointers in data structures and – the use of self-correcting codes such as Hamming Codes Lecture 4: Dependability and fault tolerance Mikael Asplund 23 22 Lecture 4: Dependability and fault tolerance Mikael Asplund 24 Dynamic redundancy and exceptions • An exception can be defined as the occurrence of an error • Exception handling is a forward error recovery mechanism • Exception handling facility however, can be used to provide backward error recovery Reading • Chapter 5, 6.5 and 12.8 of Burns & Wellings Lecture 4: Dependability and fault tolerance Mikael Asplund • Dependable Systems: IFIP terminology as described in [Avizienis et.al. 2004] 25 Lecture 4: Dependability and fault tolerance Mikael Asplund 26