Dependability TDDB47 Real Time Systems • How can we produce systems that do their job, and how to measure how well they do their jobs? – How do things go wrong and why? – What can we do about it? • This lecture: Basic overview of fault tolerant systems Lecture 3: Dependability and fault tolerance Calin Curescu Real-Time Systems Laboratory Department of Computer and Information Science Linköping University, Sweden The lecture notes are partly based on lecture notes by Simin Nadjm-Tehrani, Jörgen Hansson, Anders Törne. They also loosely follow Burns’ and Welling book “Real-Time Systems and Programming Languages”. These lecture notes should only be used for internal teaching purposes at Linköping University. Lecture 3: Dependability and fault tolerance Calin Curescu 36 pages Lecture 3: Dependability and fault tolerance Calin Curescu Accidents/Mishaps • • 2 of 36 What is dependability? 1987 – Six patients in USA and Canada got very high doses of radiation and severe burns from the cancer treatment system Therac 25. – Doses as high as 15,000-20,000 radiation units compared with the normal levels (~ 200 units) had been given. Three died. Property of a computing system which allows reliance to be justifiably placed on the service it delivers. [Avizienis et al.] 1994 – TCAS is a system designed to avoid mid-air collisions between passenger planes. – Two commercial aircrafts came as close as 1.6 km to each other while flying over Oregon in USA. • minimum safety rule ~4.8 km. Lecture 3: Dependability and fault tolerance Calin Curescu 3 of 36 Lecture 3: Dependability and fault tolerance Calin Curescu Dependability Attributes of dependability • [Sv. Pålitlighet] • Safety – non-occurrence of catastrophic consequences on the environment Availability – the readiness for usage Integrity – non-occurrence of unauthorized alteration of information Reliability – continuity of correct service Confidentiality – absence of unauthorized disclosure of information, Maintainability – ability to undergo repairs and modifications. • • • • IFIP WG 10.4 Lecture 3: Dependability and fault tolerance Calin Curescu • 5 of 36 4 of 36 Lecture 3: Dependability and fault tolerance Calin Curescu 6 of 36 Reliability Faults, Errors & Failures • • [Sv. Tillförlitlighet] • Means that the system (functionally) behaves as specified, and does it continually over measured intervals of time. • • – Typical measure in aerospace: 10-9 failures/hour • Lecture 3: Dependability and fault tolerance Calin Curescu 7 of 36 Fault [sv. felkälla] – a defect within the system or a situation that can lead to failure Error [sv. felyttring] – manifestation (symptom) of the fault - an unexpected behaviour Failure [sv. misslyckande] – system not performing its intended function System state point of view – Failure – an unspecified external state – Error – an unspecified internal state – Fault - the reason for the illegal state transition Lecture 3: Dependability and fault tolerance Calin Curescu Transitions • Failure Æ fault Æ error Æ failure Æ fault • Systems are composed of components which are themselves systems, hence – a failure in a subcomponent is a fault in the component Lecture 3: Dependability and fault tolerance Calin Curescu 9 of 36 What is a bug? Lecture 3: Dependability and fault tolerance Calin Curescu Falut sources • 8 of 36 10 of 36 Examples Four sources of faults which can result in system failure – Inadequate specification – Design errors in software – Hardware component failure – Interference in the communication subsystem Lecture 3: Dependability and fault tolerance Calin Curescu 11 of 36 Lecture 3: Dependability and fault tolerance Calin Curescu 12 of 36 Fault Types • • • • Transient fault – Starts at a particular time, remains in the system for some period and then disappears. – Many faults in communication systems are transient. • E.g. hardware components which have an adverse reaction to radioactivity. Permanent fault – Remains in the system until repaired • E.g., a broken wire or a software design error. Intermittent fault – Is a transient fault that occurs repeatedly. • E.g. a hardware component that is heat sensitive, it works for a time, stops working, cools down and then starts to work again. Lecture 3: Dependability and fault tolerance Calin Curescu Failure types • • • 13 of 36 Value domain – Value error – Constraint error Timing Domain – Early – Late • Fail late – Omission • Fail silent • Fail stop • Fail controlled Impromptu/Comission Arbitrary /Fail uncontrolled Lecture 3: Dependability and fault tolerance Calin Curescu Another failure classification • • 14 of 36 Fault, Error, Failure Node failures – Crash – Omission – Byzantine • Goal of system verification and validation is to eliminate faults • Goal of safety/risk analysis is to focus on important faults Channel failures – Crash (and potential partitions) – Message loss – Erroneous/arbitrary messages • Goal of fault tolerance is to reduce effects of errors if they appear – In order to eliminate or delay failures Lecture 3: Dependability and fault tolerance Calin Curescu 15 of 36 Lecture 3: Dependability and fault tolerance Calin Curescu Measuring Reliability and Availability Time from an initial instant to the next failure event • • Typical measures: – MTTF – mean time to failure – MTBF – mean time between failures – MTTR – mean time to repair • • Availability – Ratio of service time to elapsed time • MTTF/(MTTF+MTTR) Lecture 3: Dependability and fault tolerance Calin Curescu 17 of 36 16 of 36 Achieving Reliability • • Hard to differentiate an arbitrary failure from a commission followed by an omission failure Fault prevention – attempts to eliminate any possibility of faults creeping into a system before it goes operational. Stage 1: Fault avoidance techniques Stage 2: Fault removal techniques • Fault tolerance enables a system to continue functioning even in the presence of faults. • Both approaches attempt to produces systems which have desirable failure modes. Lecture 3: Dependability and fault tolerance Calin Curescu 18 of 36 Fault Avoidance • • • Limit the introduction of faults during system construction Hardware – use of the most reliable components within the given costb and performance constraints – use of thoroughly-refined techniques or interconnection of components and assembly of subsystems – packaging the hardware to screen out expected forms of interference Software – rigorous, if not formal, specification of requirements – use of proven design methodologies – use of languages with facilities for data abstraction and modularity – use of software engineering environments to help manipulate software components and thereby manage complexity Lecture 3: Dependability and fault tolerance Calin Curescu 19 of 36 Fault Removal • In spite of fault avoidance, design errors in both hardware and software components will exist. • Fault removal – design reviews – program verification – code inspections – system testing. • Limitations – A test can only show that a fault exist, not that it does not exist – Sometimes impossible to test under realistic conditions – Errors that have been introduced at the requirements stage Lecture 3: Dependability and fault tolerance Calin Curescu 20 of 36 Fault tolerance • • • Means that a system provides a degraded (but acceptable) function – Even in presence of faults – For a period defined by certain model assumptions Due to – Incomplete fault removal – Hardware failure • Inaccessible or infeasible repair and maintenance Redundancy • Redundant elements - as they are not required in a perfect system – Detect and recover from faults – Used by all fault-tolerant techniques • Increase overall complexity – Can lead to less reliable systems. • Static vs. dynamic redundancy Address unforeseen faults? Lecture 3: Dependability and fault tolerance Calin Curescu 21 of 36 Lecture 3: Dependability and fault tolerance Calin Curescu Static redundancy • • Used in any case Error masking properties • • Hardware – N-modular redundancy Software – N-version programming • 22 of 36 Dynamic redundancy • • Used only in case of error appearance Error detecting properties • • Software – recovery methods Hardware – switching to back-up or recovery module • • Data – self-correcting codes Time – re-computing a result Influence on timeliness? Lecture 3: Dependability and fault tolerance Calin Curescu 23 of 36 Lecture 3: Dependability and fault tolerance Calin Curescu 24 of 36 N-Version • From D. Lardner: Edinburgh Review, year 1824: • • – ”The most certain and effectual check upon errors which arise in the process of computation is to cause the same computations to be made by separate and independent computers*; and this check is rendered still more decisive if their computations are carried out by different methods.” • * people who compute Lecture 3: Dependability and fault tolerance Calin Curescu N-Version Programming 25 of 36 Design diversity Independent generation of N (N > 2) functionally equivalent programs from the same initial specification No interactions between groups. • The programs execute – concurrently – with the same inputs • Their results are compared by a driver process – Results should be identical – If different – the consensus result is taken to be correct. Lecture 3: Dependability and fault tolerance Calin Curescu Ultimate N-Version 26 of 36 Consistent comparison problem • N-version specification • To what extent can votes be compared? • N-version development teams • • Text or integer arithmetic produce identical results Real numbers usually produce different values • N-version programming languages • • N-version hardware Need for inexact voting techniques – Temperature threshold example • Ulra-reliable hardware voting Lecture 3: Dependability and fault tolerance Calin Curescu 27 of 36 Lecture 3: Dependability and fault tolerance Calin Curescu Software Dynamic Redundancy • Phases – Error detection • No fault tolerance scheme can be utilised until the associated error is detected. – Damage confinement and assessment • To what extent has the system been corrupted? • May depend on detection delay – Error recovery • Get system to a state from which it can continue its normal operation (perhaps with degraded functionality). • – Forward vs. backward error recovery Lecture 3: Dependability and fault tolerance Calin Curescu Error Detection • – Fault treatment and continued service • An error is a symptom of a fault, what to do with it? 29 of 36 28 of 36 Environmental detection (run-time system or hardware) – Hardware: e.g. illegal instruction – O.S: null pointer Application detection – Replication checks, e.g., 2 version difference check – Timing checks – Watchdog timers - reset timer periodically, else error – Reversal checks: e.g. squareroot check by squaring – Coding checks, e.g., checksums – Reasonableness checks, e.g., range of variables – Structural checks, e.g., redundant pointers in data structures – Dynamic reasonableness check Lecture 3: Dependability and fault tolerance Calin Curescu 30 of 36 Forward error recovery • Continues from an erroneous state by making selective corrections to the system state Backward Error Recovery • – This has the same functionality but uses a different algorithm (c.f. N-Version Programming) – Recovery point & checkpointing • Saving appropriate system state – May include making safe the controlled environment which may be hazardous or damaged because of the error – It is system specific and depends on accurate predictions of the location and cause of errors (i.e, damage assessment) • Examples: redundant pointers in data structures and the use of self-correcting codes such as Hamming Codes Lecture 3: Dependability and fault tolerance Calin Curescu Relies on restoring the system to a previous safe state and executing an alternative section of the program • 31 of 36 Advantage: the erroneous state is cleared and it does not rely on finding the location or cause of the fault • can be used to recover from unanticipated faults including design errors Lecture 3: Dependability and fault tolerance Calin Curescu 32 of 36 Static (NV) vs. Dynamic (RB) Redundancy • • • • • Design overheads – both require alternative algorithms, NV requires driver, RB requires acceptance test Runtime overheads – NV requires N * resources, – RB requires establishing recovery points Diversity of design both susceptible to errors in requirements Error detection – vote comparison (NV) – acceptance test (RB) Atomicity – NV vote before it outputs to the environment – RB must be instructed to only output after the passing of an acceptance test. Lecture 3: Dependability and fault tolerance Calin Curescu 33 of 36 Lecture 3: Dependability and fault tolerance Calin Curescu 34 of 36 Lecture 3: Dependability and fault tolerance Calin Curescu 35 of 36 Lecture 3: Dependability and fault tolerance Calin Curescu 36 of 36