Dependability and real-time TDDD07/TDDC47 Real-time Systems/Real-time and Concurrent programming • If a system is to produce results within time constraints, it needs to produce results at all! Lecture 8/9 • How to justify and measure how well computer systems meet their targets in presence of faults? Dependability & Fault tolerance Simin Nadjm-Tehrani Real-time Systems Laboratory Department of Computer and Information Science Linköping University Undergraduate course on Real-time Systems Linköping University 50 pages Autumn 2010 Dependable systems • How do things go wrong and why? • What can we do about it? – This lecture: Basic notions of dependability and redundancy in fault-tolerant systems Undergraduate course on Real-time Systems Linköping University 2 of 50 Autumn 2010 Early computer systems • 1944: Real-time computer system in the Whirlwind project at MIT, used in a military air traffic control system 1951 • Short life of vaccum tubes gave mean ti time to t failure f il off 20 minutes i t 3 of 50 Autumn 2010 Engineers: Fool me once, shame on you – fool me twice, shame on me Undergraduate course on Real-time Systems Linköping University Undergraduate course on Real-time Systems Linköping University 5 of 50 Autumn 2010 Undergraduate course on Real-time Systems Linköping University 4 of 50 Autumn 2010 Software developers: Fool me N times, who cares, this is complex and anyway no one expects software to work... Undergraduate course on Real-time Systems Linköping University 6 of 50 Autumn 2010 1 FT - June 16, 2004 • "If you have a problem with your Volkswagen the likelihood that it was a software problem is very high. Software technology is not something that we as car manufacturers feel comfortable with.” Bernd Pischetsrieder, chief executive of Volkswagen Undergraduate course on Real-time Systems Linköping University 7 of 50 Autumn 2010 October 2005 • “Automaker Toyota announced a recall of 160,000 of its Prius hybrid vehicles following reports of vehicle warning lights illuminating for no reason, and cars' gasoline engines stalling unexpectedly.” Wired 05-11-08 • The problem was found to be an embedded software bug Undergraduate course on Real-time Systems Linköping University February 2, 2004 • Angel Eck, driving a 1997 Pontiac Sunfire found her car racing at high speed and accelerating on Interstate 70 for 45 minutes, heading toward Denver • ... with no effect from trying the brakes, shifting to neutral, and shutting off the ignition. Driver support: Volvo cars 1984 ABS Anti-lock 2004 Blind Spot Information system ((BLIS)) 1998 Dynamic Stability 2006 Active Bi-Xenon lights 2002 Roll Stability Control 2006 Adaptive Cruise Control (ACC) 2003 Intelligent Driver Collision warning system with brake support Braking System and Traction Control (DSTC) (RSC) Information System (IDIS) Undergraduate course on Real-time Systems Linköping University 9 of 50 Autumn 2010 Undergraduate course on Real-time Systems Linköping University Early space and avionics • During 1955, 18 air carrier accidents in the USA (when only 20% of the public was willing to fly!) 8 of 50 Autumn 2010 2006 Source: Volvo Cars 10 of 50 Autumn 2010 Airbus 380 • Integrated modular avionics (IMA), with safety-critical digital components, e.g. – Power-by-wire: complementing the hydraulic powered flight control surfaces • Today’s complexity many times higher – Cabin pressure control (implemented with a TTP operated bus) Undergraduate course on Real-time Systems Linköping University 11 of 50 Autumn 2010 Undergraduate course on Real-time Systems Linköping University 12 of 50 Autumn 2010 2 Dependability What is dependability? Property of a computing system which allows reliance to be justifiably placed on the service it delivers. [Avizienis et al.] • How do things go wrong? • Why? • What can we do about it? • Basic notions in dependable systems and replication for fault tolerance The ability to avoid service failures that are more frequent or more severe than is acceptable. (sv. Pålitlighet) (sv. Pålitliga datorsystem) Undergraduate course on Real-time Systems Linköping University 13 of 50 Autumn 2010 Undergraduate course on Real-time Systems Linköping University Attributes of dependability IFIP WG 10.4 definitions: • Safety: absence of harm to people and environment • Availability: the readiness for correct service • Integrity: absence of improper system alterations • Reliability: continuity of correct service • Maintainability: ability to undergo modifications and repairs Undergraduate course on Real-time Systems Linköping University 15 of 50 Autumn 2010 Reliability [Sv. Tillförlitlighet] • Means that the system (functionally) behaves as specified, and does it continually over measured intervals of time • Measured as probability of failure, e.g. 10-9 • Another way of putting it: MTTF – In commercial flight systems - One failure in 109 flight hours Undergraduate course on Real-time Systems Linköping University Faults, Errors & Failures • Fault: a defect within the system or a situation that can lead to failure • Error: manifestation (symptom) of the fault - an unexpected behaviour • Failure: system not performing its intended function 14 of 50 Autumn 2010 16 of 50 Autumn 2010 Examples • Year 2000 bug • Bit flips in hardware due to cosmic radiation in space • Loose wire • Air craft retracting its landing gear while on ground Effects in time: Permanent/ transient/ intermittent Undergraduate course on Real-time Systems Linköping University 17 of 50 Autumn 2010 Undergraduate course on Real-time Systems Linköping University 18 of 50 Autumn 2010 3 Fault ⇒ Error ⇒ Failure • Goal of system verification and validation is to eliminate faults Four approaches [IFIP 10.4]: Some will remain… 1. 2. 3. 4. • Goal of hazard/risk analysis is to focus on important faults • Goal of fault tolerance is to reduce effects of errors if they appear eliminate or delay failures Undergraduate course on Real-time Systems Linköping University Dependability techniques Fault Fault Fault Fault p prevention removal tolerance forecasting Reading: Section 5.1 & 5.3 of article TDDD07: lecture 7 Let’s look at an example! 19 of 50 Autumn 2010 Undergraduate course on Real-time Systems Linköping University Google’s 100 min outage September 2, 2009: A small fraction of Gmail’s servers were taken offline to perform routine upgrades. “We had slightly underestimated the load which some recent changes (ironically, some designed to improve service availability) placed on the request routers .” 20 of 50 Autumn 2010 Dependability techniques Four approaches [IFIP 10.4]: 1. 2. 3. 4. Fault Fault Fault Fault prevention removal tolerance forecasting Fault forecasting was applicable! Undergraduate course on Real-time Systems Linköping University 21 of 50 Autumn 2010 Undergraduate course on Real-time Systems Linköping University Fault tolerance • Means that a system provides a degraded (but acceptable) function – Even in presence of faults – During a period defined by certain model assumptions 22 of 50 Autumn 2010 External factors The film... • Foreseen or unforeseen? – Fault model describes the foreseen faults Undergraduate course on Real-time Systems Linköping University 23 of 50 Autumn 2010 Undergraduate course on Real-time Systems Linköping University 24 of 50 Autumn 2010 4 Fault models • Leading to Node failures – – – – • Fault/error detection – By program or its environment • Fault containment by architectural choices • Fault tolerance using redundancy – software – hardware – Data Crash Omission Timing Byzantine • In distributed systems: can also have Channel failures – – – – Crash (and potential partitions) Message loss Message delay Erroneous/arbitrary messages Undergraduate course on Real-time Systems Linköping University On-line fault management 25 of 50 Autumn 2010 Undergraduate course on Real-time Systems Linköping University Redundancy 26 of 50 Autumn 2010 Static Redundancy From D. Lardner: Edinburgh Review, year 1824: ”The most certain and effectual check upon errors which arise in the process of computation is to cause the same computations to be made by separate and independent computers*; and this check is rendered still more decisive if their computations are carried out by different methods.” Used all the time (whether an error has appeared or not), just in case… – SW: N-version programming – HW: Voting systems – Data: parity bits, checksums * people who compute Undergraduate course on Real-time Systems Linköping University 27 of 50 Autumn 2010 Undergraduate course on Real-time Systems Linköping University Dynamic Redundancy Used when error appears and specifically aids the treatment 28 of 50 Autumn 2010 Server replication models • Passive replication • Active replication – SW: Rollback recovery, Exceptions – HW: Switching to back-up module – Data: Self-correcting codes – Time: Re-computing a result X : Denotes a replica group Undergraduate course on Real-time Systems Linköping University 29 of 50 Autumn 2010 Undergraduate course on Real-time Systems Linköping University 30 of 50 Autumn 2010 5 Underlying mechanisms • Active replication – Replicas must be deterministic in their computations (same output on same sequence of inputs) – Relies on group membership protocol: who is up, who is down? For high availability • In both cases we need to – Respect message ordering • Implicit agreement among replicas • Passive replication – Primary – backup: brings the secondary server up to date – After each “write”, or periodically? Undergraduate course on Real-time Systems Linköping University 31 of 50 Autumn 2010 Relating states 32 of 50 Autumn 2010 The consensus problem Assume that we have a reliable broadcast • Needs a notion of time (or event ordering) in distributed systems • TDDC47: Recall nodes in TTP share a global clock • Processes p1,,…,p ,pn take p part in a decision • Each pi proposes a value vi – e.g. application state info • All non-faulty processes decide on a common value v that is equal to one of the proposed values • TDDD07: Recall from lecture 4 – Synchronised model – Asynchronous systems: • Abstract time via order of events • Notion of snapshot Undergraduate course on Real-time Systems Linköping University Undergraduate course on Real-time Systems Linköping University 33 of 50 Autumn 2010 Desired properties • Every non-faulty process eventually decides (Termination) • No two non-faulty processes decide differently (Agreement) process decides v then the value v was • If a p proposed by some process (Validity) Undergraduate course on Real-time Systems Linköping University 34 of 50 Autumn 2010 Basic impossibility result [Fischer, Lynch and Paterson 1985] There is no deterministic algorithm solving the consensus problem in an asynchronous distributed system with a single crash failure In presence of given failure model Algorithms for consensus have to be proven to have these properties Undergraduate course on Real-time Systems Linköping University 35 of 50 Autumn 2010 Undergraduate course on Real-time Systems Linköping University 36 of 50 Autumn 2010 6 Synchrony helps... • Computations at each node proceed in rounds initiated by pulses • Pulses implemented using local physical clocks, synchronised assuming bounded message delays Undergraduate course on Real-time Systems Linköping University Byzantine agreement Problem: • Nodes need to agree on a decision but some nodes are faulty • Faulty nodes can act in an arbitrary way (can be malicious) Undergraduate course on Real-time Systems Linköping University 37 of 50 Autumn 2010 Byzantine agreement protocol 38 of 50 Autumn 2010 Result from 1980 • Theorem: There is an upper bound t for the number of byzantine node failures compared to the size of the network N, N ≥ 3t+1 • Gives a t+1 round algorithm for solving consensus in a synchronous network • Solved consensus in presence of synchrony • Proposed by Pease, Shostak and Lamport [1980] How many faulty nodes can the algorithm tolerate? • Here: – We just demonstrate that t nodes would not be enough! Undergraduate course on Real-time Systems Linköping University Undergraduate course on Real-time Systems Linköping University 39 of 50 Autumn 2010 40 of 50 Autumn 2010 Scenario 1 • G and L1 are correct, L2 is faulty • G and L2 are correct, L1 is faulty G G 0 0 1 0 0 1 0 1 G G 1 1 L1 Scenario 2 1 G said 1 L2 Undergraduate course on Real-time Systems Linköping University L1 G said 0 L2 41 of 50 Autumn 2010 L1 0 G said 1 L2 Undergraduate course on Real-time Systems Linköping University L1 G said 0 L2 42 of 50 Autumn 2010 7 Scenario 3 … does not work with t=1, N=3! • The general is faulty! G G 0 1 1 L1 • Seen from L1, scenario 1 and 3 are identical, so if L1 decides 1 in scenario 1 it will decide 1 in scenario 3 0 1 • Similarly for L2, if it decides 0 in scenario 2 it decides 0 in scenario 3 G said 1 L2 0 L1 2-round algorithm G said 0 Undergraduate course on Real-time Systems Linköping University L2 43 of 50 Autumn 2010 Summary • Dependable systems need to justify why services can be relied upon • To tolerate faults we normally deploy replication • Replication needs (some kind of) agreement on state • Agreement problems are hard to solve & prove correct in distributed systems • Require assumptions on faults, and (some kind of) synchrony Undergraduate course on Real-time Systems Linköping University 45 of 50 Autumn 2010 • L1 and L2 do not agree in scenario 3! Undergraduate course on Real-time Systems Linköping University Real-time communication • Remember in a TTP bus: – Nodes have synchronised clocks – The communication infrastructure (collection of CCs and the TTP bus communication) guarantees a bounded time for a new data appearing on comm. interface of a node (its CNI) – Byzantine faults in nodes can be detected by other nodes! • Exercise: Which faults are detectable in a system connected with a CAN bus? Undergraduate course on Real-time Systems Linköping University Timing and fault tolerance • Implementing fault tolerance often relies on support for timing guarantees 46 of 50 Autumn 2010 Timing overhead • Does fault tolerance cost time? • Two examples: – Fault-tolerant scheduling g – Replicated server • How do fault tolerance mechanisms affect application timeliness? Undergraduate course on Real-time Systems Linköping University 44 of 50 Autumn 2010 47 of 50 Autumn 2010 Undergraduate course on Real-time Systems Linköping University 48 of 50 Autumn 2010 8 Fault-tolerant scheduling • Assume we are running RMS • How can we deal with transient faults that lead to a process crash? To the client Client request Checkpointing request • We can reschedule the process before its deadline! – Redundancy in time... Undergraduate course on Real-time Systems Linköping University Highly available servers Logging Application Server Server Request replies • Optimal checkpointing interval for achieving highest availability 49 of 50 Autumn 2010 Undergraduate course on Real-time Systems Linköping University 50 of 50 Autumn 2010 9