TDDD82 Secure Mobile Systems Lecture 5: Dependability Mikael Asplund Real-time Systems Laboratory Department of Computer and Information Science Linköping University Based on slides by Simin Nadjm-Tehrani – Tekniska problem kan man aldrig svara på. De är oftast kopplade till datorproblem eller trafik. Det kan också vara uppdateringar med mer som kan ställa till det. Men det är inget som jag kan besvara på, säger Stefan Gustafsson, presstalesman för polisen i region Väst. Toyota “Toyota settles acceleration lawsuit after $3million verdict Toyota heads off punitive damages after a $3million jury verdict pointed to software defects in a fatal crash. The case could fuel other sudden acceleration lawsuits.” [LA Times, October 26, 2013] Expert witness conclusions: ● ● ● ● ● Toyota’s electronic throttle control system (ETCS) source code is of unreasonable quality. Toyota’s source code is defective and contains bugs, including bugs that can cause unintended acceleration (UA). Code-quality metrics predict presence of additional bugs. Toyota’s fail safes are defective and inadequate (referring to them as a “house of cards” safety architecture). Misbehaviors of Toyota’s ETCS are a cause of UA. Dependability Property of a computing system which allows reliance to be justifiably placed on the service it delivers. [Avizienis et al.] The ability to avoid service failures that are more frequent or more severe than is acceptable. Dependability taxonomy Fault-tolerant Distributed Systems Redundancy ● Necessary for fault-tolerance! ● Increase overall complexity ● Static – ● Error masking properties Dynamic – Error detecting properties N-version From D. Lardner: Edinburgh Review, year 1824: ”The most certain and effectual check upon errors which arise in the process of computation is to cause the same computations to be made by separate and independent computers*; and this check is rendered still more decisive if their computations are carried out by different methods.” * people who compute Dependability & Distribution • Making systems fault-tolerant typically uses redundancy – Redundancy in space leads to distribution – But distributed systems are not necessarily faulttolerant! Replication • Passive replication – Primary – backup – Cold/Warm/Hot • Active replication – – Group membership Consistency ● Linearizability – ● One-copy-serializability – ● Every data item appears to all actors as being in a single location and concurrent transactions are executed as if in some serial order (isolation) Replica consistency – ● Every write is atomic and instantaneous Every data item appears to all actors as being in a single location Eventual consistency – If no writes occur for some period of time the replicas will eventually converge to a common state Implementing replica consistency • Message ordering – Use the before relation (i.e., by using Lamport clocks) • Agreement – For passive replication • Controlled by the master • Still requires agreement of when the primary is down... – Active replication • Agreement for every operation Agreement is not just for replication The consensus problem • Processes p1,…, pn take part in a decision – Each pi proposes a value vi – All correct processes decide on a common value v that is equal to one of the proposed values • Desired properties – Termination: Every correct process eventually decides – Agreement: No two (correct) processes decide differently – Validity: If a process decides v then the value v was proposed by some process Fault model Non-tolerated faults Tolerated faults Normality Recall from previous lecture ● Node/Channel failures – – – – ● Crash Omission Timing Byzantine/arbitrary System model – – Synchronous Asynchronous Basic impossibility result [Fischer, Lynch and Paterson 1985] • There is no deterministic algorithm solving the consensus problem in an asynchronous distributed system with a single crash failure. Naïve approaches ● Wait for all to agree – ● Wait for a majority to agree – ● Node crash What about conflicts? When to move on? Assume synchrony ● ● If a node does not respond within time t, it will not respond at time t+d Partial synchrony – ● Bounds exist but are not known Powerful abstraction: – Unreliable failure detectors Paxos ● Solves the consensus problem in asynchronous model – – – ● Agreement Validity Termination is guaranteed under partial synchrony Standard protocol – E.g., Google Chubby protocol Network partitions ● Network is split in multiple parts – – ● Link failures Mobility Classical approaches: – – – Stop until healed Let a majority continue Optimistically continue and then reconcile Två generaler Theorem There is no deterministic protocol which guarantees timed agreement and progress for an unreliable communication channel. For the project ● ● Passive replication Need to think carefully about your fault model! – – Nodes/Channels Frequency