TDDD82 Secure Mobile Systems Lecture 5: Dependability

advertisement
TDDD82 Secure Mobile Systems
Lecture 5: Dependability
Mikael Asplund
Real-time Systems Laboratory
Department of Computer and Information Science
Linköping University
Based on slides by Simin Nadjm-Tehrani
– Tekniska problem kan man aldrig svara på.
De är oftast kopplade till datorproblem eller
trafik. Det kan också vara uppdateringar
med mer som kan ställa till det. Men det är
inget som jag kan besvara på, säger Stefan
Gustafsson, presstalesman för polisen i
region Väst.
Toyota
“Toyota settles acceleration lawsuit after $3million verdict
Toyota heads off punitive damages after a $3million jury verdict pointed to software defects
in a fatal crash. The case could fuel other
sudden acceleration lawsuits.” [LA Times,
October 26, 2013]
Expert witness conclusions:
●
●
●
●
●
Toyota’s electronic throttle control system (ETCS) source
code is of unreasonable quality.
Toyota’s source code is defective and contains bugs,
including bugs that can cause unintended acceleration
(UA).
Code-quality metrics predict presence of additional bugs.
Toyota’s fail safes are defective and inadequate (referring
to them as a “house of cards” safety architecture).
Misbehaviors of Toyota’s ETCS are a cause of UA.
Dependability
Property of a computing system which allows
reliance to be justifiably placed on the service
it delivers.
[Avizienis et al.]
The ability to avoid service failures that
are more frequent or more severe than is
acceptable.
Dependability taxonomy
Fault-tolerant Distributed Systems
Redundancy
●
Necessary for fault-tolerance!
●
Increase overall complexity
●
Static
–
●
Error masking properties
Dynamic
–
Error detecting properties
N-version
From D. Lardner: Edinburgh Review, year 1824:
”The most certain and effectual check upon errors which
arise in the process of computation is to cause the same
computations to be made by separate and independent
computers*; and this check is rendered still more decisive if
their computations are carried out by different methods.”
* people who compute
Dependability & Distribution
• Making systems fault-tolerant typically uses
redundancy
– Redundancy in space leads to distribution
– But distributed systems are not necessarily faulttolerant!
Replication
• Passive replication
– Primary – backup
– Cold/Warm/Hot
• Active replication
–
–
Group membership
Consistency
●
Linearizability
–
●
One-copy-serializability
–
●
Every data item appears to all actors as being in a single location and
concurrent transactions are executed as if in some serial order (isolation)
Replica consistency
–
●
Every write is atomic and instantaneous
Every data item appears to all actors as being in a single location
Eventual consistency
–
If no writes occur for some period of time the replicas will eventually converge
to a common state
Implementing replica consistency
• Message ordering
– Use the before relation (i.e., by using Lamport clocks)
• Agreement
– For passive replication
• Controlled by the master
• Still requires agreement of when the primary is down...
– Active replication
• Agreement for every operation
Agreement is not just for replication
The consensus problem
• Processes p1,…, pn take part in a decision
– Each pi proposes a value vi
– All correct processes decide on a common value v that is equal to
one of the proposed values
• Desired properties
– Termination: Every correct process eventually decides
– Agreement: No two (correct) processes decide differently
– Validity: If a process decides v then the value v was proposed by
some process
Fault model
Non-tolerated faults
Tolerated faults
Normality
Recall from previous lecture
●
Node/Channel failures
–
–
–
–
●
Crash
Omission
Timing
Byzantine/arbitrary
System model
–
–
Synchronous
Asynchronous
Basic impossibility result
[Fischer, Lynch and Paterson 1985]
• There is no deterministic algorithm solving the consensus
problem in an asynchronous distributed system with a
single crash failure.
Naïve approaches
●
Wait for all to agree
–
●
Wait for a majority to agree
–
●
Node crash
What about conflicts?
When to move on?
Assume synchrony
●
●
If a node does not respond within time t, it
will not respond at time t+d
Partial synchrony
–
●
Bounds exist but are not known
Powerful abstraction:
–
Unreliable failure detectors
Paxos
●
Solves the consensus problem in asynchronous
model
–
–
–
●
Agreement
Validity
Termination is guaranteed under partial synchrony
Standard protocol
–
E.g., Google Chubby protocol
Network partitions
●
Network is split in multiple parts
–
–
●
Link failures
Mobility
Classical approaches:
–
–
–
Stop until healed
Let a majority continue
Optimistically continue and then reconcile
Två generaler
Theorem
There is no deterministic protocol which
guarantees timed agreement and progress
for an unreliable communication channel.
For the project
●
●
Passive replication
Need to think carefully about your fault
model!
–
–
Nodes/Channels
Frequency
Download