Fault Tolerance: Byzantine FT (TvS:Ch7)

advertisement
System Reliability and Fault
Tolerance
Reliable Communication
Byzantine Fault Tolerance
Recap: Replication
 Write is handled only by the remote primary server, and the
backups are updated accordingly; Read is performed locally.
 Replicated services:
– Sun Network Info Service (NIS, formerly Yellow Pages)
– FAB: Building distributed enterprise disk arrays from commodity
components, ASPLOS’04
Reliable Point-to-Point Comm.

Failure Models
– Process failure: Sender vs Receiver
» Fail-stop: a process crash which can be detected by other processes
» How to detect such crash? Timeout can indicate only that a process is
not responding
– Comm. failure
» send failure: A process completes a send, but the msg is not put in its
outgoing msg buffer
» receive failure: A msg is in incoming buf, but is not received by a
process
» Channel failure: fail while msg is transmitted from outgoing buf to
incoming buf
– Arbitrary failure (Byzantine failure)
» Any type of error may occur. E.g. return wrong value.

Reliable comm:
– Validity: Any msg in the outgoing buf is eventually delivered to the
incoming buf
– Integrity: The msg received is identical to the one sent and no msgs
are delivered twice.
© C. Xu, 1998-2007
RPC Failure Semantics
 Five
Possible Failures:
– The client is unable to locate the server
» Server is down or Stub mismatches with Skeleton
» Throw UnknownHostException
– The request message is lost
» start with a timer
» retransmission of the request message
– The server crashes after receiving a request
– The reply message is lost
– The client crashes after sending a request
 In
Java, all remote methods must be prepared to
catch RemoteException
© C. Xu, 1998-2007
Server Crashes

A server in client-server communication
a) Normal case
b) Crash after execution
c) Crash before execution
• At least once semantics
• At most once semantics
• Exactly once semantics
java.rmi.ServerRuntimeException
© C. Xu, 1998-2007
Server Crash (Cont’)

Assume client requests server to print a msg
– Send a completion msg (M) before print (P), or
– Send a completion msg (M) after print (P)

Combinations
–
–
–
–
–
–
MPC: crash after ack and print
MC(P):
PMC:
PC(M)
C(PM): crash before print and ack
C(MP)
© C. Xu, 1998-2007
When a crashed server recovers, the client can

never reissue a request (Never)

always reissue a request (Always)
 reissue only if it received ack
 reissue only if it received no ack
Client
Server
Strategy M  P
Strategy P  M
Reissue strategy
MPC
MC(P)
C(MP)
PMC
PC(M) C(PM)
Always
DUP
OK
OK
DUP
DUP
OK
Never
OK
ZERO
ZERO
OK
OK
ZERO
Only when ACKed
DUP
OK
ZERO
DUP
OK
ZERO
Only when not ACKed
OK
ZERO
OK
OK
DUP
OK
ok: Text is printed once; dup: printed twice; zero: no printout
© C. Xu, 1998-2007

Lost Reply Messages
– Some requests can be re-executed with side-effects (e.g.
Read 1024 bytes of a file); some not (idempotent).
– Solutions:
» Structure requests in an idempotent way
» Assign request a sequence number to be checked by server

Client Crashes leading to orphan computation
– extermination: client side logging of RPC about what to
do; the log is checked after a reboot.
– reincarnation: client bcasts a new epoch when it reboots;
server detects orphan computations based on epochs.
» kill orphan remote computation or locate their owners
– expiration: set a time quantum for each RPC request; if it
cannot finish, more quanta are asked.
© C. Xu, 1998-2007
Reliable Multicast
 Basic
properties:
– Validity: If a correct process multicasts message m, then
it will eventually deliver m.
– Integrity: a correct process delivers the msg at most once
 Atomic
messages (aka agreement)
– A message is delivered to all members of a group, or to
none
 Message
ordering guarantees
– within group
– across groups
© C. Xu, 1998-2007
Message Ordering
 Different
members may see messages in different
orders
 Ordered group communication requires that all
members agree about the order of messages
 Within each group, assign global ordering to
messages
 Hold back messages that arrive out of order (delay
their delivery)
© C. Xu, 1998-2007
(I) Unordered Multicasts
Process P1
Process P2
Process P3
mcast m1
receives m1
receive m2
mcast m2
receives m2
receives m1
(II) FIFO-ordered Multicasts
Process P1
Process P2
Process P3
Process P4
mcast m1
receive m1
receives m3
mcast m3
mcast m2
receives m3
receives m1
mcast m4
receives m2
receives m2
receives m4
receives m4
If a process multicasts two msgs m and m’ in order, then every
process in the group will deliver the msgs in the same order
(III) Causally-order Multicasts
C
(2)
(1)
A
If mcast(g, m)  mcast(g, m’)
Then any process in the group
should deliver m before m’
B
(1)
(2)
(1) Delayed
D
(VI) Totally-ordered multicasts
If a process delivers msg m before m’, then any other process
that delivers m’ will deliver m before m’.
Centralized Impl of Total Ordering
 Central
ordering server (sequencer) assigns global
sequence numbers
 Hosts apply to ordering server for numbers, or
ordering server sends all messages itself
 Hold-back easy, since sequence numbers are
sequential
– Msgs will remain in hold-back queue until they can be
delivered according to their sequence numbers.
 Sequencer:
bottleneck and single point of failure
– tricky protocol to deal with case where ordering server
fails
© C. Xu, 1998-2007
Atomic Messages

Each recipient acks message, and sender retransmits if ack
not received
– Sender could crash before msg is delivered!!
» Simple approach: if sender crashes, a recipient volunteers to be “backup
sender” for the message
» re-sends message to everybody, waits for acks
» use simple algorithm to choose volunteer
» apply method again if backup sender crashes

No single best solutions exist!
© C. Xu, 1998-2007
Reliability due to Replication
 Blocking update, waiting till backups are updated
– Blocking update of backup servers must be atomic so as to implement
sequential consistency as the primary can serialize all incoming writes
(in order) and all processes see all writes in the same order from any
backup servers.
 Total ordering due to the use of primary for centralized sequencer
 Atomic:
– What happens if some W4 are Postive Ack and some are NAck?
– Two-phase commit protocol:
» W3: “prepare” msg from primary to other replicas
» W3+: ack to “prepare” (If in a prepared state, related objects be preserved in
permanent storage; will eventually be able to commit it.)
» W4: “commit” or “abort” msg
» W4+: ack to “commit/abort”
2PC Protocol in the Presence of Failures
 If ack to “prepare” msg is timed out, primary can send “abort” to
replicas and safely abort itself
 If replica waiting for “commit” or “abort” is timed out,
– If its ack to “prepare” was negative, simply abort itself
– If its ack to “prepare” was positive, it cannot “commit”, nor “abort”.
Block, waiting for primary or network recovery
 How to handle crash/reboot, particularly primary failure?
– Cannot back out of a commit if already decided
– Semantics of failure: store  commit; cannot commit before store
– Recovery protocol w/ non-volatile memory
 2PC causes a long waiting time if primary fails after “prepare” msg is
sent out
–
–
–
–
Three-phase commit protocol: Pre-Prepare, Prepare, Commit
Replica times out waiting for Commit msg will commit the trans
2PC: execute transaction when everyone is willing to commit
3PC: execute transaction when everyone knows it will commit
Recovery from Primary Failure
 Need to pick up a new primary defining a new “view.” It
could be set by human operator OR autonomic
– Suppose the lowest-numbered live server is the primary
– Replicas need to ping each other
– Ping msg lose or delayed may lead to more than one primary
 Paxos protocol for fault-tolerant consensus
– At most a single value is chosen
– Agreement reached despite lost msgs and crashed nodes
– Paxos protocol: eventually succeeds if a majority of replicas are
reachable
– See Lamport’98 (submitted to TOCS in 90) and ChandraToueg’96 for details
Handling Byzantine Failure
 Byzantine Failure
– Failed replicas are not necessarily failure-stop
– Failed replicas may generate arbitrary results!!
The Byzantine Generals Problem
Leslie Lamport, Robert Shostak, and
Marshall Pease in 1982
Byzantine Generals Problem
 N divisions of Byzantine army surround city
– Each division commanded by a general
– Some of the N generals are traitors
 Generals communicate via messages
– Traitors can send different values to different generals
 Requirements:
– All loyal generals decide upon same plan of action
– A “small” number of traitors cannot cause loyal generals to
adopt a bad plan
– NOT required to identify traitors
Restricted BGP
 Restate problems as:
– 1 commanding general
– N-1 lieutenants
 Interactive consistency requirements
– IC1: All loyal lieutenants obey the same order
– IC2: If the commander is loyal, every loyal lieutenant obeys
his/her order
 If we can solve this problem…
– Original BGP problem reduces to N instances of this problem;
one instance per general acting as commander
3-General Impossibility Result
 Assume 2 loyal generals and 1 traitor (shaded)
– Two messages: ATTACK or RETREAT
 If Lt.1 sees {Attack, “He said Retreat”}  what to do?
– If Lt2 is traitor (Fig1), L1 must attack to satisfy IC2
– If Commander is traitor (Fig2), L1 and L2 must make same decision
(always obeying commanders order over lieutenant’s violates IC1)
Commander
Attack
Attack
He said Retreat
lieutenant
lieutenant
Commander
Attack
Retreat
He said Retreat
lieutenant
lieutenant
General Impossibility Result
 In general, no solutions with fewer than 3m+1 generals
if there are m traitors
 Proof by contradiction:
– Assume there is a solution for 3m Albanians, including m
traitors
– Let a Byzantine general simulate m Albanian generals
– The problem is then reduced to 3-general problem
Solution Example
 With one faulty process: f=1, N=4
 1st round: the cmd sends a value to each Lt
 2nd round: each Lt copies the value to all other Lts
p1 (Commander)
p1 (Commander)
1:v
1:v
1:u
1:w
1:v
1:v
2:1:v
p2
2:1:u
p3
3:1:u
4:1:v
p2
4:1:v
2:1:v
4:1:v
3:1:w
p3
3:1:w
2:1:u
p4
3:1:w
p4
Faulty processes are shown coloured
4:1:v
Practical Byzantine Fault
Tolerance
Miguel Castro and Barbara Liskov
OSDI’99
Assumptions
 Asynchronous distributed systems
 Faulty nodes may behave arbitrarily
– Due to malicious attacks or software errors
 Independent node failures
 Network may fail to deliver, delay, duplicate or deliver
them out of order
 An adversary may coordinate faulty nodes, delay
comm, or delay correct nodes in order to cause the
most damage to the service. BUT it cannot delay
correct nodes indefinitely, nor subvert the
cryptographic techniques
– Any network fault will be eventually repaired
– E.g. cannot forge a valid signature of non-faulty node
– E.g. Cannot find two msgs with the same digests
Objectives
 To be used for the implementation of any deterministic
replicated service with a state and some operations
– Clients issue requests and block waiting for a reply
 Safety if no more than [(n-1)/3] faulty replicas (i.e. to
tolerate f faulty nodes, at least n=3f+1 needed)
– Safety: the replicated service satisfies linearizability
– Behaves like a centralized implementation that executes ops
atomically one at a time
– Why 3f+1 the optimal resiliency?
 Liveness: clients eventually receive replies to their
requests,
– At most [(n-1)/3] faulty replicas
– Comm delay is bounded with unknown bounds; delay is the latency
from the time of first sending to the time of receipt by the destination
Algorithm in a nutshell
Backup
f + 1 Match (OK)
Client
Primary
Backup
Backup
Replicas and Views
Set of replicas (R): |R| ≥ 3f + 1
R0
R1
R2
………
R|R-1|
0
View 1
For view v primary p assigned such that p= v mod |R|
Normal Case Operation
Client
{REQUEST, o, t, c}
Primary
o – Operation
t – Timestamp
c - Client
Timestamps are totally ordered such that later
requests have higher timestamps than earlier ones
Normal Case Operation
 state of each replica is stored in a message log
 Primary p receives a client request m , it starts a threephase protocol to atomically multicast the request to the
replicas
– Pre-prepare, Prepare, Commit
 Pre-Prepare and Prepare phases are for total ordering
of requests sent in the same view, even when the
primary is faulty
 Prepare and Commit phases are to ensure committed
requests are totally ordered across views
Pre-Prepare Phase
Backup
Primary
<<PRE-PREPARE, v, n, d> , m>
v – view number
n – sequence number
m – message
d – digest of the message
Backup
Backup
Prepare Phase
 If replica i accepts the PRE-PREPARE message it enters
prepare phase by multicasting
<PREPARE, v, n, d, i>
to all other replicas and adds both messages to its log
 Otherwise does nothing
 A replica accepts the PRE-PREPARE message provided,
–
–
–
–
The signatures are valid and the digest matches m
It is in view v
It has not accepted a PRE-PREPARE for the same v and n
Sequence number is within accepted bounds
Commit Phase
 When replica i receives 2f matched PREPARE msg,
the replica gets into Commit Phase by multicasting
<COMMIT, v, n, d , i> to other replicas
 Replica i executes required operation after it has
accepted 2f+1 matched commit msgs from different
replicas.
 Replica i’s state reflects the seq execution of all
requests with lower sequence numbers. This ensures
all non-faulty replicas execute requests in same order.
 To guarantee exactly-once semantics, replicas discard
requests whose timestamp is lower than the timestamp
in the last reply they sent to the client.
Normal Operation Reply
 All replicas sends the reply <REPLY, v, t, c, i, r>,
directly to the client
v = current view number
t = timestamp of the corresponding request
i = replica index
r = execution result
 Client waits for f+1 replies with valid signatures from
different replicas, and with same t and r, before
accepting the result r
Normal Case Operation: Summery
Request
Reply
C
Primary: 0
1
2
Faulty: 3
X
Pre-prepare
Prepare
Commit
Safeguards
 If the client does not receive replies soon enough, it
broadcasts the request to all replicas
 If the request has already been processed, the replicas
simply re-send the reply
 If the replica is not the primary, it relays the request to
the primary
 If the primary does not multicast the request to the
group, it will eventually be suspected to be faulty by
enough replicas to cause a view change
View Changes
 Timer is set when a request is received, recording
the waiting time for the request to execute.
 If the timer of replica expires in view v, the replica
starts a view change to move to view v+1 by,
– Stop accepting pre-prepare/prepare/commit messages
– Multicasting a VIEW-CHANGE message
<VIEW-CHANGE, v+1, n, C, P, i>
n = #seq of last stable checkpoint s known to i
C = 2f + 1 checkpoint msgs proving correctness of s
P = {Pm: for each m prepared, its #seq >n}
Pm = pre-prepare and 2f matching prepare msgs
New Primary
 When primary p’ of view v+1 receives 2f valid
VIEW-CHANGE messages
– It multicasts a <NEW-VIEW, v+ 1, V, O> message to
all other replicas where
» V = set of 2f valid VIEW-CHANGE messages
» O = set of reissued PRE-PREPARE messages
– Moves to view v+1
 For a replica that accepts NEW-VIEW
– Sends PREPARE messages for every pre-prepare
in set O
– Moves to view v+1
References
 See OSDI’99 for
– Optimization, implementation, and evaluation
– 3% overhead in NFS daemon
 See tech report for
– Formal presentation of the algorithm in I/O
automation model
– Proof of safety and liveness
 See OSDI’00 for
– Proactive recovery
Further Readings
 In synchronous systems, assume msg exchanges take place in
rounds and processes can detect the absence of a msg through a
timeout
– At least f+1 rounds of msgs are needed (Fisher-Lynch, 82)
 In async systems with unbounded delay, a crashed process
becomes indistinguishable from a slow one.
– [Impossibility] No algorithm can guarantee to reach consensus in such
systems, even with one process crash failure. (Fisher-Lynch-Paterson,
J. ACM’85)
 Approaches to working around the impossibility
– In partially async systems with bounded but unknown delay
» Practical Byzantine Fault Tolerant Alg (Castro-Liskov’99)
– Using failure detectors: unresponsive process be treated as failure and
discard their subsequent msgs.
– Consensus can be reached, even with an unreliable failure detector, if
fewer than N/2 processes crashes and comm is reliable. (ChandraToueg’96)
– Statistical consensus: “no guarantee” doesn’t mean “cannot”.
» Introduce an element of chance in processes’ behaviors so that the
adversary cannot exercise its thwarting strategy effectively.
Download