COMP 413
Fall 2002
• Models
– Synchronous vs. asynchronous systems
– Byzantine failure model
• Secure storage with self-certifying data
• Byzantine quorums
• Byzantine state machines
Synchronous system: bounded message delays (implies reliable network!)
Asynchronous system: message delays are unbounded
In practice (Internet): reasonable to assume that network failures are eventually fixed
(weak synchrony assumption).
• Data and services (state machines) can be replicated on a set of nodes R .
• Each node in
R has iid probability of failing
• Can specifiy bound f on the number of nodes that can fail simultaneously
Byzantine failures
• no assumption about nature of fault
• failed nodes can behave in arbitrary ways
• may act as intelligent adversary
(compromised node), with full knowledge of the protocols
• failed nodes may conspire (act as one)
• Data is not self-certifying (multiple writers without shared keys)
• Idea: replicate data on sufficient number of replicas (relative to f ) to be able to rely on majority vote
Representative problem: implement a read/write variable
Assuming no concurrent reads, writes for now
Assuming trusted clients, for now
How many replicas do we need?
• clearly, need at least 2 f +1, so we have a majority of good nodes
• write( x ): send x to all replicas, wait for acknowledgments (must get at least f +1)
• read( x ): request x from all replicas, wait for responses, take majority vote (if no concurrent writes, must get f +1 identical votes!)
R
W
Does this work? Yes, but only if
• system is synchronous (bounded msg delay)
• faulty nodes cannot forge messages
(messages are authenticated!)
Now, assume
• Weak synchrony (network failures are fixed eventually)
• messages are authenticated (e.g., signed with sender’s private key)
Let’s try 3 f +1 replicas (known lower bound)
• write( x ): send x to all replicas, wait for 2 f +1 responses (must have at least f +1 good replicas with correct value)
• read( x ): request x from all replicas, wait for 2 f +1 responses, take majority vote (if no concurrent writes, must get f +1 identical votes!? – no, it is possible that the f nodes that did not respond were good nodes!)
R
W
Let’s try 4 f +1 replicas
• write( x ): send x to all replicas, wait for 3 f +1 responses (must have at least 2 f +1 good replicas with correct value)
• read( x ): request x from all replicas, wait for 3 f +1 responses, take majority vote (if no concurrent writes, must get f +1 identical votes!? – no, it is possible that the f faulty nodes vote with the good nodes that have an old value of x !)
R
W
Let’s try 5 f +1 replicas
• write( x ): send x to all replicas, wait for 4 f +1 responses (must have at least 3 f +1 good replicas with correct value)
• read( x ): request x from all replicas, wait for 4 f +1 responses, take majority vote (if no concurrent writes, must get f +1 identical votes!)
• Actually, can use only 5 f replicas if data is written with monotonically increasing timestamps
R
W
Still rely on trusted clients
• Malicious client could send different values to replicas, or send value to less than a full quorum
• To fix this, need a byzantine agreement protocols among the replicas
Still don’t handle concurrent accesses
Still don’t handle group changes
BFT (Castro, 2000)
• Can implement any service that behaves like a deterministic state machine
• Can tolerate malicious clients
• Safe with concurrent requests
• Requires 3 f +1 replicas
• 5 rounds of messages
• Clients send requests to one replica
• Correct replicas execute all requests in same order
• Atomic multicast protocol among replicas ensures that all replicas receive and execute all requests in the same order
• Since all replicas start in same state, correct replicas produce identical result
• Client waits for f +1 identical results from different replicas
• Client c sends m = <REQUEST,o,t,c>σ c to the primary . (o=operation,t=monotonic timestamp)
• Primary p assigns seq# n to m and sends <PRE-
PREPARE,v,n,m> σ p to other replicas. (v=current view, i.e., replica set)
• If replica i accepts the message, it sends
<PREPARE,v,n,d,i> σ i to other replicas. (d is hash of the request). Signals that i agrees to assign n to m in v .
• Once replica i has a pre-prepare and 2 f +1 matching prepare messages, it sends
<COMMIT,v,n,d,i> σ i to other replicas. At this point, correct replicas agree on an order of requests within a view.
• Once replica i has 2 f +1 matching prepare and commit messages, it executes m, then sends
<REPLY,v,t,c,i,r> σ i to the client. (The need for this last step has to do with view changes.)
• More complexity related to view changes and garbage collection of message logs
• Public-key crypto signatures are bottleneck: a variation of the protocol uses symmetric crypto
(MACs) to provide authenticated channels. (Not easy: MACs are less powerful: can’t prove authenticity to a third party!)