Because a plain BFT system cannot scale. You can start with a

advertisement
A hybrid DHT and BFT approach for the
Adder Bulletin board distributed system.
Why?
Because a plain BFT system cannot scale. You can start with a minimum of 4 nodes to tolerate 1 faulty and 1
partitioned server but all of them will work in lockstep, without any kind of load balance. However, if you add more
servers to the problem in order to either scale or tolerate more faulty nodes, the system will perform at most equally
or worse, due to the increased number of messages to be exchanged in the 3-phase synchronization protocol.
Besides, in a WAN operation, the latency of the network will affect the system considerably as every node
communicates with all other nodes.
Additionally, Adder’s bulletin board does not need total ordering of messages per stage. In other words, although
stage switching must be performed (virtually) synchronously by all nodes, the intra-stage messages can be inserted
in any order. The single and most annoying exception to this relaxation of consistency requirements is that messages
from the same user have to be partially ordered, i.e. message m+1 from user x should never appear before message
m in any replica that stores his messages.
What?
A DHT is a well understood and efficient way to partition a range of keys deterministically. The central idea is that
the client program can decide via the hash function the subset of the servers that are responsible for the range that
includes the authenticated user and submit all messages to this range. A crucial assumption, borrowed from the
static membership approach of the BFT approach, is that the client knows all nodes of the system.
For the DHT approach, Chord (Stoika et all 2001) will be used as a reference platform, along with ideas from CFS
(Dabek et all, 2001). For the BFT state machine replication, Practical BFT (Castro & Liskov 1999) will be used as a
reference platform. This should not affect generality as no particular characteristics of these systems will be used
(???)
How?
A set of <n> servers participates in a single DHT. A parameter <k> defines the number of nodes that will replicate the
operations the users submit. <k> implies that a single node participates in <k> address ranges, hence the partitions
are not <n> but <n>/<k>.
A user signs on the client software and given his user name, the latter obtains the <key> of the partition he belongs
to via the consistent hash function (e.g. SHA-1). Whatever messages the user generates are transmitted to the set of
<k> successive servers starting from the one responsible for the address range <key> belongs to.
These <k> servers form a group that mimics the operation of a single BFT group. The important property to be
preserved is that messages arriving from a single user have to be ordered. The authority to enforce this ordering can
be the responsible for the address range that includes <key>. The issue however how a faulty such server can be
handled gracefully.
One solution can be the idea of the “primary” from the BFT approach. The group maintains “views” and view
number mod k is the primary for the range. This looks overly complicated though.
Another can be the idea of a server leaving the ring of Chord, hence its successor takes over the faulty one’s address
range. The key problems here are (a) how will the clients learn that fact and (b) how will the remaining servers allow
the faulty server to rejoin the group once it is repaired. (a) does not look too difficult if the client multicasts his
requests to the whole <k> nodes. However, <b> needs some more thought.
Download