High Throughput Byzantine Fault Tolerance Ramakrishna Kotla, Mike Dahlin Laboratory for Advanced Systems Research, The University of Texas at Austin Summary of the talk High throughput is achievable along with Byzantine fault tolerance Contributions High Throughput BFT Architecture CBASE : Generic Prototype CBASE-FS : High throughput replicated NFS July 12, 2016 Department of Computer Sciences, UT Austin 2 Outline Overview Architecture Implementation Evaluation Conclusion July 12, 2016 Department of Computer Sciences, UT Austin 3 Motivation Large scale Internet services High Availability High Reliability High Security High Throughput : 24 X 7 service : Correctness : Data integrity/Confidentiality : System load Challenges : Byzantine failures Malicious attacks • http://www.cert.org Software and operator errors • ROC@USITS03 Network and hardware failures July 12, 2016 Department of Computer Sciences, UT Austin 4 BFT State Machine Replication July 12, 2016 Department of Computer Sciences, UT Austin 5 BFT state machine replication Byzantine Fault Tolerance Protocol Tolerates f Byzantine server failures using 3f+1 replicas Agreement : Order requests from clients Execution stage : Execute requests Provide high availability, reliability and security PBFT, Farsite, Oceanstore [OSDI99, OSDI01, SOSP01, SOSP03] Server Replicas Execution Execution Execution Execution Agreement Agreement Agreement Agreement Clients July 12, 2016 Department of Computer Sciences, UT Austin 6 BFT : Tradeoff throughput for fault tolerance ? July 12, 2016 Department of Computer Sciences, UT Austin 7 Traditional BFT : Limitations Fail to provide high throughput Does not scale with hardware resources and application parallelism Reason Uses Generalized State Machine Replication Correctness conditions: • Agreement : Every non-faulty state machine replica receives every request • Order : Every non-faulty state machine replica processes the requests in the same relative order BFT State machine replication : Execute requests sequentially to ensure order July 12, 2016 Department of Computer Sciences, UT Austin 8 High Throughput BFT : Idea Modify Order without compromising consistency/safety Relaxed order : Every non-faulty replica executes dependent requests in the same relative order Dependent requests : Two requests are dependent if read set or write set of one intersects with write set of the other. Requests that are not dependent can be concurrently executed Exploit application parallelism to provide high throughput Commercial applications like web server, file systems, databases have inherent data parallelism July 12, 2016 Department of Computer Sciences, UT Austin 9 Outline Overview Architecture Implementation Evaluation Conclusion July 12, 2016 Department of Computer Sciences, UT Austin 10 HT BFT : Architecture Goals : Generic : Generic interface that exposes application parallelism Extensible : Easily extensible to support any application Modular : Support different fault models easily Reuse : Reuse existing agreement protocols Server Replicas July 12, 2016 Execution Execution Execution Execution Parallelizer Parallelizer Parallelizer Parallelizer Agreement Agreement Agreement Agreement Department of Computer Sciences, UT Austin 11 Parallelizer Application independent module Receives ordered requests from agreement Maintains/Updates dependency graph of requests 2 level dependency analysis Concurrency matrix Schedules a request if it is not dependent on any outstanding requests (no outgoing edges at a request node) Requests that are not dependent are concurrently executed July 12, 2016 Department of Computer Sciences, UT Austin 12 Parallelizer : Concurrency Matrix Definition/Figure : Square matrix rows/columns represent operations 1 represents independent, 0 represents dependent operations Exports application level parallelism Statically defined Two matrices : Dependency also depends on objects Related objects Unrelated objects Table Lookup Low overhead July 12, 2016 Department of Computer Sciences, UT Austin 13 Parallelizer : Dependence Analysis Parallelizer figure : agreement stage, input queue, dependency graph, multi thread execution stage July 12, 2016 Department of Computer Sciences, UT Austin 14 Advantages/Limitations Advantages : Supports high throughput applications Simple : Minimal/No changes to client/agreement protocol/application Flexible : Supports different fault models easily Limitation : Concurrency matrix requires inner workings of application Conservative rules ensures correctness at the expense of performance Incrementally refine the rules to gain performance July 12, 2016 Department of Computer Sciences, UT Austin 15 Outline Overview Architecture Implementation Evaluation Conclusion July 12, 2016 Department of Computer Sciences, UT Austin 16 System Model Asynchronous system Nodes operate at arbitrarily different speeds Network may delay, drop or deliver messages out of order Assumption : Bounded fair links Fault Model : Byzantine Faults Faulty nodes may behave arbitrarily : crash, lose/alter data, send incorrect messages Adversary : Strong adversary Can coordinate faulty nodes in arbitrarily bad ways Assumption : Computationally limited July 12, 2016 Department of Computer Sciences, UT Austin 17 CBASE : Concurrent BASE Uses unmodified PBFT agreement protocol [OSDI 1999] Built upon BASE library [SOSP 2001] Agreement stage : Single thread Execution stage : Multithreaded Parallelizer : Producer/Consumer queue Figure ?? July 12, 2016 Department of Computer Sciences, UT Austin 18 Parallelizer : Interface Parallelizer.insert() Parallelizer.next_request() Parallelizer.sync() July 12, 2016 Department of Computer Sciences, UT Austin 19 CBASE-FS : BFT NFS Figure Brief description of NFS concurrency matrix rules Related objects : Same NFS handle Rules are conservative Refer paper for more details July 12, 2016 Department of Computer Sciences, UT Austin 20 Outline Overview Architecture Implementation Evaluation Conclusion July 12, 2016 Department of Computer Sciences, UT Austin 21 Evaluation With 4 server replicas that tolerate 1 Byzantine failure Replicas running on different uniprocessor machine 933 MHz P3, 256 MB Ram July 12, 2016 5 Client machines Dedicated network with 100MB ethernet hub OS : Redhat Linux 7.2 with NFS 2.0 Assumption : No correlated failures due to OS. Department of Computer Sciences, UT Austin 22 Microbenchmark : Overhead BASE versus CBASE July 12, 2016 Department of Computer Sciences, UT Austin 23 Microbenchmark : Scalability Scalability with hardware resources Scalability with application level parallelism July 12, 2016 Department of Computer Sciences, UT Austin 24 Microbenchmark : CBASE-FS/BASE-FS/NFS Latency versus Throughput with no sleep Latency versus Throughput with 20 ms sleep Iozone results summary July 12, 2016 Department of Computer Sciences, UT Austin 25 Macrobenchmarks Postmark : Andrew : July 12, 2016 Department of Computer Sciences, UT Austin 26 Conclusions Commercial applications have parallelism High throughput BFT provides a simple/flexible solution to achieve high throughput July 12, 2016 Department of Computer Sciences, UT Austin 27 Questions ? Why don’t you have parallelizer in the agreement stage to reduce agreement cost ? July 12, 2016 Department of Computer Sciences, UT Austin 28