CONFIGURABLE CONSISTENCY FOR WIDE-AREA CACHING by Sai R. Susarla A dissertation submitted to the faculty of The University of Utah in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Science School of Computing The University of Utah May 2007 c Sai R. Susarla 2007 Copyright All Rights Reserved THE UNIVERSITY OF UTAH GRADUATE SCHOOL SUPERVISORY COMMITTEE APPROVAL of a dissertation submitted by Sai R. Susarla This dissertation has been read by each member of the following supervisory committee and by majority vote has been found to be satisfactory. Chair: John B. Carter Wilson Hsieh Jay Lepreau Gary Lindstrom Edward R. Zayas THE UNIVERSITY OF UTAH GRADUATE SCHOOL FINAL READING APPROVAL To the Graduate Council of the University of Utah: I have read the dissertation of Sai R. Susarla in its final form and have found that (1) its format, citations, and bibliographic style are consistent and acceptable; (2) its illustrative materials including figures, tables, and charts are in place; and (3) the final manuscript is satisfactory to the Supervisory Committee and is ready for submission to The Graduate School. Date John B. Carter Chair: Supervisory Committee Approved for the Major Department Martin Berzins Chair/Director Approved for the Graduate Council David S. Chapman Dean of The Graduate School ABSTRACT Data caching is a well-understood technique for improving the performance and availability of wide area distributed applications. The complexity of caching algorithms motivates the need for reusable middleware support to manage caching. To support diverse data sharing needs effectively, a caching middleware must provide a flexible consistency solution that (i) allows applications to express a broad variety of consistency needs, (ii) enforces consistency efficiently among WAN replicas satisfying those needs, and (iii) employs application-independent mechanisms that facilitate reuse. Existing replication solutions either target specific sharing needs and lack flexibility, or leave significant consistency management burden on the application programmer. As a result, they cannot offload the complexity of caching effectively from a broad set of applications. In this dissertation, we show that a small set of customizable data coherence mechanisms can support wide-area replication effectively for distributed services with very diverse consistency requirements. Specifically, we present a novel flexible consistency framework called configurable consistency that enables a single middleware to effectively support three important classes of applications, namely, file sharing, shared database and directory services, and real-time collaboration. Instead of providing a few prepackaged consistency policies, our framework splits consistency management into design choices along five orthogonal aspects, namely, concurrency control, replica synchronization, failure handling, update visibility and view isolation. Based on a detailed classification of application needs, the design choices can be combined to simultaneously enforce diverse consistency requirements for data access. We have designed and prototyped a middleware file store called Swarm that provides network-efficient wide area peer caching with configurable consistency. To demonstrate the practical effectiveness of the configurable consistency framework, we built four wide area network services that store data with distinct consistency needs in Swarm. They leverage its caching mechanisms by employing different sets of configurable consistency choices. The services are: (1) a wide area file system, (2) a proxy-caching service for enterprise objects, (3) a database augmented with transparent read-write caching support, and (4) a real-time multicast service. Though existing middleware systems can individually support some of these services, none of them provides a consistency solution flexible enough to support all of the services efficiently. When these services employ caching with Swarm, they deliver more than 60% of the performance of custom-tuned implementations in terms of end-to-end latency, throughput and network economy. Also, Swarm-based wide-area peer caching improves service performance by 300 to 500% relative to client-server caching and RPCs. v This dissertation is an offering to The Divine Mother of All Knowledge, Gaayatri. CONTENTS ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv GLOSSARY OF TERMS USED . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvi ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii CHAPTERS 1. 2. INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Consistency Needs of Distributed Applications . . . . . . . . . . . . . . . . . . . 1.1.1 Requirements of a Caching Middleware . . . . . . . . . . . . . . . . . . . . 1.1.2 Existing Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Configurable Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.1 Configurable Consistency Options . . . . . . . . . . . . . . . . . . . . . . . . 1.2.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.3 Limitations of the Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.4 Swarm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4 Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5 Roadmap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 6 7 9 11 12 14 15 16 19 20 CONFIGURABLE CONSISTENCY: RATIONALE . . . . . . . . . . . . . . . . . 22 2.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.1 Programming Distributed Applications . . . . . . . . . . . . . . . . . . . . . 2.1.2 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Replication Needs of Applications: A Survey . . . . . . . . . . . . . . . . . . . . 2.3 Representative Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.1 File Sharing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.2 Proxy-based Auction Service . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.3 Resource Directory Service . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.4 Chat Service . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Application Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5 Consistency Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.1 Update Stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.2 View Isolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.3 Concurrency Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.4 Replica Synchronization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.5 Update Dependencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 22 23 25 27 27 29 30 30 31 32 32 32 34 35 36 2.5.6 Failure-Handling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.7 Limitations of Existing Consistency Solutions . . . . . . . . . . . . . . . . . . . . 2.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3. CONFIGURABLE CONSISTENCY FRAMEWORK . . . . . . . . . . . . . . . 42 3.1 Framework Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Data Access Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Configurable Consistency Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.1 Concurrency Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.2 Replica Synchronization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.2.1 Timeliness Guarantee . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.2.2 Strength of Timeliness Guarantee . . . . . . . . . . . . . . . . . . . . . 3.3.3 Update Ordering Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.3.1 Ordering Independent Updates . . . . . . . . . . . . . . . . . . . . . . . 3.3.3.2 Semantic Dependencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.4 Failure Handling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.5 Visibility and Isolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 Example Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5 Relationship to Other Consistency Models . . . . . . . . . . . . . . . . . . . . . . . 3.5.1 Memory Consistency Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5.2 Session-oriented Consistency Models . . . . . . . . . . . . . . . . . . . . . . 3.5.3 Session Guarantees for Mobile Data Access . . . . . . . . . . . . . . . . . 3.5.4 Transactions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5.5 Flexible Consistency Schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6.1 Ease of Use . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6.2 Orthogonality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6.3 Handling Conflicting Consistency Semantics . . . . . . . . . . . . . . . . . 3.7 Limitations of the Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.7.1 Conflict Matrices for Abstract Data Types . . . . . . . . . . . . . . . . . . 3.7.2 Application-defined Logical Views on Data . . . . . . . . . . . . . . . . . 3.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4. 38 38 40 41 42 43 45 45 47 48 49 49 51 51 53 54 56 58 58 62 64 65 67 69 69 70 70 71 71 72 72 IMPLEMENTING CONFIGURABLE CONSISTENCY . . . . . . . . . . . . . 74 4.1 Design Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.1 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Swarm Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.1 Swarm Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.2 Application Plugin Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Using Swarm to Build Distributed Services . . . . . . . . . . . . . . . . . . . . . . 4.3.1 Designing Applications to Use Swarm . . . . . . . . . . . . . . . . . . . . . 4.3.2 Application Design Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.2.1 Distributed File Service . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.2.2 Music File Sharing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.2.3 Wide-area Enterprise Service Proxies . . . . . . . . . . . . . . . . . viii 74 76 77 78 80 81 85 87 87 87 88 4.4 Architectural Overview of Swarm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5 File Naming and Location Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6 Replication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6.1 Creating Replicas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6.2 Custodians for Failure-resilience . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6.3 Retiring Replicas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6.4 Node Membership Management . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6.5 Network Economy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6.6 Failure Resilience . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.7 Consistency Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.7.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.7.1.1 Privilege Vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.7.1.2 Handling Data Access . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.7.1.3 Enforcing a PV’s Consistency Guarantee . . . . . . . . . . . . . . . 4.7.1.4 Handling Updates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.7.1.5 Leases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.7.1.6 Contention-aware Caching . . . . . . . . . . . . . . . . . . . . . . . . . . 4.7.2 Core Consistency Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.7.3 Pull Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.7.4 Replica Divergence Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.7.4.1 Hard time bound (HT) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.7.4.2 Soft time bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.7.4.3 Mod bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.7.5 Refinements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.7.5.1 Deadlock Avoidance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.7.5.2 Parallelism for RD Mode Sessions . . . . . . . . . . . . . . . . . . . . 4.7.6 Leases for Failure Resilience . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.7.7 Contention-aware Caching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.8 Update Propagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.8.1 Enforcing Semantic Dependencies . . . . . . . . . . . . . . . . . . . . . . . . 4.8.2 Handling Concurrent Updates . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.8.3 Enforcing Global Order . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.8.4 Version Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.8.5 Relative Versions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.9 Failure Resilience . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.9.1 Node Churn . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.10 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.11 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.11.1 Security . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.11.2 Disaster Recovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.12 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5. 88 89 91 91 95 95 95 96 96 97 98 98 98 100 100 102 102 105 106 109 110 110 111 112 112 112 113 114 119 121 121 122 123 124 125 126 127 128 128 129 129 EVALUATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 5.1 Experimental Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 5.2 Swarmfs: A Flexible Wide-area File System . . . . . . . . . . . . . . . . . . . . . 133 5.2.1 Evaluation Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 ix 5.2.2 Personal Workload . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.3 Sequential File Sharing over WAN (Roaming) . . . . . . . . . . . . . . . . 5.2.4 Simultaneous WAN Access (Shared RCS) . . . . . . . . . . . . . . . . . . . 5.2.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 SwarmProxy: Wide-area Service Proxy . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.1 Evaluation Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.2 Workload . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4 SwarmDB: Replicated BerkeleyDB . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.1 Evaluation Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.2 Diverse Consistency Semantics . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.3 Failure-resilience and Network Economy at Scale . . . . . . . . . . . . . 5.4.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5 SwarmCast: Real-time Multicast Streaming . . . . . . . . . . . . . . . . . . . . . . 5.5.1 Results Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5.3 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5.4 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6. RELATED WORK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184 6.1 Domain-specific Consistency Solutions . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Flexible Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3 Wide-area Replication Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.1 Replica Networking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.2 Update Propagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.3 Consistency Maintenance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.4 Failure Resilience . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4 Reusable Middleware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7. 137 140 143 145 150 150 151 153 154 155 160 161 162 167 168 172 174 174 175 175 176 179 185 188 190 190 191 192 194 195 FUTURE WORK AND CONCLUSIONS . . . . . . . . . . . . . . . . . . . . . . . . . 196 7.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1.1 Improving the Framework’s Ease of Use . . . . . . . . . . . . . . . . . . . 7.1.2 Security and Authentication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1.3 Applicability to Object-based Middleware . . . . . . . . . . . . . . . . . . 7.2 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196 196 197 197 198 REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201 x LIST OF FIGURES 2.1 A distributed application with five components A1..A5 employing the function-shipping model. Each component holds a fragment of overall state and operates on other fragments via messages to their holder. . . . . . . 24 2.2 A distributed application employing the data-shipping model. A1..A5 interact only via caching needed state locally. A1 holds fragments 1 and 2 of overall state. A2 caches fragments 1 and 3. A3 holds 3 and caches 5. . . 25 2.3 Distributed application employing a hybrid model. A1, A3 and A5 are in data-shipping mode. A2 is in client mode, and explicitly communicates with A1 and A3. A4 is in hybrid mode. It caches 3, holds 4 and contacts A5 for 5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.1 Pseudo-code for a query operation on a replicated database. The cc options are explained in Section 3.3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.2 Pseudo-code for an update operation on a replicated database. . . . . . . . . . 46 3.3 Atomic sessions example: moving file fname from one directory to another. Updates u1,u2 are applied atomically, everywhere. . . . . . . . . . . . . . 51 3.4 Causal sessions example. Updates u1 and u2 happen concurrently and are independent. Clients 3 and 4 receive u1 before their sessions start. Hence u3 and u4 causally depend on u1, but are independent of each other. Though u2 and u4 are independent, they are tagged as totally ordered, and hence must be applied in the same order everywhere. Data item B’s causal dependency at client 3 on prior update to A has to be explicitly specified, while that of A at client 4 is implicitly inferred. . . . . . . . . . . . . . . . . . . . . 52 3.5 Update Visibility Options. Based on the visibility setting of the ongoing WR session at replica 1, it responds to the pull request from replica 2 by supplying the version (v0, v2 or v3) indicated by the curved arrows. When the WR session employs manual visibility, local version v3 is not yet made visible to be supplied to replica 2. . . . . . . . . . . . . . . . . . . . . . . . 56 3.6 Isolation Options. When replica 2 receives updates u1 and u2 from replica 1, the RD session’s isolation setting determines whether the updates are applied (i) immediately (solid curve), (ii) only when the session issues the next manual pull request (dashed curve), or (iii) only after the session ends (dotted curve). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 4.1 A centralized enterprise service. Clients in remote campuses access the service via RPCs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 4.2 An enterprise application employing a Swarm-based proxy server. Clients in campus 2 access the local proxy server, while those in campus 3 invoke either server. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 4.3 Control flow in an enterprise service replicated using Swarm. . . . . . . . . . . 84 4.4 Structure of a Swarm server and client process. . . . . . . . . . . . . . . . . . . . . . 89 4.5 File replication in a Swarm network. Files F1 and F2 are replicated at Swarm servers N1..N6. Permanent copies are shown in darker shade. F1 has two custodians: N4 and N5, while F2 has only one, namely, N5. Replica hierarchies are shown for F1 and F2 rooted at N4 and N5 respectively. Arrows indicate parent links. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 4.6 Replica Hierarchy Construction in Swarm. (a) Nodes N1 and N3 cache file F2 from its home N5. (b) N2 and N4 cache it from N5; N1 reconnects to closer replica N3. (c) Both N2 and N4 reconnect to N3 as it is closer than N5. (d) Finally, N2 reconnects to N1 as it is closer than N3. . . . . . . . 94 4.7 Consistency Privilege Vector. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 4.8 Pseudo-code for opening and closing a session. . . . . . . . . . . . . . . . . . . . . 101 4.9 Consistency management actions in response to client access to file F2 of Figure 4.5(d). Each replica is labelled with its currentPV, and ‘-’ denotes ∞. 104 4.10 Basic Consistency Management Algorithms. . . . . . . . . . . . . . . . . . . . . . . 107 4.11 Basic Pull Algorithm for configurable consistency. . . . . . . . . . . . . . . . . . . 108 4.12 Computation of relative PVs during pull operations. The PVs labelling the edges show the PVin of a replica obtained from each of its neighbors, and ‘-’ denotes ∞. For instance, in (b), after node N3 finishes pulling from N5, its N5.PVin = [wr, ∞, ∞, [10,4]], and N5.PVout = [rd, ∞, ∞,∞]. Its currentPV becomes PVmin( [wr, ∞, ∞, [10,4]], [rd, ∞, ∞,∞], [wrlk]), which is [rd, ∞, ∞,∞]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 4.13 Basic Push Algorithm for Configurable Consistency. . . . . . . . . . . . . . . . . 111 4.14 Replication State for Adaptive Caching. Only salient fields are presented for readability. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 4.15 Adaptive Caching Algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 4.16 Master Election in Adaptive Caching. The algorithms presented here are simplified for readability and do not handle race conditions such as simultaneous master elections. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 4.17 Adaptive Caching of file F2 of Figure 4.5 with hysteresis set to (low=3, high=5). In each figure, the replica in master mode is shown darkly shaded, peers are lightly shaded, and slaves are unshaded. . . . . . . . . . . . . 120 5.1 Network topology emulated by Emulab’s delayed LAN configuration. The figure shows each node having a 1Mbps link to the Internet routing core, and a 40ms roundtrip delay to/from all other nodes. . . . . . . . . . . . . . . . . . 134 5.2 Andrew-Tcl Performance on 100Mbps LAN. . . . . . . . . . . . . . . . . . . . . . . 138 xii 5.3 Andrew-Tcl Details on 100Mbps LAN. . . . . . . . . . . . . . . . . . . . . . . . . . . 139 5.4 Andrew-tcl results over 1Mbps, 40ms RTT link. . . . . . . . . . . . . . . . . . . . . 139 5.5 Network Topology for the Swarmfs Experiments described in Sections 5.2.3 and 5.2.4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 5.6 The replica hierarchy observed for a Swarmfs file in the network of Figure 5.5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 5.7 Roaming File Access: Swarmfs pulls source files from nearby replicas. Strong-mode Coda correctly compiles all files, but exhibits poor performance. Weak-mode Coda performs well, but generates incorrect results on the three nodes (T2, F1, F2) farthest from server U1. . . . . . . . . . . . . . . 143 5.8 Latency to fetch and modify a file sequentially at WAN nodes. Strong-mode Coda writes the modified file synchronously to server. . . . . . . . . . . . . . . . . . . 144 5.9 Repository module access patterns during various phases of the RCS experiment. Graphs in Figures 5.10and 5.11 show the measured file checkout latencies on Swarmfs-based RCS at Univ. (U) and Turkey (T) sites relative to client-server RCS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 5.10 RCS on Swarmfs: Checkout Latencies near home server on the “University” (U) LAN. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 5.11 RCS on Swarmfs: Checkout Latencies at “Turkey” (T) site, far away from home server. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 5.12 Network Architecture of the SwarmProxy service. . . . . . . . . . . . . . . . . . . 152 5.13 SwarmProxy aggregate throughput (ops/sec) with adaptive replication. . . . 156 5.14 SwarmProxy aggregate throughput (ops/sec) with aggressive caching. The Y-axis in this graph is set to a smaller scale than in Figure 5.13 to show more detail. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 5.15 SwarmProxy latencies for local objects at 40% locality. . . . . . . . . . . . . . . 158 5.16 SwarmProxy latencies for non-local objects at 40% locality. . . . . . . . . . . . 159 5.17 SwarmDB throughput observed at each replica for reads (lookups and cursor-based scans). ‘Swarm local’ is much worse than ‘local’ because SwarmDB opens a session on the local server (incurring an inter-process RPC) for every read in our benchmark. . . . . . . . . . . . . . . . . . . . . . . . . . . . 166 5.18 SwarmDB throughput observed at each replica for writes (insertions, deletions and updates). Writes under master-slave replication perform slightly worse than RPC due to the update propagation overhead to slaves. . . . . . . 166 5.19 Emulated network topology for the large-scale file sharing experiment. Each oval denotes a ‘campus LAN’ with ten user nodes, each running a SwarmDB client and Swarm server. The labels denote the bandwidth (bits/sec) and oneway delay of network links (from node to hub). . . . . . . 169 xiii 5.20 New file access latencies in a Swarm-based peer file sharing network of 240 nodes under node churn. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172 5.21 The average incoming bandwidth consumed at a router’s WAN link by file downloads. Swarm with proximity-aware replica networking consumes much less WAN link bandwidth than with random networking. . . . . . . . . . 173 5.22 Data dissemination bandwidth of a single Swarm vs. hand-coded relay for various number of subscribers across a 100Mbps switched LAN. With a few subscribers, Swarm-based producer is CPU-bound, which limits its throughput. But with many subscribers, Swarm’s pipelined I/O delivers 90% efficiency. Our hand-coded relay requires a significant redesign to achieve comparable efficiency. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177 5.23 End-to-end packet latency of a single relay with various numbers of subscribers across 100Mbps LAN, shown to a log-log scale.. Swarm relay induces an order of magnitude longer delay as it is CPU-bound in our unoptimized implementation. The delay is due to packets getting queued in the network link to the Swarm server. . . . . . . . . . . . . . . . . . . . . . . . . . . 178 5.24 A SwarmCast producer’s sustained send throughput (KBytes/sec) scales linearly as more Swarm servers are used for multicast (shown to log-log scale), regardless of their fanout. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 5.25 Adding Swarm relays linearly scales multicast throughput on a 100Mbps switched LAN. The graph shows the throughput normalized to a single ideal relay that delivers the full bandwidth of a 100Mbps link. A network of 20 Swarm servers deliver 80% of the throughput of 20 ideal relays. . . . 180 5.26 Swarm’s multicast efficiency and throughput-scaling. The root Swarm server is able to process packets from the producer at only 60% of the ideal rate due to its unoptimized compute-bound implementation. But its efficient hierarchical propagation mechanism disseminates them at higher efficiency. Fanout of Swarm network has a a minor impact on throughput. 181 5.27 End-to-end propagation latency via Swarm relay network. Swarm servers are compute-bound for packet processing, and cause packet queuing delays. However, when multiple servers are deployed, individual servers are less loaded. This reduces the overall packet latency drastically. With a deeper relay tree (obtained with a relay fanout of 2), average packet latency increases marginally due to the extra hops through relays. . . . . . . . 182 5.28 Packet delay contributed by various levels of Swarm multicast tree (shown to a log-log scale). The two slanting lines indicate that the root relay and its first level descendants are operating at their maximum capacity, and are clearly compute-bound. Packet queuing at those relays contributes the most to the overall packet latency. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183 6.1 The Spectrum of available consistency management solutions. We categorize them based on the variety of application consistency semantics they cover and the effort required to employ them in application design. . . . . . . 185 xiv LIST OF TABLES 1.1 Consistency options provided by the Configurable Consistency (CC) framework. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.2 Configurable Consistency (CC) options for popular consistency flavors. . . . 13 2.1 Characteristics of several classes of representative wide-area applications. 28 2.2 Consistency needs of several representative wide-area applications. . . . . . 33 3.1 Concurrency matrix for Configurable Consistency. . . . . . . . . . . . . . . . . . 47 3.2 Expressing the needs of representative applications using Configurable Consistency. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.3 Expressing various consistency semantics using Configurable Consistency options. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 3.4 Expressing other flexible models using Configurable Consistency options. 60 4.1 Swarm API. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 4.2 Swarm interface to application plugins. . . . . . . . . . . . . . . . . . . . . . . . . . . 81 5.1 Consistency semantics employed for replicated BerkeleyDB (SwarmDB) and the CC options used to achieve them. . . . . . . . . . . . . . . . . . . . . . . . . . 162 GLOSSARY OF TERMS USED AFS Andrew File System, a client-server distributed file system developed at CMU Andrew a file system benchmark created at CMU CMU Carnegie-Mellon University CVS Concurrent Versioning System, a more sophisticated version control software based on RCS Caching Creating a temporary copy of remotely located data to speed up future retrievals nearby Coda An extension of the AFS file system that operates in weakly-connected network environments LAN Local-area Network NFS Network File System developed by Sun Labs. NOP null/no operation NTP Network Time Protocol used to synchronize wall clock time among computers. RCS Revision Control System, a public-domain file version control software Replication Generally keeping multiple copies of a data item for any reason such as performance or redundancy TTCP A sockets-based Unix program to measure the network bandwidth between two machines VFS Virtual File System: an interface developed for an operating system kernel to interact with file system-specific logic in a portable way WAN Wide-area Network ACKNOWLEDGMENTS At the outset, I owe the successful pursuit of my Ph.D degree to two great people: my advisor, John Carter and my mentor, Jay Lepreau. I really admire John’s immeasurable patience and benevolence in putting up with a ghost student like me who practically went away in the middle of PhD for four long years to work full-time (at Novell). I thank John for giving me complete freedom to explore territory that was new to both of us, while gently nudging me towards practicality with his wisdom. Thanks to John’s able guidance, I could experience the thrill of systems research for which I joined graduate school - conceiving and building a really ‘cool’, practically valuable system of significant complexity all by myself, from scratch. Finally, I owe most of my writing skills to John. However, all the errors you may find in this thesis are mine. I would like to thank Jay Lepreau for providing generous financial as well as moral support throughout my graduate study at Utah. Jay Lepreau and Bryan Ford have taught me the baby steps in research. I thank Jay for allowing me to use his incredible tool, the Emulab network testbed, extensively, without which my work would have been impossible. I thank the members of the Flux research group at Utah, especially Mike Hibler, Eric Eide, Robert Ricci and Kirk Webb for patiently listening to my technical problems and ideas, and providing critical feedback that greatly helped my work. I really enjoyed the fun atmosphere in the Flux group at Utah. I feel fortunate to have been guided by Wilson Hsieh, a star researcher. From him, I learnt a great deal how to clearly identify and articulate the novel aspects of my work that formed my thesis. The critical feedback I received from Wilson, Gary Lindstrom and Ed Zayas shaped my understanding of doctoral research. The positive energy, enthusiasm and joy that Ed Zayas radiates is contagious. I cherish every moment I spent with him at Novell and later at Network Appliance. My graduate study has been a long journey during which I met many great friends of a lifetime - especially Anand, Gangu, Vamsi and Sagar. Each is unique, inspiring, pleasant and heart-warming in his own way. I am grateful to my family members including my parents, grandparents, aunts and uncles (especially Aunt Ratna) and my wife Sarada, without whose self-giving love and encouragement I would not have achieved anything. Finally, my utmost gratitude to the Veda Maataa, Gaayatrii (the Divine Mother of Knowledge), who nurtured me every moment in the form of all these people. She provided what I needed just when I needed it, and guides me towards the goal of Life - Eternal Existence (Sat), Infinite Knowledge and Power (Chit), and Inalienable Bliss (Ananda). xviii CHAPTER 1 INTRODUCTION Modern Internet-based services increasingly operate over diverse networks and cater to geographically widespread users. Caching of service state and user data at multiple locations is well understood as a technique to scale service capacity, to hide network variability from users, and to provide a responsive and highly available service. Caching of mutable data raises the issue of consistency management, which ensures that all replica contents converge to a common final value in spite of updates. A dynamic wide area environment poses several unique challenges to caching algorithms, such as diverse network characteristics, diverse resource constraints, and machine failures. Their complexity motivates the need for a reusable solution to support data caching and consistency in new distributed services. In this dissertation, we address the following question: can a single middleware system provide cached data access effectively in diverse distributed services with a wide variety of sharing needs? To identify the core desirable features of a caching middleware, we have surveyed the data sharing needs of a wide variety of distributed applications ranging from personal file access (with little data sharing) to widespread real-time collaboration (with fine-grain synchronization). We found that although replicable data is prevalent in many applications, their data characteristics (e.g., the unit of data access, its mutability, and the frequency of read/write sharing) and consistency requirements vary widely. To support this diversity efficiently requires greater customizability of consistency mechanisms than provided by existing solutions. Also, applications operate in diverse network environments ranging from well-connected corporate servers to intermittently connected mobile devices. The ability to promiscuously replicate data 1 1 and synchronize with any avail- The term replication has been used in previous literature [62, 78, 22] to refer both to keeping transient copies/replicas of data to improve access latency and availability (also called caching or second-class replication [38]) as well as to maintaining multiple redundant copies of data to protect against permanent 2 able replica, called pervasive replication [62], greatly enhances application availability and performance in such environments. Finally, we observed that certain core design choices recur in the consistency management of diverse applications, although different applications need to make different sets of choices. This commonality in the consistency enforcement options allows the development of a flexible consistency framework and its implementation in a caching middleware to support diverse sharing needs. Based on the above observations, this dissertation presents a novel flexible consistency management framework called configurable consistency that supports efficient caching for diverse distributed applications running in non-uniform network environments (including WANs). The configurable consistency framework can express a broader mix of consistency semantics than existing models (ranging from strong to eventual consistency) by combining a small number of orthogonal design choices. The framework’s choices allow an application to make different tradeoffs between consistency, availability, and performance over time. Its interface also allows different users to impose different consistency requirements on the same data simultaneously. Thus, the framework is highly customizable and adaptable to varying user requirements. Configurable consistency can be enforced efficiently among data replicas spread across non-uniform networks. If a data service/middleware adopts the configurable consistency protocol to synchronize peer replicas, it can support the read/write data sharing needs of a variety of applications efficiently. For clustered services, read-write caching with configurable consistency helps incrementally scale service capacity to handle client load. For wide area services, it also improves end-to-end service latency and throughput, and reduces WAN usage. To support our flexibility and efficiency claims, we present the design of a middleware data store called Swarm2 that provides pervasive wide area caching, and supports diverse application needs by implementing configurable consistency. To demonstrate the flexibility of configurable consistency, we present four network services that store loss of some of them (called first-class replication). In this dissertation, we use the unqualified term replication to refer to caching, unless stated otherwise. 2 Swarm stands for Scalable Wide Area Replication Middleware. 3 data with distinct consistency needs in Swarm and leverage its caching support with different configurable consistency choices. Though existing systems support some of these services, none of them has a consistency solution flexible enough to support all of the services efficiently. Under worst-case workloads, services using Swarm middleware for caching perform only within 20% of equivalent hand-optimized implementations in terms of end-to-end latency, throughput and network utilization. However, relative to traditional client-server implementations without caching, Swarm-based caching improves service performance by at least 500% on realistic workloads. Thus, layering these services on top of Swarm middleware only incurs a low worst-case penalty, but benefits them significantly in the common case. Swarm accomplishes this by providing the following features: (i) a failure-resilient proximity-aware replica management mechanism that organizes data replicas into an overlay hierarchy for scalable synchronization and adjusts it based on observed network characteristics and node accessibility; (ii) an implementation of the the configurable consistency framework to let applications customize consistency semantics of shared data to match their diverse sharing and performance needs; and (iii) a contention-aware replication control mechanism that limits replica synchronization overhead by monitoring the contention among replica sites and adjusting the degree of replication accordingly. For P2P file sharing with close-to-open consistency semantics, proximity-aware replica management reduces data access latency and WAN bandwidth consumption to roughly one-fifth that of random replica networking. With configurable consistency, database applications can operate with diverse consistency requirements without redesign. Relaxing consistency improves their throughput by an order of magnitude over strong consistency and read-only replication. For enterprise services employing WAN proxies, contention-aware replication control outperforms both aggressive caching and RPCs (i.e., no replication) at all levels of contention while still providing strong consistency. In Section 1.1, we present three important classes of distributed applications that we target for replication support. We discuss the diversity of their sharing needs, the commonality of their consistency requirements, and the existing ways in which those requirements are satisfied to motivate our thesis. In Section 1.2, we outline the config- 4 urable consistency framework and present an overview of Swarm. Finally, in Section 1.3, we outline the four specific representative applications we built using Swarm and our evaluation of their efficiency relative to alternate implementations. 1.1 Consistency Needs of Distributed Applications Previous research has revealed that distributed applications vary widely in their consistency requirements [77]. It is also well-known that consistency, scalable performance, and high availability in the wide area are often conflicting goals [23]. Different applications need to make different tradeoffs based on application and network characteristics. To understand the diversity in replication needs, we studied three important and broad classes of distributed services: (i) file access, (ii) directory and database services, and (iii) real-time collaborative groupware. Though efficient custom replication solutions exist for many individual applications in all these categories, our aim is to see if a more generic middleware solution is feasible that provides efficient support for a wide variety of applications. File systems are used by myriad applications to store and share persistent data, but applications differ in the way files are accessed. Personal files are rarely write-shared. Software and multimedia are widely read-shared. Log files are concurrently appended. Shared calendars and address books are concurrently updated, but their results can often be merged automatically. Concurrent updates to version control files produce conflicts that are hard to resolve and must be prevented. Eventual consistency (i.e., propagating updates lazily) provides adequate semantics and high availability in the normal case where files are rarely write-shared. But during periods of close collaboration (e.g., an approaching deadline), users need tighter synchronization guarantees such as close-to-open (to view latest updates) or strong consistency (to prevent update conflicts) to facilitate productive fine-grained document sharing. Hence users need to make different tradeoffs between consistency and availability for files at different times. Currently, the only recourse for users during close collaboration over the wide area is to avoid distributed file systems and resort to manual synchronization at a central server (via ssh/rsync or email). 5 A directory service locates resources such as users, devices, and employee records based on their attributes. The consistency needs of a directory service depend on the applications using it and their resources being indexed. A music file index used for peer-to-peer music search such as KaZaa might require a very weak (e.g., eventual) consistency guarantee, but the index must scale to a large floating population of (thousands of) users frequently updating it. Some updates to an employee directory may need to be made effective “immediately” at all replicas (e.g., revoking a user’s access privileges to sensitive data), while other updates can be performed with relaxed consistency. This requires support for multiple simultaneous accesses to the same data with different consistency requirements. Enterprise data services (such as auctions, e-commerce, inventory management) involve multi-user access to structured data. Their responsiveness to users spread geographically can be improved by deploying wide area proxies that cache enterprise objects (e.g., sales and customer records). For instance, the proxy at a call center could cache many client records locally, thereby speeding up response. However, enterprise services often need to enforce strong consistency and integrity constraints in spite of wide area operation. Node/network failures must not degrade service availability when proxies are added. Also since the popularity of objects may vary widely, caching decisions must be made based on the available locality. Widely caching data with poor locality leads to significant coherence traffic and hurts, rather than improves, performance. These applications typically have semantic dependencies among updates such as atomicity and causality that must be preserved to ensure data integrity. Real-time collaboration involves producer-consumer style interaction among multiple users or application components in real-time. The key requirement is to deliver data from producers to consumers with minimal latency while utilizing network bandwidth efficiently. Example applications include data logging for analysis, chat (i.e., manyto-many data multicast), stock/game updates, and multimedia streaming to wide area subscribers. In the traditional organization of these applications, producers send their data to a central server that disseminates it to interested consumers. Replicating the central server’s task among a multicast network of servers helps ease load on the central 6 server and thus has the potential to improve scalability. Such applications differ in the staleness of data tolerated by consumers. 1.1.1 Requirements of a Caching Middleware In general, applications widely differ in several respects including their data access locality, the frequency and extent of read and write sharing among replicas, the typical replication factor, semantic interdependencies among updates, the likelihood of conflicts among concurrent updates, and their amenability to automatic conflict resolution. Operating these applications in the wide area introduces several additional challenges. In a wide area environment, network links typically have non-uniform delay and varying bandwidth due to congestion and cross-traffic. Both nodes and links may become intermittently unavailable. To support diverse application requirements in such a dynamic environment, we believe a caching solution must have the following features: • Customizable Consistency: Applications require a high degree of customizability of consistency mechanisms to achieve the right tradeoff between consistency semantics, availability, and performance. The same application might need to operate with different consistency semantics based on its changing resources and connectivity (e.g., file sharing). • Pervasive Replication: Application components must be able to freely cache data and synchronize with any available replica to provide high availability in the wide area. Rigid communication topologies such as client-server or static hierarchies prevent efficient utilization of network resources and restrict availability. • Network Economy: The caching and consistency mechanisms must use network capacity efficiently and hide variable delays to provide predictable response to users. • Failure Resilience: For practical deployment in the wide area, a caching solution must continue to operate correctly in spite of the failure of some nodes and network links, i.e., it must not violate the consistency guarantees given to applications. 7 1.1.2 Existing Work Previous research efforts have proposed a number of efficient techniques to support pervasive wide area replication [62, 56, 18] as well as consistency mechanisms that suit distinct data sharing patterns [38, 74, 2, 52, 41]. However, as we explain below, existing systems lack one or more of the listed features that we feel are essential to support the data sharing needs of the diverse application classes mentioned above. Many systems [38, 62, 52, 41, 46, 61] handle the diversity in replication needs by devising packaged consistency policies tailored to applications with specific data and network characteristics. For example, the Coda file system [38] provides two modes of operation with distinct consistency policies, namely, close-to-open and eventual consistency, based on the connectivity of its clients to servers. Fluid replication [37] provides three flavors of consistency on a per-file-access basis to support sharing of read-only files, rarely write-shared personal files, and files requiring serializability. Their approach adequately serves specific access patterns, but cannot provide slightly different semantics without a system redesign. For instance, they cannot enforce different policies for file reads and writes, or control replica synchronization frequency, which our evaluation shows can significantly improve application performance. Several research efforts [39, 77] have recognized the need to allow tuning of consistency to specific application needs, and have developed consistency interfaces that provide options to customize consistency. TACT defines a continuous consistency model that provides continuous control over the degree of divergence of replica contents [77]. TACT’s model provides three powerful orthogonal metrics in which to express the replica divergence requirements of applications: numerical error, order error, and staleness. However, to express important consistency constraints such as atomicity, causality and isolation, applications such as file and database services need a session abstraction for grouping multiple data accesses into a unit, which the TACT model lacks. Also, TACT only supports strict enforcement of divergence bounds by blocking writers. Enforcing strict bounds is overkill for chat and directory services, and reduces their throughput and availability for update operations in the presence of failures. 8 Oceanstore [60] provides an Internet-scale persistent data store for use by arbitrary distributed applications and comes closest to our vision of a reusable data replication middleware. However, it takes an extreme approach of requiring applications to both define and enforce a consistency model. It only provides a semantic update model that treats updates as procedures guarded by predicates. An application or a consistency library must design the right set of predicates to associate with updates to achieve the desired consistency. Their approach leaves unanswered our question about the feasibility of a flexible consistency solution for the application classes mentioned above. Hence, although Oceanstore can theoretically support a wide variety of applications, programmers incur the burden of implementing consistency management. Several other systems [12, 75] adopt a similar approach to flexible consistency. In summary, the lack of a sufficiently flexible consistency solution remains a key hurdle to building a replication middleware that supports the wide area data sharing needs of diverse applications. The difficulty lies in developing a consistency interface that allows a broad variety of consistency semantics to be expressed, can be enforced efficiently by a small set of application-independent mechanisms in the wide area, and is customizable enough to enable a large set of applications to make the right tradeoffs based on their environment. This dissertation proposes such a consistency interface and shows how it can be implemented along with the other three features in a single middleware to support the diverse application classes mentioned above. In addition to the features mentioned in Section 1.1.1, for practical deployment, a full-fledged data management middleware must also address several important issues including security and authentication, fault-tolerance for reliability against permanent loss of replicas, and long-term archival storage and retrieval for disaster recovery. For the purpose of this thesis, however, we limit our scope to flexible consistency management for pervasive replication, as it is an important enabler for reusable data management middleware. 9 1.2 Configurable Consistency To aid us in designing a more flexible consistency interface, we surveyed a number of applications in the three classes mentioned in the previous section, looking for common traits in their diverse consistency needs. From the survey, described in Chapter 2, we found that a variety of consistency needs can be expressed in terms of a small number of design choices for enforcing consistency. Those choices can be classified into five mostly orthogonal dimensions: • concurrency - the degree to which conflicting (read/write) accesses can be tolerated, • replica synchronization - the degree to which replica divergence can be tolerated, including the types of interdependencies among updates that must be preserved when synchronizing replicas, • failure handling - how data access should be handled when some replicas are unreachable or have poor connectivity, • update visibility - the granularity at which the updates issued at a replica should be made visible globally, • view isolation - the duration for which the data accesses at a replica should be isolated from remote updates. There are multiple reasonable options along each of these dimensions that create a multidimensional space for expressing consistency requirements of applications. Based on this classification, we developed the novel configurable consistency framework that provides the options listed in Table 1.1. When these options are combined in various ways, they yield a rich collection of consistency semantics for reads and updates to shared data, covering the needs of a broad mix of applications. For instance, this approach lets a proxy-based auction service employ strong consistency for updates across all replicas, while enabling peer proxies to answer queries with different levels of accuracy by relaxing consistency for reads to limit synchronization cost. A user can still get an accurate answer by specifying a stronger consistency 10 Table 1.1. Consistency options provided by the Configurable Consistency (CC) framework. Consistency semantics are expressed for an access session by choosing one of the alternative options in each row, which are mutually exclusive. Options in bold italics indicate reasonable defaults that suit many applications. In our discussion, when we leave an option unspecified, we assume its default value. Dimension Concurrency Control Available Consistency Options Access mode Failure Handling Update Visibility View Isolation excl (RDLK, WRLK) time (staleness = 0..∞ secs) mod (unseen writes = 0..∞) Strength hard soft Semantic Deps. none causal atomic causal+atomic Update ordering none total serial optimistic (ignore replicas w/ RTT ≥ 0..∞) pessimistic session per-update manual session per-update manual Timeliness Replica Synchronization concurrent (RD, WR) manual requirement for queries, e.g., when placing a bid, when the operation warrants incurring higher latency. The configurable consistency framework assumes that applications access (i.e., read or write) their data in sessions, and that consistency can be enforced at session boundaries as well as before and after each read or write access within a session. The framework’s definition of reads and writes is general and includes queries and updates of arbitrary complexity. In this framework, an application expresses its consistency requirements for each session as a vector of consistency options (one from each row of Table 1.1) covering several aspects of consistency management. Each row of the table indicates several mutually exclusive options available to control the aspect of consistency indicated in its first column. The table shows reasonable default options in italics, which together enforce the close-to-open consistency semantics provided by AFS [28] for coherent read-write file sharing. An application can select a different set of options for subsequent sessions on the same data item to meet dynamically varying consistency requirements. Also, different application instances can select different sets of options on the same data item simultaneously. In that case, the framework guarantees that all sessions achieve 11 their desired semantics by providing mechanisms that serialize sessions with conflicting requirements. Thus, the framework provides applications a significant amount of customizability in consistency management. 1.2.1 Configurable Consistency Options We now briefly describe the options supported by the configurable consistency framework, which are explained in detail in Chapter 3. The framework provides two flavors of access modes to control the parallelism among reads and writes. Concurrent flavors (RD, WR) allow arbitrary interleaving of accesses across replicas, while the exclusive modes (RDLK, WRLK) provide traditional concurrentread-exclusive-write semantics globally [48]. Divergence of replica contents (called timeliness in Table 1.1) can be controlled via limiting staleness in terms of time, the number of unseen remote updates, or both. The divergence bounds can be hard, i.e., strictly enforced by stalling writes if necessary (similar to TACT’s model [77]), or soft, i.e., enforced in a best-effort fashion without stalling any accesses. Two types of semantic dependencies can be expressed among multiple writes (to the same or different data items), namely, causality and atomicity. When updates are issued independently at multiple replicas, our framework allows them to be applied (1) with no particular constraint on their ordering at various replicas (called ‘none’), (2) in some arbitrary but common order everywhere (called ‘total’), or (3) sequentially via serialization (called ‘serial’). When not all replicas are equally well-connected or available, different consistency options can be imposed dynamically on different subsets of replicas based on their relative connectivity. For this, the framework allows qualifying the options with a cutoff value for a link quality metric such as network latency. In that case, consistency options will be enforced only relative to replicas reachable via network links of higher quality (e.g., lower latency) than the cutoff. With this option, application instances using a replica can optimistically make progress with available data even when some replicas are unreachable due to node/network failures. 12 Finally, the framework provides control over how long a session is kept isolated from the updates of remote sessions, as well as when its updates are made ready to be visible to remote sessions. A session can be made to remain isolated entirely from remote updates (‘session’, ensuring a snapshot view of data), to apply remote updates on local copies immediately (‘per-update’, useful for log monitoring), or when explicitly requested via an API (‘manual’). Similarly, a session’s updates can be propagated as soon as they are issued (useful for chat), when the session ends (useful for file updates), or only upon explicit request (‘manual’). Table 1.2 shows how several popular consistency flavors can be expressed in terms of configurable consistency options. In the proxy-based auction example described above, strong consistency can be enforced for updates by employing exclusive write mode (WRLK) sessions to ensure data integrity, while queries employ concurrent read mode (RD) sessions with relaxed timeliness settings for high query throughput. A replicated chat service needs to employ per-update visibility and isolation to force user messages to be eagerly propagated among chat servers in real-time. On the other hand, updates to source files and documents are not stable until a write session ends. Hence they need to employ session visibility and isolation to ensure consistent contents. We discuss the expressive power of the framework in the context of existing consistency models in detail in Section 3.5. 1.2.2 Discussion At first glance, providing a large number of options (as our framework does) rather than a small set of hardwired protocols might appear to impose an extra design burden on application programmers. Programmers need to determine how the selection of a particular option along one dimension (e.g., optimistic failure handling) affects the semantics provided by the options chosen along other dimensions (e.g., exclusive write mode, i.e., WRLK). However, thanks to the orthogonality and composability of our framework’s options, their semantics are roughly additive; each option only restricts the applicability of the semantics of other options and does not alter them in unpredictable ways. For example, employing optimistic failure handling and WRLK mode together for a data 13 Table 1.2. Configurable Consistency (CC) options for popular consistency flavors. For options left unspecified, we assume a value from the options vector: [RD/WR, time=0, mod=∞, hard, no semantic deps, total order, pessimistic, session-grain visibility & isolation]. Consistency Semantics CC Options Sample Applications Existing Support Demo Apps. locking rd, wr (strong consistency) RDLK/WRLK DB, objects, file locking SwarmProxy, RCS, SwarmDB master-slave wr WR, serial shared queue close-to-open rd, wr bounded inconsistency MVCC rd time=0, hard collaborative file sharing airline reservation online shopping inventory queries personal file access Fluid Replication [16], DBMS, Objectstore [41] read-only repl. (mySQL) AFS, Coda[38] Swarmfs TACT [77] SwarmDB eventual / optimistic close-to-rd optimistic wr-to-rd append consistency time=x, mod=y, hard RD, time=0, soft, causal+atomic time=0, soft, optimistic, per-update isolation time=x, mod=y, soft, optimistic WR, time=0, soft, none/total/serial, per-update Directory, stock quotes logging, chat, SwarmDB Objectstore[41], Oracle Pangaea [62], Swarmfs, Ficus, Coda, Fluid, NFS Active Directory[46] WebFS [74], GoogleFS SwarmDB SwarmDB SwarmCast streaming, games access session guarantees exclusive write access for that session only within the replica’s partition (i.e., among replicas connected well to that replica). Thus by adopting our framework, programmers are not faced with a combinatorial increase in the complexity of semantics to understand. To ease the adoption of our framework for application design, we anticipate that middleware systems that adopt our framework will bundle popular combinations of options as defaults (e.g., ‘Unix file semantics’, ‘CODA semantics’, or ‘best effort streaming’) for object access, while allowing individual application components to refine their con- 14 sistency semantics when required. In those circumstances, programmers can customize individual options along some dimensions while retaining the other options from the default set. Although the options are largely orthogonal, a few of them implicitly imply others. For example, the exclusive access modes imply session-grain visibility and isolation, a hard most-current timeliness guarantee, and serial update ordering. Likewise, serial update ordering implicitly guarantees that updates are totally ordered as well. 1.2.3 Limitations of the Framework Although we have designed our framework to support the consistency semantics needed by a wide variety of distributed services, we have chosen to leave out semantics that are difficult to support in a scalable manner across the wide area or that require consistency protocols with application-specific knowledge. As a consequence, our framework has two specific limitations that may restrict application-level parallelism in some scenarios. First, it cannot support application-specific conflict matrices that specify the parallelism possible among application-level operations [5]. Instead, applications must map their operations into the read and write modes provided by our framework, which may restrict parallelism. For instance, a shared editing application cannot specify that structural changes to a document (e.g., adding/removing sections) can be safely allowed to proceed in parallel with changes to individual sections, although their results can be safely merged at the application level. Enforcing such application-level concurrency constraints requires the consistency protocol to track the application operations in progress at each replica site, which is hard to track efficiently in an application-independent manner. Second, our consistency framework allows replica divergence constraints to be imposed on data but not on application-defined views of data, unlike TACT’s conit concept [77]. For instance, a user sharing a replicated bulletin board might be more interested in tracking updates to specific threads of discussion than others, or messages posted by her friends than others, etc. To support such requirements, views should be defined on the bulletin board dynamically (e.g., all messages with subject ‘movie’) and updates should 15 be tracked by the effect they have on those views, instead of on the whole board. Our framework cannot express such views precisely. It is difficult to efficiently manage such dynamic views across a large number of replicas, because doing so requires frequent bookkeeping communication among replicas that may offset the parallelism gained by replication. An alternative solution is to split the bulletin board into multiple consistency units (threads) and manage consistency at a finer grain when this is required. Thus, our framework provides alternate ways to express both of these requirements that we believe are likely to be more efficient due to their simplicity and reduced bookkeeping. 1.2.4 Swarm To demonstrate the practicality of the configurable consistency framework, we have designed and prototyped the Swarm middleware file store. Swarm is organized as a collection of peer servers that provide coherent file access at variable granularity behind a traditional session-oriented file interface. Applications store their shared state in Swarm files and operate on their state via nearby Swarm servers. By designing applications this way, application writers are relieved of the burden of implementing their own data location, caching and consistency mechanisms. Swarm allows applications to access an entire file or individual file blocks within sessions. When opening a session, applications can specify the desired consistency semantics via configurable consistency options, which Swarm enforces for the duration of that session. Swarm files can be updated by overwriting previous contents (physical updates) or by invoking a semantic update procedure that Swarm later applies to all replicas (with the help of application plugins). Swarm supports three distinct paradigms of shared data access: (i) whole-file access with physical updates to support unstructured variable-length data such as files, (ii) page-grained access to support persistent objects and other heap-based data structures, and (iii) whole file access with semantic updates to support structured data such as databases and directories. Swarm builds failure-resilient overlay replica hierarchies dynamically per file to manage large numbers (thousands) of file replicas across non-uniform networks. By default, Swarm aggressively caches shared data on demand, but dynamically restricts caching 16 when it observes high contention. We refer to this technique as contention-aware replication control. Swarm’s caching mechanism dynamically monitors network quality between clients and replicas and reorganizes the replica hierarchy to minimize the use of slow links, thereby reducing latency and saving WAN bandwidth. We refer to this feature of Swarm as proximity-aware replica management. Finally, Swarm’s replica hierarchy mechanism is resilient to intermittent node and network failures that are common in dynamic environments. Swarm’s familiar file interface enables applications to be programmed for a location-transparent persistent data abstraction, and to achieve automatic replication. Applications can meet diverse consistency requirements by employing different sets of configurable consistency options. The design and implementation of Swarm and configurable consistency are described in detail in Chapter 4. 1.3 Evaluation To support our claim that configurable consistency can meet the data sharing needs of diverse applications in the wide area efficiently, we used the Swarm prototype to support four diverse network services. With these services, we demonstrate three properties of our configurable consistency implementation that are important to support wide area replication effectively: (1) efficient enforcement of diverse consistency semantics, (2) network economy, and (3) failure-resilience at scale. The Swarm prototype used for our evaluation supports all of the configurable consistency options except causality, atomicity, and timeliness control via limiting unseen updates (the mod bound option). As the last column of Table 1.2 indicates, we evaluated the four services with a variety of consistency flavors. More details of the services and their evaluation are presented in Chapter 5. Our first service, called Swarmfs, is a wide area peer-to-peer file system with a decentralized but uniform hierarchical file name space. It provides ubiquitous file storage to mobile users via autonomously managed file servers [69]. With Swarmfs, multiple applications can simultaneously access the same or different files enforcing diverse consistency semantics such as the strong consistency of Sprite [48], the close-to-open consistency of AFS [28] and Coda, the weak eventual consistency flavors of Coda, 17 Pangaea, NFS, and Ficus file systems [38, 62, 57], and the append-only consistency of WebFS [74]. On a personal file access workload, Swarmfs delivers 90% of the performance of comparable network file systems (Coda and NFS) on a LAN and 80% of local linux file system performance across a WAN despite being layered on top of an unoptimized middleware. Unlike existing file systems, Swarmfs can exploit data locality even while providing the strong consistency required for file locking across nodes. For WAN sharing of version control files that require file locking (e.g., in a shared RCS/CVS repository), Swarmfs provides near-local latency for checkin and checkout operations when repository accesses are highly localized to a site, and less than one-third the latency of a traditional client-server version control system even in the presence of widespread sharing. Our second application, called SwarmProxy, models how the responsiveness of an online shopping service such as Amazon.com could be improved for clients in different geographic regions by deploying WAN proxies. By caching remote service objects such as inventory records locally, proxies can improve the response time, throughput and availability of the service. However, a shopping service tends to have stringent data integrity constraints for transactions such as inventory updates. Enforcing those constraints often requires strong consistency [41] among caches which can be expensive over a WAN, especially when clients contend for access to shared objects. Our evaluation of SwarmProxy demonstrates that proxy-caching with Swarm improves the aggregate throughput and the average latency of an enterprise service by several times relative to a remote centralized server even when clients exhibit significant contention (i.e., 40% locality). Swarm’s contention-aware replication control provides the low latency and high throughput of a LAN-based server under low contention (i.e., 60% locality or more), and more than 90% of a remote server’s performance even under extreme contention (i.e., less than 10% locality). In comparison, aggressive caching underperforms by a factor of 3.6 at all levels of contention due to the high cost of synchronization over a WAN. Our third application, called SwarmDB, illustrates how Swarm enables read-write caching to be added transparently to an existing database while providing significant control over consistency to its applications, and the performance and engineering benefits 18 of this approach. We augmented the popular BerkeleyDB database library [67] with caching support by implementing a wrapper library named SwarmDB around unmodified BerkeleyDB. SwarmDB intercepts BerkeleyDB calls to operate on replicas. It hosts each BerkeleyDB database inside a Swarm file and uses Swarm’s semantic updates to synchronize replicas. Unlike BerkeleyDB’s existing master-slave replication that only supports read-only replicas, Swarm-based BerkeleyDB can support full-fledged readwrite replication with a wide range of consistency semantics. We evaluate SwarmDB’s query and update throughput under an update-intensive workload. We evaluate five distinct consistency flavors for queries and updates, ranging from strong (appropriate for a conventional database) to time-based eventual consistency (appropriate for many directory services), and compare the performance of the resulting system to BerkeleyDB’s client-server (RPC) implementation. We find that relaxing consistency requirements even slightly improves write throughput by an order of magnitude and scales beyond RPC and master-slave, enabling SwarmDB to be reused for diverse applications. We implemented a prototype peer-to-peer file sharing service using SwarmDB for attribute-based file search. We used this service to evaluate the network economy and failure-resilience of Swarm’s replication and lease-based consistency mechanisms under node churn (i.e., where nodes continually join and abruptly leave the network). In a 240node file sharing network, Swarm’s proximity-aware replica management mechanism reduces the latency and WAN bandwidth consumption for new file downloads and limits the impact of high node churn (5 node deaths/second) to roughly one-fifth that of random replica networking (employed by several P2P systems such as KaZaa [34]). Our final application, called SwarmCast, is an online streaming service that stresses the ability of Swarm to efficiently multicast information to a large number of sites in realtime. It models applications where a number of users or service components interact in publish-subscribe mode, such as online real-time chatting, Internet gaming, data logging, and content dissemination. These applications require events/data produced at a site (such as a message typed by a user to a chat room or a player’s move in a game) to be propagated to other affected sites in real-time. In SwarmCast, a single event producer sends event packets at full-speed to a relay server which in turn multicasts them to a 19 large number of subscribers. A centralized relay server can be quickly overwhelmed by the CPU and network load imposed due to a high rate of concurrent events. Swarm’s replica network can be leveraged as an overlay network of relay servers to scale the event-handling rate of SwarmCast, as follows. The producer appends event packets at full-speed to a shared log file stored in Swarm by sending semantic updates to a Swarm server. Subscribers open the log file for reading at one or more Swarm servers and register to receive log updates. In response, Swarm servers cache the log file, thereby forming a replica hierarchy to efficiently multicast the event packets. Thus, network I/O-handling is largely eliminated from SwarmCast application code. In spite of our unoptimized Swarm implementation, Swarm-based multicast pushes data at 60% of the rate possible with ideal multicast on a 100Mbps switched LAN with up to 20 relays and 120 subscribers. However, since our Swarm prototype is compute-bound, the end-toend packet latency through a tree of Swarm servers 5 levels deep is about 250ms on an 850MHz Pentium III CPU and 80ms on a 3GHz Pentium 4. Half of the latency is incurred on the first hop due to queuing before the root server. Each of these applications has different requirements for accessing cached data and achieves different consistency semantics using Swarm. Swarm successfully handles caching for each of these applications, offloading its complexity from the application logic. Swarm-based implementation of these applications provides comparable (60%) or better performance relative to implementations based on existing support. This shows that configurable consistency is powerful, i.e., supports a variety of useful consistency semantics, as well as practical and useful, i.e., effectively supports wide area replication for diverse distributed applications. 1.4 Thesis Therefore, we put forth the thesis that a small set of customizable data coherence mechanisms can support aggressive wide-area data caching effectively for distributed services with diverse consistency requirements. The major contributions of this dissertation research are the following: 20 1. We have studied the diverse data sharing needs of a number of important distributed applications belonging to three major classes, classified those needs in a way that exposes their commonality, and identified certain core design choices that recur in the consistency management of these applications. 2. Based on the commonalities identified by our application study, we have developed a novel flexible consistency framework that can support a broader set of consistency requirements of these applications than existing consistency solutions, with simple parameters along five orthogonal dimensions. 3. We showed that the framework can be implemented to support diverse application needs efficiently for caching in non-uniform networks. To do this, we implemented a caching middleware that adopts this framework, and demonstrated four applications that reuse the middleware and its flexible consistency in very different ways for effective wide area caching. In the remainder of this dissertation, we motivate and then demonstrate the truth of our thesis statement. 1.5 Roadmap In Chapter 2, we discuss our survey of the replication needs of diverse distributed applications and the rationale behind our configurable consistency management approach. In Chapter 3, we describe the configurable consistency framework and discuss the rich variety of useful consistency semantics that are possible in a system that adopts configurable consistency. In Chapter 4, we first describe how configurable consistency can be used in the design of distributed applications. We then describe the interface, design and implementation of the Swarm replication middleware, our research vehicle for evaluating the practicality of configurable consistency. In Chapter 5, we present our evaluation of Swarm in the context of the four services mentioned in Section 1.3. For each service, we start by describing how we implemented it on top of Swarm and then present a detailed evaluation of its performance and compare it with alternative implementations where applicable. In Chapter 6, we discuss Swarm in the context of the extensive existing 21 research on replication and consistency. Finally, in Chapter 7, we suggest some directions for future work and summarize our conclusions. CHAPTER 2 CONFIGURABLE CONSISTENCY: RATIONALE As mentioned in Chapter 1, our goal is to determine if middleware support for data caching can be provided in a reusable manner for a wide variety of distributed services. To address this question, we first surveyed a wide variety of distributed services and identified those that can benefit from data caching. Next, we examined their sharing needs to see if there is sufficient commonality in those needs for a reusable middleware to be useful. In this chapter, we describe our survey and our findings to motivate our decomposed approach to consistency management. We begin this chapter by defining what we mean by consistency and state the objective of our application survey in Section 2.1. We describe the classes of applications that we surveyed in Section 2.2. In particular, we describe four representative applications from these classes that we use to motivate and evaluate our work. In Section 2.4, we present our observations on the consistency needs of applications, identifying significant commonalities. In Section 2.6, we discuss how these commonalities enable the development of our configurable consistency framework. In Section 2.7 we discuss other approaches to flexible consistency management in the light of our survey of application requirements. 2.1 Background 2.1.1 Programming Distributed Applications The paradigms adopted by designers of distributed applications can be loosely classified as data-shipping, function-shipping or a hybrid. In all the three paradigms, a distributed application is structured in terms of distributed components that own and manage different portions of its overall state. Here, state refers to application data that can persist across multiple application-level computations (such as processing different client requests). In the function-shipping paradigm (commonly known as the client- 23 server paradigm), a component operates on remote state by sending messages to its owner. However in the data-shipping paradigm, a component operates on remote state by first obtaining it locally, thereby exploiting locality. A shared data abstraction can relieve the programming effort involved in data-shipping by providing the illusion that all application state is available locally to each component, and can be accessed as in a multithreaded single-machine environment. Data caching/replication is relevant to those portions of a distributed application that are designed for data-shipping. The structure of an application using function- and data-shipping is illustrated in Figures 2.1 and 2.2 respectively. For example, a chat service is traditionally split into two kinds of components with different roles: the client and the server. The client is responsible for managing the user interface and sending user messages to the server. The server keeps track of all application state, e.g., the chat transcript and the chat room membership, and communicates client messages to other clients. However, in a pure data-shipping model, both the client and the server directly access the application state, namely the chat transcript, thus blurring their distinction. They become peer service replicas. The job of tracking chat room membership and broadcasting messages between clients is then delegated to an underlying data management system that mediates access to application state. In a hybrid paradigm, illustrated in Figure 2.3, application components (also called service agents) can take on either role based on the processing, storage and network resources at their disposal. Thus, the chat service agent at a client site with a high bandwidth network link can take on the role of a peer chat server to ease load on other servers, while other service agents continue as clients. Thus, the hybrid paradigm allows considerable flexibility in structuring a distributed application in a way that allows incremental scaling with offered load. 2.1.2 Consistency Data-shipping requires coordinating concurrent accesses to application state to ensure its consistency in the face of distribution and replication. Traditionally, the consistency of shared data is defined in terms of a consistency model, which is a contract between 24 Figure 2.1. A distributed application with five components A1..A5 employing the function-shipping model. Each component holds a fragment of overall state and operates on other fragments via messages to their holder. an application and the underlying shared data management system regarding data access [70]. If the application adheres to system-imposed conventions for data access, such as informing the system before and after each access, the system guarantees to maintain certain invariants about the quality of data accessed, referred to as its consistency semantics. An example invariant could be “all components see updates in the same order”. Consistency mechanisms are the actions taken by the system’s components to enforce this contract. These actions include (i) controlling admission of access and update operations on data replicas and (ii) propagating updates among replicas to maintain desired data quality invariants. A consistency framework supports one or more consistency models for shared data access by providing an interface to express them as well as mechanisms to enforce them. Previous research has revealed that distributed applications vary widely in the consistency semantics they need [77]. It is also well-known that consistency, scalable performance, and availability in the wide area are often conflicting goals. Different applications need to make different tradeoffs based on application and network characteristics. Many existing systems [38, 62, 52, 41, 46, 61] handle this diversity by devising consistency models tailored to specific classes of applications. Several recent research efforts 25 Figure 2.2. A distributed application employing the data-shipping model. A1..A5 interact only via caching needed state locally. A1 holds fragments 1 and 2 of overall state. A2 caches fragments 1 and 3. A3 holds 3 and caches 5. (e.g., Fluid replication [37], Oceanstore[60] and TACT [77]) have developed consistency frameworks that cater to a broader set of applications by providing options to customize consistency. However, a complete reusable middleware solution that helps arbitrary wide area applications achieve the right balance does not exist. The difficulty lies in developing a consistency framework that can express a broad variety of consistency semantics, can be enforced efficiently by a small set of mechanisms, and is flexible and adaptable enough to enable applications to make the right tradeoffs based on their environment. The advantage of having such a framework is that it paves the way for designing a reusable middleware for replication management in distributed applications. 2.2 Replication Needs of Applications: A Survey To study the feasibility of a consistency framework that supports a broader set of replication needs than addressed by existing systems, we surveyed a wide variety of distributed applications looking for common traits in their diverse needs. We considered three broad categories of popular Internet-based applications that handle a significant amount of data and can benefit from replication: (i) file access, (ii) collaborative groupware, and (iii) enterprise data services. Efficient replication solutions exist for individual applications in many of these categories. However, our aim is to arrive at a systematic 26 Figure 2.3. Distributed application employing a hybrid model. A1, A3 and A5 are in data-shipping mode. A2 is in client mode, and explicitly communicates with A1 and A3. A4 is in hybrid mode. It caches 3, holds 4 and contacts A5 for 5. understanding of their consistency requirements to see if a more generic middleware solution is feasible. Our study, presented in this section, reveals that the consistency requirements of a wide variety of applications can be expressed along a small set of fairly orthogonal dimensions. As we argue in Section 2.7, to express the same set of requirements naturally, existing consistency interfaces need to be extended further. The first category of applications facilitates users accessing their files over a network and sharing them with other users for various collaborative tasks. Access to personal files, shared documents, media, calendars, address books, and other productivity tools falls under this category. Although specific instances of end user applications in this category use files in very different ways, they all exhibit a high degree of locality, making caching highly effective at improving file availability and access latency. Applications in the second category enable users to form online communities, exchange information and interact in real-time or otherwise. Examples include bulletin boards, instant messaging, Internet gaming, collaborative search and resource sharing. Though these applications do not always exhibit high degrees of locality, replication helps these applications scale to a large number of users by spreading the workload among multiple server sites. The third category involves multi-user access to structured data. In addition to supporting 27 concurrent read-write sharing, these applications also need to meet integrity constraints on the data. Example applications include enterprise databases (e.g., inventory, sales, and customer records), CAD environments, and directory services that index dynamic content such as online shopping catalogs, bulletin board messages, user profiles and network resources. These applications are typically centrally administered, and replication helps them provide responsive service to geographically widespread users by hiding wide area latencies and network failures. Table 2.1 summarizes the data characteristics and replication needs of several applications in each of the three categories. For each application, it lists why and where replication is beneficial, the typical replication factor, the available locality, the frequency and extent of read or write sharing among replicas, and the likelihood of conflicts among concurrent updates. 2.3 Representative Applications For the remainder of this discussion, we will focus on four applications that are broadly representative of a variety of wide area services: a file service, a shared music index, a chat service, and an online auction service (highlighted in bold italics in the table). We refer to these four applications throughout this thesis to motivate the design decisions that we made. We describe our implementation and performance evaluation of these applications using our consistency management framework in Chapter 5. Before proceeding with our analysis in Section 2.4, we describe each of the four applications in turn. 2.3.1 File Sharing Our first focus application is a distributed file service. A file system is a convenient and popular way to access and share data over a network. File systems have been used to store and share persistent data by myriad applications, so a large number of file systems have been developed for various environments. Depending on their application, files are accessed and shared in different ways in a variety of network environments. Personal files are rarely write-shared by multiple users, whereas software, multimedia and other read- 28 Table 2.1. Characteristics of several classes of representative wide-area applications. A row in bold face indicates an application class with default characteristics. Other rows describe specific applications. An application’s property is the same as that of its class, unless specifically mentioned. Applications Unit File service file personal files version control ad-hoc doc. sharing media files web, content distrib. data/event logging Replication Benefit # Data Characteristics Locality Sharing Conflicts file file doc low latency, availability availability low latency low latency 1-10 100s 10s single user read, write read, write no yes yes file low b/w util. 1000s file low b/w util. 10000s log low b/w util. 100s write-once, read many one-writer, read many read, append no db mesg mailbox load balance low b/w util. load balance availability 1000s 100s 1-10 event info queue low latency 1000s game state load balance, 10s- real-time streaming low b/w util. 100s Enterprise Data shopping catalog airline reservation directory online auctions CAD sales, inventory low latency, availability Groupware search index bulletin board email stock, news updates producerconsumer apps, search, mining chat, games high low 100s read-write read, write read, write one-reader, write-many one writer, readmany read, append low real-time, readwrite variable read-write, structured data rare no no no no db 100s rare flight 10s yes db catalog object db, item low latency load balance low latency 100s 10s 100s 10s high write long sessions 29 only files tend to be widely read-shared. Collaboration environments involve multiple users cooperating to update files via version control systems to prevent update conflicts. Mail systems store messages in files that are mostly accessed by their recipient. Though files are used in many ways, as indicated by their diverse sharing characteristics in Table 2.1, they exhibit a high degree of locality in general. Hence caching files aggressively is highly effective at improving latency and availability in the wide area. Existing file systems support only a subset of the sharing patterns listed in Table 2.1. 2.3.2 Proxy-based Auction Service Our second focus application is an online auction service. It employs wide-area service proxies that cache important auction information locally to improve responsiveness to customers in diverse geographic regions. It is modeled after more sophisticated enterprise services that handle business-critical enterprise objects such as sales, inventory, or customer records. Many of these services benefit from deploying wide area proxies that cache enterprise objects locally to serve regional clients. For example, the administrators of a corporation’s database service can deploy proxies at customer service (or call) centers that serve different sets of clients. The proxy at a call center caches many of the client records locally, speeding up response. An online auction service such as EBay [20] could deploy proxies at different sites to exploit regional locality. In this case, different proxies handle the sale of different items based on how heavily the items are traded in their region. Even if trading does not exhibit much locality, proxies help spread the service load among multiple sites. An important challenge in designing a proxy-based auction service is the need to enforce strong consistency requirements and integrity constraints in spite of wide area operation. Node/network failures must not degrade service availability when proxies are added. Also since locality varies widely based on the popularity of sale items, caching decisions must be made based on the available locality. Widely caching data with poor locality leads to significant coherence traffic and hurts rather than improving performance. 30 2.3.3 Resource Directory Service Our third focus application is a resource directory service that maintains a global index of shared resources (e.g., network devices, music files, employee information and machine configuration) based on their attributes. It is modeled after sophisticated directory services that offer complex search capabilities [46]. Unlike file accesses, the locality of index queries and updates vary widely. The consistency needs of the directory service depend on the resources being indexed and the applications relying on them. A very weak (e.g., eventual) consistency guarantee is sufficient for a music file index used for peer-to-peer music sharing such as KaZaa, but it must scale to a large floating population of (thousands of) users frequently updating it. An employee directory may need some updates to be “immediately” made effective at all replicas (e.g., revoking a user’s access privileges to sensitive data), while others can be performed with relaxed consistency. 2.3.4 Chat Service Our final focus application is an online chat service. It models several applications that involve producer-consumer style interaction among a number of users or service components, such as instant messaging, Internet gaming, Internet search and data mining. The application requires that events/data produced at one or more sites, such as a message typed by a user to a chat room or a player’s move in a game, be propagated to other affected sites in real-time. In some cases, the events need to be delivered in certain order, e.g., total or causal order. For example, chat room messages require causal order for discussions to make sense (e.g., an answer appearing before its corresponding question is confusing), whereas players’ moves in a game must be delivered in the same order to all participants for fairness. A key issue for scaling such a service is the CPU and network load imposed on the server. The number of users who interact directly (e.g., participants in a chatroom or players active in a shared virtual space) is usually small (dozens). But the total number of passive subscribers to the service could be very large (thousands or more), e.g., listeners in a chat room. A centralized game or chat room server quickly saturates its outgoing 31 Internet link, because the bandwidth consumed for broadcast among N users through a single node grows quadratically with N (since each user’s message must be sent to N-1 other users). Replicating service state on proxy (chat or game) servers helps incrementally scale the overall service with load, as the replicas can be organized into an overlay network and synchronized via an efficient multicast mechanism. When replicated, the service must provide event ordering and/or real-time propagation guarantees, and scale to a large number of participants and a moderate to high rate of concurrent events. 2.4 Application Characteristics Table 2.1 reveals that in general applications widely differ in their natural unit of sharing, their data access characteristics and their replica scaling requirements. The natural unit of sharing could be a whole file or database, its individual pages, or the application objects embedded therein. Applications differ in their available locality and the likelihood of conflicts among concurrent updates. By locality, we mean the percentage of data accesses at a site that are not interleaved by updates to remote replicas. File accesses are known to exhibit high locality and hence aggressive caching improves their access latency and availability in the wide area, even when updates need to be serialized to prevent conflicts (as in version control). Collaborative groupware applications involve concurrent updates to shared data and exhibit low locality. However, since their updates rarely conflict, caching improves concurrency. Enterprise applications involve frequent updates that are more likely to conflict, and hence need to be serialized to preserve data integrity. They exhibit varying locality. Naively replicating enterprise data with insufficient locality could severely hurt performance and service availability. Hence unlike in the case of file access, these applications need to adapt between function-shipping and aggressive replication based on the amount of data access locality at various sites. Applications have different scaling requirements in terms of the number of replicas of a data item, ranging from a few dozens to thousands. The replication factor depends on the number of users that actively share a data item and the number of users that a single replica can support. 32 Table 2.1 shows the data characteristics of the applications surveyed. For the purpose of this survey, we classify data accesses into reads and writes, but define them broadly to include arbitrary queries and updates. Reads can be arbitrary queries that do not modify data. Writes access data as well as make arbitrary modifications either by directly overwriting existing contents or via update procedures that can be later re-executed at other replicas. The sharing patterns of applications range between little sharing (e.g., personal files), widespread read-only sharing (e.g., music files, software), and frequent read-write sharing (e.g., documents, databases). 2.5 Consistency Requirements Table 2.2 summarizes the consistency requirements of the surveyed applications. The columns classify each application’s required invariants along several aspects. We now discuss each of these aspects, illustrating them with the four target applications mentioned in Section 2.2. 2.5.1 Update Stability Applications differ in their notion of what constitutes a “stable” update to a data item, as shown in Column 2 of Table 2.2. Informally, a stable update is one that transforms data from one valid state to another valid state that can be made visible (or propagated) to remote replicas. Updates to source files and documents are typically considered stable only at the end of a write session, whereas user messages in a chat session are independent and can be sent to other users as soon as they are received. These notions of an application’s acceptable granularity of update visibility are referred to in Table 2.2 as session and per-update visibility respectively. 2.5.2 View Isolation Applications have different requirements for the extent to which ongoing data access at a replica site must be isolated from concurrent remote updates. Reading a source file or querying an auction service requires an unchanging snapshot view of data across multiple accesses during a file session or a database query session. On the other hand, to enable 33 Table 2.2. Consistency needs of several representative wide-area applications. Applications File service personal files version control ad hoc doc. sharing media files WWW, content distrib. data/event logging, Groupware shared music index bulletin board email stock, news updates chat games, streaming Enterprise Data shopping catalog airline reservation enterprise directory online auctions CAD sales, inventory Update View Stability Isolation Concurrency Replica synchronization control read write Timeli- Str- Update ness ength deps. Failure Handling optimistic session session session session weak weak current hard weak excl. hard causal none session session weak weak current hard causal session session session session soft soft append per-update single op update per-update weak weak time current soft weak weak none none total (time), serial soft manual pessimistic optimistic none msg op causal msg op update current soft time hard none total pessimistic post msg move current soft current hard causal total pessimistic hard update per-update weak weak reservation per-update weak weak mod update transaction weak weak varies bid, buy transaction weak excl. manual transaction manual transaction excl. excl. excl. excl. soft optimistic pessimistic hard varies atomic, serial atomic atomic pessimistic optimistic pessimistic 34 real-time interaction, a chat service requires that incoming updates be applied locally before every access. Column 3 refers to these requirements as session and per-update isolation respectively. In general, applications require the ability to group a set of individual data accesses into a session for expressing their update visibility and view isolation requirements. Traditionally, transactions [8] are one common way to express such requirements in the context of database applications. 2.5.3 Concurrency Control Applications that involve write-sharing differ in the extent to which they can handle concurrent reads and writes at multiple replicas. In general, allowing parallel reads and writes has two consequences. First, if replicas apply updates in different order from one another, their final results could diverge and the effect of some updates could be lost permanently. Such updates are said to conflict. Replicas can resolve conflicts by undoing and reapplying updates in a final order consistent at all replicas (called commit order), or by merging their results to avoid losing the effect of some updates. This leads to the second consequence: reads may provide an incorrect view of data if they observe the effect of uncommitted writes (i.e., those whose position in the final commit order is yet undetermined) that have to be undone later. An application can afford to perform parallel writes if their semantics allow conflicts to be resolved in a way that ensures a consistent final result at all replicas. Otherwise, the application must serialize writes to avoid conflicts. Similarly, the application must serialize its reads with writes to ensure that its users always view committed data. A broad variety of applications that we surveyed either require serialized accesses or can handle an arbitrary amount of parallelism provided replicas converge on a common final result. In Columns 3 and 4 of Table 2.2, we refer to accesses that must be serialized with other writes as ‘excl’ (for exclusive) and others as ‘weak’. For the auction service, conflicting concurrent writes at different replicas are unacceptable as they are hard to resolve automatically (e.g., selling the same item to different clients). A shared music directory 35 can perform parallel lookups, insertions and deletions as long as replica convergence can be assured. This is because conflicting insertions and deletions can at most cause some subsequent file lookups to fail temporarily. Most file accesses except version control exhibit little write-sharing, and hence need not be serialized. The chat service, data logging, and other publish-subscribe services perform a special type of write, namely, appending data to the end of a shared file or queue. Unlike raw writes to a file, appends can be executed concurrently, as they can be automatically reconciled by applying them sequentially without losing data. 2.5.4 Replica Synchronization Replicas must be synchronized in a timely manner to prevent their contents from diverging indefinitely. Some applications must always access the latest data, while others can tolerate ‘eventual’ replica convergence. We identify three degrees of acceptable replica divergence: current, time (time-bounded), and mod (modification-bounded), as shown by Column 5 labelled ‘Timeliness’ in Table 2.2. For example, users working on a shared document expect it to reflect the latest updates made by other users before each editing session (current). For stock-quote updates to be useful, their staleness must be bounded by a time interval (time). Finally, an airline reservations system must limit the extent to which a flight is overbooked. To do this accurately across multiple database replicas, it must enforce an upper limit on the number of unseen remote reservations at each replica (mod). Some applications need to hand-tune their replica synchronization strategy based on domain-specific knowledge. We refer to their requirement as a ‘manual’ divergence bound. For instance, consider a shared music file index, portions of which are cached by a large number of Internet users. An index replica needs to resynchronize with other replicas only to serve a local cache miss, i.e., if an entry being looked up is not found in the local index replica. Similarly, when a site advertises a new entry to its local index replica, it need not actively inform all other replicas about the new entry. However, when a site removes an entry from the index, it needs to propagate that deletion quickly to those replicas that got notified of its insertion, to force them to purge that entry from their 36 copies. With this scheme, each index replica eventually collects its working set of index entries, and avoids synchronization as long as its working set does not change, regardless of new additions elsewhere. Manual control over synchronization is also helpful to conserve battery power on mobile devices by exploiting batched sessions. Transmitting data in large chunks infrequently is known to consume less power than multiple frequent transmissions of small packets [31]. Batching updates also helps conserve bandwidth by removing self-canceling operations (such as addition and deletion of the same name from a directory) from the update stream. Applications differ regarding whether their timeliness bounds are ‘hard’, i.e., need to be strictly enforced, or ‘soft’, indicating that best effort suffices. Strictly enforcing a timeliness bound incurs synchronization delays and reduces an application’s availability when some replicas become inaccessible or weakly connected. Moreover, not all applications require hard bounds. It is unacceptable for applications such as chat as it severely limits their responsiveness (e.g., delaying the acceptance of a user’s next message until her previous message reaches everybody). Groupware applications such as chat, bulletin boards and resource directories can tolerate best-effort eventual convergence qualified by a soft bound in exchange for better responsiveness and service availability. Column 6 of Table 2.2 lists the strength of timeliness bound required by the applications surveyed as either hard or soft, and reveals the diversity in this aspect. 2.5.5 Update Dependencies Some updates are semantically dependent on other updates made earlier to the same or different data items. For application correctness, such dependencies must be taken into account when propagating updates among replicas. We identify two important types of dependencies that capture the needs of a wide variety of applications: causality, and atomicity. Applications also differ in the order in which they require concurrent independent updates to be applied at various replicas, which we classify as none, total order, and serial order. The ‘update deps.’ column of Table 2.2 shows these requirements. Causal dependency [18] means that if replica A sees B’s update and then makes an update, B’s update must be seen before A’s everywhere. For example, in a file system 37 directory, a file deletion followed by the creation of a new file with the same name must be performed in the same order everywhere [62]. Other collaborative applications such as chat service and email delivery require that causally dependent updates must be seen in causal order everywhere, but otherwise require no ordering. Some updates must be applied atomically (in an all-or-nothing fashion) at each replica to preserve data integrity despite partial failures, especially when updates span multiple objects or consistency units. Examples include the file rename operation in a file system and money transfer between bank accounts. Unordered update delivery (none) suffices for applications whose updates are independent and commutative. For instance, updates to different entries in a distributed file system directory or a music index are commutative and can be applied in any order at different directory replicas. When updates are independent but not commutative (e.g., when a data item is overwritten with a new value), they must be applied in the same (i.e., total) order everywhere to ensure replica convergence. For example, if concurrent conflicting changes to a file are propagated to different replicas in a different order, their contents could diverge permanently. Depending on the application, the criterion for the actual order could be arbitrary or based on update arrival times. Applications that rely on the chronology of distributed events need to impose a total update ordering that matches their time order. Examples include user moves in a multi-player game and real-time event monitoring. Finally, some updates that require total ordering cannot be undone, and hence cannot be reordered during propagation among multiple replicas. They must be globally serialized (e.g., by executing one after another). For example, consider a replicated queue. It is unacceptable for multiple clients that concurrently issue the dequeue operation at different replicas to obtain the same item, although multiple enqueue operations can be executed concurrently and reordered later to ensure the same order everywhere. Hence the dequeue operation requires ‘serial’ ordering, while enqueue requires ‘total’ ordering. 38 2.5.6 Failure-Handling Partial failures (such as some nodes or network links going down) are inevitable in the wide area due to independent failure modes of components. Applications differ in how they can handle partial failures. When desired consistency semantics cannot be guaranteed due to some replicas becoming inaccessible, some applications can be optimistic, i.e., continue to operate under degraded consistency so as to remain available to users until the failure heals. Others must pessimistically abort affected operations or delay them until the failure is repaired, as otherwise the application could enter an invalid state due to conflicting updates. The choice depends on the likelihood of conflicts and whether an application can automatically recover from the inconsistency. For example, an online auction service must be pessimistic when partitioned, as selling the same item to multiple clients is unacceptable behavior and cannot be undone. In contrast, a file service can optimistically allow file access when partitioned, because conflicting file and directory updates are either rare or could be reconciled by merging their results. Wide area networks often exhibit low network quality due to the presence of lowbandwidth links or due to transient conditions such as congestion. Even if network connectivity is not lost completely, maintaining normal consistency in such ‘weak connectivity’ scenarios could drastically degrade application performance. For some applications, e.g., a file service, better application performance might be achieved by treating weak connectivity as a failure condition and switching to degraded consistency. The Coda file system provides a weak connectivity mode in which a client does not maintain strong consistency with a server while the quality of its network link to the server drops below a quality threshold. An enterprise directory may need to maintain different levels of consistency for clients within a region and for clients in between campuses for good performance. However, a chat service must continue to provide real-time synchronization despite weak connectivity. 2.6 Discussion In the previous section, we have examined the characteristics and consistency needs of a broad mix of distributed applications. We have expressed their diverse consistency 39 needs along several dimensions and summarized them in Table 2.2. This exercise reveals that although the surveyed applications differ widely in their consistency needs, those needs can be expressed in terms of a small set of options. In particular, we found that consistency requirements can be classified along five major dimensions: • concurrency control - the degree to which concurrent (read/write) accesses can be tolerated, • replica synchronization - the degree to which replica divergence can be tolerated, including the types of interdependencies among updates that must be preserved when synchronizing replicas, • failure handling - how data access should be handled when some replicas are unreachable or have poor connectivity, • update visibility - the granularity at which the updates issued at a replica should be made visible globally, • view isolation - the duration for which the data accesses at a replica should be isolated from remote updates. In effect, there is a multidimensional space for expressing the consistency requirements of applications. There are multiple reasonable options along each of these loosely orthogonal dimensions. Each option suits several applications, though the suitable combination of options varies from one application to another. Moreover, the five dimensions correspond to the five tasks that any wide area consistency management system must address. Hence expressing application requirements along these dimensions amounts to specifying the appropriate consistency mechanisms that suit those needs. Based on these observations, we developed a novel approach to structuring consistency management in distributed applications, called configurable consistency. Configurable consistency lets applications express consistency requirements individually along the above dimensions on a per-access basis instead of only supporting a few packaged combinations of semantics. We believe that this approach gives applications more flexibility in balancing 40 consistency, availability, and performance by giving them direct control over the tradeoffs possible. 2.7 Limitations of Existing Consistency Solutions A number of consistency solutions have been proposed to meet the replication needs of the application classes addressed by our survey. In this section, we discuss three existing solutions that are representative of distinct approaches to flexible consistency management. Fluid replication [37] targets cached access to remote file systems. It provides three discrete flavors of consistency semantics, namely, last-writer, optimistic, and pessimistic semantics. The first two flavors allow concurrent conflicting updates with total ordering, while the last option prevents conflicts by serializing updates. The ‘last-writer’ semantics imposes a total order by arbitrarily choosing one among conflicting updates to overwrite others, whereas ‘optimistic’ semantics detects and resolves the conflict using applicationspecific knowledge. Fluid replication leaves several important issues open, including replica synchronization frequency and dependencies between updates. Oceanstore [60] provides an Internet-scale persistent data store for use by arbitrary distributed applications. It supports data replication, but lets applications define and enforce consistency semantics by providing a semantic update model that treats updates as procedures guarded by predicates. Each application must define its own set of predicates that serve as preconditions for applying updates. Oceanstore provides primary-copy replication and propagates updates along an application-level multicast tree of secondary replicas. In contrast to Oceanstore, our goal is to develop a consistency framework with a small set of predefined predicates that can express the consistency needs of a broad variety of applications. The TACT toolkit [77] defines a continuous consistency model that can be tuned for a variety of consistency needs. Its model provides three orthogonal metrics in which to express the consistency requirements of applications: numerical error, order error, and staleness. Numerical error bounds replica divergence in terms of the cumulative weight of unseen remote writes at a replica, while staleness expresses the tolerable 41 replica divergence in terms of elapsed time. These metrics capture the ‘mod’ and ‘time’ criteria for timeliness described in Section 2.5.4. TACT’s order error restricts concurrency among data accesses by bounding the number of uncommitted updates that can be held by a replica before it must serialize with remote updates. A value of zero enforces serializability (corresponding to the ‘excl’ requirement), whereas a value of infinity allows unrestricted parallelism (corresponding to ‘weak’ accesses). Though this metric provides continuous control over concurrency, we believe, based on our study, that the two extreme settings (‘excl’ and ‘weak’) are both intuitive and sufficient to capture the needs of many applications. If application designers require finer-grain control over conflicts among parallel updates, they can achieve it by controlling the frequency of replica synchronization, which is more intuitive to reason about in application terms. Finally, to naturally express atomicity, causality and isolation constraints, applications need a session abstraction. 2.8 Summary In this chapter, we studied the feasibility of providing reusable support for replicated data management by surveying a variety of popular distributed applications belonging to three distinct categories: (i) file access, (ii) collaborative groupware, and (iii) enterprise data services. We argued that although they have replicable data and their replication needs vary widely, they have significant common needs that can be satisfied by common consistency mechanisms. To this end, we classified their consistency requirements along several dimensions that correspond to various aspects of consistency management. We argued that expressing applications’ consistency requirements along these dimensions paves the way for a flexible consistency framework called configurable consistency. Finally we examined several existing solutions for flexible consistency to see how well they can support the surveyed application classes. In the next chapter, we describe the configurable consistency framework and discuss how it can be used in the design of distributed applications. CHAPTER 3 CONFIGURABLE CONSISTENCY FRAMEWORK In the previous chapter, our survey of a variety of applications revealed several common characteristics that affect consistency. We observed that the consistency needs of those applications can be expressed as combinations of a small number of requirements along five dimensions: concurrency control, replica synchronization, failure handling, update visibility and view isolation. Based on this insight, in this chapter we present a novel consistency framework called configurable consistency that can express a broad variety of consistency needs, including those of the applications we surveyed in Chapter 2. In Section 3.1, we enumerate the specific objectives and requirements of our framework based on the conclusions of our application survey. In Section 3.2, we describe the data access model for configurable consistency. In Section 3.3, we present the consistency framework including the options it provides. In Section 3.4, we describe how these options can employed in combination to meet the consistency needs of the four representative applications we described in Section 2.2. In Section 3.5, we discuss the framework’s generality by showing how its interface allows expressing semantics provided by a variety of existing consistency models. Finally, in Section 3.6, we discuss situations in which our framework shows its limitations. 3.1 Framework Requirements Our goal in developing the configurable consistency (CC) framework is to provide efficient replication support for diverse applications in the three classes surveyed in Section 2.2. Those application classes have a broader diversity of replication and scaling needs than supported by any existing consistency solution. Hence to support them, the configurable consistency framework must meet the following requirements: 43 Generality: It must be able to express a broad variety of consistency semantics, specifically those required by the applications listed in Table 2.2. Practicality: It must be enforceable in a WAN replication environment via a small set of application-independent mechanisms that facilitate reuse. These mechanisms must satisfy the diverse consistency needs of the applications mentioned in Table 2.1 efficiently. 3.2 Data Access Interface Since replication is relevant to applications designed for data-shipping, our consistency framework assumes a system organization based on shared data access as explained in Section 2.1 and illustrated in Figure 2.2. In that organization, a distributed application is structured as multiple distributed components, each of which holds a portion of the global application state and caches other portions locally as needed. A consistency management system (depicted as a cloud providing the shared state abstraction in Figure 2.2) mediates all accesses to application state/data at each site and enforces the configurable consistency interface. A data item that is the unit of sharing and consistency maintenance could be an entire database or a file, an individual page, or an application object embedded therein. Our framework suits systems that provide a session-based interface for application components (e.g., A1..A5 in Figure 2.2, hereafter referred to as clients of the consistency management system) to access their local replicas of shared data. This data access interface allows clients to open a session for each consistency unit, to read and write the unit’s data in the context of that session, and to close that session when the access has completed. The definition of reads and writes assumed by our framework is general, and includes both raw accesses to individual bytes of data (e.g., for file access) as well as arbitrary queries and updates in the form of application-specific semantic operations (e.g., debit account by 100). One possible way for the system to support semantic update operations in an application-independent manner is to allow clients to register a plugin that interprets those operations at each site. When clients supply a semantic update operation as opaque data as part of their writes (e.g., via a special update interface), the system can invoke the plugin to apply the operation at 44 various replicas to keep them synchronized. Also, when consistency semantics allow concurrent updates that can conflict, clients can supply routines via the plugin that detect and resolve those conflicts in an application-specific manner. Given this data access interface to the consistency management system, our framework extends the interface in several ways to provide consistency guarantees to clients on a per-session basis. For instance, when opening a session at a replica site, each client needs to specify its required consistency semantics for the data item accessed during that session as a vector of configurable consistency options explained in Section 3.3. In response, the system guarantees that the data accessed during the session will satisfy those consistency requirements. If allowing a session operation (i.e., open, read, write, update or close) to proceed would violate the semantics guaranteed to its session or to other sessions anywhere, the semantics are said to conflict. In that case, the system delays the session operation until the conflict ceases to exist (i.e., until the session with conflicting semantics closes). As a result, if two sessions request conflicting consistency semantics (e.g., both request exclusive write access to data), the system delays one of them until the other session closes. If the system cannot ascertain whether a session’s consistency semantics conflict with those of other sessions elsewhere (e.g., because some replicas are unreachable), the system fails that session’s operations. Figures 3.1 and 3.2 illustrate in pseudo-code how a replicated database can be queried and updated using our framework. With the consistency options specified, the query guarantees to return results that are not more than 5 seconds stale, and the update guarantees to operate on data not more than 1 second stale. A distributed application that adopts our framework can impose a default per-data-item consistency semantics to be specified globally for all of its sessions, or impose a per-replica semantics to be specified by default for all of its local sessions, and still allow individual sessions to override those default semantics if necessary. Thus, the framework’s per-session consistency enforcement gives the application a significant degree of flexibility in managing data consistency. 45 querydb(dbfile) { cc options = [ access-mode=RD, staleness=5sec, update ordering=TOTAL, failure handling=OPTIMISTIC, isolation & visibility=PER UPDATE, ]; sid = open(dbfile, cc options); ret = read(sid, dbfile, &buf); ... close(sid); } Figure 3.1. Pseudo-code for a query operation on a replicated database. The cc options are explained in Section 3.3. 3.3 Configurable Consistency Options In this section, we describe the options provided by the configurable consistency framework. Table 1.1 in Section 1.2 lists those options, which can be classified along five dimensions: concurrency control, replica synchronization, failure handling, update visibility and view isolation. Clients express their consistency semantics for each session as a vector of options, choosing one among the alternatives in each row of Table 1.1. We describe configurable consistency options along each of the five dimensions and refer to the application needs that motivated their inclusion in our framework’s option set. When describing the options, we also illustrate their use in the context of the four representative applications we mentioned in Section 2.2. 3.3.1 Concurrency Control By concurrency, we mean the parallelism allowed among read and write sessions. With our framework, this can be controlled by specifying an access mode as one of the consistency options when opening a session. Our survey revealed that two types of read and write accesses, namely, concurrent and exclusive accesses, adequately capture the needs of a wide variety of applications. Based on this observation, we support two distinct flavors of access mode for each of the reads and writes: concurrent (RD, 46 updatedb(dbfile, op, params) { cc options = [ access-mode=WR, staleness=1sec, update ordering=TOTAL, failure handling=OPTIMISTIC, isolation & visibility=SESSION, ]; sid = open(dbfile, WR, cc options); updt pkt = pack(op, params, plugin); ret = update(sid, updt pkt); ... close(sid); } Figure 3.2. Pseudo-code for an update operation on a replicated database. WR) and exclusive (RDLK, WRLK) access modes. Concurrent modes allow arbitrary interleaving of accesses across replicas. As a result, reads could return stale data and replicas at multiple sites could be written independently, which may cause write conflicts. Exclusive access mode sessions are globally serialized to enforce traditional concurrentread-exclusive-write (CREW) semantics for strong consistency [48]. Table 3.1 shows the concurrency matrix that indicates the concurrency possible among various modes. When different sessions request both flavors simultaneously, RD mode sessions proceed in parallel with all other sessions including exclusive sessions, while WR mode sessions are serialized with exclusive sessions, i.e., they can occur before a RDLK/WRLK session begins or are deferred until it ends. Finally, exclusive mode sessions can proceed in parallel with RD sessions, but must block until ongoing WR sessions finish everywhere. Regardless of the access modes that clients employ for their sessions, individual write operations by those sessions arriving at a replica are always guaranteed to be applied atomically and serially (one after another), and each read is guaranteed to return the result of a previous completed write. Thus replica contents are never clobbered due to the interleaved or partial execution of writes. Our model allows application-writers to provide their own routines via a plugin to resolve conflicting updates (caused by parallel writes to the same version of data via WR mode sessions). A simple resolution policy 47 Table 3.1. Concurrency matrix for Configurable Consistency. ’X’ indicates that sessions with those modes are not allowed to proceed in parallel as they have ‘conflicting’ semantics. Concurrency RD RDLK WR WRLK RD RDLK X X WR WRLK X X X X X is ‘last-writer-wins’ [62], where the latest update by logical modification time overrides all others. Other resolution options include reexecuting the update (if it is a semantic procedure), merging results, or rejecting the update. To ensure serializable transactions that operate on the latest data, our proxy-caching enterprise application described in Section 2.3.2 would need to access its data within exclusive mode sessions that provide strong consistency. However, our other focus applications (music search directory, file access and chat service) can employ concurrent modes to access their data, since their requirements are not as strict. In particular, the chat service described in Section 2.3.4 must allow multiple users to concurrently append to a shared transcript file. This can be supported by treating an append operation as a semantic update and executing it in WR mode sessions for parallelism. The chat service can resolve concurrent appends across replicas by providing a resolution routine that applies all appends one after another in some order without losing data. 3.3.2 Replica Synchronization Replica synchronization controls the extent of divergence among replica contents by propagating updates in a timely fashion. Our framework lets applications express their replica synchronization requirements in terms of timeliness requirements and the ordering constraints to be imposed on updates during their propagation. 48 3.3.2.1 Timeliness Guarantee Timeliness refers to how close data viewed by a particular client at any given time must be to a version that includes all updates. Based on the divergence needs revealed by our application survey, our framework allows clients to impose two flavors of divergence bounds in combination for a session: a time bound, and a modification bound. These flavors are applicable only to concurrent access modes (RD, WR) because, exclusive mode sessions are always guaranteed to view the latest contents. The time and modification bounds are equivalent to the staleness and numeric error metrics provided by the TACT consistency model [77]. Relaxed timeliness bounds weaken consistency by reducing the frequency of replica synchronization for improved parallelism and performance. Most applications that involve real-time interaction among distributed users, including cooperative editing, chat, and online auctions need the most current guarantee, which can be expressed as a time or modification bound of zero. Time bound ensures that a replica is not stale by more than a given time interval from a version that includes all updates. It provides a conceptually simple and intuitive way to weaken consistency, and ensures an upper limit on the synchronization cost regardless of update activity. However, the extent of replica divergence within a time interval varies based on the number of updates that happen in that interval. Modification bound 1 specifies the maximum cumulative weight of unseen remote writes tolerated at a replica, when each write can be assigned a numeric weight by the application (the default being 1). Applications can use this to control the degree of replica divergence in application-specific terms regardless of the frequency of updates. In addition, our framework allows a client to explicitly request synchronization at any time during a session (called manual timeliness) via a pull interface that obtains the latest remote updates, and a push interface that immediately propagates local updates to all replicas. This enables an application to hand-tune its replica synchronization strategy based on domain-specific knowledge and relax timeliness of data for performance. As 1 Although we present a design for enforcing modification bounds in Section 4.7, our prototype implementation of configurable consistency does not support this option. 49 explained in Section 2.5.4, the shared music file index benefits from manual timeliness control. 3.3.2.2 Strength of Timeliness Guarantee Our framework allows applications to indicate that their timeliness bounds are either hard or soft (referred to as the ’strength’ option in Table 3.2). When a client imposes a hard timeliness bound on the data viewed by a session, replicas must be synchronized (i.e., via pulling or pushing updates) often enough, stalling session operations such as reads or writes if necessary, to ensure that the data provided during the session stays within the timeliness bound. For instance, enforcing a hard time bound of zero requires a replica to pull remote updates before every access. Enforcing a hard mod bound of zero at a replica requires that as soon as a remote session issues a write operation, the write does not return successfully until it is propagated and applied to the zero-bound replica. A session’s hard manual push operation cannot succeed until local updates are propagated to and acknowledged by all replicas. In contrast, soft timeliness bounds are enforced by replicas asynchronously pushing updates as they arrive in a best-effort manner, i.e., without blocking for them to be applied elsewhere. Employing soft bounds weakens consistency guarantees, but increases concurrency and availability. File sharing and online auctions require hard timeliness bounds, while a soft bound is adequate for the chat service and music search index. 3.3.3 Update Ordering Constraints Update ordering constraints refer to the order in which an application requires multiple updates to be viewed by sessions. Our framework allows two types of ordering constraints to be expressed: (i) constraints based on semantic dependencies among updates (referred to as ’semantic deps.’ in Table 3.2, and (ii) constraints that impose an implicit order among concurrent independent updates (referred to as ’update ordering’ in the table). Our framework supports two types of semantic dependencies: causality and atomicity, and three choices for ordering independent updates: none, total ordering 50 Table 3.2. Expressing the needs of representative applications using Configurable Consistency. The options in parentheses are implied by the WRLK option. ’Serial’ order means that the write is linearizable with other ongoing writes. Focus Application File Service Personal files, read-mostly shared doc lock file: read lock, write shared log Music index Chat service Auction: browse bid, buy, sell Concur. control Replica Synch. Update Str- Timeli- Data Visibility ength ness Deps. RD,WR soft latest none RD,WR RD WRLK RD,WR hard hard (hard) soft latest latest (latest) varies RD RD,WR RD WRLK soft time, manual soft latest soft latest (hard) (latest) View Isolation Failure Handling session optimistic total session none session (serial) (session) varies per-update session session (session) per-update optimistic optimistic pessimistic optimistic none per-update per-update optimistic causal per-update none session (serial) (session) per-update session (session) pessimistic pessimistic pessimistic session and serial ordering, because our application survey revealed that they capture the update ordering constraints of a variety of applications. Our framework guarantees that updates issued by sessions at the same site are always applied in the order their sessions issued them, everywhere. Otherwise, updates are considered to be independent unless clients explicitly specify their semantic dependencies. Applications can express their ordering constraints in our framework by specifying the dependency and ordering types for each session when opening the session. Each update inherits the ordering and dependency types of its issuing session. To help capture dependencies among multiple objects/sessions, the framework also provides a depends() interface. The depends() interface expresses the dependencies of an ongoing session with other sessions or data item updates. The depends() interface and the framework’s interface to express ordering constraints among updates are together powerful enough to support higher-level constructs such as transactions to ease application programming, as we explain in Section 3.5.4. 51 3.3.3.1 Ordering Independent Updates Updates whose ordering is of type ‘none’ could be applied in different order at different replicas. Updates tagged with ‘total’ order are eventually applied in the same final order with respect to each other everywhere, but may need to be reordered (by being undone and redone) before their position in the final order stabilizes. Updates tagged with ‘serial’ order are guaranteed never to be reordered, but applied only once at each replica, in their final order. ‘Serial’ order can be used for updates that cannot be undone, such as results sent to a user. 3.3.3.2 Semantic Dependencies Given its interface to express dependencies, our framework provides the following guarantees to sessions, which we illustrate with examples in Figure 3.3 and 3.4: 1. Updates belonging to sessions grouped as ‘atomic’ (via the depends() interface) are applied atomically everywhere. 2. Updates by a ‘causal’ session are treated as causally dependent on previous causal and totally ordered updates viewed by the session as well as its causally preceding sessions (specified via the depends() interface); they are applied in an order that preserves this causal order. Figure 3.3 illustrates how update atomicity can be expressed using our framework, by giving the pseudo-code for a file move operation in a distributed file system from mv file(srcdir, dstdir, fname) { sid1 = open(srcdir, WR, ATOMIC); sid2 = open(dstdir, WR, TOTAL); depends(sid2, sid1, ATOMIC); update(sid1, u1=’del fname’); update(sid2, u2=’add fname’); close(sid1); close(sid2); } Figure 3.3. Atomic sessions example: moving file fname from one directory to another. Updates u1,u2 are applied atomically, everywhere. 52 Client 1 Client 2 sid1 = open(A, CAUSAL); update(sid1, u1); close(sid1); u1 Client 3 sid3 = open(B, CAUSAL); depends(sid3, A, CAUSAL); update(sid3, u3); close(sid3); sid2 = open(A, TOTAL | CAUSAL); update(sid2, u2); close(sid2); u1 Client 4 sid4 = open(A, TOTAL | CAUSAL); update(sid4, u4); close(sid4); Figure 3.4. Causal sessions example. Updates u1 and u2 happen concurrently and are independent. Clients 3 and 4 receive u1 before their sessions start. Hence u3 and u4 causally depend on u1, but are independent of each other. Though u2 and u4 are independent, they are tagged as totally ordered, and hence must be applied in the same order everywhere. Data item B’s causal dependency at client 3 on prior update to A has to be explicitly specified, while that of A at client 4 is implicitly inferred. one directory to another. Updates u1 and u2 are applied atomically because they are issued by sessions grouped as atomic. Figure 3.4 illustrates how causal dependencies can be expressed, using a scenario where clients at four replicas operate via sessions on two objects A and B. Client 1 performs update u1 and pushes it to clients 3 and 4. Subsequently, clients 3 and 4 issue an update each. Independently, client 2 performs update u2. In this case, u1 causally precedes u3 and u4 but not u2. Although updates u2 and u4 are independently issued at clients 2 and 4, since they are performed within totally ordered sessions, they will be applied in the same order everywhere, while preserving other dependencies. However, no ordering or dependency exists between updates u3 and u4. Hence they can be applied in different order at different replicas. Allowing an application to specify ordering categories for each of its sessions enables it to simultaneously enforce distinct ordering requirements for various types of update operations. Consider the case of a replicated employee directory service. If two administrators try to simultaneously assign the same username to different employees by 53 adding them to different directory replicas, only one of them should prevail. This can be ensured by performing the name insertions within totally ordered sessions. Changing an employee’s name from ‘A’ to ‘B’ requires that ‘B’ is added and ‘A’ removed from the directory atomically. The application employing our framework can achieve this by inserting ‘B’ and removing ‘A’ in a session marked as atomic. Subsequently, suppose a new employee is assigned the username ‘A’. For the new name insertion to succeed everywhere, it must be causally preceded by the previous removal of ‘A’. Such a dependency can be expressed in our framework by tagging the session that does the removal as causal. The new insertion thus forms a causal dependency with the previous removal and gets propagated in the correct order. Finally, updates to an employee’s membership information in different business divisions require no ordering since they are commutative operations. 3.3.4 Failure Handling When a consistency or concurrency control guarantee cannot be met due to node failures or network partitions, applications may be optimistic, i.e., continue with the available data in their partition and risk inconsistency, or be pessimistic i.e., treat the data access as having failed and deal with it at the application-level. During failure-free performance, optimistic and pessimistic failure-handling behave identically. Our framework provides these two options for handling failures, but offers more flexibility to applications by allowing different tradeoffs to be made for individual sessions between consistency and performance based on network quality. A network link below a quality threshold by some metric, e.g., a combination of RTT and bandwidth, is called a weak link, and is treated as a failed link by optimistic sessions. Thus, an optimistic session’s consistency guarantee holds only relative to replicas reachable by strong links (i.e., with quality above a tolerated threshold), whereas a pessimistic session’s guarantee holds relative to all replicas. By restricting consistency maintenance to an accessible subset of replicas, the optimistic option trades off consistency for higher availability. For example, an enterprise directory could enforce strong consistency only among replicas within a well-connected region such as a campus. Clients can still obtain a globally 54 consistent view of data at the expense of incurring WAN synchronization and reduced availability in case of failures. The file service (except when locking semantics are required) and the music index must be optimistic for increased availability. The online auction service must be pessimistic for correctness. 3.3.5 Visibility and Isolation Our survey revealed that applications have three distinct notions of what constitutes a stable update that preserves the validity of application state: session, per-update, and manual updates. For update propagation to preserve the validity of application state at various replicas, it must take update stability into account. Our framework enables application-dependent notions of update stability to be expressed via two categories of consistency options: update visibility and view isolation. Visibility refers to the time at which a session’s updates are ready to be made visible (i.e., propagated) to remote replicas. The time when they will be actually propagated is determined by the timeliness requirements of remote replicas. Isolation refers to the time during a session when its replica can incorporate incoming updates from remote replicas. A replica must delay applying incoming remote updates until it can meet the isolation requirements of all local sessions. These options only apply with respect to remote sessions. Concurrent local sessions at a replica still see each other’s updates immediately. Our framework provides three useful flavors of visibility and isolation: Session: Session visibility specifies that updates are not to be made visible until the session that issued them ends. They are not interleaved with other updates during propagation, which prevents remote sessions from seeing the intermediate writes of a session. Session isolation specifies that a replica checks for remote updates only before opening a session. Once a session is open, the system delays applying incoming updates until the session ends. Session isolation ensures that a replica’s contents remain unchanged for the duration of the session, which is important for document sharing and atomic updates to shared data structures. 55 Per-update: Per-update visibility means that a session’s updates are immediately available to be propagated to remote replicas. Updates belonging to multiple sessions can be interleaved during propagation. Per-update isolation means that a replica applies incoming updates as soon as they arrive, after serializing them with other ongoing local update operations. This setting enables fine-grain replica synchronization in the presence of long-lived sessions, such as for real-time interactive applications like chat, multiplayer games, and event logging. Manual Manual visibility means that a session’s updates are made visible to remote sessions only when explicitly requested, or when the session ends. Manual isolation means that remote updates are only incorporated when the local client explicitly requests that they be applied. These settings can be used in conjunction with manual update propagation to completely hand-tune replica synchronization. Figure 3.5 shows how a WR session’s visibility setting affects the version of data supplied to remote replicas. Replica 1 receives a pull request from replica 2 (in response to a RD request) after an ongoing WR session makes three updates. If the WR session is employing per-update visibility, all the three updates are made visible as soon as they are applied locally. Hence the replica 1 supplies version v3 to replica 2. Instead, if the WR session employs manual visibility and issues a manual push after the first two updates, the version supplied is v2 even though the latest local version is v3 at that time. Finally, if the WR session employs session visibility, version v0 is supplied to replica 2. Figure 3.6 illustrates how an ongoing RD session’s isolation setting determines when incoming updates are applied locally. With per-update isolation, they are applied immediately. With manual isolation, both the pending updates are applied when the session issues a manual pull request. With session isolation, they are deferred until the RD session ends. Mobile file access and online shopping requires session visibility and isolation to ensure data integrity. The shared music index requires per-update visibility and isolation since each index insertion/deletion is independent and needs to be propagated as such. 56 Replica 1 v0 local updates v1 manual push v2 v3 WR Time per−update visibility manual visibility session visibility Replica 2 pull RD Time Figure 3.5. Update Visibility Options. Based on the visibility setting of the ongoing WR session at replica 1, it responds to the pull request from replica 2 by supplying the version (v0, v2 or v3) indicated by the curved arrows. When the WR session employs manual visibility, local version v3 is not yet made visible to be supplied to replica 2. Replica 1 u1 u2 WR Time per−update push session manual per−update Replica 2 RD Time manual pull Figure 3.6. Isolation Options. When replica 2 receives updates u1 and u2 from replica 1, the RD session’s isolation setting determines whether the updates are applied (i) immediately (solid curve), (ii) only when the session issues the next manual pull request (dashed curve), or (iii) only after the session ends (dotted curve). The chat service also requires per-update visibility and isolation to ensure real-time propagation of user messages to others during a long-term chat session. Other combinations of visibility and isolation are also useful. For instance, employing session visibility for update transactions and per-update isolation for long-running (data mining) queries to an inventory database allows users to track the latest trends in live sales data without risking inconsistent results. 3.4 Example Usage The expressive power of configurable consistency comes from the orthogonality of the options described in the previous sections. By orthogonality, we mean the ability 57 to enforce each of the options independently. Each individual option is required by at least one of the applications surveyed, but applications widely differ in the particular combination of options they require. Hence, the ability to independently select the various options from different dimensions that our framework provides gives applications greater flexibility than other consistency interfaces. To support our claim, we discuss how the consistency needs of the four focus applications we mentioned in Section 2.2 can be described in our framework. Table 3.2 lists the combination of options that express various application needs. As mentioned in Section 2.3.1, the consistency needs of the file service vary due to the different ways in which files are used by applications. Eventual consistency is adequate for personal files that are rarely write-shared, and media and system files that are rarely written [62]; close-to-open consistency is required for users to safely update shared documents [28]; globally serializable writes [48] are required for reliable file locking across replicas. Log file sharing requires the ability to propagate individual appends as semantic updates, provided by our per-update visibility and isolation settings. The shared music index requires the ability to manually control when to synchronize with remote replicas based on the operation being performed on an index replica, as explained in Section 2.5.4. This can be expressed using the manual timeliness requirement and enforced via explicit pull and push operations. The chat service also requires eventual consistency and per-update visibility, but in addition, it requires a user’s chat message written to the shared transcript file to be propagated at once to other replicas, which requires setting the time bound to zero. At the same time, imposing a hard timeliness bound forces chat clients to synchronously wait for messages to propagate to other clients, which is unnecessary. Hence, the chat service requires a soft time bound of zero. Finally, the online auction service requires strong consistency for serializing bids and finalizing purchases, which can be expressed by our framework’s WRLK mode. The WRLK mode automatically implies a hard most-current timeliness guarantee on data, which is shown in the table by parenthesizing those options. However, the number of users browsing the auction items is typically much larger than those making purchase 58 bids. Hence they must not block bids. This requirement can be expressed by specifying RD mode for casual browsing operations. 3.5 Relationship to Other Consistency Models Due to its flexibility, the configurable consistency framework can be used to express the semantics of a number of existing models. In this section, we explore the consistency semantic space covered by our framework in relation to other consistency solutions. First, we explore several consistency models designed for shared memory in the context of multiprocessors, distributed shared memory systems, and object systems, because many other models can be described as special cases of these models. Next, we discuss a variety of consistency models defined for session-oriented data access in the context of file systems and databases, including transactions. Finally, we discuss several flexible consistency schemes that offer tradeoffs between consistency and performance, in relation to our framework. Tables 3.3 and 3.4 list some well-known categories of consistency semantics supported by these schemes (in the first column) and the sets of CC options that express them. 3.5.1 Memory Consistency Models Traditionally, consistency models are discussed in the context of read and write operations on shared data. A variety of consistency models have been developed to express the coherence requirements of shared memory in multiprocessors, DSM systems and shared object systems [70]. Though our framework targets wide area applications, we discuss shared memory consistency semantics here because they form the conceptual basis for reasoning about consistency in many applications. Shared memory systems assume a data access model consisting of individual reads and writes to underlying memory, whereas shared object systems typically enforce consistency at the granularity of object method invocations. Pure memory consistency models by themselves do not enforce concurrency control, but define special operations by which applications can 59 Table 3.3. Expressing various consistency semantics using Configurable Consistency options. Blank cells indicate wildcards (i.e., all options are possible). Consistency model Shared mem[70]: Linearizability Causal consistency FIFO/PRAM consistency Weak consistency Concur. control Read Write Replica Synch. Str- Timeli- Order ength ness RDLK WRLK Update Visibility View Isolation Failure handling eager per-update pess. RD WR soft latest causal eager per-update pess. RD WR soft latest none eager per-update pess. RD WR hard manual none manual manual pess. manual manual pess. manual manual pess. push, pull manual none push manual none pull Eager release consistency Lazy release consistency RD WR hard RD WR hard Close-to-open [28] Close-to-rd[62] Wr-to-rd[74] Wr-to-open[64] wr-follows-rd, monotonic wr [71] rd-your-wr, monotonic rd RD WR hard latest total session session RD RD RD RD WR WR WR WR soft soft hard latest latest time none none total causal session eager eager session per-update per-update session session TXN isolation [40]: Degree 0 Degree 1: rd uncommitted Degree 2: rd committed Degree 3: serializable Snapshot Isolation[8, 52] Read consistency [52] pull on switch RD short WRLK RD long WRLK short long RDLK WRLK long long RDLK WRLK RD WR atomic session session RD atomic session per-update WR atomic atomic atomic atomic 60 Table 3.4. Expressing other flexible models using Configurable Consistency options. Blank cells indicate wildcards (i.e., all options are possible). Consistency model Concur. control Read Write TACT[77]: Numerical error Staleness Order error Fluid Replication [16]: last-writer optimistic pessimistic Lazy repl. [40]: Causal ops Forced ops (total order) Immediate ops (serialized) N-ignorant TXN[39] Timed/delta consistency[72] Cluster consistency[55]: Weak ops Strict ops RD WRLK, WR WR RD WR Replica Synch. Stre- Timeli- Order ngth ness hard mod hard time Update Visibility View Isolation Failure handling all total, time total, app RDLK WRLK causal total WRLK mod N time opt. pess. 61 notify the memory system of synchronization events. We describe how the configurable consistency framework allows expressing most of these models and can be used to implement those special operations. Sequential consistency requires that all replicas see the result of data accesses in the same order, and that order is itself equivalent to some serial execution on a single machine. Linearizability [4] is a stronger semantics than sequential consistency that can be expressed in our framework via exclusive modes (RDLK, WRLK) for reads and writes. Linearizability requires a serial order that preserves the time order of arrival of operations at their originating replicas. This is the strongest consistency guarantee that our framework can provide, and hereafter, we refer to it as strong consistency. The rest of the consistency semantics that we list below can be expressed in our framework by employing concurrent mode (RD, WR) sessions with per-update visibility and isolation for accesses. Causal consistency [29] permits concurrent writes to be seen in a different order at different replicas, but requires causally related ones to be seen in the same order everywhere. It can be achieved with our framework by tagging all writes as causally ordered. FIFO/PRAM consistency [43] only requires that all writes issued at a replica be seen in the same order everywhere; it requires no ordering among concurrent writes at different replicas. Our framework’s “unordered” updates option implicitly preserves per-replica update ordering and hence guarantees FIFO consistency. All of the above consistency models enforce consistency on individual reads and writes. The configurable consistency framework’s per-update visibility and isolation options provide similar semantics. Another set of consistency models enforce consistency on groups of reads and writes delimited by special ‘synchronization’ operations. Weak consistency [19] enforces sequential consistency on groups of operations delimited by a special ‘sync’ memory operation. Thus, individual reads and writes need not incur synchronization. A ‘sync’ operation causes all pending updates to be propagated to all replicas before it completes, and can be emulated in our framework by issuing a manual push operation followed by a manual pull operation. (Eager) Release consistency [26] identifies two kinds of special operations, namely acquire and release. The system must ensure that before an 62 acquire completes, all remote writes to locally cached data items have been pulled, and before a release completes, all local writes have been pushed to any remote replicas. Typically, the bulk of the update propagation work is done by the release. However, in lazy release consistency [36], nothing is pushed on release, but the next acquire has to pull pending writes to locally accessed data from all remote replicas before proceeding. Both variants can be achieved with the CC framework. The acquire operation must issue a manual pull request for updates to all data from remote replicas. The release must issue a synchronous push of local updates to all remote replicas to ensure the eager variant of release consistency, and must be a no-op in the case of the lazy variant. Entry consistency [9] mandates association of synchronization variables with each shared data item instead of on the entire shared memory. An acquire on a synchronization variable only brings the associated data items up-to-date. Distributed object systems such as Orca [6] and CRL[30] employ entry consistency to automatically synchronize replicas of an object before and after invocation of each of its methods. Entry consistency can be readily achieved with our framework using a similar implementation of acquire and release operations as for release consistency, but only to push and pull the affected data items. 3.5.2 Session-oriented Consistency Models A number of session-oriented models have been developed in the context of file systems [28, 38, 62] and databases [18, 52]. These models define consistency across multiple read and write operations, grouped into sessions. Databases and file locking often require sessions to be serializable [8]. Serializability is the database analogue of sequential consistency in memory systems, and guarantees that any concurrent execution of access sessions produces results equivalent to their serial execution on a single machine in some order. It is achieved in our framework by employing exclusive access mode (RDLK and WRLK) sessions that provide strong consistency or linearizability. A variety of weaker consistency flavors have been developed that sacrifice serializability to improve parallelism and availability [63]. They can be classified based on 63 the hardness of their bound on replica divergence. Hard-bounded (also called bounded inconsistency) flavors ensure an upper limit on replica divergence within a partition, regardless of the network delays between replicas [63]. Soft-bounded (also called eventual consistency) flavors only guarantee eventual replica convergence qualified by a timeliness bound, subject to network delays [18]. We can classify weaker consistency flavors further based on the degree to which they provide isolation among concurrent sessions, as close-to-open, close-to-rd, wr-to-rd and wr-to-open. They can be expressed in our framework with various combinations of visibility and isolation options. Close-to-open semantics, provided by AFS [28], guarantees that the data viewed by a session include the latest writes by remote sessions that were closed before opening that session, and not other writes by ongoing sessions. If an application’s updates preserve data integrity only at session boundaries, this semantics ensures the latest valid (i.e., internally “consistent”) snapshot of data that does not change during a session, regardless of ongoing update activity elsewhere. It can be expressed in our framework with session visibility and isolation (i.e., exporting and importing updates only at session boundaries) and a hard most-current timeliness guarantee. In contrast, “close-to-rd” semantics guarantees to incorporate remote session updates before every read, not just at open() time. It can be obtained by employing per-update isolation. The Pangaea file system [62] provides the soft-bounded (i.e., eventual) flavor of close-to-rd semantics by propagating file updates eagerly as soon as a file’s update session ends. “Wr-to-rd” semantics ensures tighter convergence by guaranteeing that a read sees the latest writes elsewhere, including those of ongoing sessions. Thus it might cause a session’s reads to see the intermediate writes of remote sessions. WebFS [74] provides append-only consistency, which is a special case of wr-to-rd semantics for sharing append-only logs. Several systems (such as Bayou [18]) that synchronize replicas via operational updates provide wr-to-rd semantics. Finally, “wr-to-open” semantics ensures that a replica incorporates the latest remote updates (including those of ongoing sessions) at the time of opening a session. For instance, this is required for the Unix command ‘tail -f logfile’ to correctly track remote 64 appends to a shared log file. The semantics of NFS version 3 [64] file system can be loosely categorized as a time-bounded flavor of wr-to-open semantics, because an NFS client can delay pushing local writes to the server for up to 30 seconds, but checks the server for file updates on every open(). Wr-to-open semantics can be achieved with our framework’s eager write visibility and session isolation options. 3.5.3 Session Guarantees for Mobile Data Access Some applications must present a view of data to users that is consistent with their own previous actions, even if they read and write from multiple replica sites. For example, if a user updates her password at one password database replica and logs in at another replica, she might be denied access if the password update does not get propagated in time. Four types of causal consistency guarantees have been proposed by the designers of Bayou [71] to avoid such problems and to provide a consistent view of data to a mobile user: read-your-writes: reads reflect previous writes by the same client, monotonic reads: successive reads by a client return the result of same or newer updates, writes-follow-reads: writes are propagated after the reads on which they depend, and monotonic writes: writes are propagated after the writes that logically precede them. Among them, writes-follow-reads and monotonic writes are actually forms of causality relations among updates. As mentioned in Section 3.3.2, our framework provides causal update ordering as an explicit option. Read-your-writes and monotonic read semantics can be achieved with our framework as follows. Causal update ordering is enforced as before. In addition, whenever a mobile client application that requires these semantics switches to accessing a new replica, it performs a manual “pull updates” operation that works as described in Section 3.3.2. The pull must necessarily bring the new replica in sync and at least as up-to-date as the previous replica. If the pull operation fails, then the mobile client knows that the monotonic read or read-your-writes semantics cannot be 65 guaranteed (e.g., because the previous replica cannot be reached). If so, the mobile client application can either fail the data access request, or retry later. 3.5.4 Transactions The transaction model of computing is a well known and convenient paradigm for programming concurrent systems. A transaction is a series of data accesses and logical operations that represents an indivisible piece of work. Traditionally, a transaction guarantees the ACID properties: Atomicity: “all or nothing” property for updates, Consistency: the guarantee that application data are always transformed from one valid state to another, Isolation: noninterference of concurrent transactions, e.g., by ensuring their serializability, and Durability: committed updates are never lost, and their effect persists beyond the transaction’s lifetime. The CC framework provides update atomicity as an explicit option. To ensure the durability of a transaction’s updates, their final position in the global commit order must be fixed to be the same at all replicas. This requires updates to be totally ordered. Durability also implies that once a transaction makes its result visible to subsequent transactions, they become causally dependent on it for commitment. The total order imposed on updates must preserve such causal dependencies as well. The CC framework’s exclusive access modes (RDLK, WRLK) pessimistically serialize accesses to eagerly commit updates and ensure durability, but incur global synchronization. Its concurrent access modes (RD, WR) optimistically allow parallel transactions, in which case, durability can be achieved by enforcing both total ordering and causality. To maintain data consistency, the transaction model requires that transactions are executed as if they are isolated from each other. In our data access model, each client’s local cache serves as a private workspace for transaction processing. When a transaction 66 T is invoked at a machine, T’s entire execution is performed on that machine, and remote data is accessed through the machine’s local cache. When the CC framework’s session visibility is used, no partial result of the execution is visible at other replica sites. Moreover, a site’s updates are propagated in the same order everywhere. These features together guarantee that transactions at different sites are isolated from each other. However, concurrent transactions running at the same site have to be isolated via locking schemes provided by the local operating system. Enforcing serializability of transactions requires frequent synchronization, which reduces performance in a wide area replicated environment. To improve concurrency among transactions in the traditional database context, the ANSI SQL standard[3] has defined three isolation levels for queries in a locking-based implementation. The three isolation levels provide increased concurrency by employing either no read locks, short duration read locks (held for the duration of a single data access), or long duration read locks (held for the duration of a transaction). These isolation levels for transactions can be achieved with our framework by employing the RD mode when no locking is desired for reads, and RDLK mode for read locks of various durations. With snapshot isolation [7], provided by Microsoft Exchange and the Oracle database system [52], a transaction T always reads data from a snapshot of committed data valid as of the (logical) time when T started. Updates of other transactions active after T started are not visible to T. T is allowed to commit if no other concurrent committed transaction has already written data that T intends to write; this is called the first-committer-wins rule to prevent lost updates. Oracle’s Read Consistency [52] ensures that each action (e.g., SQL statement) in a transaction T observes a committed snapshot of the database state as it existed before the action started (in logical time). A later action of T observes a snapshot that is at least as recent as that observed by an earlier action of T. Read consistency can provide a different snapshot to every action, whereas snapshot isolation provides the same snapshot to all actions in a transaction. Snapshot isolation and read consistency are the transactional analogues of close-to-open and close-to-rd semantics defined earlier, and hence can be expressed with our framework. By relaxing timeliness bounds beyond ‘most current’, our framework can enhance concurrency further, at the 67 risk of increased aborts of update transactions. To reduce aborts, the relaxed semantics can be employed only for query transactions. There are numerous other relaxed consistency models designed for transactional concurrency control in replicated databases. They are explored in depth in Adya’s PhD thesis [1]. Examining the extent to which our framework can express them requires further study, and is beyond the scope of our thesis. 3.5.5 Flexible Consistency Schemes In this section, we discuss several important consistency schemes that provide tradeoffs between consistency and performance, in relation to our framework. The continuous consistency model of the TACT toolkit [77] was specifically designed to provide fine-grain control over replica divergence. Instead of supporting a sessionoriented access model, TACT mediates all application-level accesses to a replicated data store and classifies them as reads and writes. Their reads and writes are generic and encapsulate the application logic comprising an entire query or update transaction respectively. In that sense, propagating a write means reexecuting the update transaction on another database replica. TACT’s consistency model allows applications to dynamically define logical views of data that they care about, called conits, and control the divergence of those views rather than the replica contents themselves. Example conits include the number of available seats in a flight’s reservation database, each discussion thread in a bulletin board, and the number of entries in a distributed queue. Conits help exploit application-level parallelism and avoid false-sharing. But maintaining dynamically defined conits among wide area replicas incurs significant bookkeeping, and it is not clear whether it is justified by the increased parallelism at scale. In contrast, the CC framework allows control over divergence of replica contents. TACT provides three metrics to continuously control divergence, namely, numerical error, order error and staleness. Numerical error limits the total weight of unseen writes affecting a conit that a replica can tolerate, and is analogous to our framework’s modification bound. Staleness places a real-time bound on the delay of write propagation 68 among replicas, and is equivalent to our time bound. Order error limits the number of outstanding tentative writes (i.e., subject to reordering) that affect a conit, at a replica. For reasons mentioned in Section 2.7, the CC framework supports the two extreme values for order error, namely, zero (with its exclusive access modes) and infinity (with its concurrent access modes). The Lazy replication of Ladin et al. [40] supports three types of ordering of various operations on data. Causal operations are causally ordered with respect to other causal operations, forced operations are performed at all replicas in the same order relative to one another, and immediate operations are serializable, i.e., performed at all replicas in the same order relative to all operations. The ordering type for each operation on a replica must be specified by the application-writer at design time. For example, the designer of a replicated mail service would indicate that send mail and read mail are causal, add user and del user are forced, and del at once is immediate. Causal and forced operation ordering can be achieved in our framework by performing those operations in concurrent mode sessions and tagging updates involving those operations as causally or totally ordered, respectively. The effect of immediate operations can be achieved by performing them inside a WRLK session, which ensures serializability with respect to all other operations. In an N-ignorant system [39] developed in the context of databases, an N-ignorant transaction may be ignorant of the results of at most N other transactions. To emulate this behavior in our framework, such a transaction is executed as an update operation within a session that sets a modification (mod) bound of N on its local replica. Since the execution of a transaction at a replica increments its divergence by 1, the modification bound ensures that the N-ignorant transaction misses the updates of at most N remote transactions. Timed/delta consistency models (e.g., [72]) require the effect of a write to be observed everywhere within specified time delay. This can be readily expressed in our framework using a time bound. In the cluster consistency model [55] proposed for mobile environments, replicas are partitioned into clusters, where nodes are strongly connected within a cluster but 69 weakly/intermittently connected across clusters. Consistency constraints within a cluster must always be preserved, while inter-cluster consistency may be violated subject to bounds (called m-consistency). To implement this, two kinds of operations have been defined: weak operations check consistency only within their cluster, whereas strict operations ensure consistency across all clusters. The Coda file system’s consistency scheme makes a similar distinction between weak and strong connectivity between client and server, but provides no control over replica divergence in weak connectivity mode. A Coda client skips consistency checks with a server on reads if the quality of the link to its server is “poor” (i.e., below a quality threshold), and pushes writes to the server lazily (called trickle re-integration). The CC framework’s “optimistic” and “pessimistic” failure handling options for sessions provide similar semantics to that of the weak and strict operations of the cluster consistency model, where nodes reachable via network links of higher quality than a threshold are considered to belong to a cluster. 3.6 Discussion In the previous section, we examined the expressiveness and generality of the configurable consistency framework. We discussed how it can support a wide variety of existing consistency models. In this section, we discuss several issues that affect the framework’s adoption for distributed application design. 3.6.1 Ease of Use At first glance, providing a large number of options (as our framework does) rather than a small set of hardwired protocols might appear to impose an extra design burden on application programmers as follows. Programmers need to determine how the selection of a particular option along one dimension (e.g., optimistic failure handling), affects the semantics provided by the options chosen along other dimensions (e.g., exclusive write mode, i.e., WRLK). However, thanks to the orthogonality and composability of our framework’s options, their semantics are roughly additive; each option only restricts the applicability of the semantics of other options and does not alter them in unpredictable ways. For example, employing optimistic failure handling and WRLK mode together 70 for a data access session guarantees exclusive write access for that session only within the replica’s partition (i.e., among replicas well-connected to that replica). Thus by adopting our framework, programmers are not faced with a combinatorial increase in the complexity of semantics to understand. To ease the adoption of our framework for application design, we anticipate that middleware systems that adopt the configurable consistency framework will bundle popular combinations of options as defaults (e.g., ‘Unix file semantics’, ‘CODA semantics’, or ‘best effort streaming’) for object access, while allowing individual application components to refine their consistency semantics when required. In those circumstances, programmers can customize individual options along some dimensions while retaining the other options from the default set. 3.6.2 Orthogonality As Tables 3.3 and 3.4 show, the expressive power of the configurable consistency framework is due to its orthogonal decomposition of consistency mechanisms. Although the consistency options provided by our framework along various dimensions are largely orthogonal, there are a few exceptions. Certain consistency options implicitly imply other options. For example, exclusive access modes imply session-level isolation and a most-current timeliness guarantee. Also, certain combinations of options do not make sense in practical application design. For example, exclusive read mode and concurrent write mode are unlikely to be useful together in an application. 3.6.3 Handling Conflicting Consistency Semantics Although the composability of options allows a large number of possible combinations and applications can employ different combinations on the same data simultaneously on a per-session basis, the semantics provided by certain combinations are inherently conflicting, i.e., cannot be enforced simultaneously. Although the configurable consistency implementation presented in this thesis correctly serializes sessions that employ conflicting semantics, the CC interface enables some conflicting semantics to be expressed that might not make sense in practical application design, and can also lead 71 to hard-to-detect bugs in distributed applications. For instance, consider two sessions that operate on the same data. Session 1 is a write session that wants all readers to see its writes immediately. Session 2 is a read session that does not want to see any writes until it explicitly asks for them (e.g., via manual pull), or until the session ends. When session 1 issues a write while session 2 is in progress, the write must be synchronously pushed everywhere and must be blocked until all replicas apply it locally and acknowledge it. Therefore, session 1’s write operation has to block until session 2 accepts its write, which could mean an indefinite and unproductive wait. Certain other conflicting semantics are perfectly normal, such as a RDLK and a WRLK session. Ideally, a configurable consistency implementation should be able to detect and reject unproductive combinations of semantics. Further work is needed to enumerate such combinations of semantics possible with configurable consistency. 3.7 Limitations of the Framework Although we have designed our framework to express the consistency semantics needed by a wide variety of distributed services, we have chosen to leave out semantics that are difficult to support in a scalable manner across the wide area or that require consistency protocols with application-specific knowledge. As a consequence, our framework has two specific limitations that may restrict application-level parallelism in some scenarios. 3.7.1 Conflict Matrices for Abstract Data Types Our framework cannot support application-specific conflict matrices that specify the parallelism possible among application-level operations [5]. Instead, applications must map their operations into the read and write modes provided by our framework, which may restrict parallelism. For instance, a shared editing application cannot specify that structural changes to a document (e.g., adding or removing sections) can be safely allowed to proceed in parallel with changes to individual sections, although their results can be safely merged at the application level. Enforcing such application-level concurrency constraints requires the consistency protocol to track the application operations 72 in progress at each replica site, which is hard to track efficiently in an applicationindependent manner. 3.7.2 Application-defined Logical Views on Data Our framework allows replica divergence constraints to be expressed on data but not on application-defined views of data, unlike TACT’s conit concept [77]. For instance, a user sharing a replicated bulletin board might be more interested in tracking updates to specific threads of discussion than others, or messages posted by her friends than others, etc. To support such requirements, views should be defined on the bulletin board dynamically (e.g., all messages with subject ‘x’) and updates should be tracked by the effect they have on those views, instead of on the whole board. Our framework cannot express such views precisely. It is difficult to manage such dynamic views efficiently across a large number of replicas, because doing so requires frequent communication among replicas, which may offset the parallelism gained by replication. An alternative solution is to split the bulletin board into multiple consistency units (threads) and manage consistency at a finer grain when this is required. Thus, our framework provides alternate ways to express both of these requirements that we believe are likely to be more efficient due to their simplicity and reduced bookkeeping. 3.8 Summary In this chapter, we described a novel consistency framework called configurable consistency that lets a wide variety of applications compose an appropriate consistency solution for their data by allowing them to choose consistency options based on their sharing needs. To motivate our design of the framework, we stated its objectives and requirements. We presented the framework including its options along five dimensions. We illustrated how the diverse consistency needs of the four focus applications we described in the previous chapter can be expressed using the framework. Next, we discussed the generality of the framework by illustrating how it can support a wide variety of existing consistency models. Finally, we discussed a few situations in which our framework shows its limitations. 73 In subsequent chapters, we focus on the practicality of our framework. We start the next chapter by presenting the design of a wide area replication middleware that implements the CC framework. CHAPTER 4 IMPLEMENTING CONFIGURABLE CONSISTENCY To pave the way for building reusable replication middleware for distributed systems, we have developed a configurable consistency framework that supports the diverse consistency requirements of three broad classes of distributed applications. In previous chapters, we described our framework and discussed its expressive power in the context of diverse applications. However, to be deployed in real wide area applications, our framework must be practical to implement and use. In this chapter, we show how configurable consistency can be implemented in a peer replication environment spanning variable-quality networks. We do this by presenting the design of SWARM,1 an aggressively replicated wide area data store that implements configurable consistency. 4.1 Design Goals As explained in Section 2.4, the three application categories that we surveyed vary widely with respect to their natural unit of sharing, their typical degree of replication, and their consistency and availability needs. For configurable consistency management to benefit a broad variety of distributed applications in these categories, its implementation must be application-independent, support aggressive data caching, operate well over diverse networks, and be resilient to node and network failures. Application-independence allows mechanisms to be reused across multiple applications. Aggressive caching enables applications to efficiently leverage available storage, processing, and network resources and incrementally scale with offered load. To efficiently support aggressive caching over networks of diverse capabilities, our consistency implementation must employ a scalable mechanism to manage a large number of replicas, and adapt its communication based on network quality to 1 SWARM is an acronym for Scalable Wide Area Replication Middleware. 75 conserve network bandwidth across slow links. In a wide area system with a large number of nodes, both transient and permanent failures of some of the nodes and/or network links are inevitable, so our system must detect failures in a timely manner and gracefully recover from them. Specifically, the design of our configurable consistency system must meet the following three goals: Scalability: Our system must scale along several dimensions: the number of shared objects, their granularity of sharing (e.g., whole file or individual blocks), the number of replicas, the number of participating nodes in the system and their connectivity. The number of shared objects that our target applications use vary from a few hundreds (in the case of personal files) to hundreds of thousands (in the case of widespread music sharing). Their typical replication factors also range from a few dozens of replicas in some cases to a few thousands in others. The number of nodes in a typical configuration also varies accordingly. Our consistency system must gracefully scale to these sizes. Network Economy: Most of our targeted applications need to operate in a very wide area network environment consisting of clusters of LANs often spanning continents, over links whose bandwidth ranges from a few tens of Kbps to hundreds of Mbps. Many of them are likely to be wired links that are more reliable and stay mostly connected, while others are lossy wireless links. To provide a responsive service, our system must automatically adapt to this diversity by efficiently utilizing the available network resources. Failure Resilience: Our target applications operate in a variety of environments that can be loosely categorized as server/enterprise, workstation, and mobile environments, with distinct failure characteristics. Server-class environments consist of powerful machines with reliable high-speed network connectivity. Workstations are static commodity PCs of moderate capacity and connectivity that can be turned off by their owners anytime. Mobile devices such as laptops and PDAs are the least powerful and intermittently connected at different network locations. The typical 76 rate at which nodes join and leave the system, called the node churn rate [59] also varies widely based on the average time spent by a node in the system, called its mean session time. Session times range from days or months (in case of servers) causing low churn, to as low as a few minutes (in case of mobile devices) causing high churn. Our system must gracefully handle failures and node churn in all the three types of environments, while preserving scalability and network economy. 4.1.1 Scope The focus of our work is to show how flexible consistency management can be provided in a distributed system that aggressively replicates data over the wide area. However, there are several other important issues that overlap with the design of consistency management, and must be addressed by a full-fledged data management system to facilitate wide area data sharing in the real-world. Some of these issues are: security and authentication, long-term archival for durability of data, disaster recovery, management of storage and quotas at individual nodes. We do not address these issues in this thesis, but leave them for future work. In the rest of this chapter, we describe our design of a wide area file store called Swarm, our research vehicle for evaluating the feasibility of implementing configurable consistency in a reusable replication middleware. In Section 4.2, we outline Swarm’s basic organization and interface. In Section 4.3, we outline Swarm’s expected usage for building distributed applications. In Section 4.5, we describe the design of its naming and location tracking schemes. In Section 4.6 we describe how Swarm supports largescale network-aware replication. In Section 4.7, we describe our design of configurable consistency in Swarm which is the focus of this dissertation. We discuss the failure resilience characteristics of our design in Section 4.9. We provide details of our prototype Swarm implementation in Section 4.10. Finally, we discuss several issues that our current design does not address, and outline ways to address them. 77 4.2 Swarm Overview Swarm is a distributed file store organized as a collection of peer servers (called Swarm servers) that cooperate to provide coherent wide area file access at variable granularity. Swarm supports the data-shipping paradigm of computing described in Section 2.1 by letting applications store their shared state in Swarm files and operate on it via Swarm servers deployed nearby. Files (also called regions) in Swarm are persistent variable-length flat byte arrays named by globally unique 128-bit numbers called SWIDs. Swarm exports a file system-like session-oriented interface that supports traditional read/write operations on file blocks as well as operational updates on files (explained below). A file block (also called a page) is the smallest unit of data sharing and consistency in Swarm. Its size (which must be a power-of-2 bytes and is 4-Kilobytes by default) can be set for each file when creating it. This gives applications considerable flexibility in balancing consistency overhead and parallelism. A special metadata block holds the file’s attributes such as its size, consistency options and other applicationspecific information, and is similar to an inode in Unix file systems. Swarm servers locate files by their SWIDs (as described in Section 4.5), cache them as a side-effect of local access (as described in Section 4.6), and maintain consistency of cached copies according to the per-file consistency attributes (as described in Section 4.7). Each Swarm server utilizes its configured persistent local store for permanent copies of some files and the rest of the available space to cache remotely stored files. Swarm servers discover each other as a side-effect of locating files by their SWIDs. Each Swarm server monitors its connection quality (latency, bandwidth, connectivity) to other Swarm servers with which it communicated in the recent past, and uses this information in forming an efficient hierarchical overlay network of replicas dynamically for each file (as described in 4.6). Swarm allows files to be updated in place in two ways: (i) by directly overwriting previous contents on a per file block basis (called absolute or physical updates), or (ii) by supplying a semantic update procedure that Swarm applies to all replicas via an application-supplied plugin at each Swarm server (called operational updates). Absolute updates are useful for traditional file sharing as well as for implementing persistent pagebased distributed data structures (as we did for the proxy-caching application described 78 in Section 5.3). Operational updates can greatly reduce the amount of data transferred to maintain consistency and increase parallelism when sharing structured objects such as databases and file system directories. Swarm maintains consistency at the smallest granularity at which applications access a file at various replicas. Thus, it can maintain consistency at the granularity of individual file blocks (when using physical updates) or the entire file (when using either physical or operational updates). Swarm clients can control consistency per-file (imposed by default on all replicas), per-replica (imposed by default on its local sessions), or per-session (affecting only one session). This gives each client the ability to tune consistency to its individual resource budget. In the rest of this section, we describe the interface Swarm exports to distributed applications to facilitate shared data programming for transparent caching. 4.2.1 Swarm Interface Swarm exports the file operations listed in Table 4.1 to its client applications via a Swarm client library that is linked to each application process. The Swarm client library locates a nearby Swarm server via an external lookup service similar to DNS, and communicates with it using IPC. sw alloc() creates a new file with consistency attributes specified by the ‘cc options’ and returns its unique Swarm ID (SWID). sw open() starts an access session at a Swarm server for a portion (perhaps all) of the specified file (called a segment), returning a session id (sid). The ‘cc options’ denotes a vector of desired configurable consistency options for this session along the dimensions listed in Table 1.1 overriding the file’s consistency attributes. Depending on the file’s consistency attributes and the ‘cc options’ that include the requested access mode, the Swarm server may need to obtain a copy of the data and/or bring the local copy up to date. A session is Swarm’s point of concurrency control, isolation and consistency management. sw read() and sw write() transfer file contents to/from a user buffer. sw close() terminates the session. sw getattr() and sw setattr() let the client inspect or modify a file’s attributes including its consistency settings. Finally, sw free() evicts the local Swarm replica if the ‘global’ flag is false. Otherwise it 79 Table 4.1. Swarm API. Operation Description SWID ← sw alloc(page size, cc options) create a new file sw free(SWID, global) destroy local replica/all replicas of file sw getattr/setattr(SWID, attr) set/get file attributes (size, consistency etc.) sid ← sw open(SWID, offset, size, cc options) open file/segment with specified consistency sw read/write(sid, offset, size, buf) read/write file segment sw update(sid, plugin id, update proc) apply update proc to file via plugin sw snoop(sid, update filter, callback func) notify remote updates via callback sw depends(sid, SWID, offset, size, dep types) establish interfile/segment dependencies sw close(sid) close file deletes file itself and all its replicas from Swarm, freeing up its resources (including SWID) for reuse. sw update() can be used to provide an operational update procedure (as an opaque update packet) to be invoked on data at the local Swarm server. Example operational updates include “add(name) to directory”, and “debit(1000) from account”. To use this facility, the application must implement a plugin library to be linked to the local Swarm server at each node running an application instance. When a Swarm server receives an update packet, it invokes the specified plugin to interpret and apply its procedure on the local copy and returns the result in response to the sw update() call. Swarm servers propagate the packet instead of the new file contents to synchronize replicas. We describe Swarm’s plugin interface in Section 4.2.2. sw snoop() can be used by clients to register to receive notification of updates made to a file replica via a registered callback function, i.e., to “snoop” on updates arriving for the local replica. It can be used by application components to monitor activity at other components without costly polling. For example, a chat client can snoop on writes to the transcript file by other clients and supply a callback that updates the user’s local screen when a remote write is received. sw depends() can be used to express 80 semantic dependencies (i.e., causality and atomicity) among updates to multiple Swarm objects. Finally, an application can access Swarm files either through the Swarm client interface described above, or natively via a local file system mount point. To facilitate the latter, the Swarm server also exports a convenient native file system interface to Swarm files via the local operating system’s VFS layer. It does so by providing a wrapper that talks to a file system module in the kernel such as CodaFS [38]. The wrapper provides a hierarchical file name space by implementing directories within Swarm files. This allows Swarm files to be accessed via the operating system’s native file interface. We chose to export a file as Swarm’s basic data abstraction as opposed to fixed-length blocks of storage, because of the needs of our target application classes. Choosing a file abstraction simplifies distributed storage management at the middleware level by avoiding the distributed garbage collection issues that arise in block-level storage allocation. The file abstraction also provides applications with an independently growable entity and gives applications the autonomy to manage storage allocation to suit their needs. The abstraction directly matches the needs of several target applications (such as file systems and databases). Other applications such as distributed data structures can submanage pages within a file as a persistent page heap by exploiting Swarm’s page-granularity sharing. 4.2.2 Application Plugin Interface A Swarm application plugin must implement the operations listed in Table 4.2. Swarm invokes the ‘apply’ operation to locally apply incoming operational updates. When Swarm receives a remote update but finds that the local copy got updated independently, it invokes the ‘merge’ operation instead to let the plugin resolve the conflict. We discuss conflict resolution in Section 4.8. A plugin can directly operate on the local copy of a Swarm file via the local file system bypassing Swarm. It can also employ local memory buffering techniques to speed up its operations and lazily reflect them to the underlying Swarm file. Swarm invokes the ‘sync’ operation to let the plugin flush those changes to the Swarm file before it transfers raw file contents to other replicas. The ‘cleanup’ 81 Table 4.2. Swarm interface to application plugins. Each plugin must implement these operations in a library linked into local Swarm server. Operation init() ret← apply(file, logentry, update) success←merge(file, neighbor, update) sync(file) cleanup(file) uninit() Description Initialize plugin state apply operational update locally, return result merge update into local copy (resolve conflicts) sync changes into local Swarm file copy (before full file transfer) cleanup file state before eviction Destroy plugin state operation allows the plugin to free up its state for a particular file before its eviction from local cache. 4.3 Using Swarm to Build Distributed Services We envision Swarm being employed to provide coherent wide area caching to distributed services (such as file and directory services) in two ways. A new distributed service can be implemented as a collection of peer-to-peer service agents that operate on Swarm-hosted service data. Alternatively, an existing centralized service (such as an enterprise service) can export its data into Swarm’s space at its home site, and deploy wide area service proxies that access the data via Swarm caches to improve responsiveness to wide area clients. We have built a wrapper library around the BerkeleyDB database library to enable applications to transparently replicate their BerkeleyDB database across the wide area. We describe it in Section 5.4. For example, consider an enterprise service such as payroll that serves its clients by accessing information in a centralized database. Figure 4.1 shows how such a service is typically organized. Clients access the service from geographically dispersed campuses by contacting the primary server in its home campus via RPCs. As the figure shows, the enterprise server is normally implemented on a cluster of machines and consists of two tiers; the “application server” (labelled ‘AS’ in the figure) implements payroll logic, 82 Home Campus ... Clients ... Clients Campus 2 Server cluster RPCs Internet AS DB FS ... Clients Campus 3 Figure 4.1. A centralized enterprise service. Clients in remote campuses access the service via RPCs. while the “DB” server (such as mySQL, BerkeleyDB, or Oracle) handles database access when requested by ‘AS’ using the storage provided by a local storage service (FS). Figure 4.2 shows how the same service could be organized to employ wide area proxies based on Swarm. Each enterprise server (home or proxy) is a clone of the original two-tier server with an additional component, namely, a Swarm server. Clients can now access any of the available server clusters, as they are functionally identical. At each enterprise server cluster, the ‘AS’ must be modified to issue its database queries and updates to the Swarm server by wrapping them within Swarm sessions with desired consistency semantics. In response, Swarm brings the local database to desired consistency by coordinating with Swarm servers at other proxy locations, and then invokes the operations on the local DB server. The DB server remains unchanged. Employing this 3-tier architecture offloads data replication and consistency management complexity from the enterprise service by enabling it to leverage Swarm’s support instead. Figure 4.3 shows the architecture of the server cluster and its control flow in more detail. The DB plugin encapsulates DB-specific logic needed by Swarm to manage replicated access. It must be implemented and linked to each Swarm server and has two functions: (i) it must apply the DB query and update operations that Swarm gets from the AS and remote Swarm servers, to the local database replica; (ii) it must detect and resolve conflicts among concurrent updates when requested by Swarm. 83 Campus 2 Home Campus ... Clients Clients Swarm Protocol Server cluster Internet AS Swarm DB ... RPCs Clients FS ... Proxy cluster AS Swarm DB FS Campus 3 Figure 4.2. An enterprise application employing a Swarm-based proxy server. Clients in campus 2 access the local proxy server, while those in campus 3 invoke either server. The steps involved in processing a client request in this organization are as follows: 1. A client issues a query or update request to the payroll application server (AS). 2. The AS identifies the database objects required to process the request, and opens Swarm sessions on those objects specifying the required consistency semantics. It does so via the Swarm client library using the API described in Section 4.2.1. In response, Swarm brings the local database objects to the specified consistency as follows: 3. Swarm contacts remote Swarm servers if necessary, and pulls any outstanding updates to bring local replicas up-to-date. 4a. Swarm applies those updates locally via the DB plugin which invokes them on the local DB server. 4b. (Optional) Sometimes, Swarm must load a fresh copy of a database object to the local store. This is required when instantiating a new database replica, or when the local replica is too far out of sync with other replicas. In either case, Swarm invokes the DB plugin to instantiate the local database replica in the local store. 84 ... Clients 1 5c App server 5b 5a 2 3 Swarm DB plugin 4a DB server 4b 6 Local FS Server cluster Figure 4.3. Control flow in an enterprise service replicated using Swarm. The AS can cache recent Swarm sessions. Hence, it can skip step 2 (i.e., opening Swarm sessions) if it already has the requisite Swarm sessions opened as a sideeffect of previous activity. 5a. The AS can directly invoke query operations on the local DB, bypassing Swarm. 5b. However, the AS must issue update operations on the local DB via Swarm, so Swarm can log them locally and propagate them later to other replicas. Swarm invokes the DB plugin to apply such operations, and returns their results to the AS. 6. In either case, the DB server performs the operation on the local database replica, and is oblivious of replication. 5c. (Optional) When opening a session, the AS can optionally ask Swarm to suggest a remote proxy server site to redirect its request for better overall locality and performance. The AS can then forward its client request to the remote AS at 85 the suggested site as an RPC, instead of executing it locally. Swarm dynamically tracks the available data locality at various sites to determine the appropriate choice between caching and RPCs. Wide-area replication can be provided to all or parts of a file system in a similar manner. First, the files and directories that will be accessed over the wide area are exported (i.e., mapped) to Swarm’s file name space. Then, clients access the mapped files by their Swarm path names via local Swarm servers. Each file is mapped to a different Swarm file and its consistency attributes can be controlled individually. 4.3.1 Designing Applications to Use Swarm Employing Swarm for wide area replication in the way described above requires the designers of a distributed service to map its shared data to Swarm’s data model, its consistency requirements to the CC framework and its access methods to Swarm’s access API. Swarm’s familiar data abstractions and intuitive API make this mapping relatively straightforward for applications that already employ the data-shipping paradigm in their design. However, the CC framework requires applications to express their consistency requirements in terms of several low-level consistency options. Though many of these options are intuitive, the right set of options may not be obvious in all application scenarios, and wrong choices lead to suboptimal performance. Hence determining the appropriate CC options for application data requires adhering to certain guidelines, which we describe in this section. First, we briefly outline a recommended methodology for designing distributed applications in the data-shipping paradigm for replication. It consists of three steps that are independent of configurable consistency. 1. Structure the application as a collection of active objects or application entities, each in charge of a well-defined portion of application state, with well-defined operations on that state. This can be done by following the well-known object design methodology adopted for a networked object model such as that of CORBA [50]. Express operations on object state as performed in a multithreaded singlemachine environment (such as employed by the Java programming model). Doing 86 so forces the objects to be designed for concurrency (e.g., by adopting a sessionoriented data access model). This is an important prerequisite step to designing for replication. 2. Identify objects whose state persists across multiple operation invocations, and consider the possible benefit of replicating each of those objects (with its state and operations) on multiple application sites. The likely candidates for replication are those objects for which a clear benefit can be found, such as improved loadbalancing, scalability, access latency or availability. 3. Express all access to a replicable object’s state in terms of access to a file, logical operations on a database, or as memory mapped access to a shared persistent heap. This forces application objects to be designed for persistence, which simplifies their deployment on multiple sites. Once the replicable objects in an application have been identified, the next step is to examine their consistency requirements and express those requirements using CC options. The following steps guide this design process: 1. For each of the candidate objects for replication, determine the high-level consistency semantics it needs by identifying unacceptable sharing behavior. What invariants must be maintained at all times on the object state viewed by its operations? What does invalid state mean? At what points in the object’s logic must validity be preserved? What interleavings among concurrent operations must be avoided? Are there any dependencies (e.g., atomicity) among multiple data access operations that must be preserved? This approach of finding the minimal acceptable consistency enables maximal parallelism, while ensuring correctness. 2. For each operation on each object, express the above invariants in terms of the five aspects of configurable consistency described in Table 1.1. For example, unacceptable interleaving of multiple operations determines concurrency control. Identify the minimal sequence of operations in the application logic that preserve state validity. They are likely to be the boundaries of visibility and isolation. Study 87 how each update to state affects other entities viewing that state. The negative consequences of stale data determine replica divergence control decisions. 3. Express all synchronization between application components, such as critical sections, using message-passing primitives, barriers, or exclusive data locks. Configurable consistency is not designed to support intercomponent synchronization unless it can be efficiently expressed in terms of data locks. 4. Protect all accesses to application state by sessions that start and end access to that state. Set the appropriate vector of configurable consistency options for each session based on the invariants identified in step 1. 4.3.2 Application Design Examples We illustrate how some of our focus applications (described in Section 2.3) can be designed to employ Swarm’s caching support, and how the appropriate configurable consistency options for their sharing needs can be determined. 4.3.2.1 Distributed File Service A file service has two kinds of replicable objects: files and directories. A file must be reachable from at least one directory as long as it is not deleted. This requires that the file and its last directory entry must be removed atomically. When a file or directory is concurrently updated at multiple places, a common final result must prevail, which can be ensured by the CC framework’s total ordering option. Source files and documents typically have valid contents only at the end of an editing session, and hence need session visibility and isolation. However, each update to a directory preserves its validity. Hence per-access visibility and isolation are acceptable for directories. A user accessing a shared document typically wants to see prior updates, and requires a hard most-current timeliness guarantee. 4.3.2.2 Music File Sharing There are two classes of replicable objects in a music file sharing application: the music index, which is frequently updated, and the music files themselves, which are 88 read-only. The index stores mappings from attributes to names of files with matching attributes. Replicating the music file index speeds up queries. However, a single global music index could get very large, making synchronization of replicas expensive when a large number of users updates it frequently. Splitting the music index into a hierarchy of independently replicable subindices helps spread this load among multiple nodes. In that case, the only invariant to maintain is the integrity of the index hierarchy when splitting or merging subindices. Returning stale results to queries is acceptable, provided eventual replica convergence can be ensured. 4.3.2.3 Wide-area Enterprise Service Proxies An enterprise service typically manages a number of objects that can be cached individually at wide area proxy sites to exploit regional locality. Examples include customer records, sale items, inventory information, and other entities that an e-commerce service deals with. However, these services typically store enterprise objects in a relational database system that aggregates multiple objects into a single table. Replication at the granularity of a table could lead to false sharing among multiple sites which performs poorly under strong consistency requirements. Employing an object veneer such as that provided by Objectstore [41] on top of the relational database enables individual objects to be cached with different consistency semantics based on their sharing patterns for improved overall performance. For instance, close-to-open consistency suffices for accessing customer records at a service center; bounded inconsistency is required for reserving seats in an airline to limit the extent of overbooking [77]; the sale of individual items requires strong consistency; eventual consistency suffices for casual browsing operations. 4.4 Architectural Overview of Swarm Having outlined Swarm and its expected usage for employing caching in distributed applications, we now turn to describing Swarm’s design and implementation. We start by giving an overview of Swarm’s internal architecture. Figure 4.4 shows the structure of a Swarm server with its major modules and their flow of control. 89 Figure 4.4. Structure of a Swarm server and client process. A Swarm file access request is handled by the Swarm server’s session management module, which invokes the replication module to fetch a consistent local copy in the desired access mode. The replication module creates a local replica after locating and connecting to another available replica nearby. It then invokes the consistency module to pull an up-to-date copy from other replicas. The consistency module performs global concurrency control and replica divergence control followed by local concurrency control before granting file access to the local session. The bulk of configurable consistency functionality in a Swarm server is implemented by the consistency, update, log and the inter-node protocol modules. Of these, the update and log modules are responsible for storing updates, and preserving their ordering and dependency constraints during propagation. A Swarm server’s state is maintained in several node-local persistent cache data structures (using BerkeleyDB) and cached files stored in the local file system. Finally, application-supplied plugins can be dynamically linked to a Swarm server. They interpret and apply operational updates issued by application instances and participate in resolving conflicting updates, at all replicas. We describe each of these modules in the following sections. 4.5 File Naming and Location Tracking As mentioned in Section 4.2, Swarm files are named by their SWIDs. To access a file, a Swarm server first locates peers holding permanent copies of the file, called its 90 custodians, based on its SWID via an external location service (e.g., Chord [68], or a directory service). The number of custodians can be configured per file, and determines Swarm’s ability to locate the file in spite of custodian failures. One of the custodians, designated the root custodian (also called the home node), coordinates the file’s replication as described in the next section, to ensure consistency. By default, the root custodian is the Swarm server where the file was created. In case of a proxy caching application, the root custodian is usually the Swarm server at the application’s home service site where all application state is permanently stored. When the root node fails, a new root is elected from among the available custodians by a majority voting algorithm. Scalable file naming and location are not the focus of our work, because several scalable solutions exist [58, 68, 79]. Hence, for simplicity, we devised the following scheme for assigning and locating SWIDs in our Swarm prototype. Each Swarm server manages its own local ID space for files created locally, and SWIDs are a combination of its owner Swarm server’s IP address, and its ID within the Swarm server’s local ID space. A Swarm server finds a file’s root custodian based on the address hard-coded in its SWID. Each allocated SWID is also given a generation number that is incremented every time the ID is reassigned to a new file, to distinguish between references to its old and new incarnations. Swarm’s design allows our simple hardwired naming and location scheme to be easily replaced by more robust schemes such as those of Pastry [58] and Chord [68], as follows. Some of the Swarm servers, designated as SWID-location servers, run a peer-to-peer location protocol such as that of Chord to manage the SWID-to-custodian mappings among themselves. They advertise themselves to other Swarm servers by registering with an external DNS-like directory service. When a Swarm server allocates a new SWID, it becomes the SWID’s root custodian and adds the SWID-to-custodian mapping to the SWID location service. Whenever a custodian is added or removed for a SWID (as described in Section 4.6.2), its mapping gets updated by contacting a SWID-location server. We expect this operation to be infrequent, since location servers typically run on reliable machines. When looking up a 91 SWID, a Swarm server first discovers an available SWID-location server via the external directory service and queries it for the SWID’s custodians. To expedite SWID lookups and to reduce load on the fewer location servers, each Swarm server caches the results of recent lookups for custodians as well as file replicas in a lookup cache, and consults it before querying the distributed SWID location service. Our prototype employs lookup caches. 4.6 Replication Swarm’s replication design must meet three goals: it must enable the enforcement of diverse consistency guarantees, provide low latency access to data despite large-scale replication, and limit the load imposed on each replica to a manageable level. Swarm servers create file replicas locally as a side-effect of access by Swarm clients. For network-efficient and scalable consistency management, they dynamically organize replicas of each file into an overlay replica hierarchy rooted at its root custodian, as explained in this section. Each server imposes a user-configurable replica fanout (i.e., number of children, typically 4 to 8) to limit the amount of replica maintenance traffic handled for each local replica. To support both strong and weak consistency flavors of the configurable consistency framework across non-uniform links, Swarm employs a consistency protocol based on recursive messaging along the links of the replica hierarchy (described in Section 4.7). Hence the protocol requires that the replica network be acyclic to avoid deadlocks. Figure 4.5 shows a Swarm network of six servers with replica hierarchies for two files. In the rest of this section, we describe how Swarm creates and destroys replicas and maintains dynamic cycle-free hierarchies and how a large number of Swarm nodes discover and track each other’s accessibility and connection quality. 4.6.1 Creating Replicas When a Swarm server R wants to cache a file locally (to serve a local access), it first identifies custodians for the file’s SWID by querying the location service and caches them in its local lookup cache of known replicas. It then requests one of them (say P) 92 F2 F1 N1 F1 N2 F2 N4 N3 F2 F1 F2 F1 F2 N5 F1 N6 Figure 4.5. File replication in a Swarm network. Files F1 and F2 are replicated at Swarm servers N1..N6. Permanent copies are shown in darker shade. F1 has two custodians: N4 and N5, while F2 has only one, namely, N5. Replica hierarchies are shown for F1 and F2 rooted at N4 and N5 respectively. Arrows indicate parent links. to be its parent replica and provide a file copy, preferring those that can be reached via high quality (low-latency) links over others. Each node tracks link quality with other nodes as explained in Section 4.6.5. Unless P is serving “too many” child replicas, P accepts R as its child, transfers file contents and initiates consistency maintenance as explained in Section 4.7). P also sends the identities of its children to R, along with an indication if it has already reached its configured fanout (#children) limit. R augments its lookup cache with the supplied information. If P was overloaded, R remembers to avoid asking P for that file in the near future. Otherwise R sets P as its parent replica. R repeats this process of electing a parent until has a valid file copy and an accessible parent replica. The root custodian is its own parent. When a replica loses contact with its parent, it reelects its parent in a similar manner. Even when a replica has a valid parent, it continually monitors its network quality to known replicas and reconnects to a closer replica in the background, if found. A replica detaches from current parent only after attaching successfully to a new parent. This process forms a dynamic hierarchical 93 replica network rooted at the root custodian like Blaze’s file caching scheme [10], but avoids multiple hops over slow network links when possible, like Pangaea [62]. The above parent election scheme could form cycles. Hence before accepting a new parent replica, an interior replica (i.e., one that has children) actively detects cycles by initiating a root probe request that is sent along parent links. If the probe reaches its originator, it means that accepting the new parent will form a cycle. Leaf replicas do not accept new children while they themselves are electing their parent. Hence they need not probe for cycles, nor do they form primitive cycles. Figure 4.6 illustrates how file F2 ends up with the replica hierarchy shown in Figure 4.5, assuming node distances in the figure roughly indicate their network roundtrip times. Initially, the file is homed at node N5, its only custodian. Nodes N1 and N3 cache the file, making N5 their parent (Figure 4.6a). Subsequently, nodes N2 and N4 also cache it directly from N5, while N1 reconnects to its nearby replica N3 as its parent (Figure 4.6b)2 . Since N5 informs its new children of their sibling N3, they connect to N3, which is nearer to them than N5 (Figure 4.6c). At that time, N2 gets informed of the nearer replica at N1. Thus, N2 makes N1 its parent replica for file F2 (Figure 4.6d). Each Swarm server’s administrator can independently limit the maximum fanout (number of children) of its local file replicas based on the node’s CPU and network load-handling capacity. A high fanout increases a node’s CPU load and the network bandwidth consumed by Swarm communication. A low fanout results in a deep hierarchy, and increases the maximum hops between replicas and access latency. We believe (based on the Pangaea study [62]) that a fanout of 4 to 8 provides a reasonable balance between the load imposed due to fanout and the synchronization latency over a deep hierarchy. We evaluated Swarm with several hundred replicas forming hierarchies up to 4 or 5 levels deep and a fanout of 4 (see Section 5.4.3). Since replicated performance is primarily sensitive to number of hops and hence depth, we expect Swarm to gracefully handle a fanout of 8, enabling it to scale to thousands of replicas with hierarchies of similar depth. 2 Note that N3 could have reconnected to N1 instead. If both try connecting to each other, only one of them succeeds, as one of them has to initiate a root probe which detects the cycle. 94 4.8 Figure 4.6. Replica Hierarchy Construction in Swarm. (a) Nodes N1 and N3 cache file F2 from its home N5. (b) N2 and N4 cache it from N5; N1 reconnects to closer replica N3. (c) Both N2 and N4 reconnect to N3 as it is closer than N5. (d) Finally, N2 reconnects to N1 as it is closer than N3. The replica networking mechanism just described does not always form the densest hierarchy possible, although it forms a hierarchy that enables efficient bandwidth utilization over the wide area. However, a dense hierarchy enables the system to scale logarithmically in the number of replicas. To keep the hierarchy dense, a replica with not enough children periodically advertises itself to its subtree as a potential parent to give some of its descendants a chance to reconnect to it (and move up the hierarchy) if they find it to be nearer. Finally, all the blocks of a file share the same replica hierarchy, although consistency is maintained at the granularity of individual file blocks. This keeps the amount of hierarchy state maintained per file low and independent of file size, and simplifies the enforcement of causality and atomicity constraints for page-grained sharing of distributed data structures. 95 4.6.2 Custodians for Failure-resilience The purpose of multiple custodians is to guard against transient failure of the root replica as well as to ease the load imposed on it by SWID lookup traffic from nodes reconnecting to the hierarchy. A file’s home node recruits its child replicas to keep enough number of custodians as configured for the file. The identity of custodians is propagated to all replicas in the background to help them reconnect to the replica hierarchy after disconnection from their parent. Also, if the root node becomes inaccessible, other custodians reconnect to each other to keep a file’s replica hierarchy intact until the root node comes up, and prevent it from getting permanently partitioned. 4.6.3 Retiring Replicas Each Swarm server (other than a custodian) tracks local usage of cached file copies and reclaims its cache space by evicting least recently used ones, after propagating their updates to neighboring replicas and informing neighbors of the replica’s departure. The orphaned child replicas elect a new parent as described above. However, before a custodian can retire, it must inform the root node, as custodians must participate in majority voting for root election. When the root node itself wishes to retire its copy, it first transfers the root status to another custodian (in our prototype, the one with the lowest IP address) via a two-phase protocol and becomes its noncustodian child. This mechanism allows permanent migration of a file’s custody among nodes. 4.6.4 Node Membership Management Swarm servers learn about each other’s existence as a side-effect of looking up SWIDs. Each Swarm server communicates with another via a TCP connection to ensure reliable, ordered delivery of messages. Since Swarm servers form replica hierarchies independently for each file, a Swarm server that caches a lot of files might need to communicate with a large number of other Swarm servers, potentially causing scaling problems. To facilitate scalable communication, each Swarm server keeps a fixed-size LRU cache of live connections to remote nodes (called connection cache), and a larger persistent node cache to efficiently track the status of remote nodes with which it communicated recently. 96 A replica need not maintain an active connection with its neighboring replicas all the time, thus ensuring a Swarm server’s scalable handling of a large number of files and other Swarm servers. A Swarm server assumes that another node is reachable as long as it has an established TCP connection to it. If a communication error (as opposed to a graceful close by peer) causes the TCP layer to shut down the connection, the Swarm server infers that the other node is down and avoids it in subsequent lookups and replica reconnection attempts for a timeout period. A Swarm node treats a busy neighbor differently from an unreachable neighbor by employing longer timeouts for replies from a connected node to avoid prematurely reacting to transient overload. 4.6.5 Network Economy To aid in forming replica hierarchies that utilize the network efficiently, Swarm servers continually monitor the quality of their network links to one another as a sideeffect of normal communication, and persistently cache this information in the node cache. Our prototype employs roundtrip time (RTT) as the link quality metric, but it can be easily replaced by more sophisticated metrics that also encode bandwidth and lossiness of links. The link quality information is used by the consistency management system to enforce various failure-handling options, and by the replication system to reduce the use of slow links in forming replica hierarchies. In our current design, each replica decides for itself which other replicas are close to it by pinging them individually (piggybacked on other Swarm messages if possible). Thus, the ping traffic is initially high when a number of new Swarm servers join the system for the first time, but quickly subsides and is unaffected by nodes going up and down. However, Swarm can easily adopt better RTT estimation schemes such as IDMaps[25] as they become available. 4.6.6 Failure Resilience Swarm treats custodians (including root) as first-class replicas for reliability against permanent loss, and employs a majority voting algorithm [35] to provide Byzantine faulttolerance. Hence for high availability under strong consistency, custodian copies must be 97 small in number (typically 4), and hosted on well-connected machines. However, Swarm can handle a large number (thousands) of secondary replicas and their failures due to its dynamic hierarchy maintenance. Swarm handles transient root node failure as explained earlier in this section. 4.7 Consistency Management In this section, we describe our design of configurable consistency management in the context of Swarm. Each Swarm server has a consistency module (CM) that is invoked when clients open or close a file or when clients perform reads or updates within a session via the operations listed in Table 4.1. The CM performs checks, interacting with replica neighbors (parent and children) via pull operations as described below, to ensure that the local file copy meets client requirements. Similarly, when a client issues updates, the CM propagates them via push operations to enforce consistency guarantees given by peer servers to their clients. Before granting data access, the consistency module performs concurrency control and replica divergence control. Concurrency control delays the request until all other access sessions that conflict with the requested access mode (according to the concurrency matrix in Table 3.1) are completed at all replicas. Replica divergence control pulls updates from other replicas (preserving their ordering constraints) to make sure that the local copy meets given replica synchronization requirements. To succintly track the concurrency control and replica divergence guarantees given to client sessions (i.e., time and mod bounds), the CM internally represents them by a construct called the privilege vector (PV). The CM can allow a client to access a local replica without contacting peers if the replica’s consistency guarantee indicated by its PV is at least as strong as required by the client. For example, if a client requires 200ms staleness and the replica’s PV guarantees a max staleness of 100ms, no pull is required. CMs at neighboring replicas in the hierarchy exchange PVs based on client consistency demands, ensure that PVs do not violate guarantees of remote PVs, and push enough updates to preserve each other’s PV guarantees. Although a Swarm server 98 is responsible for detecting inaccessible peers and repairing replica hierarchies, its CM must continue to maintain consistency guarantees in spite of reorganization (by denying access if necessary). To recover from unresponsive peers, the CMs at parent replicas grant PVs to children as leases, as explained below. 4.7.1 Overview 4.7.1.1 Privilege Vector A privilege vector (PV) consists of four components that are independently enforced by different consistency mechanisms: an access mode (mode), a hard time/staleness limit (HT), a hard mod limit (HM) and a soft time+mod limit (STM.[t,m]). Figure 4.7 defines a PV and an “includes” relation (⊇) that partially orders PVs. This relation defines how to compare the strength of consistency guarantee provided by two PVs. The figure also lists the privilege state maintained by a replica, which will be explained later. Associated with each replica of a consistency unit (i.e., a file or a portion of it) is a current privilege vector (currentPV in short) that indicates the highest access mode and the tightest (i.e., numerically lowest) staleness and the mod limit guarantees that can be given to local sessions without violating the guarantees given at remote replicas. By default, a file’s root custodian has the highest PV (i.e., [WRLK, *, *, *] where * is a wildcard), whereas a new replica starts with the lowest PV (i.e., [RD, ∞, ∞, ∞]). A replica is a potential writer if its PV contains a WR or WRLK mode, allowing it to update data. 4.7.1.2 Handling Data Access When a Swarm client requests a nearby server to open a session on a data item, the server first caches the item locally and computes the required PV from the session’s consistency options. If the replica’s currentPV is not high enough to include the required PV (i.e., the client wants a stronger consistency guarantee than the replica currently has), the server obtains the required PV from other replicas as described below. Next, the server performs local concurrency control by delaying the request until the close of other local sessions whose access modes conflict with the new session’s mode (according to the concurrency matrix in Table 3.1). These two actions constitute a pull operation. Finally, 99 Privilege Vector (PV): [ mode: (RD, WR, RDLK, WRLK), /* (pull) */ HT: (0, ∞) /* Hard Time limit (pull) */, HM: (0, ∞) /* Hard Mod limit (push) */, STM: [t: (0, ∞), m: (0, ∞)] /* Soft Time and Mod limits (push) */ ] PVincludes(PV1, PV2) /* ⊇relationship */ { /* checks if PV1⊇PV2 */ /* WRLK⊇WR⊇RD; WRLK⊇RDLK⊇RD */ if (PV1.mode == WRLK) or ((PV1.mode == RDLK) and (PV1.mode ⊇PV2.mode)) return TRUE; else if (PV1.mode⊇PV2.mode) and (PV1.[HT, HM, STM] ≤ PV2.[HT, HM, STM]) return TRUE; else return FALSE; } PVmax(PV1, PV2) { return [max(PV1.mode, PV2.mode), min(PV1.HT, PV2.HT), min(PV1.HM, PV2.HM), min(PV1.STM, PV2.STM)]; } PVcompat(localPV) { /* returns the highest remote PV compatible with localPV */ /* localPV ⇒remote PV */ [WRLK] ⇒[RD, ∞, 0, [0,0]] [RDLK] ⇒[RDLK] [WR, finite, *, *] ⇒[RD, ∞, 0, [0,0]] [RD, finite, *, *] ⇒[RDLK] [WR, ∞, *, *] ⇒[WR, ∞, 0, [0,0]] [RD, ∞, *, *] ⇒[WRLK] } Replica Privilege State: foreach neighbor N, struct { PVin; /* PV obtained from N */ PVout; /* PV granted to N */ lease; /* if N is parent: lease on PVin; * if N is child: lease on PVout */ last pull; /* time of last completed pull */ unsent updates; /* #updates to push */ last push; /* time of last completed push */ } N; localPV = PVmax(S.PV) ∀ongoing local sessions S. currentPV = PVmin(N.PVin) ∀neighbors N. Invariants on Privilege State: (PVcompat(N.PVout)⊇ N.PVin and PVcompat(N.PVin)⊇N.PVout) ∀neighbors N. currentPV⊇localPV. PVmin(PV1, PV2) { return [min(PV1.mode, PV2.mode), min(PV1.HT, PV2.HT), max(PV1.HM, PV2.HM), max(PV1.STM, PV2.STM)]; } Figure 4.7. Consistency Privilege Vector. 100 the server opens the local session, allowing the client to access the local replica. Figure 4.8 gives the pseudo-code for a session open operation. 4.7.1.3 Enforcing a PV’s Consistency Guarantee A server obtains a higher PV for a local replica (e.g., to satisfy a client request) by issuing RPCs to its neighbors in the replica’s hierarchy. These RPCs are handled by those neighbors in a similar manner to client session open requests. To aid in tracking remote PVs, a replica remembers, for each neighbor N, the relative PV granted to N (called N.PVout) and obtained from N (called N.PVin). The replica’s currentPV is the lowest (according to the PVmin() function defined in Figure 4.7) of the PVs it obtained from its neighbors. Since each replica only keeps track of the PVs of its neighbors, a replica’s PV state is proportional to its fanout and not to the total number of replicas of the data item. When a replica R gets a pull RPC from a neighbor N, R pulls the required PV in turn by recursively issuing pull RPCs to its other neighbors and following it up with local concurrency control. Then, R grants the PV to the requestor N and stores it locally as N.PVout, after updating N.PVin to reflect R’s own local consistency needs relative to the requestor. By granting a PV, a replica promises to callback its neighbor before allowing other accesses in its portion of the hierarchy that violate the granted PV’s consistency guarantee. The recursive nature of the pull operation requires that the replica network be acyclic, since cycles in the network cause deadlocks. The pull algorithm is discussed in Section 4.7.2. 4.7.1.4 Handling Updates When a client issues updates or closes a session, the Swarm server examines the PVs granted to its neighbors. If it finds that their consistency guarantees (as indicated by their N.PVout values) will be violated as a result of accepting local updates, it propagates enough updates to bring them back to consistency before proceeding, a step called the push operation. The update ordering, visibility and isolation options of the CC frame- 101 sw open(SWID, offset, size, mode, cc options) { reqPV = [mode, ∞,∞,∞]; if cc options.strength == hard, reqPV.[HT, HM] = cc options.[time, mod]; else reqPV.STM = cc options.[time, mod]; if reqPV.STM.m > reqPV.HM, reqPV.STM.m = reqPV.HM; cache obj (SWID, offset, size) locally; pull(reqPV, self); open new local session S on obj with mode; S.PV = reqPV; return S.id; } sw update(session id, update) { foreach neighbor N, push(N); /* push outstanding updates to N */ apply update locally; } sw close(session id) { foreach neighbor N, push(N); /* push outstanding updates to N */ close session(session id); wake up waiters; } Figure 4.8. Pseudo-code for opening and closing a session. 102 work are enforced as part of the push operation, as explained in Section 4.7.2. Figure 4.8 gives the pseudo-code for update and session close operations. 4.7.1.5 Leases To reclaim PVs from unresponsive neighbors, a replica always grants PVs to its children as leases (time-limited privileges), which they need to renew after a timeout period. The root node grants a lease of 60 seconds, and other nodes grant a slightly smaller lease than their parent lease to give them sufficient time to respond to parent’s lease revocation. A parent replica can unilaterally revoke PVs from its children (and break its callback promise) after their finite lease expires. Thus, all privileges eventually end up with the root replica. Child replicas that accept updates made using a leased privilege must propagate them to their parent within the lease period, or risk update conflicts and inconsistencies such as lost updates. Leases are periodically refreshed via a simple mechanism whereby a node pings other nodes that have issued it a lease for any data (four times per lease period in our current implementation). Each successful ping response implicitly refreshes all leases issued by the pinged node that are held by the pinging node. If a parent is unresponsive, the node informs its own children that they cannot renew their lease. Swarm’s lease maintenance algorithm is discussed in more detail in Section 4.7.6. 4.7.1.6 Contention-aware Caching A Swarm server normally satisfies data access requests by aggressively caching the data locally and pulling the PV and updates from other replicas as needed. It keeps the PV obtained indefinitely (renewing its lease if needed) until another node revokes it. When clients near various replicas exhibit significant access locality, access-driven replication and privilege migration help achieve high performance by amortizing their costs over multiple accesses. However, if multiple sites contend for conflicting privileges frequently, aggressive caching can incur high latency due to frequent pulls among replicas (called thrashing). Swarm provides a heuristic algorithm called contention-aware 103 caching that helps restrict frequent migration of access privileges among replicas under contention to limit performance degradation. We describe this algorithm in Section 4.7.7. Our implementation of consistency via exchange of privilege vectors enables wide area clients with various degrees of laxity in concurrency and replica synchronization requirements to co-exist. Clients with looser requirements get more parallelism, while the strong requirements of others are respected globally. For example, consider a replicated access scenario for file F2 of Figure 4.6d. Figure 4.9 illustrates the consistency management actions taken in response to clients accessing the different replicas of F2 with different consistency requirements. Figure 4.9a shows the initial state where the root replica owns the highest PV. Next, a client opens a write session at node N1 with eventual consistency, meaning that it wants to be asynchronously notified of remote writes (Figure 4.9b). In response, Swarm performs pull operations to get this guarantee from other replicas. When N3 replies to N1, it lowers its own PV to [rd,∞,∞, ∞]. Thus, when N1 gets update u1 from the client, it does not push it to N3. Subsequently, a client opens a write session at N5 requiring to be synchronously notified of remote writes, thus forcing them to serialize with its local updates (Figure 4.9c). When it issues four updates u2-u5, they need not be propagated to N1 right away, since N1 can tolerate upto 4 unseen writes. Next, a client at N4 (Figure 4.9d) requires a different kind of hard guarantee, namely, latest data just before local access (e.g., at session open() time, called close-to-open consistency). In response, N3 does not pull from N1, as N1 has promised to push its updates to N3 synchronously. When clients at N1 and N5 issue updates u7 and u6 concurrently (Figure 4.9e), they are propagated to each other but not to N4, as N4 needs latest data only before local access. Finally, when a client requests a read lock session at N2 (Figure 4.9f), it blocks until all write sessions terminate globally, and prevents subsequent writes until the session is closed. Our hierarchy-based consistency mechanism scales to a large number of replicas by requiring replicas to track only a limited number of other replicas (corresponding to their fanout). This is because the PV presented to a replica by each neighbor includes the summarized consistency invariants of that neighbor’s portion of the hierarchy. This summarization is possible due to the inclusive property of PVs. Specifically, the locking 104 (a) Initial status (b) open(wr, soft, 10, 4) at N1 N5 N5 [wrlk] (c) open(wr, hard, −, 0) at N5 [wr,0,0 ,[0,0]] [rd,−,−,−] client pull N3 [rd,−,−,−] N1 [rd,−,−,−] pull N3 [rd,−,−,−] [rd,−,−,−] client N4 [wr,−,−, [10,4]] u1 N1 [wr,−,−,[10,4]] N2 [rd,−,−,−] N2 [rd,−,−,−] [rd,−,−,−] N4 [wr,−,−, [10,4]] u2−u5 u1 N5 u1 N3 pull [wr,0,0 ,[0,0]] pull [wr,−,−,[10,4]] [rd,−,−,−] N1 [wr,−,−,[10,4]] N4 N2 [rd,−,−,−] (d) open(rd, hard, 0, −) at N4 (e) Parallel updates at N1 and N5 (e) open(rdlk) at N2 [wr,hard,−, 0] [wr,hard,−,0] N5 [rd,−,−,−] u6 N5 [wr,0,0 ,[0,0]] N5 [wr,0,0 ,[0,0]] client u7 client push pull push u2−u5 u6 pull [wr,−,−,[10,4]] N3 u7 [wr,−,−,[10,4]] [rd,−,−,−] N3 push pull N3 u1−u5 [rd,−,−,−] push [rd,−,−,−] u7 N1 u2−u6 pull N1 N1 [rd,−,−,−] N4 N4 N4 [wr,−,−, [wr,−,−, [rd,−,−,−] [wr,−,−, pull u1−u7 [10,4]] [10,4]] [10,4]] client N2 [rd,−,−,−] N2 client [rdlk] N2 [rdlk] [rd,−,−,−] [rd,0,−,−] Figure 4.9. Consistency management actions in response to client access to file F2 of Figure 4.5(d). Each replica is labelled with its currentPV, and ‘-’ denotes ∞. (a) The initial replica network and its consistency status. (b) F2 is opened at N1 with eventual consistency. Pulls are issued to let N3 and N5 know that N1 is a potential writer. Subsequently, N1 gets update u1. (c) F2 is opened at N5 with a hard mod limit of 0. N1 must synchronously push its updates from now on. N5 also gets four updates u2-u5. But since N1 set its mod limit to 4, N5 need not push its updates yet. (d) F2 is opened at N4 with a hard time limit of 0 (such as close-to-open consistency). N3 need not pull from N1, as N1 will notify N3 before it updates the file. (e) N1 and N5 concurrently issue updates u7 and u6 respectively. N1 pushes u7 to N5 thru N3 and awaits an ack, while N5 asynchronously pushes u2-u6. Updates are not pushed to N4, as it did not set a mod limit in step (d). (f) Finally, F2 is opened in RDLK mode at N2. This reduces PVs at all other replicas to their lowest value. 105 modes include nonlocking modes, and write modes include read modes (with the exception of WR and RDLK). This allows the access privileges of an entire portion of a replica hierarchy to be represented concisely by the highest PV that includes all others. 4.7.2 Core Consistency Protocol The bulk of the CM’s functionality (including its consistency protocol) consists of the pull and push operations, which we describe in this section. Pull(PV): Obtain the consistency guarantees represented by the given PV from remote replicas. Push(neighbor N): Propagate outstanding updates to a given neighbor replica N based on its synchronization and ordering requirements as indicated by the PV granted to it earlier (N.PVout). A pull operation is performed before opening a session at a replica as well as during a session if explicitly requested by a client. In addition to notifying a replica’s desired consistency semantics to other replicas, it also enforces the replica’s hard time (HT) bound by forcing other replicas to immediately bring it up-to-date by propagating all their pending updates to it. A push operation is performed when updates arrive at a replica (originating from local or remote sessions), are applied locally (after satisfying the isolation options of local sessions) and are ready for propagation (based on the originating session’s visibility setting). It enforces mod bounds (both hard and soft) as well as soft time bounds at replicas (HM and STM) by propagating updates to them while adhering to their ordering constraints. To implement these operations, Swarm’s core consistency protocol consists of two primary kinds of messages, namely, get and put. A Swarm server’s protocol handler responds to a get message from a neighbor replica by first making a corresponding pull request to the local consistency module to obtain the requested PV locally. It then replies to the neighbor via a put message, transfering the requested privilege along with any updates to bring it in sync. It also recomputes its own PVin to reflect its current 106 consistency requirements compatible with the neighbor’s new PV, using the PVcompat() procedure outlined in Figure 4.7. When a put message arrives, the protocol handler waits for ongoing update operations to finish, applies the incoming updates atomically, and makes a push request to the consistency module to trigger their further propagation. Figure 4.10 gives the pseudo-code for the get and put operations. We describe the pull algorithm in the next section, and the push algorithm in Section 4.7.4. 4.7.3 Pull Algorithm The pull operation implements both concurrency and replica divergence control, and its algorithm is central to the correctness and performance of configurable consistency. To ease presentation, we first describe the basic algorithm (given in Figure 4.11), and refine it later with several improvements. We defer the discussion on enforcing the CC framework’s update ordering options to Section 4.8. The consistency module at a replica R makes each pull operation go through four successive steps: 1. The pull operation is blocked until previous pull operations complete. This ensures FIFO ordering of access requests, so that an access request (e.g., exclusive write) is not starved by a stream of conflicting access requests (e.g., exclusive reads). 2. The desired PV is requested from neighboring replicas that haven’t already granted the privilege, via get messages. For this, R computes a relative PV that must be obtained from each neighbor, so that the desired PV can be guaranteed locally while keeping the PVs established earlier. It sends a get message to a neighbor N if the PV obtained from N (N.PVin) does not include the relative PV needed. We describe how the relative PVs are computed to enforce various timeliness bounds in Section 4.7.4. 3. R waits for replies to its get messages. The neighbors handle get messages as follows (Figure 4.10 gives the pseudo-code). They pull the PV recursively from other neighbors, make sure that no other local access at their site conflicts with the requested PV, and respond with a put reply message that grants the PV to replica 107 get(reqPV, requesting replica R) { 1: pull(reqPV, R); 2: R.PVout = reqPV; /* compute my new PV relative to R */ 3a: R.PVin = PVmax(R.PVout, S.PV) ∀neighbors N6=R, ∀local sessions S; 3b: R.PVout.HT = (R.PVin.mode == WR,WRLK) ? ∞: 0; R.PVin.HT = (R.PVout.mode == WR,WRLK) ? ∞: 0; 3c: nwriters = # neighbors N where N.mode == WR, WRLK; R.PVin.HM /= nwriters; R.PVin.STM.m /= nwriters; if R.PVin.STM.m > R.PVin.HM R.PVin.STM.m = R.PVin.HM; 3d: if (R.PVout.HM == 0) R.PVout.HT = 0; if (R.PVin.HM == 0) R.PVin.HT = 0; 4: send put(R.PVout, R.PVin, updates) as reply message to R, where updates bring R within reqPV.[HT, HM, STM] limits of me and all my neighbors excluding R. } put(dstPV, srcPV, updates) from neighbor R { wait behind ongoing update operations apply updates to local replica R.PVin = dstPV; R.PVout = srcPV; foreach neighbor N6=R, push(N); /* push updates to neighbor N */ signal waiters } Figure 4.10. Basic Consistency Management Algorithms. 108 pull(reqPV, requesting replica R) { 1: wait ongoing pull operations to finish (except when deadlock is possible as explained in the text) 2a: /* recompute mod bound if #writers changed */ nwriters = #neighbors N, where N6=R and N.PVout.mode == WR,WRLK; ++ nwriters if reqPV.mode == WR,WRLK; minHM = localPV.HM / nwriters; minSM = localPV.STM.m / nwriters; if nwriters > 0, foreach neighbor N { split = nwriters; PV = (N == R) ? reqPV : N.PVout; –split if PV.mode == WR,WRLK; minHM = min(minHM, PV.HM / split); minSM = min(minSM, PV.STM.m / split); } foreach neighboring replica N6=R { relPV = reqPV; 2b: if relPV.HT < (curtime() - N.last pull), relPV.HT = ∞; 2c: relPV.[HM, STM.m] = [minHM, minSM]; if not PVincludes(N.PVin, relPV) { relPV = PVmax(relPV, N.PVin); send get(relPV) message to N; } } 3: wait for put replies to get messages sent above 4: foreach local session S, if reqPV.mode conflicts with S.PV.mode, wait for S to close. } Figure 4.11. Basic Pull Algorithm for configurable consistency. 109 Figure 4.7. At the end of this step, no replica other than R has a conflicting access privilege, and R is within the desired timeliness bounds (PV.[HT, HM, STM]). 4. Subsequently, the pull operation is blocked until local sessions with conflicting access modes end. The pull operation is complete only at the end of this step. Our algorithm ensures mutual compatibility of the PVs of replicas by establishing the right set of callback promises between replicas to prevent unwanted divergence in step 2, and by serializing conflicting access mode sessions in step 4. The recursive nature of the pull operation is illustrated in Figure 4.9b, where node N1 pulls privilege [wr, ∞,∞, [10, 4]] from N3, which in turn pulls it from N5. Figure 4.12 shows the relative PVs established at all replicas before and after the client session is opened at N1. By issuing parallel get requests to all neighbors along the hierarchy, the pull algorithm incurs latency that only grows logarithmically in the number of replicas (provided the hierarchy is kept dense, as explained in Section 4.6.1). Also, since each replica only keeps track of the privileges held by its neighbors, the privilege state maintained at a replica is proportional to its fanout and not to the total number of replicas. The algorithm exploits locality by performing synchronization (i.e., pulls and pushes) only between replicas that need it, such as active writers and active readers. This is because, in step 2, get messages are sent only to replicas that could potentially violate PV semantics. For instance, in Figure 4.12b, N3 need not pull from N4 because N4 thinks N3’s PV is [wrlk] and hence cannot write locally without first pulling from N3. 4.7.4 Replica Divergence Control As mentioned in Section 4.7.2, a hard time/staleness bound (HT) is enforced by a pull operation, while the other timeliness bounds (HM and STM) are enforced by the push operation. In general, a replica obtains divergence bound guarantees from each neighbor via a pull operation by specifying the tightest (i.e., numerically smallest) bound of its own local sessions and other neighbors. 110 (a) Initial status (b) After open(wr, soft, 10, 4) at N1 N5 N5 [rd,−,−,−] [wrlk] [wr,−,−,[10,4]] [rd,−,−,−] [wrlk] N3 [wrlk] [rd,−,−,−] [rd,−,−,−] N1 N4 [wrlk] [rd,−,−,−] N2 [rd,−,−,−] N3 [wrlk] [wr,−,−,[10,4]] [rd,−,−,−] N1 N4 [wrlk] client [rd,−,−,−] N2 Figure 4.12. Computation of relative PVs during pull operations. The PVs labelling the edges show the PVin of a replica obtained from each of its neighbors, and ‘-’ denotes ∞. For instance, in (b), after node N3 finishes pulling from N5, its N5.PVin = [wr, ∞, ∞, [10,4]], and N5.PVout = [rd, ∞, ∞,∞]. Its currentPV becomes PVmin( [wr, ∞, ∞, [10,4]], [rd, ∞, ∞,∞], [wrlk]), which is [rd, ∞, ∞,∞]. 4.7.4.1 Hard time bound (HT) To enforce HT, a replica R keeps track of the last time it pulled updates from each neighboring replica N, and includes an HT constraint in its pull request only if the maximum time limit has elapsed since the last pull (step 2b in the pull() procedure of Figure 4.11). In response, replica N grants a HT guarantee of 0 to R if its current privileges relative to R (R.PVin) do not allow writes. Otherwise, it grants a HT guarantee of ∞ (step 3b in the get() procedure of Figure 4.10). In the former case, R can unilaterally assume the latest contents; in the latter case, it must pull before every access. This is illustrated in Figure 4.9d. node N4 requests a HT of 0, but N3 grants it ∞, since N3 is a potential writer relative to N1. 4.7.4.2 Soft time bound During a pull, a replica obtains a soft time bound guarantee (STM.t) from each neighbor that is equal to the smallest of the soft time bounds of its own local sessions and its other neighbors. To enforce the guarantee, a replica remembers the last time it pushed updates to each of its neighbors, and pushes new updates when the soft time limit guaranteed to the neighbor has elapsed since the last push (step 1 of the push() procedure in Figure 4.13). 111 push(destination neighbor R) { 1: if R.unsent updates > R.PVout.HM, or R.unsent updates > R.PVout.STM.m, or (curtime() - R.last push) > R.PVout.STM.t, { 2: send put(R.PVout, R.PVin, R updates) message to R, where R updates bring R within R.PVout.[HM, STM] limits. 3. if R.unsent updates > R.PVout.HM, await ack for put message. } } Figure 4.13. Basic Push Algorithm for Configurable Consistency. 4.7.4.3 Mod bound To enforce a modification bound of N unseen remote updates globally (HM and STM.m), a replica splits the bound into smaller bounds to be imposed on each of its neighbors that are potential writers (in step 2a of the pull() procedure in Figure 4.11). It thus divides the responsibility of tracking updates among various portions of the replica hierarchy. Those neighbors in turn recursively split the bound among other neighbors until all replicas are covered. Each replica tracks the number of updates that are ready for propagation but unsent to each neighbor, and pushes updates whenever it crosses the neighbor’s mod limit (in step 1 of the push() procedure). If the mod bound is hard, it pushes updates synchronously, i.e., waits until they are applied and acknowledged by receivers, before making subsequent updates. Since a mod bound guarantee is obtained only from neighbors that are potential writers, it must be re-established via pulls whenever a replica’s neighbor newly becomes a writer. When the neighbor issues a pull for write privilege, the replica recomputes the new bound in step 2a of the pull procedure, issues further pulls if necessary. It also recomputes its own mod bound relative to the neighbor (in step 3c of the get() procedure) before granting the write privilege. Our algorithms adjust PVs to reflect the fact that a zero HM bound implicitly guarantees zero HT bound (step 3d of the get() procedure). 112 4.7.5 Refinements Next we present several successive refinements to the pull algorithm presented above to fix several problems as well as to improve its performance. 4.7.5.1 Deadlock Avoidance The serialization of pull operations in step 1 prevents starvation and ensures fairness, but is prone to deadlock where concurrent pull operations at parent and child replicas cause them to send get messages to each other and wait for replies. These gets trigger their own pulls at their receiving replicas which wait behind the already waiting pulls, causing a deadlock. To avoid this, we introduce asymmetry: we allow a parent’s pull request at a child not to wait behind the child’s pull request to parent. 4.7.5.2 Parallelism for RD Mode Sessions The pull algorithm presented so far can sometimes be overly restrictive. Consider a RD mode session which must not block for ongoing locking mode (RDLK, WRLK) sessions to finish elsewhere. However, this scenario is possible with our pull algorithm if a replica R gets a WR mode session request, while a RDLK is held elsewhere. The WR mode session issues a pull, which blocks for the remote RDLK session to finish. If a RD session request arrives at R at this time, it blocks in step 1 behind the ongoing WR mode pull, which will not complete unless the remote RDLK session completes. To fix this problem, we exempt RD mode pulls from waiting in step 1 for pulls of non-RD modes to finish. Instead, separate get requests are sent to other replicas for a RD privilege. Also, a put reply that carries a RD privilege wakes up only waiters for RD mode. Thus, a RD session at a replica proceeds with the latest contents while a WR session waits for conflicting remote sessions to end elsewhere as well. The next refinements add privilege-leasing and contention-aware caching to the algorithm, which we describe in the following sections. 113 4.7.6 Leases for Failure Resilience Our next refinement adds leasing of privileges for a limited time to help replicas recover from unresponsive neighbors, as explained in Section 4.7.1. When a replica P grants a higher PV than [wr, ∞,∞, ∞] (indicating a non-trivial consistency guarantee) to its child replica C, it also gives it a lease period L, and locally marks C’s lease expiry time as (local clock time + L). L can be set to twice the maximum latency for a message to travel between the root and the farthest leaf along the hierarchy. When C receives the privilege, it computes its own lease expiry time as (local clock time + L - latency to P), and sets a smaller lease period for its children based on its own lease. Subsequently, if P fails to send a get to C as part of a pull operation, it blocks the pull until C’s lease expires and unilaterally resets the C.PVout to the lowest possible value (i.e., [rd, ∞,∞, ∞]). The two replicas consider the lease as valid as long as the Swarm nodes hosting them communicate successfully at lease once in a lease period. A Swarm server keeps the parent leases of its local replicas of data items valid by periodically pinging the servers hosting their parent replicas (four times in a lease period). If the server finds that a remote server hosting parent replicas is unresponsive, it forces local replicas to reconnect to a new parent and re-pull their privilege (via a special ‘lease recovery’ pull, explained below) before the lease with the old parent expires. Replicas that succeed can continue to allow local access.3 For those replicas that cannot recover the privilege within its lease period, the Swarm server informs their child replicas that the lease cannot be renewed. Those replicas must fail ongoing sessions after the lease period as their consistency can no longer be guaranteed. Normally, a replica blocks pull requests until leases given to unresponsive children expire. When a child replica needs to recover a valid lease after reconnecting to a new parent, it requests a pull marked as a ‘lease recovery’ pull. When a replica receives a such a pull request, it revokes leases from its unresponsive children immediately, to enable the reconnecting child replica to recover its leased privilege before it expires. For the same 3 Such a recovery of a leased privilege is possible if the Swarm server could reach the parent server indirectly through other servers via different network paths. A common example is a Swarm server running on a mobile computer that gets connected to different networks over time. 114 reason, lease recovery pulls arriving at a node are propagated ahead of normal pulls, and trigger similar pulls recursively. Our lease algorithm thus enables replicas to preserve lease validity and maintain consistency guarantees in spite of inaccessibility of some replicas and a dynamically changing hierarchy. 4.7.7 Contention-aware Caching Our last refinement adds adaptation to the available data locality at multiple replicas by switching between aggressive caching and redirection of client requests, which we describe next. Caching data at a site is effective if both the data and associated PVs are locally available when local clients need them most of the time. When clients make interleaved data accesses at multiple replicas that often require them to synchronize via pull operations, data access latency goes up, hurting application performance. In that case, better performance is achieved by centralizing data accesses, i.e., redirecting client requests to fewer replicas and reducing the frequency of pull operations at those replicas. Centralization eliminates the extra network traffic and delays incurred by synchronous pulls between replicas, but adds a network roundtrip to remote client requests. The challenge lies in designing an algorithm that dynamically makes the right tradeoff between caching and centralization and chooses the replica sites where accesses should be directed to improve overall performance compared to a static choice. We devised a simple heuristic algorithm called adaptive caching to achieve this, which we describe in this section. Though our algorithm is not optimal, it effectively prevents thrashing when accesses are interleaved across replicas, and resorts to aggressive caching otherwise. Figure 4.14 shows the major modifications made to replica state for adaptive caching. To aid in adaptation, Swarm tracks the available locality at various replicas as follows. Each replica monitors the number of access requests to a data item (i.e., session open() attempts) it satisfied locally between successive revocations of its privilege by remote replicas, called its PV reuse count. This count serves as an estimate of the data item’s access locality at that replica. For instance, if clients issue strictly interleaving 115 Replica State for adaptive caching: foreach neighbor N, struct { ... cachemode: {MASTER, SLAVE, PEER}, reuse; /* relative reuse count */ } N; master; /* none: PEER, self: MASTER, other: SLAVE */ reuse; /* PV reuse count */ master reuse; /* reuse count between notes to slaves */ low thres, high thres; /* hysteresis threshold */ notify thres; /* threshold master reuse to notify slaves */ Figure 4.14. Replication State for Adaptive Caching. Only salient fields are presented for readability. write sessions at multiple replicas of a file employing close-to-open consistency, their reuse counts will all be zero, indicating lack of locality. The relative reuse count of a replica with respect to its neighbor is the highest of its own reuse count and those of its other neighbors. Replicas exchange their relative reuse counts with neighbors as part of consistency messages (get and put), and reset their own count after relinquishing their PV to neighbors. Clients can specify a low and a high hysteresis threshold as per-file consistency attributes that indicate the minimum number of local access requests a replica must see before it can issue the next pull. A hysteresis threshold of zero corresponds to aggressive caching, whereas a high value forces centralization by preventing privilege-shuttling as explained below. An appropriate threshold value for an application would be based on the ratio of the cost of pulling PVs to that of RPC to a central server. Figure 4.15 and 4.16 outline the configurable consistencyalgorithms to support adaptive caching. When a replica must pull from other replicas to open a session locally, if its reuse count is less than the file’s low hysteresis threshold, it sends out an election request via its high-reuse neighbors for a master (M) mode replica where the session must be actually opened (see the check locality() algorithm in Figure 4.15). This request, called elect master msg() in Figure 4.16, gets propagated via the replica hierarchy to an already elected master replica or one that has accumulated a higher relative reuse count (or higher 116 than the low threshold). The forwarding replica now switches to slave (S) mode, i.e., it transparently creates a session at the master and performs the local session’s reads and writes through the master. Figure 4.16 outlines the master election algorithms. Alternatively, when opening a session, the application can ask Swarm for the location of the elected master replica and directly forward the application-level operation to its application component near the master. In any case, the slave as well as the master replica increment their local reuse counts for each forwarded session. A slave whose reuse count rises to high threshold unilaterally switches to pulling access privileges, i.e., switches to peer (P) mode. To keep the slave replicas in slave mode, the master periodically (i.e., whenever reuse crosses the low threshold) notifies its local reuse count to all slaves via the replica hierarchy (in a notify slave() message). If a slave sees that it contributed to a major portion of the master’s reuse count (via session forwarding), it switches to peer mode without waiting for its count to reach high threshold. Otherwise it conservatively assumes contention at the master, and stays in slave mode after resetting its own count. If many slaves are predominantly inactive (such as under high locality), periodic usage notifications from the master are both unnecessary and wasteful of network bandwidth. To restrict such messages, a master starts out by never sending any usage notifications. Instead, slaves whose usage rises to low threshold explicitly request a notification from master (in a notify master() message). In response, the master sends out a usage notification, and sets its reuse count at that time as the threshold for sending subsequent notifications. Our algorithm has several properties. When replicas reuse their PVs beyond the hysteresis threshold between successive remote pulls (high locality), they all employ aggressive caching. When accesses are interleaved but some replicas see significantly more accesses than others (moderate locality), the most frequently accessed among them gets elected as the master (due to its higher reuse count) to which all other client accesses are redirected. Since master election stops at a replica with reuse count higher than the threshold, there could be multiple master mode replicas active at a time, in which case they are all peers relative to each other. When client accesses are interleaved but uniformly spread among replicas (poor locality, e.g., due to widespread contention), one 117 check locality(): if reuse < low thres, return elect master(); else if reuse < high thres, if I am slave, send notify master(reuse); else master = none; /* PEER */ return master; pull(PV, requesting replica R): ... if not PVincludes(N.PVin, reqPV), M = check locality(); if M != self, return ”try M”; send get(PV) to N; await replies to gets; ... ++ reuse; if master == self, and ++ master reuse > notify threshold, { send notify slave(master reuse) message to each neighbor in slave mode. master reuse = 0; } get(PV, requestor R): ... master = none; /* PEER */ reuse = 0; R.cachemode = PEER; send put(PV) as reply to R; put(dstPV, srcPV, R updates) from neighbor R: ... master = none; /* PEER */ wake up waiters; Figure 4.15. Adaptive Caching Algorithms. 118 elect master(): if master 6=none, return master; find replica N (in the order: parent, self, children) where N.reuse is highest or ≥ low thres; if N6=self, send elect master msg() to N; await reply; else master = self; /* MASTER */ return master; elect master msg() from replica R, forwarded by neighbor N: M = elect master(); N.cachemode = (M==N) ? MASTER : SLAVE; send elect master reply(M) to R; elect master reply(M): master = M; if M == self, notify thres = ∞; notify slave(master usage) from neighbor R: reject unless I’m SLAVE; if reuse > master usage/2, master = none; /* PEER */ else reuse = 0; forward notify slave(master usage) to neighbors N6=R where N.cachemode == SLAVE; notify master(slave usage) from slave replica N: if master reuse ≥ low thres, send notify slave(master reuse) message to each neighbor in slave mode. notify thres = master reuse; master reuse = 0; Figure 4.16. Master Election in Adaptive Caching. The algorithms presented here are simplified for readability and do not handle race conditions such as simultaneous master elections. 119 of the replicas gets elected as master based on its past reuse and remains that way until some asymmetry arises in use counts. This is because, once a node becomes master, its master status is not revoked by others until their local usage rises from zero to low threshold. The rise of slaves’ reuse counts is prevented by the master’s usage notifications. Thus, frequent oscillations between master-slave and peer modes is prevented under all conditions. However, our algorithm does not always elect one of the contenders as master unless there is significant asymmetry among replica accesses. Figure 4.17 illustrates the algorithm in the context of accesses to file F2 of Figure 4.5 when employing a low hysteresis threshold of 3 and a high of 5. All replicas start with a zero reuse count. In this example, they all get write session requests with close-to-open consistency (i.e., open(WR, hard, 0, ∞)) that require them to pull on each open. When the replicas at nodes N2 and N4 get an open, they send out election requests to their parent N3 (Figure 4.17a), since among neighbors with equal counts, the parent is chosen to be master. N5 becomes the master, to which N1 and N4 forward one open each, while N2 forwards 3. At the third open, N2 solicits usage notification from N5 (Figures 4.17b and c). Since N2 contributed to more than half of the 5 opens at N5, it switches to peer mode and pulls the PV locally (Figure 4.17d) for the next open. When N4 issues an open to N5, N5’s use count already dropped to 0. So it suggests N4 to try N3. When N1, N3 and N4 issue uniformly interleaved open requests, they all elect N2 as their master due to its recent high activity (Figure 4.17e). However, this time, they forward 2, 2, and 3 opens respectively to N2, none of which contribute significantly to the 7 opens at N2. Hence they keep resetting their local use counts in response to the N2’s slave notifications and stay as slaves, even if N2 stops getting further opens from local clients (Figure 4.17f). Ideally, one of them should be elected as master instead of N2, but our algorithm does not achieve that. 4.8 Update Propagation The goal of update propagation algorithms is to ensure that each update issued by a client on a data replica is eventually applied exactly once at all replicas, while satisfying dependencies and ordering constraints. In other words, an update’s effect must not be 120 (a) N2, N4 elect N5 as master m0−>m2 2 N5 M ele ct_ y l p r re t_ lec e elect S ly N3 N4 1 S 1 N2 S S en op 3 P−>S P elect 2 opens 4 N2 P 0 0 S N1 1−>0 N4 1 N4 3 N2 S−>P 0 elect 2 2 opens 3 opens N2 m0−>m7 11 P−>M S S 2−>0 N1 N4 3 S 1−>0 (f) Low locality N5 P elect N3 S 2 N1 S N4 N3 (e) N1, N3, N4 elect N2 as master tN m5−>m0 N5 M notify_slave(5) S 1 N1 0 ges pull N3 pull 2 opens notify_master(3) S N5 Ps ug pull P 5 3 N2 S (d) N2 switches to peer mode 0 N1 0 S elect 5−>0 elect elect S 0 N1 (c) N5 notifies slaves m2−>m5 5 N5 M ep 0 N3 (b) Moderate locality at N2 N5 P N3 Snotify_slave(7) 2−>0 S N4 3−>0 11 N2 notify_master(3) M m7−>m0 Figure 4.17. Adaptive Caching of file F2 of Figure 4.5 with hysteresis set to (low=3, high=5). In each figure, the replica in master mode is shown darkly shaded, peers are lightly shaded, and slaves are unshaded. duplicated or permanently lost anywhere, except due to other updates. If total ordering is also imposed, an update must have the same effect at all replicas. In this section, we describe Swarm’s update propagation mechanism. Swarm propagates updates (in the form of modified file pages or operational update packets) along the parent-child links of a replica hierarchy via put messages. For example, in the replica hierarchy shown in Figure 4.6d, updates to file F2 at node N2 reach node N4 after traversing N1 and N3. When a Swarm client issues an update at a replica (called its origin), it is given a node-local version number and a timestamp based on local clock time. An update’s origin replica and its version number are used to uniquely identify it globally. Operational updates are stored in their arrival order at each node in a persistent update log and propagated in that order to other nodes. Absolute updates that have semantic dependencies associated with them are also logged with their old and new values, as they may need to be undone to preserve dependencies (as explained below). 121 To ensure at-least-once update delivery to all replicas, a Swarm replica keeps local client updates in its update log in tentative state and propagates them via its parent until a custodian acknowledges their receipt. The custodian switches the updates to saved state and takes over the responsibility to propagate them further. The origin replica can then remove them from its local log. To prevent updates from being applied more than once at a replica, Swarm can be configured to employ a mechanism based on loosely synchronized clocks or version vectors, as we explain below. 4.8.1 Enforcing Semantic Dependencies Swarm enforces ordering constraints by tagging each update with the semantic dependency and ordering types of its issuing session. With each update, Swarm tracks the previous updates in the local log on which it has a causal and/or atomic dependency. Replicas always log and propagate updates in their arrival order that also satisfies their dependencies. Atomically grouped updates are always propagated and applied together. 4.8.2 Handling Concurrent Updates When applying incoming remote updates, a Swarm replica checks if independent updates unknown to the sender were made elsewhere, indicating a potential update conflict. Conflicts are possible when clients employ concurrent WR mode sessions. For ‘serially’ ordered updates, Swarm avoids conflicts by forwarding them to the root custodian to be applied sequentially. For ‘unordered’ updates, Swarm ignores conflicts and applies updates in their arrival order, which might vary from one replica to another. For ‘totally’ ordered updates, Swarm relies on a conflict resolution routine to impose a common global order at all replicas. This routine must apply the same update ordering criterion at all replicas to ensure a convergent final outcome. Swarm provides default resolution routines that reorder updates by their origination timestamp or by their arrival order at the root custodian. Swarm invokes the application plugin’s merge() operation (see Table 4.2) as the resolution routine. The application can employ one of Swarm’s default routines or supply its own merge() routine, that exploits application knowledge (such as commutativity) to resolve conflicts more efficiently. 122 If an incoming update could be applied or merged into the local copy successfully, Swarm adds the update to its local log and re-establishes dependencies for further propagation. If the conflict could not be resolved (e.g., if the update got overridden by a ‘later’ update due to reordering), Swarm rolls back and switches the update to rejected state and does not propagate it further. It recursively rolls back and rejects all other updates in its atomicity group as well as those causally dependent on them. If an update is rejected after its causally succeeding updates have already been propagated to other nodes in the hierarchy, they will be undone when the overriding update that caused the initial rejection is propagated to those nodes. This is due to our requirement of a common conflict resolution criterion at all replicas. The Swarm replica informs the identities of rejected and accepted updates to their sending neighbor in a subsequent message. The sender can undo those updates immediately and recursively inform their source, or wait for their overriding updates to arrive. 4.8.3 Enforcing Global Order Swarm provides two approaches to enforcing a global order among concurrent updates, namely, a centralized update ‘commit’ approach similar to that of Bayou [54], and a timestamp-based approach based on loosely synchronized clocks. Centralized Commit: Swarm’s commit-based conflict resolution routine relies on a central node (the root custodian) to order updates similar to Bayou [54]. An update is considered committed if it is successfully applied locally at the root custodian. The update is aborted if the root has received and rejected it. Otherwise, it is tentative. When a Swarm replica receives an update from a client, it must remember the update locally until it knows of the update’s commit or abort status. At a replica, tentative updates are always reordered (i.e., undone and redone) after committed updates that arrive later. Thus, a committed update never needs to be reordered. A client can request a session’s updates to be synchronously committed to ensure a stable, reliable result that is never lost, by specifying a ‘sync’ consistency option when opening the session. However, this incurs reduced performance. Centralized commit requires that absolute updates be logged with 123 old and new values as they may need to be reapplied based on the final commit order at the root. This is expensive for whole file access. Timestamp-based ordering: Swarm’s timestamp-based conflict resolution routine requires nodes to loosely synchronize their clocks using a protocol such as NTP. Unlike the centralized routine, each replica reorders updates locally based on their origination timestamps. Thus, an update must be reordered (i.e., undone and redone) whenever updates with earlier timestamps arrive locally. However, this ordering scheme does not require absolute updates to be logged with old and new values (except to preserve dependencies), as they are applied to each replica at most once, and need not be undone. 4.8.4 Version Vectors To identify the updates to propagate among replicas, Swarm maintains a version vector (VV) at each replica that indicates the version of the latest update originating in every replica and incorporated locally. A VV concisely describes a replica’s contents for the purpose of propagation. When a replica R connects to a new parent P, they exchange their VVs and compare them to determine the updates to propagate to each other. In general, the size of a VV is proportional to the total number of replicas, which could be very large (thousands) in Swarm. Swarm needs to enforce consistency on data units as small as a file block when absolute updates are employed, and at whole-file granularity in case of operational updates. This means potentially maintaining a VV for each file block, which is very expensive. Fortunately, Swarm’s organization of replicas in a tree topology enables several optimizations that significantly reduce the overhead of managing VVs in practice. Since a replica propagates updates to its neighbors only once and a tree topology provides only one path between any two replicas, each replica receives remote updates only once in a stable hierarchy. Thus, a replica needs to exchange VVs only when it reconnects to a new parent, not for every update exchange. When replica sites employ loosely synchronized clocks, propagating absolute updates (used for page-based data structures and whole file access) does not require VVs for reasons explained below. 124 4.8.5 Relative Versions To identify the updates to propagate and to detect update conflicts among already connected replicas, Swarm maintains neighbor-relative versions which are lightweight compared to VVs. A replica gives a node-local version number (called its relative version) to each incoming update from its neighbors and uses it as the log seqno (LSN) to log the update locally. It also remembers the sending neighbor to inform it if the update gets rejected later due to a conflict. For each neighbor, it maintains the latest local version sent and neighbor version received. Relative versions are compact and effective for update propagation in the common case where replicas stay connected. When a replica R reconnects to a new parent replica P, R must first reconcile its contents with P to a state from which it can proceed to synchronize using relative versions. When absolute updates are employed, reconciliation simply involves overwriting R with P’s contents if they are newer (or semantically merging their contents by invoking the application plugin). When replicas’ clocks are loosely synchronized, their timestamps are sufficient to determine which of them is newer, obviating the need for VVs. With operational updates (to the whole file as a replication unit), reconciliation is more involved. Both replicas must compare their VVs to identify the updates missing from each other’s copy and to recompute their relative sent and received versions. Subsequently when sending updates to its neighbor, each replica must skip updates already covered by the neighbor’s VV. If neither replica has all the updates in its local log to bring its neighbor up to date (i.e., if the VVs are ‘too far apart’), they must resort to a full file transfer. In a full transfer operation, the child replica obtains the entire file from the parent and then reapplies any local updates unknown to parent, modifying its VV accordingly. The child replica could also reconcile with its new parent by always reverting to the parent’s version via full transfer. In that case, the child need not maintain VVs and does not permanently miss any updates. But without VVs, it must issue a pull from parent if its clients require a monotonic reads guarantee (defined in Section 3.5.2), as the new parent may not have seen updates already seen by the child. Therefore VVs need not be maintained if an application requires the latest data, or if temporarily seeing an older version of data is acceptable. On the other hand, a full file transfer operation 125 is expensive for large files such as database replicas, and could trigger cascading full transfers in the replica’s subtree. A variety of version vector maintenance algorithms have been proposed in the context of wide area replication [18, 62, 56], that employ different techniques to prune the size of VVs as replica join and leave the network. Swarm can employ any of those algorithms. However, we do not maintain VVs in our prototype implementation of Swarm, since none of the application scenarios that we evaluate require them. We do not discuss VV algorithms in this dissertation, as they are not the focus of this work. Though version vectors enable Swarm to provide session guarantees, employing neighbor-relative versioning schemes and relying on loosely synchronized clocks makes many of Swarm’s update propagation algorithms simpler and more efficient. 4.9 Failure Resilience In this section, we discuss Swarm’s resilience to node and network failures. We assume a fail-stop model wherein a node either behaves correctly or stops responding, but does not maliciously return incorrect data. We assume both transient and permanent failures. Swarm nodes use TCP for reliable, ordered message delivery over the wide area, and handle network failures as explained in Section 4.6.4. They employ message timeouts to recover from unresponsive nodes. Since Swarm employs recursive messaging, the reply to a Swarm node’s request may take time proportional to the number of message hops, which are hard to determine beforehand. In our prototype we set the reply timeout to 60 seconds, which we found to be adequate for a variety of applications. A Swarm server recovers from unresponsive clients by terminating their access sessions (and aborting atomic updates) if their connection is not reestablished within a session timeout period (60 seconds by default). Swarm client library periodically checks its server connections to make sure that its sessions remain valid. A Swarm server ensures the internal consistency of its local state across node crashes by storing it in transactionally consistent persistent data structures implemented as BerkeleyDB databases [67]. 126 Swarm heals partitions in a replica hierarchy caused by node failures as explained in Section 4.6.1. It employs leases to recover access privileges from unresponsive child replicas as explained in Section 4.7.6. Swarm treats custodian failures differently from other replica failures. When a custodian becomes inaccessible, other replicas bypass it in the hierarchy, but custodians continually monitor its status until it recovers. However, revoking the custodian status of a replica after its permanent failure requires a majority vote by other custodians, and it must currently be initiated by the administrator. Swarm’s update propagation mechanism is resilient to the loss of updates due to node crashes. When a replica permanently fails, only those updates for which it is the origin replica and which have not reached a custodian may be permanently lost. Replicas recover from failure of intermediate parent replicas by repropagating local updates to a custodian via their new parent. 4.9.1 Node Churn Swarm can handle node churn (nodes continually joining and leaving Swarm) well for the following reasons. A majority of nodes in a replica hierarchy are at the leaf level, and their churn does not adversely affect Swarm’s performance. Since Swarm nodes join a replica hierarchy at the leaf level, replicas on stable nodes (with less churn) tend to remain longer and thus become interior replicas. When a replica node fails, its child replicas elect a new parent lazily in response to local accesses instead of electing a new parent immediately. Swarm tends to heal the disruption caused by an interior replica leaving the system in a localized manner. This is because lookup caches absorb most searches by child replicas for new parents, avoiding hierarchical lookups. Swarm’s persistent node and lookup caches together enable replicas to quickly learn of other nearby replicas exploiting past information. Unless nodes have a very short life span in Swarm (the time between joining Swarm for the first time and leaving it for the last time), the location and link quality information in their persistent caches remains valid in spite of nodes continually leaving and rejoining the Swarm network. Finally, the highest allowed replica fanout can be configured at each Swarm node based on its expected stability and available resources. For instance, the fanout of a Swarm server running on 127 a mobile device must be set very low (e.g., 0 or 1) to prevent it from becoming a parent replica, while a corporate server can have a high fanout (4 or more). 4.10 Implementation We have implemented a prototype of Swarm that runs on FreeBSD and Linux. Swarm servers are user-level daemon processes that are accessed by applications via a Swarm client library. Since communication dominates computation in Swarm, a Swarm server or client library is implemented as a single-threaded non-blocking event-driven state machine. In response to each external event (such as message arrival or a timeout), Swarm runs its event handler from start to finish atomically. If the handler needs to block for another event, Swarm saves its state in system data structures and continuations, and restarts the handler when the awaited event happens. This organization allows a Swarm server to efficiently handle a large number of concurrent events with little overhead. A Swarm server stores all its persistent state (including file data and metadata) in a single directory tree in the local file system. Each server can be configured with the maximum amount of local storage it can consume. It uses that storage to host permanent copies of files as well as to cache remote files. A Swarm server maintains its persistent metadata state in four data structures, all implemented as recoverable BerkeleyDB databases. The region directory stores file metadata and consistency state keyed by SWID. The segment directory stores the state of various file segments in the granularity at which they are shared by Swarm nodes. The update log stores updates in their arrival order keyed by log sequence number. The node cache remembers the connectivity and link quality to remote nodes contacted earlier. In addition to these, the server also maintains several in-memory data structures such as a table of open sessions, cache of TCP connections, and the SWID lookup cache. A Swarm server updates a local file atomically after completely importing incoming contents, so that a crash in the middle of file transfer does not corrupt file contents. When a Swarm server recovers from a crash, all open client sessions become invalid and all updates by atomic sessions are undone. 128 Swarm clients normally incur an RPC for each Swarm API invocation to its local Swarm server. To avoid the high cost of RPCs in the common case where a local client repeatedly accesses the same data item often, a Swarm server exposes its session table to local clients via shared memory. A Swarm client can reopen recently closed Swarm sessions locally without issuing RPC to local server. As our evaluation presented in the next chapter shows, this significantly improves latency for fine-grained access to Swarm-based persistent data structures when the client exhibits locality. Our Swarm prototype does not implement full-sized version vectors and causal dependency tracking. 4.11 Discussion Swarm servers are reactive. They cache objects locally in response to client accesses. While Swarm provides mechanisms for creating cached copies and their consistency maintenance, applications must determine where objects are cached/replicated. Applications can also build object prefetching schemes to suit their access patterns using Swarm API. 4.11.1 Security Our Swarm design is based on the assumption that Swarm servers and clients are mutually trustworthy. It does not provide security and authentication mechanisms necessary for communication between untrusted parties over untrusted channels. Though we leave their design for future research, we outline a few techniques here. Communication over untrusted channels can be accomplished by employing encryption-based communication protocols between Swarm nodes, such as secure TCP sockets or by tunneling over SSH channels. Swarm nodes can mutually authenticate themselves using well-known public-key encryption mechanisms. However, managing their keys in a decentralized and scalable manner requires further exploration. A recently proposed scheme, called Self-certifying pathnames [33], provides a novel way to use public-key encryption mechanisms to support decentralized user authentication in globally distributed file systems spanning multiple administrative domains. 129 4.11.2 Disaster Recovery A Swarm server stores all critical data and metadata in a local file system directory tree. Hence, each Swarm server’s data can be archived for disaster recovery and restored in the same way existing local file systems are archived. 4.12 Summary In this chapter, we described the design of Swarm, a wide area file store that employs configurable consistency. We first described the design goals for Swarm and outlined its basic architecture and expected usage for building distributed applications. We then described its major subsystems including naming and location, replication, configurable consistency management and update propagation. In the next chapter, we present our evaluation of Swarm in the context of the four focus applications we presented in Chapter 2. We describe their implementation on Swarm and analyze their performance characteristics. CHAPTER 5 EVALUATION In the previous chapter, we showed how configurable consistency (CC) can be implemented in a peer replication environment spanning variable-quality networks. We presented the design of Swarm, a data replication middleware that provides caching with configurable consistency to diverse applications. In this chapter, we evaluate the effectiveness of Swarm’s CC implementation in meeting the diverse consistency and performance needs of the four representative applications mentioned in Section 2.3. For each application, we show how its consistency requirements can be precisely expressed in the CC framework. We also how the resulting application’s performance (i.e., latency, throughput and/or network utilization) when using Swarm improves by more than 500% relative to a client-server implementation without caching, and stays within 20% of optimized implementations where available. Our thesis claims that configurable consistency mechanisms can support the wide area caching needs of diverse applications effectively. Hence, through this evaluation, we demonstrate the following properties of Swarm’s configurable consistency implementation, which are important to support wide area caching effectively: 1. Flexible Consistency: (a) Swarm can enforce the diverse consistency semantics required by different applications. Each semantics has its associated performance and availability tradeoffs. (b) Swarm can enforce different semantics on the same data simultaneously. 2. Contention-aware Replication: (a) By adopting configurable consistency, Swarm exploits the available data locality effectively while enforcing a variety of consistency semantics. (b) Swarm’s contention-aware replication management mechanism outperforms both aggressive caching and RPCs (i.e., no caching) at all levels of contention when providing strong consistency. Thus unlike other systems, 131 Swarm can support strongly consistent wide area caching effectively regardless of the degree of contention for shared data. 3. Network Economy: Swarm leverages its proximity-aware replica management to utilize network capacity efficiently for its consistency-related communication and to hide variable network delays from applications. 4. Failure resilience and Scalability: Swarm performs gracefully and continues to ensure consistency in spite of replicas continually joining and abruptly leaving the Swarm network. Swarm preserves its network economy and failure resilience properties even in networks with hundreds of replicas. Different subsets of these properties are important for different applications. Since the configurable consistency implementation presented in this thesis has all of these properties, it supports the wide area caching needs of a wide variety of applications effectively. In Section 5.2, we show that Swarmfs, a peer-to-peer file system built on top of Swarm, can exploit locality more effectively than existing file systems (property 2a), while supporting a variety of file sharing modes. In particular, it supports file access semantics ranging from exclusive file locking to close-to-open to eventual consistency on a per-file basis (property 1a). Swarm’s proximity-aware replica management enables Swarmfs to provide low latency file access over networks of diverse quality (property 3). In Section 5.3, we show that SwarmProxy, a synthetic enterprise service using Swarm for strongly consistent wide area proxy-caching, can significantly improve its responsiveness to wide-area clients using CC’s contention-aware replication control mechanism (property 2b). This mechanism restricts caching to a few sites when contention for shared data is high to limit synchronization overhead, and reverts to aggressive caching when the contention subsides. This adaptive mechanism provides the best performance of both aggressive caching and RPCs under all levels of contention. In Section 5.4, we evaluate how SwarmDB, a wrapper library that augments the BerkeleyDB database library with replication support using Swarm, performs using five different consistency requirements ranging from strong (appropriate for a conventional 132 database) to time-bounded eventual consistency (appropriate for many directory services) (properties 1a and 1b). We find that relaxing consistency requirements even slightly significantly improves throughput, provided the application can tolerate the weaker semantics. We also show that when Swarm is used for sharing files indexed with SwarmDB, it conserves bandwidth by serving files from nearby sources (property 3) to hundreds of users while they continually join and leave the sharing network (property 4). Finally, in Section 5.5, we demonstrate how SwarmCast, a real-time streaming multicast application, linearly scales its delivered event throughput to a large number of subscribers by using Swarm servers that automatically form a multicast network to relay events. A single Swarm server in our unoptimized prototype is CPU-bound, and thus delivers only 60% of the throughput delivered by a hand-optimized relay server implementation. However, the aggregate throughput delivered to 120 subscribers scales linearly with as many as 20 Swarm servers, an order of magnitude beyond what a single hand-coded relay server can deliver due to its limited outgoing network bandwidth. Moreover, when using Swarm, the SwarmCast application need not deal with networking or buffering, which are the tricky aspects of implementing multicast. Thus, Swarm can be leveraged effectively even for a traditional message-passing application such as real-time content dissemination. Each of these applications uses Swarm’s replication and consistency support in very different ways. Swarm’s flexible API enables each of these applications to manipulate shared data in their diverse natural idioms, e.g., shared files, shared objects, and database. Swarm’s configurable consistency enables each application to express and enforce the consistency semantics appropriate for its sharing needs. Swarm’s aggressive peer replication enabled each application to utilize network bandwidth efficiently and to fully exploit available locality, which results in performance improvements over alternate implementations where available. We begin by describing our experimental environment in the next section. In subsequent sections, we present each of the representative applications in turn, describing how we implemented it using Swarm before we present our evaluation. 133 5.1 Experimental Environment For all our experiments, we used the University of Utah’s Emulab Network Testbed [76]. Emulab allows us to network a collection of PCs emulating arbitrary network topologies by configuring per-link latency, bandwidth, and packet loss rates. The PCs have 850MHz Pentium-III CPUs with 512MB of RAM. Depending on the experimental requirements, we configure them to run FreeBSD 4.7 (BSD), Redhat Linux 7.2 (RH7.2), or RedHat 9 (RH9). In addition to the emulated experimental network, each PC is connected to a separate 100Mbps control LAN which we use for logging experimental output. Swarm servers run as user-level processes and store files in a single directory in the local file system, using the SWID as the filename. We set the replica fanout on each server to a low value of 4 to induce replica hierarchies up to 4 or 5 levels deep and to magnify the effect of deep hierarchies on access latencies. For the WAN experiments, we frequently use Emulab’s ‘delayed LAN’ configuration to emulate nodes connected via point-to-point links of diverse bandwidths and delays to a common Internet routing network. Figure 5.1 illustrates a 6-node delayed LAN, where each link is configured with 1Mbps of bandwidth and a 10ms one-way propagation delay, resulting in a 40ms roundtrip latency between any two nodes. Clients and servers log experimental output to an NFS server over the control network only at the start or finish of experiments to minimize interference. 5.2 Swarmfs: A Flexible Wide-area File System Efficient support for wide area write-sharing of files enables more applications than the read-only or rarely write-shared files supported by traditional file systems [38, 33, 57, 62, 47]. For instance, support for coherent write-sharing allows fine-grain collaboration between users across the wide area. Although eventual consistency provides adequate semantics and high availability when files are rarely write-shared, coherent write-sharing requires close-to-open or strong consistency. We built a peer-to-peer distributed file system called Swarmfs that provides the required flexibility in consistency. Swarmfs is implemented as a file system wrapper 134 Routing Node core 1Mbps link, 10ms delay Internode RTT = 40msec Figure 5.1. Network topology emulated by Emulab’s delayed LAN configuration. The figure shows each node having a 1Mbps link to the Internet routing core, and a 40ms roundtrip delay to/from all other nodes. integrated into each Swarm server. The wrapper exports a native file system interface to Swarm files at the mount point /swarmfs via the local operating system (currently Linux/FreeBSD). It provides a hierarchical file name space by implementing directories within Swarm files using the Coda file system’s directory format [38]. Swarmfs synchronizes directory replicas via operational updates. We refer to a Swarm server with the wrapper enabled as a Swarmfs agent. A Swarmfs agent interacts with the CodaFS in-kernel module to provide native Swarmfs file access to local applications. Figure 4.4 illustrates this architecture. It could also be extended to export the NFS server interface to let remote clients mount the Swarmfs file system. The root directory of a Swarmfs path is specified by its globally unique Swarm file ID (SWID). Thus, an absolute path name in Swarmfs looks like this: ‘‘/swarmfs/swid:0xabcd.2/home/me/test.pdf”. Here, ‘‘swid:0xabcd.2” specifies that the root directory is stored in a swarm file with SWID ‘‘0xabcd” and generation number ‘‘2” (Recall that a SWID’s generation number is incremented each time it is reassigned to a new file). Each Swarmfs agent is an independent file server and can be made to mount a different directory as its root, which is used as the starting point for pathname lookup when its local clients omit the SWID component from their absolute path names (e.g., “/swarmfs/home/me/test.pdf” on that machine), Since numeric SWID components are cumbersome to handle, users can avoid 135 dealing with them by creating symbolic links with more intuitive names. Swarmfs is thus a federation of autonomous peer file servers that provide a decentralized but globally uniform file name space. Each server can independently create files and directories locally, cache remotely created files, and migrate files to other servers. Unlike NFS, Swarmfs’ absolute path names are globally unique because they use SWIDs, i.e., an absolute path name refers to the same file at all servers regardless of file migration. By making use of Swarm, Swarmfs provides a unique combination of features not found in existing file systems: 1. Network Economy: Unlike typical client-server file systems (e.g., NFS [64], AFS [28] and Coda[38]), it exploits Swarm’s proximity-aware peer replica networking to access files from geographically nearby copies, similar to Pangaea [62]. 2. Customizable Consistency: Unlike peer-to-peer file systems like Pangaea and Ficus [57], Swarmfs makes use of Swarm’s configurable consistencyframework to support a broader range of file sharing semantics. In particular, it can provide on a per-file-access basis: the strong consistency semantics of Sprite [48], the close-to-open consistency of AFS [28] and Coda [38], the (weak) eventual consistency of Coda, Pangaea [62], and NFS [64], and the append-only consistency of the WebOS file system [74]. Swarmfs employs close-to-open consistency by default, but allows users to override it on a per-file-session basis. Swarmfs applies a directory’s consistency settings automatically to subsequently created files and subdirectories to facilitate easy administration of file consistency semantics. 3. Uniform, Decentralized Name Space: Unlike many existing file systems (AFS, Coda, Pangaea, Ficus) that enforce a single global file system root, Swarmfs is a federated file system similar to NFS and SFS [45]. Swarmfs agents can be configured to mount the same root everywhere, or share only subtrees of their name space. For example, a user can expose a project subtree of his home directory to be mounted by remote colleagues without exposing the rest of his local Swarmfs file name space. This feature is enabled by the use of SWIDs that are both location-transparent and globally unique. 136 4. Transparent File Migration: Unlike many file systems including SFS, Swarmfs allows files to be transparently migrated between servers (i.e., permanently moved without disrupting access). This facilitates the use of Swarmfs agents to provide ubiquitous transient storage for mobile users with simple storage administration. Client applications can perform Swarm-specific operations on individual files and directories such as viewing/modifying consistency settings or performing operational updates, via Unix ioctl() system calls. When clients access files under /swarmfs, the local operating system invokes the CodaFS in-kernel file system module, which in turn invokes the user-level Swarmfs agent via upcalls on a special FIFO device (just like Coda’s client-side implementation). The CodaFS module makes these upcalls only for metadata and open/close operations on files. Hence a Swarmfs agent can mediate client accesses to a file only during those times. It invokes Swarm to obtain a consistent Swarm file copy into the local file system, and supplies its name (inode number) to the kernel module. The kernel module performs file reads and writes directly on the supplied file at local file system speed, bypassing Swarmfs. 5.2.1 Evaluation Overview We evaluate Swarmfs under three different usage scenarios. First, in Section 5.2.2 we consider a personal file system workload with a single client and server connected by either a high-speed LAN or a slow WAN link. This study shows that the inherent inefficiencies of layering Swarmfs over the generic Swarm data store have little impact on its baseline performance relative to a local file system when there is no sharing. Our second and third experiments focus on sharing files across a WAN. In Section 5.2.3 we consider the case where shared files are accessed sequentially (e.g., by a roaming user or by collaborators) on nodes spanning continents. This study shows that Swarm efficiently enforces the close-to-open semantics required for this type of file sharing while exploiting nearby replicas for near-local latency file access, unlike Coda. Finally, in Section 5.2.4 we consider the case where a shared RCS repository is accessed in parallel by a collection of developers spread across continents. This study shows that Swarm’s lock caching not only provides correct semantics for this application, but can 137 also provide order-of-magnitude improvements in latency over client-server RCS by caching files close to frequent sharers. 5.2.2 Personal Workload We first study the baseline performance of Swarmfs and several other representative distributed file systems in a LAN-based single-server, single-client setup with little sharing. In this experiment, a single client runs Andrew-tcl [62], a scaled up version of the Andrew benchmark [28] by starting with a source tree extracted into a directory on the client. Thus, the client starts with a warm cache. Andrew-tcl is a personal software development workload consisting of five phases, each of which stresses a different aspect of the system (e.g., read/write performance and metadata/directory operation performance): (1) mkdir: creates 200 directories, (2) copy: copies the 745 Tcl-8.4 source files with a total size of 13MB from one directory to another, (3) stat: performs “ls -lR” on the files in both directories to read their metadata/attributes, (4) grep: performs “du” and “grep” on the files, and (5) compile: compiles the source code. As there is no inter-node sharing, the choice of consistency protocol does not significantly affect performance, so we employed Swarmfs’ default close-to-open consistency. We ran the Andrew-tcl benchmark on four file systems: the Redhat Linux 7.2 local file system, NFS, Coda, and Swarmfs. For Swarmfs, we considered two modes: scache, where the client is configured to merely cache remotely homed files, and speer, where files created on the client are locally homed in full peer-to-peer mode. We averaged the results from five runs per system, which resulted in a 95% confidence interval of 3% for all the numbers presented. Figure 5.2 shows the performance of each system when the client and server are separated by a 100Mbps LAN, broken down by the time spent in each phase. Execution time in all cases is dominated by the compute bound compile phase, which is comparable in all systems. Figure 5.3 focuses on the other phases. As expected, the best performance is achieved running the benchmark directly on the client’s local file system. Among the distributed file systems, NFS performs best, followed by speer, but all distributed file systems perform within 10% of one another. Speer performs better than scache during 138 80 70 60 50 40 30 20 10 0 r ee sp e sc ac h co da stat mkdir grep copy compile nf s lin ux lo ca l seconds Andrew-Tcl Results on 100Mbps LAN Figure 5.2. Andrew-Tcl Performance on 100Mbps LAN. the data-intensive copy and mkdir phases of the benchmark, because files created by the benchmark are homed on the client node. Coda’s file copy over the LAN takes twice as long as the scache copy due to Coda’s eager flushes of newly created files to the server. In the grep and stat phases, Swarm must retrieve the metadata of 1490 files from a BerkeleyDB database where it stores them, since we only employed an in-memory cache of 500 file metadata entries. This, coupled with Swarm’s unoptimized code path results in four times the stat latency and twice the grep latency of Coda. Figure 5.4 shows the relative performance of Coda and Swarmfs when the client and server are separated by a 1Mbps, 40ms WAN link. Coda detects that the link to the server is slow and switches to its weak connectivity mode, where it creates files locally just like peer-mode Swarmfs. Thus, both Coda and Speer perform within 13% of a local linux file system for personal file access across a WAN. In summary, for a traditional personal file access workload, Swarmfs provides performance close to that of a local file system both across a LAN and across a WAN link from a file server, despite being layered on top of a generic user-level middleware data server. Like Coda and unlike NFS, Swarmfs exploits whole-file caching to provide local 139 9 8 7 6 5 4 3 2 1 0 sp ee r sc ac he co da nf s copy grep mkdir stat lin ux lo ca l seconds Andrew-Tcl Details on 100Mbps LAN Figure 5.3. Andrew-Tcl Details on 100Mbps LAN. Andrew-Tcl results across 1Mbps,40ms RTT link 64 seconds 62 stat mkdir grep copy compile 60 58 56 54 52 50 coda speer Figure 5.4. Andrew-tcl results over 1Mbps, 40ms RTT link. 140 performance across WAN links. In addition, Swarmfs servers can be configured as peer file servers that create files locally, resulting in double the performance of Coda for file copies in a LAN environment. Hence, Swarmfs supports traditional file sharing equally well or better than existing distributed file systems. 5.2.3 Sequential File Sharing over WAN (Roaming) Our next experiment evaluates Swarmfs on a synthetic “sequential file sharing” or roaming benchmark, where we model collaborators at a series of locations accessing shared files, one location at a time. This type of file sharing is representative of mobile file access or roaming [62], and workflow applications where collaborators take turns updating shared documents or files. To support close collaboration or near-simultaneous file access, users require tight synchronization such as provided by close-to-open consistency semantics [48] even under weak connectivity. We compare three flavors of distributed file systems: (1) Swarmfs employing peer servers (speer), (2) Coda forced to remain in its strong connectivity mode even under weak connectivity so that it guarantees close-to-open consistency (coda-s), and (3) Coda in its native adaptive mode where it switches to providing eventual consistency when it detects weak connectivity (coda-w). We show that Swarmfs not only provides the close-to-open semantics needed to run this benchmark correctly, but also performs better than Coda by exploiting Swarm’s ability to access files from nearby replicas and by employing a symmetric consistency protocol. Although the Pangaea file system [62] also exploits nearby replicas like Swarm, it does not provide close-to-open semantics. We run the sequential file access benchmark on the emulated WAN topology shown in Figure 5.5. The topology modeled consists of five widely distributed campuses, each with two machines on a 100Mbps campus LAN. The node U1 (marked ‘home node’) initially stores the 163 Tcl-8.4 source files with a total size of 6.7MB. We run Swarmfs servers and Coda clients on all nodes and a Coda server on node U1. In our synthetic benchmark, clients at various nodes sequentially access files as follows. Each client modifies one source file (modify), compiles the Tcl-8.4 source tree by invoking ‘make’ (compile), and then deletes all object files by invoking ‘make clean’ 141 Home Node U1 T1 U2 USA 20 Europe m Univ. LAN s( Internet RT I1 4ms 40 1 0 0 1 0 1 0 1 0 1 100ms Slow link 1 0 0 1 0 1 0 1 0 1 10 s Link b/w: 1Mbps Workstation Corp. LAN F2 m 30 C2 F1 ms ISP C1 Turkey ms T) I2 T2 1 0 0 1 0 1 Router 0 1 0 1 France 100Mbps LAN Figure 5.5. Network Topology for the Swarmfs Experiments described in Sections 5.2.3 and 5.2.4. (cleanup). These operations represent isolated updates, intensive file-based computation, and creating and deleting a large number of temporary files. Clients on each node in each campus (in the order University (U) → ISP (I) → Corporate (C) → Turkey (T) → France (F)) perform the modify-compile-cleanup operations, one client after another. Thus, the benchmark compares Swarmfs with an existing distributed file system on traditional file operations in a distributed file access scenario. Figure 5.6 shows the replica hierarchy that Swarm created for files in this experiment. Figure 5.7 shows the compilation times on each node. As reported in the previous section, both Swarmfs and Coda perform comparably when the home file server is on the same local LAN (i.e., on nodes U1 and U2). Also, Coda in strong connectivity enforces close-to-open semantics by pushing updates synchronously to the server. However, since Swarm creates efficient replica hierarchies and acquires files from nearby replicas (e.g., from another node on the same LAN, or, in the case of France, from Turkey), it outperforms Coda’s client-server implementation, which always pulls files from server U1. As a result, compilation at other LANs was two to five times faster over Swarmfs than over Coda-s on average. When Coda detects a slow link to the server (as in the case of nodes starting at node I1), it switches to weak connectivity mode. In this mode, Coda guarantees only 142 Home Node U1 T1 U2 T2 Parent replica link Univ. Turkey I2 I1 ISP Internet F1 F2 France C1 C2 Corp. Figure 5.6. The replica hierarchy observed for a Swarmfs file in the network of Figure 5.5. eventual consistency, which causes incorrect behavior starting at node T2 (hence the missing results in Figure 5.7 for nodes T2, F1 and F2). For instance, on node T2, the ‘make’ program skipped compiling several files as it found their object files from the previous compilation on T1 still sticking around. Also, corrupt object files were reported during the linking step. On closer examination, we found that they are caused by Coda’s undesirable behavior during weak connectivity by switching to eventual consistency and performing eager push of updates during trickle reintegration. During the compilation at T1, Coda-w’s trickle reintegration mechanism starts pushing large object files to the server across the slow link as soon as they are created locally, which clogs the server link and delays the propagation of crucial directory updates that indicate the subsequent object file deletions to the server. By the time T2 sees T1’s file deletion operations, it has already started its compile and used obsolete object files. This is because Coda in weak mode never pulls for file updates from the server to avoid the high latency. Coda-s produced correct results because it enforces close-to-open consistency. However, it performs poorly across a WAN because it implements close-to-open semantics conservatively by forcing write-through of all client updates to the server. The writethrough policy causes a single file modification on Coda-s to incur double the latency of Coda-w and Swarm, as shown for node T1 in Figure 5.8. Unlike Coda’s asymmetric 143 Roaming User - Compile Latency 300 seconds 250 200 speer coda-s coda-w 150 100 50 2 F m s 2 T 30 F1 ,1 60 ,1 C T1 1, , I1 s 2 m C m s 2 I 50 m s 2 24 U U 1 0 LAN-node#, RTT to Home (U1) Figure 5.7. Roaming File Access: Swarmfs pulls source files from nearby replicas. Strong-mode Coda correctly compiles all files, but exhibits poor performance. Weak– mode Coda performs well, but generates incorrect results on the three nodes (T2, F1, F2) farthest from server U1. roles of client and server, Swarm’s symmetric (peer) consistency protocol avoids costly write-throughs across WAN links. In summary, unlike Coda-w, Swarm provides the close-to-open semantics required for the correct execution of the sequential sharing benchmark, and is more efficient than Coda-s. Swarm exploits nearby replicas similar to peer-to-peer file systems such as Pangaea and Ficus that provide pervasive replication. But it enforces the tight synchronization required for collaboration under all network conditions, which Pangaea and Ficus cannot guarantee because they ensure only eventual consistency. 5.2.4 Simultaneous WAN Access (Shared RCS) Some distributed applications (e.g., email servers and version control systems) require reliable file locking or atomic file/directory operations to synchronize concurrent read/write accesses and thus avoid hard to resolve update conflicts. However, the atom- 144 Roaming User - File Modify Latency 7 6 seconds 5 speer coda-s coda-w 4 3 2 1 2 F m s 2 T 30 F1 ,1 60 ,1 T1 1, C s 2 m C m s 2 50 I m s 2 U I1 , 24 U 1 0 LAN-node#, RTT to Home (U1) Figure 5.8. Latency to fetch and modify a file sequentially at WAN nodes. Strong-mode Coda writes the modified file synchronously to server. icity guarantees required by these operations are not provided by most wide area file systems across replicas. As a result, such applications cannot benefit from replication, even if they exhibit high degrees of access locality. For example, the RCS and CVS version control systems use the exclusive file creation semantics provided by the POSIX open() system call’s O EXCL flag to gain exclusive access to repository files. During a checkout/checkin operation, RCS attempts to atomically create a lock file and relies on its pre-existence to determine if someone else is accessing the underlying repository file. The close-to-open consistency semantics provided by AFS and Coda and the eventual consistency semantics provided by most distributed file systems are inadequate to guarantee the exclusive file creation semantics that RCS requires. Thus using those file systems to replicate an RCS or CVS repository can lead to incorrect behavior. NFS provides the file locking required for repository file sharing via a separate centralized lock manager. However, its client-server architecture inhibits delegation of locking privileges between clients, preventing NFS from fully exploiting regional locality across WANs. As a result, CVS designers do not recommend using 145 distributed file systems (other than NFS) to store repositories and most CVS installations (such as SourceForge.net) employ a client-server organization to support wide area developers, which inhibits caching [17]. In contrast, Swarmfs users can ensure strong consistency semantics for exclusive updates to repository files and directories as follows. An RCS/CVS root repository directory’s consistency attribute can be set to use exclusive access mode (WRLK) for updates, which recursively applies to all subsequently created repository files. As a result, RCS/CVS programs need not be modified to use Swarmfs. Swarmfs can thus safely replicate RCS/CVS files across a WAN, and also exploit locality for low latency access. 5.2.5 Evaluation To evaluate how effectively Swarmfs can support concurrent file sharing with strong consistency, we simulated concurrent development activities on a project source tree using a shared RCS repository for version control. Although CVS is a more popular version control system than RCS, we chose the latter for our experiment, because RCS employs per-file locking for concurrency control and hence allows more parallelism than CVS, which currently locks the entire repository for every operation. Moreover, CVS stores repository files in RCS internally. Hence modifying CVS to lock individual RCS files allows more parallelism. Our RCS sharing experiments demonstrate the potential benefits of such an effort. Overall, our results show that Swarmfs’ strongly consistent caching provides near-local latency to wide area developers for checkin and checkout operations (i.e., several orders of magnitude improvement over client-server) when repository file accesses are highly localized, and up to three times lower latency on average than a traditional client-server version control system even in the presence of widespread sharing. We evaluated two versions of RCS, one in which the RCS repository resides in Swarmfs (peer sharing mode) and one in which the RCS repository resides on one node and is accessed via ssh from other nodes (client-server/RPC mode). We employed WRLK mode for repository file access in Swarmfs and allowed repository files to be ag- 146 gressively cached on demand to determine Swarmfs’ behavior under worst-case thrashing conditions. For this set of experiments, we used the topology shown in Figure 5.5 without the ISP (I). The “Home Node” initially hosts three project subdirectories from the Andrew-tcl benchmark sources: unix (39 files, 0.5MB), mac (43 files, 0.8MB), and tests (131 files, 2.1MB). Our synthetic software development benchmark consists of six phases, each lasting 200 seconds, that simulate diverse patterns of wide area file sharing. Figure 5.9 illustrates the access patterns during these six phases. During each phase, a developer updates a random file every 0.5-2.5 seconds from the module she is currently working on, producing 24 to 120 updates/minute. Each update consists of an RCS checkout (via the ‘co’ command), a file modification, and a checkin (via the ‘ci’ command). Though this update rate is much higher than typically produced by a single user, our intention is to study the system’s behavior under heavy load. We use the latency of the checkout and checkin operations as the performance metrics. In Phase 1 (widespread shared development), all developers (one on each of the eight nodes) work concurrently on the unix module, and hence contend for unix files across all sites. In Phase 2 (clustered development), the developers at the University (U) and Corporate (C) sites in the U.S. switch to the tests module, which restricts contention to within the U.S. Also, the developers in Turkey (T) continue work on the unix module and the developers in France (F) switch to the mac module, so accesses to those modules become highly localized. In Phases 3-6 (migratory development), work is shifted (every 200 seconds) between “cooperating” sites – the unix module’s development migrates between the U.S. University and Turkey, while the mac module migrates between the Corporate LAN in the U.S. and France (e.g., to time shift developers). Figure 5.10 shows the file checkout latencies observed by clients on the University LAN (U), where the primary copy of the RCS repository is hosted, as a timeline. Figure 5.11 shows the checkout latencies observed by clients on the “Turkey” LAN (T) across the slow intercontinental link, also as a timeline. Each graph shows a scatter plot of the latencies observed for Swarmfs-based checkouts, and also the average latency curves for RPC-based checkouts from various sites for comparison. For example, the latency of 147 Turkey (T) Univ. (U) Turkey (T) Univ. (U) Unix Unix Mac Mac Tests Tests France (F) France (F) Corp. LAN (C) Phase 1 Shared Corp. LAN (C) Turkey (T) Univ. (U) Turkey (T) Univ. (U) Unix Unix Mac Mac Tests Tests France (F) France (F) Corp. LAN (C) Phases 3,5 Migratory Phase 2 Clustered Corp. LAN (C) Phases 4,6 Migratory Figure 5.9. Repository module access patterns during various phases of the RCS experiment. Graphs in Figures 5.10and 5.11 show the measured file checkout latencies on Swarmfs-based RCS at Univ. (U) and Turkey (T) sites relative to client-server RCS. a local RCS checkout operation at the primary site U1 is indicated in both graphs by the curve at the bottom labelled ‘local U1’, whereas the RPC latency for checking out files from the Turkey (T) LAN is indicated by the topmost curve labelled ‘rpc by T’. Each vertical line denotes a phase transition. The labels at the bottom indicate the phase number and those at the top indicate which LANs access which modules during each phase. In Phase 1, developers from all LANs work on unix files (as indicated by the ‘all unix’ label at the top), and Swarmfs -based file checkouts at all sites incur latencies that vary widely around 1.5 seconds, as indicated by the widely dispersed points in the ‘phase 1’ column in both graphs. However, Swarmfs latencies are roughly half of the RPC latency incurred from Turkey. In phase 2 (labelled ‘U, C tests’ at the top in the graph of Figure 5.10), when only the developers in the Univ (U) and the Corporate (C) LANs share 148 RCS Checkout Latency (millisec) RCS on Swarmfs: Checkout Latencies at the U.S. Univ. site (U) 10000 All unix U,C tests U unix (T unix) (T unix) U unix (T unix) rpc by T rpc by C 1000 rpc by U2 100 local U1 10 Phase 1 Phase 2 Phase 3 Phase 4 Phase 5 Phase 6 0 200 Swarmfs U1 400 600 800 1000 Time Elapsed (seconds) 1200 1400 Swarmfs U2 Figure 5.10. RCS on Swarmfs: Checkout Latencies near home server on the “University” (U) LAN. the tests module files, Swarmfs-based checkouts incur widely varying but reduced latencies as there are no sharers across the slow intercontinental link. At the same time, the development work gets localized in Turkey LAN (indicated by the label ‘T unix’ in phase 2 column of Figure 5.11). As a result, the two Swarmfs servers in Turkey quickly cache the working set of files and provide local latency to their clients, as indicated by the latency points clustered at the bottom near the ‘local U1’ curve. Although NFS provides file locking it would not have provided the low latency benefit seen in this scenario because it would force T1 and T2 to always get their file locks from the remote home server U1. In each of the migratory development phases 3 to 6, whenever developers in the U.S. University (U) and Turkey (T) LANs become active, their unix file accesses incur high initial latencies on Swarmfs for a short period during which locking privileges migrate from their peers. However, due to localized access, the latencies quickly drop to that of local RCS at U1. These results are indicated by the latency trends for Swarmfs checkouts in the columns for phases 3 and 5 in Figure 5.10, and in the columns for phases 2, 4 and 6 in Figure 5.11. Thus, Swarmfs caching is highly effective at reducing 149 RCS Checkout Latency (millisec) RCS on Swarmfs: Checkout Latencies at Turkey site (T) 10000 All unix T unix (U unix) T unix (U unix) T unix rpc by T rpc by C 1000 rpc by U2 100 local U1 10 Phase 1 Phase 2 Phase 3 Phase 4 Phase 5 Phase 6 0 200 Swarmfs T1 400 600 800 1000 Time Elapsed (seconds) 1200 1400 Swarmfs T2 Figure 5.11. RCS on Swarmfs: Checkout Latencies at “Turkey” (T) site, far away from home server. latencies when accesses are localized. In contrast, RPC-based checkouts incur the same high WAN latency for all operations regardless of locality. Using Swarmfs, no RCS operations failed, nor did any of our sanity checks detect any problems, which indicates that Swarmfs provides correct file locking semantics and thus enables unmodified RCS to work correctly under all workloads. Swarmfs enables RCS developers to realize the low-latency benefit of caching when there is regional locality, and provides better performance than client-server RCS even when there is widespread contention for files. For instance, when the development work is highly localized as in phases 3 and 5 at site U (Figure 5.10) and phases 2, 4 and 6 at site T (Figure 5.11), the checkout latency quickly drops to that of local RCS operations on the home server U1. Even when developers at multiple sites operate on a common set of files and exhibit no locality (as in Phase 1 when they all access the unix module, and Phase 2 when U and C access the tests module), Swarmfs outperforms client-server RCS. The worst-case RCS latency on Swarmfs is close to 3 seconds for all developers, which is the same as the average latency of client-server RCS across the slow link from Turkey (labelled ‘rpc at T’ in Figure 5.11). But Swarmfs’ average latency is less than half of the RPC 150 latency. This is because Swarm often avoids crossing the slow link for consistency traffic by forming an efficient replica network. Finally, Swarmfs responds quickly to changes in data locality. When the set of users sharing files changes (e.g., when developers at sites U and T work on the unix module alternatively in Phases 3-6), Swarm migrates the replicas to the new set of sharers fairly rapidly. Thus, Swarmfs not only supports reliable file locking across WANs unlike existing file systems, but also exploits data locality and enables applications such as version control to realize latencies close to local file system when accesses are geographically localized. In addition, Swarm’s efficient replica networking enables Swarmfs to mask network delays effectively by avoiding frequent use of slow links for consistency traffic. Due to these benefits, Swarmfs provides a better solution for file locking than traditional client-server approaches including NFS. 5.2.6 Summary In this section, we showed that Swarmfs, a distributed file system extension to Swarm, effectively supports fine-grain collaboration as well as reliable file locking applications by supporting the configurable consistencyframework, unlike existing wide area file systems. These applications can leverage Swarm’s proximity-aware replica management and peer-to-peer consistency protocol to achieve near-local latencies when accesses are highly localized, while achieving latency half that of traditional client-server schemes even when there is widespread contention for shared data. 5.3 SwarmProxy: Wide-area Service Proxy Enterprise servers that handle business-critical enterprise objects (e.g., sales, inventory or customer records) could improve responsiveness to clients in remote geographic regions by deploying service proxies (commonly known as ASPs) in those regions. Although a centralized cluster of servers can handle a large volume of incoming client requests, the end-to-end throughput and availability of the clustered services are limited by the cluster’s Internet link and the geographic spread of clients. As explained in Section 2.3.2, by caching enterprise objects locally, ASPs can improve response times, spread 151 load, and/or improve service availability. However, enterprise services tend to have stringent integrity requirements, and enforcing them often requires strong consistency [41]. Two key design challenges must be overcome to build wide area proxies for enterprise servers: (i) performance: when clients contend for strongly consistent access to writeshared data, aggressively caching data performs poorly; (ii) availability: some proxies may temporarily become inaccessible, and the service must continue to be available despite such failures. However, since ASPs are typically deployed in an enterprise-class environment consisting of powerful machines with stable Internet connectivity, a small number (dozens) of proxies can handle a large number of clients. Swarm can support wide-area caching of enterprise objects efficiently in applications such as online shopping and auctions without compromising strong consistency due to its contention-aware replication control mechanism. In this section, we demonstrate that WAN proxy caching using Swarm improves the aggregate throughput and client access latency of enterprise services by 200% relative to a central clustered solution even when clients exhibit contention. Swarm’s contention-aware caching outperforms both aggressive caching and client-server RPCs under all levels of contention, as well as under dynamically changing data access patterns and contention across multiple sites. 5.3.1 Evaluation Overview To evaluate how effectively Swarm’s contention-aware replication control supports strongly-consistent wide area caching of enterprise objects, we built a simple synthetic proxy-based enterprise service. The enterprise service consists of three tiers; a middletier enterprise server accepts service requests from front-end web servers and operates on data stored in a backend object database. Unlike existing clustered architectures for web-based enterprise services [24, 15], we deploy our web servers across the wide area from the primary enterprise server to serve clients in different geographic regions. Figure 5.12 illustrates the network architecture of the modeled service. We run the primary enterprise server at node E0 that hosts a database of enterprise objects at its local Swarm server. We deploy workload generators at 16 wide area nodes (E1-E16), 152 Primary E0 Proxy E16 Server Node Workload generator E1 E15 Routing E2 . ... request Ent. Server open core .. 1Mb 10ms ply re fwd Swarm E9 Figure 5.12. Network Architecture of the SwarmProxy service. each of which simulates a busy web server (with 4 threads) processing query and update requests for objects from web browsers at full speed by issuing RPCs to the primary enterprise server. We refer to each workload generator thread as a client. To enhance performance, we gradually deploy proxy enterprise servers, called SwarmProxies, colocated with the web servers, and cause those web servers to issue RPCs to the colocated proxy if available. Each enterprise server (primary or proxy) serves client requests by accessing objects from a local Swarm server, and may forward the request to other servers when suggested by Swarm. For instance, in Figure 5.12, the proxy at node E15 forwards a local client request to E9, which replies directly to the client at E15. We compare the access latency and the aggregate throughput observed by the clients with three configurations under varying contention: (1) proxy servers requesting Swarm to adapt between aggressive caching and master-slave (RPC) modes based on observed contention for each object (denoted as adaptive caching), (2) proxy servers forcing Swarm to always cache objects locally on demand (denoted as aggressive caching), and (3) the traditional clustered organization with no proxies, where all clients issue RPCs to the primary enterprise server across WAN (denoted as RPC). Our results indicate that deploying adaptive caching proxies across WAN near client sites improves the aggregate client throughput beyond RPC, even when client requests exhibit a modest locality of 40% (i.e., each client accesses its own ‘portion’ of the shared object space only 40% of the time). Clients colocated with adaptively caching proxies 153 experience near-local latency for objects that they access frequently, i.e., with locality of 40%, and an average latency 10% higher than that of a WAN RPC for other objects. In contrast, when proxies employ aggressive caching, clients incur latencies 50% to 250% higher than that of a WAN RPC for all objects due to the high cost of enforcing strong consistency across WAN. By avoiding this cost, the adaptive scheme improves the aggregate client throughput by 360% over aggressive replication even when there is high locality (i.e., each client accessing its own ‘portion’ of the shared object space 95% of the time). Finally, Swarm automatically migrates object privileges to frequent usage sites when working sets change. When perfect (100%) locality exists, i.e., when the same object is not shared by multiple clients, Swarm redistributes the objects automatically to clients that use them, and thereby provides linear speedup with WAN proxy caching under both schemes. When a proxy fails, clients near that site can redirect their requests to other sites. After the leases on the cached objects at the failed proxy expire, other sites can serve those objects, thus restoring service availability. 5.3.2 Workload The objective of our SwarmProxy workload is to study Swarm’s ability to handle contention when enforcing strong consistency. Hence we designed a workload similar to the TPC-A transaction processing benchmark [73]. In TPC-A, each client repeatedly (at full speed) selects a bank account at random on which to operate and then randomly chooses to query or update the account (with 20% updates). Similarly, our synthetic enterprise database consists of two components: an index structure and a collection of 256 enterprise objects (each 4KBytes), all stored in a single Swarm file managed as a persistent page heap (with a page size of 4KBytes). The index structure is a pagebased B-tree that maps object IDs to offsets within the heap where objects are stored. We deliberately employ a small object space to induce a reasonable degree of sharing and contention during a short period of experimental time. We run our workload on 16 WAN nodes (E1-E16) as illustrated in Figure 5.12. Each node is connected to a common Internet backbone routing core by a 1Mbps, 10ms delay link and has 40ms 154 RTT to other nodes. Each of four clients per node (i.e., workload generator threads, as explained earlier) repeatedly invokes a query or update operation on a random object at full speed. Processing a client request at a SwarmProxy involves walking the B-tree index in RDLK mode to find the requested object’s heap offset from its OID, and performing the requested operation after locking the object’s page in the appropriate mode (RDLK or WRLK). Thus, index pages are read-only replicated whereas object pages incur contention due to read-write replication. We model a higher (50%) proportion of writes than TPC-A to evaluate SwarmProxy’s performance under heavy write contention. We vary the degree of access locality as follows. Associated with each of the 16 web servers are 16 “local” objects (e.g., clients on E1 treat objects 1-16 as their local objects etc.). When a client randomly selects an object on which to operate, it first decides whether to select a local object or a “random” object from the entire set. We vary the likelihood of selecting a local object from 0%, in which case the client selects any of the 256 objects with uniform probability, to 100%, in which case the client selects one of its node’s 16 local objects with uniform probability. In essence, the 100% case represents a partitioned object space with maximal throughput because there is no sharing, while the 0% case represents a scenario where there is no access locality. 5.3.3 Experiments We evaluate SwarmProxy performance via three experiments run in sequence. Our first experiment determines the effect of adding wide area proxies on the overall service performance when varying the degree of access locality from 0% to 100%. We run the primary enterprise server on node E0 and deploy web servers running the workload described above on each of the 16 nodes (E1-E16). Thus initially, all web servers invoke the primary server via RPCs across WAN links, emulating a traditional client-server organization with no caching. Subsequently, we start a SwarmProxy (and associated Swarm server) at a new node (E1-E16) every 50 seconds, and redirect the “web server” on that node to use its local proxy. As we add proxies, Swarm caches objects near where 155 they are most often accessed, at the cost of potentially increasing the coherence traffic needed to keep individual objects strongly consistent. Our second experiment evaluates how Swarm automatically adapts to clients dynamically changing their working sets of objects, while fixing the amount of locality at 40%. After we deploy SwarmProxies at all web servers, each client shifts its notion of what objects are “local” to be those of its next cyclical neighbor (Clients on E1 treat objects 16-31 as local, etc.). We run this scenario denoted as “working set shift” for 100 seconds. Our third experiment evaluates how Swarm performs when many geographically widespread clients contend for very few objects, e.g., bidding for a popular auction item. After experiment 2 ends, clients on nodes E9-E16 treat objects 1-16 as their “local” objects, which introduces very heavy contention for those 16 objects. We run this scenario, denoted as “contention”, for 100 seconds. We run these experiments with enterprise servers employing both adaptive and aggressive caching schemes. For the adaptive case, we configured Swarm to require that an object replica be accessed at least 6 times before its privileges can be migrated away. We did this by setting the soft and hard caching thresholds (explained in Section 4.7.7) on the Swarm file hosting the database to 6 and 9 respectively. 5.3.4 Results Figures 5.13 and 5.14 show how aggregate throughput varies as we add SwarmProxies in Experiment 1 when varying the degree of locality (0%-100%) while enforcing strong consistency. Vertical bars denote the addition of a proxy. To provide a baseline for guaging the performance improvement with proxies, we measured the maximum throughput that a single saturated enterprise server can deliver. To do this, we ran a single web server’s workload with its four full-speed threads when it is colocated with the primary server on node E0. The line labelled “local” in the graphs indicates the performance of this scneario, which is 1000 ops/second. The line labelled “rpc” denotes the aggregate throughput achieved when all the 16 web servers run the same workload on the single primary server across the WAN links without proxies (480 ops/sec), which is 156 SwarmProxy: Aggregate throughput w/ Adaptive Lock Caching 16000 Cumulative ops/sec 14000 12000 10000 8000 100% locality 95% locality 90% locality 80% locality 60% locality 40% locality 20% locality 0% locality phases expt 2 expt 3 shift contention 6000 4000 2000 local: 1000 rpc: 480 0 0 200 400 600 800 1000 Time (seconds) Figure 5.13. SwarmProxy aggregate throughput (ops/sec) with adaptive replication. roughly half of what a single server can deliver. RPC provides low throughput, because the node E0’s 1Mbps Internet link becomes the bottleneck. At 100% (i.e., perfect) locality, adding proxies causes near-linear speedup, as expected, for both aggressive and adaptive replication. Swarm automatically partitions the database to exploit perfect locality if it exists. Also for both replication modes, the higher the degree of locality, the higher the measured throughput. With aggressive replication, the aggregate throughput quickly levels off as more proxies contend for the same objects. Even at 95% locality, throughput never exceeds 2500 ops/sec, because objects and their locking privileges are migrated across the WAN from the site that frequently accesses them to serve cache misses elsewhere. With adaptive replication, clients initially forward all requests to the root server on E0. When locality is 40% or higher, SwarmProxies cache their “local” objects soon after they are spawned. Under these circumstances, nodes use RPCs to access “remote” objects, rather than replicating them, which eliminates thrashing and allows throughput to continue to scale as proxies are added. With high (95%) locality, the adaptive scheme can support almost 9000 ops/sec in the fully proxy-based system (a 360% improvement over the aggressive scheme). Further improvement was not achieved because our contention-aware replica 157 SwarmProxy: Aggregate throughput w/ Eager Lock Caching 3000 2500 Cumulative ops/sec expt 2 expt 3 shift contention 95% locality 90% locality 80% locality 60% locality 40% locality 20% locality 0% locality phases 2000 1500 local 1000 rpc 500 0 0 200 400 600 Time (seconds) 800 1000 Figure 5.14. SwarmProxy aggregate throughput (ops/sec) with aggressive caching. The Y-axis in this graph is set to a smaller scale than in Figure 5.13 to show more detail. 158 SwarmProxy: Access Latency for local objects on node E9 Latency (millisec) 1000 E9 proxy added 100 expt2 expt3 shift contend to for E1 E10 rpc 10 local Avg. update, eager Avg. update, adaptive phases 1 0 200 400 600 800 1000 Time (seconds) Figure 5.15. SwarmProxy latencies for local objects at 40% locality. control algorithm incurs some communication overhead to track locality, which could be optimized further. Figure 5.15 shows the distribution of access latencies for “local” objects on node E9 throughout the experiment. Even with modest (40%) locality, the adaptive scheme reduces the access latency of “local” objects from over 100msecs to under 5msecs once a local proxy is spawned. In contrast, when using aggressive replication, the average access latency hovers around 100msecs due to frequent lock shuttling. As Figure 5.16 shows, once a local proxy is started, the adaptive scheme incurs a latency of around 75msecs for nonlocal objects, which is about 15% higher than the typical client-server RPC latency of 65msecs. Thus, the adaptive scheme performs close to RPC and outperforms the aggressive scheme for access to “nonlocal” objects, despite never caching these objects locally, because it eliminates useless coherence traffic. When we have each node shift the set of objects that most interest it while maintaining locality at 40%, the phase denoted “expt2” in Figures 5.15 and 5.16, the adaptive scheme migrates each object to the node that now accesses it most often within roughly 10 seconds, as seen in Figure 5.15. The performance of the aggressive caching does not change, nor does its average access latency for “nonlocal” objects. This shows that 159 SwarmProxy: Access Latency for non-local objects on node E9 1000 Latency (millisec) E9 proxy added expt 2 shift expt 3 contend to E10 for E1 500 400 300 250 200 150 100 70 Avg. update, eager Avg. update, adaptive phases 50 40 0 200 400 rpc 600 800 1000 Time (seconds) Figure 5.16. SwarmProxy latencies for non-local objects at 40% locality. Swarm’s contention-aware replication mechanism not only directs clients to where an object is more frequently accessed, it can also dynamically track usage shifts between sites. Finally, when we induce extremely heavy contention for a small number of objects, the phase denoted “expt3” in Figures 5.15 and 5.16, the adaptive scheme almost immediately picks a single replica to cache the data and become master and shifts other replicas into slave mode. By doing so, Swarm is able to serve even heavily contended objects with RPC latency (under 100 msecs) at node E9. In contrast, we found that the aggressive replication protocol often requires over 500msecs to service a request and only rarely sees sub-100msec latency. Thus, Swarm automatically switches to centralization when it detects extreme contention and prevents the latency from worsening beyond RPC to a central server. In summary, Swarm’s contention-aware replication control mechanism provides better performance than either aggressive caching or exclusive use of RPCs for all levels of contention. Swarm-based wide area enterprise proxy servers offer several benefits. First, they improve overall service throughput and responsiveness beyond what can be achieved with clustered services. Second, they automatically detect which objects are in 160 demand at various sites and automatically migrate them to those sites for good overall performance. Finally, when there is heavy contention, they switch to a more efficient centralized mechanism to stem performance degradation. 5.4 SwarmDB: Replicated BerkeleyDB In this section, we evaluate Swarm’s effectiveness in providing wide area replication to database and directory services. Since many distributed applications (such as for authentication, user profiles, shopping lists and e-commerce) store their state in a database or a directory to ease data management, efficient database replication benefits a variety of applications. The consistency requirements of a replicated database are determined by its application, and vary widely in the spectrum between strong and eventual consistency. Currently, popular databases (such as mySQL, Oracle, and BerkeleyDB) predominantly employ master-slave replication due to its simplicity; read-only replicas are deployed near wide area clients to scale query performance, while updates are applied at a central master site to ensure serializability. For applications that can handle concurrent updates (e.g., directory services that locate resources based on attributes), master-slave replication is restrictive and cannot scale or exploit regional locality in updates. By using Swarm to implement database replication, we can choose on a per client basis how much consistency is required. Thus, high throughput can be achieved when the consistency requirements are less strict, e.g., a directory service [46], while the same code base can be used to provide a strongly consistent database. We augmented the BerkeleyDB embedded database library [67] with replication support, since it is widely used for lightweight database support in a variety of applications including openLDAP [51] and authentication services [33]. We added support for replication by wrapping a library called SwarmDB around the unmodified BerkeleyDB library. SwarmDB stores each BerkeleyDB data structure such as a B-tree, a hash table or a queue (in its unmodified BerkeleyDB persistent format, hereafter referred to as the DB) in a Swarm file which is automatically cached on demand by Swarm servers near application instances. SwarmDB transparently intercepts most of the BerkeleyDB data access interface calls, and supports additional arguments to the DB open() interface with 161 which applications can specify configurable consistencyoptions for each DB access session. Thus, an existing BerkeleyDB-based application can be retargeted to use SwarmDB with simple modifications to its DB open() invocations, thereby enabling it to reap the benefits of wide area caching. More details on how a SwarmDB-based application is organized are given in Section 4.3. SwarmDB handles BerkeleyDB update operations by invoking them as operational updates on the local Swarm server, treating an entire DB as a single consistency unit. A SwarmDB plugin linked into each Swarm server is used to apply update operations via the BerkeleyDB library on a local DB replica and to resolve update conflicts. Swarm reapplies those operations at other replicas to keep them consistent. Synchronizing DB replicas via operational updates allows transparent replication without modifying the underlying database code, at the expense of increasing contention at replicas when false sharing occurs. A relational database such as mySQL could also be replicated across wide area in a similar manner. To illustrate SwarmDB’s programming complexity, the original BerkeleyDB library is 100K+ lines of C code. The SwarmDB wrapper library is 945 lines (397 semicolons) of C code, which consists of simple wrappers around BerkeleyDB calls and glue code. The SwarmDB plugin linked to each Swarm server consists of 828 lines (329 semicolons) of C code to apply operational updates on the local copy and to provide conflict resolution and log truncation. SwarmDB took a week to develop and two weeks to debug and to remove performance bottlenecks in the plugin that arose when Swarm servers propagate updates at full speed. 5.4.1 Evaluation Overview In Section 5.4.2, we demonstrate the spectrum of consistency choices that SwarmDBbased replication with configurable consistency makes available to BerkeleyDB applications, and their diverse performance and scaling characteristics. We run a full-speed update-intensive DB workload on a SwarmDB database with a mix of five distinct consistency flavors listed in Table 5.1 on up to 48 wide area replicas, and compare the throughput to that of a client-server (RPC) version of BerkeleyDB. Consistency flavors that require tight synchronization such as strong and close-to-open consistency offer throughput 162 similar to that of RPC, but do not scale beyond 16 replicas. Relaxing consistency even by a small time-bound improves throughput by an order of magnitude. Finally, employing eventual consistency improves throughput by another order of magnitude that also scales well to 48 replicas. In Section 5.4.3, we present a peer-to-peer (music) file sharing application modeled after KaZaa. It implements its attribute-based file search directory as a distributed hierarchy of SwarmDB indexes, each replicated on demand with eventual consistency. In a 240-node file sharing network where files employ close-to-open consistency, Swarm’s ability to obtain files from nearby replicas reduces the latency and WAN bandwidth consumption for new file downloads and limits the impact of high node churn (5 node deaths/second) to roughly one-fifth that of random replica networking. Swarm quickly recovers from failed replicas due to its efficient implementation of lease-based consistency, causing fewer than 0.5% of the downloads to fail in spite of the high rate of node failures. 5.4.2 Diverse Consistency Semantics We measure SwarmDB’s read and write throughput when running an update-intensive BerkeleyDB workload on a database replicated with Swarm employing five consistency flavors listed in Table 5.1. We compare SwarmDB against BerkeleyDB’s client-server (RPC) implementation. Table 5.1. Consistency semantics employed for replicated BerkeleyDB (SwarmDB) and the CC options used to achieve them. The unspecified options are set to [RD/WR, time=0, mod=∞, soft, no semantic deps, total order, pessimistic, session visibility & isolation]. Consistency Semantics locking writes (eventual rd) master-slave writes (eventual rd) close-to-open rd, wr time-bounded rd, wr optimistic/eventual rd, wr CC options WRLK WR, serial RD/WR, time=0, hard time=10, hard RD/WR, time=0, soft 163 In our synthetic SwarmDB benchmark, we create a BerkeleyDB B-tree inside a Swarm file and populate it with 100 key-value pairs. The database size does not affect the benchmark’s performance, except initial replica creation latency, because we employ operational updates where Swarm treats the entire database (file) as a single consistency unit and propagates operations instead of changed contents. We run a SwarmDB process (i.e., a client process linked with the SwarmDB library) and a Swarm server on each of 2 to 48 nodes. The nodes run FreeBSD 4.7 and are each connected by a 1Mbps, 40-msec roundtrip latency WAN link to a backbone router core using Emulab’s delayed LAN configuration illustrated in Figure 5.1. Each SwarmDB process executes an update-intensive workload consisting of 10,000 random operations at full speed (back-to-back, without think time) on its local database replica. The operation mix consists of 5% adds, 5% deletes, 20% updates, 30% lookups, and 40% cursor-based scans. Reads (lookups and cursor-based scans) are performed directly on the local database copy, while writes (adds, deletes, and updates) are sent to the local Swarm server as operational updates. Each operation opens a Swarm session on the database file in the appropriate mode, performs the operation, and closes the session. We employed a high update rate in our workload to characterize Swarm’s worst-case performance when employing various consistency semantics. We employ the following consistency flavors, which in some cases differ for reads and writes: 1. Locking writes and optimistic/eventual reads: This combination ensures strong consistency for writes via exclusive locks to ensure their serializability, but provides eventual consistency for reads, so they can be executed in parallel with writes. Write sessions are executed at multiple replicas, but serialized via locks. Their updates are propagated to other replicas by a best-effort eager push. This flavor can be used when a database’s workload is dominated by queries that need not guarantee to return the latest data, and writes that exhibit high locality but require the latest data and serializability of writes (e.g., a sales database). 164 2. Master-slave writes and eventual reads: This combination provides read-only replication supported by many existing wide area database replication solutions (e.g., BerkeleyDB and mySQL). Unlike locking writes, master-slave writes are always forwarded to a single ‘master’ replica (the root replica in Swarm) where they are executed serially, but interleaved with other local sessions. Updates are propagated to other replicas by a best-effort eager push. This flavor can be used when a database must support a higher proportion of writes issued from multiple sites, or if write conflicts are hard to resolve and should be avoided (e.g., an inventory database). 3. Close-to-open reads and writes: This flavor provides the latest snapshot view of the database at the time it is opened that does not change during an access session. It still allows writes to proceed in parallel at multiple replicas. This semantics is provided by Oracle as snapshot isolation [52]. 4. Time-bounded inconsistency for reads and writes: This flavor relaxes the ‘latest data’ requirement of close-to-open consistency by allowing a small amount of staleness in the view of data in exchange for reduced synchronization frequency and increased parallelism among writes. This semantics is provided by the TACT toolkit [77] via its staleness metric. 5. Optimistic/eventual reads and writes: This flavor allows maximal parallelism among reads and writes at multiple replicas but only guarantees eventual convergence of their contents. It performs a read or write operation without synchronizing with other replicas, and propagates writes by a best-effort eager push. This semantics is provided by the Microsoft Active Directory’s replication system [46]. Overall, SwarmDB’s throughput and scalability characteristics vary widely based on the consistency flavor employed for replication. Compared to a client-server BerkeleyDB implementation, the throughput of BerkeleyDB replicated with Swarm improves by up to two orders of magnitude with optimistic/eventual consistency, and by an order of magnitude even when replica contents are guaranteed not to be out of sync by more than 20 msecs. 165 Figures 5.17 and 5.18 show the average throughput observed per replica for reads and writes. For comparison, the graphs present BerkeleyDB’s baseline performance when clients access the database (i) stored in the local file system (local) via the original library, (ii) stored in a colocated BerkeleyDB server via RPCs (rpclocal) and, (iii) stored in a colocated Swarm server via the SwarmDB library (slocal). Slocal represents the best throughput achievable using SwarmDB on top of our Swarm prototype. The big drop in the throughput of slocal relative to local is due to two artifacts of our unoptimized prototype implementation. First, the SwarmDB library communicates with the local Swarm server via socket-based IPC to open sessions and to issue updates, which also slows down the rpclocal case. The IPC could be avoided often by employing a session cache on the client side. Second, the Swarm prototype’s IPC marshalling code employs verbose messages, and parsing them more than doubles the message processing latency in the critical path of IPC. When we replaced it with a binary message format, we found that SwarmDB’s local throughput doubles. We also found several other optimizations that could be made to improve SwarmDB’s raw performance. However, they do not change the results presented here qualitatively. When SwarmDB employs read-write replication with strong or close-to-open consistency, client throughput quickly drops below that of RPC, and does not scale beyond 16 replicas. This workload’s high update rate requires replicas to synchronize across WAN links for almost every operation to ensure consistency, incurring high latency. When clients relax consistency by tolerating even a small amount (10ms) of staleness in their results (i.e., employ time-bounded inconsistency), the per-client read and write throughput improves by an order of magnitude over close-to-open consistency. This is because, the cost of synchronization over the wide area is very high and amortizing it over multiple operations has substantial latency benefit, especially for a high-speed workload such as ours. Thus, the configurable consistency framework’s staleness option is important for wide area caching. With time-bounded inconsistency, throughput drops for large replica sets, but scales better than with close-to-open consistency. When clients employ eventual consistency, they achieve read and write throughput close to the Swarm local (slocal) case that also scales linearly to large replica sets for two reasons. 166 SwarmDB Read throughput per-replica Throughput (reads/sec) 100000 local 10000 rpclocal Swarm local (slocal) 1000 100 optimistic rd, wr 20msec-bounded 10msec-bounded close-to-open rpc 10 1 2 4 8 16 # Replicas 32 40 48 Figure 5.17. SwarmDB throughput observed at each replica for reads (lookups and cursor-based scans). ‘Swarm local’ is much worse than ‘local’ because SwarmDB opens a session on the local server (incurring an inter-process RPC) for every read in our benchmark. SwarmDB Write throughput per-replica Throughput (writes/sec) 100000 optimistic wr 20msec-bounded 10msec-bounded close-to-open locking wr rpc master-slave wr 10000 1000 local rpclocal Swarm local (slocal) 100 10 1 2 4 8 16 32 40 48 # Replicas Figure 5.18. SwarmDB throughput observed at each replica for writes (insertions, deletions and updates). Writes under master-slave replication perform slightly worse than RPC due to the update propagation overhead to slaves. 167 First, replicas synchronize by pushing updates in the background without stalling client operations. Second, although replicas accept updates at a high rate and propagate them eagerly to each other, Swarm servers make use of their local SwarmDB plugin to remove self-canceling updates (such as the addition and subsequent removal of a database entry) from the local update log and avoid their unnecessary propagation, reducing update propagation traffic. The write throughput under eventual consistency is less than the read throughput because the updates involve an IPC to the local Swarm server, whereas the SwarmDB library performs a read operation directly on the underlying BerkeleyDB database. When clients simultaneously employ different semantics for read and write operations such as strong consistency for writes and eventual consistency for reads, clients achieve high throughput for reads but incur a low throughput for writes. In other words, strongly consistent write accesses do not affect the performance of simultaneous weakly consistent reads. Thus, Swarm can support multiple clients operating on the same data with different consistency requirements simultaneously without interference, provided their semantics do not conflict. In summary, Swarm enables applications to use the same SwarmDB and application code base to achieve database replication for diverse sharing needs by simply employing different configurable consistency options. Our experiment demonstrated that employing different consistency flavors can result in order of magnitude differences in the throughput and scalability of the resulting application. 5.4.3 Failure-resilience and Network Economy at Scale For configurable consistency to provide a viable consistency solution for wide area replication, its implementation must not only utilize network capacity efficiently in nonuniform networks, but also gracefully recover from failures and enforce consistency in a dynamic environment where a large number of replicas continually join and abruptly leave the sharing network (a phenomenon called node churn). In this section, we demonstrate the failure resilience and network economy of Swarm’s replication and lease-based consistency mechanisms under node churn involving hundreds of nodes. 168 5.4.4 Evaluation We emulate a large wide area file sharing network where a number of peer nodes continually join and leave the network. Each peer comes online, looks up files in a SwarmDB-based shared index and downloads previously unaccessed files for a while by accessing them via its local Swarm server, and then abruptly goes offline, killing the server as well. As files get cached widely, the latency to download a new file drops since Swarm is likely to obtain it from nearby peers. To evaluate how the lease-based failure resilience of Swarm’s consistency mechanisms performs under churn, we employ close-to-open consistency for shared files, which forces Swarm to maintain leases. We never update the files during the experiment, so when a replica loses contact with its parent, it must reconnect to the hierarchy to ensure that it has the latest data before it can serve the file to another node. We employ two metrics to evaluate the performance of Swarm’s replication and consistency mechanisms under churn: (1) the latency to download a new file, and (2) the wide area network bandwidth consumed for downloads. Our results indicate that Swarm’s proximity-aware replica management quickly detects nearby replicas even in a network of 240 nodes, reducing download latency as well as WAN bandwidth to roughly one-fifth that of Swarm with proximity-awareness disabled. We emulate the network topology shown in Figure 5.19. Machines are clustered in campuses spread over multiple cities across a wide area network. Campuses in the same city are connected via a 10Mbps network with 10ms RTT. Campuses in neighboring cities are connected via a 5Mbps network to a backbone router, and have 50ms RTT to each other. Campuses separated by the WAN have 5Mbps bandwidth and 150ms roundtrip latency between them and need to communicate across multiple backbone routers (two in our topology). Each user node runs a file sharing agent and a Swarm s erver, which are started and stopped together. To conduct large-scale experiments on limited physical nodes, we emulate 10 user nodes belonging to a campus on a single physical machine. Thus, we emulate 240 user nodes on a total of 24 physical machines. The CPU, memory, and disk were never saturated on any of the machines during our experiment. 169 ... ... 10Mb, 2.5ms ... 5Mb, 10ms 11 00 00R2 11 00 11 ... 100Mb, 25ms R3 115Mb, 00 0010ms 11 00 11 ... Internet 11 00 R1 00 11 00 11 5Mb, 10ms router ... hub ... Campus LAN File Browser Swarm Server User node Figure 5.19. Emulated network topology for the large-scale file sharing experiment. Each oval denotes a ‘campus LAN’ with ten user nodes, each running a SwarmDB client and Swarm server. The labels denote the bandwidth (bits/sec) and oneway delay of network links (from node to hub). We designate 10% of the nodes as ‘home’ nodes (i.e., custodians) that create a total of 1000 files and make them available for sharing by adding their keys (i.e., attributevalue pairs) to the SwarmDB index. Although Swarm’s replication mechanism itself can handle transient root custodian failures, the highly simplified SWID lookup mechanism of our Swarm prototype requires the custodian to be accessible to help bootstrap a new replica, as explained in Section 4.6.1. Hence we configured the custodian nodes never to go down during the experiment. Once every 1-3 seconds, each node looks up the SwarmDB index for files using a random key and downloads those not available locally by accessing them via Swarm. We chose the file size to be 50KBytes so that they are large enough to stress the bandwidth usage of the network links, but small enough to expose the overheads of Swarm’s replica hierarchy formation. We initially bring up the home nodes and then the other nodes in random order, one every 2 seconds. We run them for a warm up period (10 minutes) 170 to build replica hierarchies for various files. We then churn the nonhome nodes for a period of 30 minutes, followed by a quiet (i.e., no churn) period of 10 minutes before shutting down all the nodes. Peer nodes have an exponentially distributed lifetime (i.e., time spent online). We simulate three different median lifetimes (30 seconds, 1 minute, and 5 minutes) by initiating node death events using a Poisson process that makes them uncorrelated and bursty. We start a new node each time one is killed to keep the total number of nodes constant during the experiment. This model of churn is similar to that described by Liben-Nowell et al. [42] and modeled during the evaluation of the Bamboo DHT system [59]. We deliberately simulate very short node lifetimes to examine the sensitivity of Swarm’s replication and consistency mechanisms to high node churn rates, ranging from an average of 5 node deaths/sec to a node dying every 2 seconds. We compare three configurations of Swarm with different replica networking schemes. In the first configuration denoted ‘random hierarchies’, we disable Swarm’s proximityawareness and its associated distance estimation mechanism so that replicas randomly connect to one another without regard for network distances. In the other configurations, we enable proximity-aware replica networking (denoted ‘WAN-aware’), but employ different policies for how the Swarm server hosting a file replica reacts when it finds a nearer parent replica candidate than its current parent (i.e., with a shorter RTT). In the second configuration, denoted ‘eager download’, Swarm reconnects the replica to the new parent in the background, and does not disconnect from the old parent until the reconnection succeeds. As a result, a new replica ends up downloading file contents from the first nearby replica willing to serve as its parent. This policy is based on the intuition that eagerly disrupting a replica hierarchy based on potentially inaccurate proximity information could increase the latency of ongoing accesses. In the third configuration, denoted ‘deferred download’, so long as a replica does not have valid file contents (e.g., when it is created), Swarm disconnects it from its current parent in addition to reconnecting to the new parent candidate. Thus, the third policy is more aggressive in choosing a nearer parent for a replica before the file download starts and is based on the intuition that spending a little more time initially to find a nearer parent pays off by utilizing the network bandwidth more efficiently. 171 Figure 5.20 shows a timeline of the mean latency observed for new file accesses by all nodes in each of the three configurations. Initially, file latencies with the ‘WAN-aware’ schemes are high because a large number of nodes come online for the first time, learn about each other, and find relative network distances by pings. However, access latencies quickly drop as the network traffic subsides and nodes learn about nearby nodes and file replicas. In the ‘random hierarchies’ scheme, file access latencies do not improve over time. Though both of the WAN-aware schemes are affected by node churn, their proximity awareness mechanism gets files from nearby replicas, and results in much better latencies than with the random scheme. Under node churn, the eager download scheme performs worse than the deferred scheme, as a new replica often initiates a file download from a distant replica before it finds and connects to a nearby parent. The deferred download scheme utilizes the network much more efficiently, resulting in better latencies that approach those seen during the stable network phase even when a node is replaced every 2 seconds (corresponding to a median node life span of 5 minutes). To evaluate the WAN bandwidth savings generated by Swarm’s proximity-aware replication, we measured the bytes transferred over the inter-router links. Figure 5.21 shows the incoming bandwidth consumed on router R1’s WAN link by Swarm traffic in each of the configurations. The observed bandwidth reductions are similar to the access latency reductions discussed above. To evaluate the overhead imposed by Swarm performing background reconnections in the replica hierarchy, we ran an instance of the experiment in which we disabled reconnections, i.e., a replica with a valid parent never reconnects to a nearer parent; we observed virtually no performance improvement, which indicates that the overhead of replica hierarchy reorganization is not significant even under high churn at this scale. Fewer than 0.5% of file accesses failed, mainly because a Swarm server gets killed before it replies to its local client’s file request. Finally, fewer than 0.5% of the messages received by any Swarm server were root probes that Swarm uses to detect cycles in the replica network (as explained in Section 4.6), indicating that our replica cycle detection scheme has negligible overhead even in such a dynamic network. 172 SwarmIndex: New File Access Latency 4 random hierarchies, 1min life WAN-aware, eager download, 1min life WAN-aware, deferred, 30sec life WAN-aware, deferred, 1min life WAN-aware, deferred, 5min life phases File access latency (seconds) 3.5 3 2.5 2 1.5 1 0.5 nodes join 0 0 warmup 500 node churn 1000 1500 stable 2000 2500 Time Elapsed (seconds) Figure 5.20. New file access latencies in a Swarm-based peer file sharing network of 240 nodes under node churn. In summary, Swarm’s proximity-aware replication and consistency mechanisms enable a peer file sharing application to support widespread coherent file sharing among hundreds of WAN nodes while utilizing the network capacity efficiently. Swarm’s proximity aware replication reduces file download latency and WAN bandwidth utilization to one-fifth that of random replication. Its lease-based consistency mechanism continues to provide close-to-open consistency to users in a large, rapidly changing replica network with nodes failing and new nodes joining at a high rate (up to 5 nodes per second). 5.5 SwarmCast: Real-time Multicast Streaming Real-time collaboration involves producer-consumer style interaction among multiple users or application components in real-time. Its key requirement is that data from producers must be delivered to a potentially large number of consumers with minimal latency while utilizing network bandwidth efficiently. For instance, online chat (for text as well as multimedia) and distributed multiplayer gaming require efficient broadcast of text, audio, video, or player moves between all participants in real-time to keep their views of chat transcript or game state consistent. A live media streaming application 173 WAN Router Bandwidth consumed for file downloads 1.2e+06 random hierarchies, 1min node life WAN-aware, eager download, 1min node life WAN-aware, deferred download, 30sec node life WAN-aware, deferred download, 1min node life WAN-aware, deferred download, 5min node life phases 1e+06 Bytes/sec 800000 warmup 600000 stable shutdown 400000 200000 nodes join 0 0 500 node churn 1000 1500 2000 2500 Time Elapsed (seconds) Figure 5.21. The average incoming bandwidth consumed at a router’s WAN link by file downloads. Swarm with proximity-aware replica networking consumes much less WAN link bandwidth than with random networking. must disseminate multimedia efficiently from one producer to multiple subscribers in real-time. The data-delivery requirement of these applications can be viewed as the need to deliver a stream of updates to shared state in real-time to a large number of replicas without blocking the originator of those updates. When viewed thus, the requirement translates to eventual consistency with best-effort (i.e., soft) push-based updates. Traditionally, a client-server message-passing architecture is employed for these applications, wherein producers send their data to a central relay server which disseminates the data to interested consumers. It is well known that multicasting data via a network of relays eases the CPU and network load on the central server and scales to a large number of consumers, at the expense of increasing the end-to-end message transfer delay [53]. Since a multicast relay network reduces load on each individual relay, it is able to disseminate events in real-time at a higher rate than a single server can. Though the configurable consistency framework and Swarm are not specifically optimized for real-time multicast support, their flexible interface allows application designers to leverage Swarm’s coherent hierarchical caching support for scalable multicast over LANs or 174 WANs without incurring the complexity of a custom multicast solution. To evaluate Swarm’s effectiveness at supporting real-time multicast, we have built a synthetic data flow application called SwarmCast. SwarmCast stresses Swarm’s ability to support real-time synchronization of a large number of replicas with eventual consistency in the presence of high-speed updates. It validates our hypothesis that the configurable consistency framework can be used effectively for applications that require real-time data dissemination. 5.5.1 Results Summary Our results show that SwarmCast is able to scale its transmission bandwidth and the number of real-time consumers supported linearly simply by redirecting them to more Swarm servers. With 120 consumers on a 100Mbps switched LAN, a SwarmCast producer is able to push data via Swarm server network at 60% of the rate possible using an ideal multicast relay network of equivalent size, while keeping end-to-end latency below 250ms. Half of this latency is incurred on the first hop as messages get queued to be picked up by the root server. This is because Swarm servers are compute-bound due to message marshaling overheads in our unoptimized prototype (e.g., mallocs and data copying). 5.5.2 Implementation We implemented SwarmCast as an extension to the Unix ttcp program. A single producer simulates a high-speed sequence of events by sending a continuous stream of fixed size (200-byte) data packets to a central relay server. Numerous subscribers obtain those events in real-time by registering with the relay or one of its proxies. We implement two types of relays: (i) a single hand-coded relay server that unicasts each packet to subscribers via TCP/IP sockets (implemented by modifying the Unix ttcp program), and (ii) a network of Swarm servers. To use Swarm servers as relays, we built a wrapper library for SwarmCast that appends event packets to a shared log file hosted at an origin Swarm server as semantic updates. At SwarmCast subscribers, the wrapper library snoops for events by registering with one of a set of Swarm servers to receive log 175 file updates.1 In response, Swarm servers cache the log file from the origin and receive semantic updates with eventual consistency via an eager best-effort push protocol. The Swarm servers build a replica hierarchy automatically and use it to propagate updates to each other as well as to the SwarmCast subscribers. Thus, the Swarm replica hierarchy serves as a multicast overlay network for real-time event dissemination. The wrapper library consists of about 250 lines of C code, including code to measure packet latencies and arrival rates, and does not involve any network I/O. 5.5.3 Experimental Setup We run Swarm servers on each of up to 20 PCs running FreeBSD 4.10. We run the producer process on a separate PC, but colocate multiple subscriber processes to run up to 120 subscribers on 30 PCs. Collocating subscriber processes on a single node did not affect our results; the CPU and memory utilization never crossed 50% on those machines. There were no disk accesses during our experiments. We distribute the subscribers evenly among the available relays (i.e., Swarm servers) such that each relay’s total fanout (i.e., the number of local subscribers + child relays) is approximately equal. The producer sends 20480 200-byte packets to the root relay as one-way messages over a TCP connection, follows them up with an RPC request to block till they are all received, and measures the send throughput. We choose a low packet size of 200 bytes to stress the effect of Swarm’s message payload overhead. We expect that Swarm would perform better with higher packet sizes. To highlight the effect of CPU utilization at the relays on SwarmCast’s overall performance, we run our experiments with all nodes connected by a 100Mbps switched LAN that provisions nodes with ample network capacity. To study the effect of Swarm replica tree depth on performance, we conduct the Swarm experiments with Swarm fanout limits of 2 and 4. 5.5.4 Evaluation Metrics We employ three metrics to evaluate relay performance: (i) the aggregate bandwidth delivered to SwarmCast subscribers (the sum of their observed receive throughput), 1 We use Swarm’s sw snoop() API described in Section 4.2.1. 176 (ii) the rate at which the producer can push data (i.e., its send bandwidth), and (iii) the average end-to-end delay for packets travelling from the producer to subscribers at various levels of the multicast tree. We measure end-to-end packet delay as follows. We run the NTP time sync daemon on all the nodes to synchronize their clocks to within 10 milliseconds of each other. We measure the clock skew between the producer and subscriber nodes by issuing probe RPCs and timing them. We timestamp each packet at the producer and receiving subscriber (using the gettimeofday() system call), measure their difference, and adjust it based on their clock skew. We track the minimum, average, and maximum of the observed delays. Since each relay can only send data at its link’s maximum capacity of 100Mbps, multiple relays should theoretically deliver aggregate throughput equal to the sum of their link capacities. Also, the producer’s ideal send bandwidth equals the root relay’s link capacity divided by its total fanout. We compare the performance of Swarm and our hand-coded ttcp-based relay against this ideal performance. 5.5.5 Results Figures 5.22 and 5.23 compare the throughput and latency characteristics of a single Swarm relay with our hand-coded TTCP-based relay. The high end-to-end latency of Swarm in Figure 5.23 indicates that Swarm is compute-bound. The high latency is caused by packets getting queued in the network due to the slow relay. To confirm this, we measured the latency from packet arrival at the relay to arrival at subscribers and found it to be less than 0.5ms. Since a Swarm relay is compute-bound, when there is a single subscriber, it delivers only 52% of its outgoing link bandwidth and accepts data at a much slower rate (29%). Our analysis of Swarm’s packet processing path revealed that the inefficiency is mostly an artifact of our unoptimized prototype implementation, namely, 30% extra payload in Swarm messages and unnecessary marshaling and unmarshaling overheads including data copying and dynamic heap memory allocation in the critical path. When we ran the same experiment on faster PCs with 3GHz pentium IV CPU and 1GB of RAM, the delays reduced by a factor of 3 and send and receive throughput went up beyond 90% of the ideal values. 177 Throughput Characteristics of a Single Relay 100 % of ideal throughput 90 recv thru, Swarm relay send thru, Swarm relay send thru, hand-coded relay recv thru, hand-coded relay 80 70 60 50 40 30 20 1 5 10 15 30 60 120 # subscribers Figure 5.22. Data dissemination bandwidth of a single Swarm vs. hand-coded relay for various number of subscribers across a 100Mbps switched LAN. With a few subscribers, Swarm-based producer is CPU-bound, which limits its throughput. But with many subscribers, Swarm’s pipelined I/O delivers 90% efficiency. Our hand-coded relay requires a significant redesign to achieve comparable efficiency. When a Swarm relay serves a large number of subscribers, Swarm’s pipelined I/O implementation enables it to deliver more than 90% of its link bandwidth. The multicast efficiency of our hand-coded TTCP relay (percent of outgoing link capacity delivered to consumers) is 90% with up to 15 subscribers, but drops down to 40% with more than 15 subscribers. This drop is an artifact of our simplistic TTCP implementation that processes packets synchronously (i.e., sends a packet to all outgoing TCP sockets before receiving the next packet). This effect is visible in Figure 5.23 as a high maximum packet delay with the hand-coded relay when serving more than 15 subscribers. When using Swarm, the send throughput of the SwarmCast producer is lower (60%) than the subscribers’ aggregate receive throughput (90%, see 5.22), because the high queuing delay incurred by packets on the first hop to the root server was included in the computation of send throughput. When multiple Swarm servers are deployed to relay packets, Figures 5.24 and 5.25 show that both the producer’s sustained send bandwidth and the aggregate multicast 178 End-to-end Latency Characteristics of a Single Relay end-to-end packet latency (millisecs) 16384 avg. latency, Swarm relay max. latency, hand-coded relay avg. latency, hand-coded relay 4096 1024 256 64 16 4 1 1 2 5 10 15 30 60 120 # subscribers Figure 5.23. End-to-end packet latency of a single relay with various numbers of subscribers across 100Mbps LAN, shown to a log-log scale.. Swarm relay induces an order of magnitude longer delay as it is CPU-bound in our unoptimized implementation. The delay is due to packets getting queued in the network link to the Swarm server. throughput scale linearly. The multicast throughput is shown normalized to 100Mbits/sec, which is the capacity of our LAN links. Figure 5.26 demonstrates that Swarm’s multicast efficiency does not degrade significantly as more servers are deployed. Swarm-based multicast with up to 20 relays yields more than 80% of the ideal multicast throughput. Throughput is largely unaffected by Swarm’s relay fanout (i.e., the number of relays served by a given relay) and hence by the depth of the relay hierarchy. Figure 5.27 shows that when multiple Swarm servers are employed to multicast packets, SwarmCast’s end-to-end latency decreases drastically due to there being proportionately less load on each individual server. The load reduction is particularly longer on the first two levels of servers in the relay hierarchy. This effect can be observed clearly in Figure 5.28, which presents the latency contributed by each level of the Swarm relay hierarchy to the overall end-to-end packet latency on a log-log scale. In summary, the SwarmCast application demonstrates that Swarm supports scalable real-time data multicast effectively in addition to offloading its implementation complex- 179 Send Throughput-scaling with Swarm Multicast (120 subscribers) Send Throughput (KB/sec, log scale) 4096 Ideal Multicast Swarm (fanout 4) Swarm (fanout 2) Hand-coded relay 2048 1024 512 256 128 64 32 1 2 4 5 8 10 12 15 20 # Relays (log scale) Figure 5.24. A SwarmCast producer’s sustained send throughput (KBytes/sec) scales linearly as more Swarm servers are used for multicast (shown to log-log scale), regardless of their fanout. ity from applications. In spite of our unoptimized Swarm implementation, Swarm-based multicast provides over 60% of ideal multicast bandwidth with a large number of relays and subscribers, although it increases end-to-end latency by about 200 milliseconds. Through this application, we also illustrated how, by using Swarm, the complexity of optimizing network I/O and building multicast overlays can be offloaded from application design while achieving reasonably efficient real-time multicast. 5.6 Discussion In this chapter, we have presented our evaluation of the practical effectiveness of the configurable consistency framework for supporting wide area caching. With four distributed services built on top of Swarm, we showed how our implementation of configurable consistency meets diverse consistency requirements, enables applications to exploit locality, utilizes wide area network resources efficiently and is resilient to failures at scale - four properties that are important for effective wide area replication support. Aggr. Throughput (normalized to single ideal relay) 180 Multicast Throughput-Scaling with Swarm Relays (120 Subscribers) 20 aggr recv thru, ideal multicast aggr recv thru, Swarm fanout 4 aggr recv thru, Swarm fanout 2 18 16 14 12 10 8 6 4 2 0 1 2 4 5 6 8 10 12 15 20 # Swarm-based relays Figure 5.25. Adding Swarm relays linearly scales multicast throughput on a 100Mbps switched LAN. The graph shows the throughput normalized to a single ideal relay that delivers the full bandwidth of a 100Mbps link. A network of 20 Swarm servers deliver 80% of the throughput of 20 ideal relays. In Section 5.2, we presented a distributed file system called Swarmfs that supports the close-to-open semantics required for sequential file sharing over the wide area more efficiently than Coda. Unlike other file systems, Swarmfs can also support reliable file locking to provide near-local file latency for source code version control operations when development is highly localized, and up to half the latency of a traditional client-server system. This demonstrates the customizability and network economy of our consistency solution. In Section 5.3, we showed that Swarm’s novel contention-aware replication control mechanism enables enterprise services to efficiently provide wide area proxy caching of enterprise objects with strong consistency under all levels of contention for shared objects. The mechanism automatically inhibits caching when there is high contention to provide latency within 10% of a traditional client-server system, and enables caching when accesses are highly localized to provide near-local latency and throughput. 181 Throughput Characteristics of Swarm Multicast (120 subscribers) 100 % of ideal throughput 90 80 70 60 50 recv thru, Swarm relays (fanout 4) recv thru, Swarm relays (fanout 2) send thru, Swarm relays (fanout 4) send thru, Swarm relays (fanout 2) send thru, hand-coded relay recv thru, hand-coded relay 40 30 20 1 2 4 5 6 8 10 12 15 20 # (Swarm) Relays Figure 5.26. Swarm’s multicast efficiency and throughput-scaling. The root Swarm server is able to process packets from the producer at only 60% of the ideal rate due to its unoptimized compute-bound implementation. But its efficient hierarchical propagation mechanism disseminates them at higher efficiency. Fanout of Swarm network has a a minor impact on throughput. In Section 5.4, we presented a Swarm-based library to augment the BerkeleyDB database library with replication support. We showed how database-backed applications using this library to cache the database can leverage the same code base to meet different consistency requirements by specifying different composable consistency choices. We showed that selectively relaxing consistency for reads and writes improves application throughput by several orders of magnitude and enhances scalability. In Section 5.4.3, we showed how Swarm’s proximity-aware replication can be leveraged to support widespread coherent read-write file sharing among hundreds of WAN nodes while reducing WAN bandwidth to roughly one-fifth of random replication. Swarm’s lease-based consistency enforcement mechanism ensures that users get close-to-open consistency for file sharing with high availability in a large network with nodes failing and new nodes joining at a high rate (up to 5 nodes per second). This application demonstrates the failure-resilience and network economy properties of our consistency solution at scale. 182 End-to-end Propagation Latency via Swarm Relays (120 Subscribers) 2500 60 2 (max hops in msg path) Latency (msecs) 3 40 1500 30 1000 3 4 20 3 4 4 500 5 % increase due to extra hops 50 2000 6 hops @ f2 5 hops @ f2 4 hops @ f2 3 hops @ f2 2 hops @ f2 4 hops @ f4 3 hops @ f4 2 hops @ f4 % increase 5 4 4 5 4 5 6 10 4 0 0 1 (4) 2 (4) 4 4 (4) (2) 5 5 (4) (2) 8 8 (4) (2) 10 10 (4) (2) 12 12 (4) (2) 15 15 (4) (2) 20 20 (4) (2) # Relays (fanout) Figure 5.27. End-to-end propagation latency via Swarm relay network. Swarm servers are compute-bound for packet processing, and cause packet queuing delays. However, when multiple servers are deployed, individual servers are less loaded. This reduces the overall packet latency drastically. With a deeper relay tree (obtained with a relay fanout of 2), average packet latency increases marginally due to the extra hops through relays. Together, these results validate our thesis that the small set of configurable consistency options provided by our consistency framework can satisfy the diverse sharing needs of distributed applications effectively. By adopting our framework, a middleware data service (as exemplified by Swarm) can provide coherent wide area caching support for diverse distributed applications/services. 183 Latency Increment at various levels of Swarm Multicast Tree Latency increment (millisecs) 4096 1024 256 64 16 root (fanout 4) root (fanout 2) level 2 (fanout 4) level 2 (fanout 2) level 3 (fanout 2) level 3 (fanout 4) level 4 (fanout 2) level 5 (fanout 2) 4 1 0.25 1 2 4 5 # Relays 8 10 12 15 20 Figure 5.28. Packet delay contributed by various levels of Swarm multicast tree (shown to a log-log scale). The two slanting lines indicate that the root relay and its first level descendants are operating at their maximum capacity, and are clearly compute-bound. Packet queuing at those relays contributes the most to the overall packet latency. CHAPTER 6 RELATED WORK Data replication as a technique to improve the availability and performance of distributed services has been extensively studied by previous work. Since our work targets middleware support for replicated data management, it overlaps a large body of existing work. In this chapter, we discuss prior research on replication and consistency in relation to Swarm. For ease of presentation, we classify existing replicated data management systems into four categories: (i) systems that target specific application domains (e.g., file systems, conventional databases, collaboration), (ii) systems that provide flexible consistency management, (iii) systems that address the design issues of large-scale or wide area replication, and (iv) systems that provide middleware or infrastructural support for distributed data management. The array of available consistency solutions for replicated data management can be qualitatively depicted as a two-dimensional spectrum shown in Figure 6.1 based on the diversity of data sharing needs they support (flexibility) and the design effort required to employ them for a given application (difficulty of use). Domain-specific solutions lie at one corner of the spectrum. They are very well tuned and ready to use for specific data sharing needs, but are hard to adapt to a different context. At the other extreme lie application-specific consistency management frameworks such as those of Oceanstore [22] and Globe[75]. They can theoretically support arbitrarily diverse sharing needs, but only provide hooks in the data access path where the application is given control to implement its own consistency protocol. Configurable consistency and TACT offer a middle ground in terms of flexibility and ease of use. They both provide a generic consistency solution with a small set of options that enable their customization for a variety of sharing needs. However, to use the solution, the specific consistency needs of an application must be expressed in terms of those options. 185 Difficulty of use Implement Config App−defined (Oceanstore, Globe) Swarm TACT domain−specific (NFS, Pangaea, DB) Select Fluid repl., WebOS, Coda Preset One Few Many Most App. semantics covered (Flexibility) Figure 6.1. The Spectrum of available consistency management solutions. We categorize them based on the variety of application consistency semantics they cover and the effort required to employ them in application design. Next, we discuss several existing approaches to consistency management in relation to configurable consistency. 6.1 Domain-specific Consistency Solutions The data-shipping paradigm is applicable to many application domains, and a variety of replication and consistency solutions have been devised for various specific domains. An advantage of domain-specific solutions is that they can exploit domain knowledge for efficiency. In this section, we outline the consistency solutions proposed in the context of a variety of application classes and compare them with configurable consistencyand Swarm. Distributed file systems such as NFS [64], Sprite [48], AFS[28], Coda [38], Ficus [57], ROAM [56] and Pangaea [62] target traditional file access with low write-sharing among users. They provide different consistency solutions to efficiently support specific file sharing patterns and environments. For instance, Sprite provides the strong consistency guarantee required for reliable file locking in a LAN. AFS provides close-to-open 186 consistency to support coherent sharing of source files and documents. Coda provides weak/eventual consistency to improve file availability during disconnected operation. Ficus, ROAM and Pangaea provide eventual consistency that suits rarely write-shared files in a peer replication environment. In contrast, building a file system on top of Swarm allows you to support multiple file sharing patterns with different consistency needs in a single wide-area file system. Employing a flexible consistency solution such as configurable consistencyor TACT allows consistency settings to be configured for individual files on a per-access basis. Many distributed file systems (e.g., Coda and Fluid Replication [16]) track the consistency of file replicas in terms of file sets (such as volumes), which can significantly reduce the bookkeeping cost of replica maintenance relative to tracking files individually. For instance, this allows changes to a volume to be quickly detected to avoid checks on individual files (which could be numerous). This optimization exemplifies how domainspecific knowledge can be used to improve the performance of a consistency solution. Employing such an optimization in conjunction with configurable consistencywould require us to treat volumes as replicated objects. Then, volume-level updates can be used as a trigger to enable checks on individual files. Database system designers have devised numerous ways to weaken consistency among replicas to improve concurrency and availability [8, 52, 39, 55, 7, 1, 40]. Objectstore [41] provides multiversion concurrency control (MVCC), which allows clients accessing read-only replicas to see stale but internally consistent snapshots of data without blocking writers. It improves concurrency while ensuring serializability. Degree-two consistency [8] avoids cascading transaction aborts by allowing a transaction to release and reacquire read locks on data and obtain different results on subsequent reads. This semantics favors data currency over transaction serializability, and is useful for financial quote services. Some systems relax the strict serialization requirement for transactions if the operations involved are commutative (such as incrementing a counter). Finally, systems that allow concurrent writes resolve write conflicts by re-executing transactions and provide hooks for application-level conflict resolution [18]. In contrast to these systems, configurable consistency framework takes a compositional approach to supporting these semantics. 187 It provides options to independently express the concurrency and data currency requirements of transactions, and their data dependencies. Distributed shared memory (DSM) systems provide a shared address space abstraction to simplify parallel programming on distributed memory machines such as multiprocessors and networks of workstations. A key hurdle to achieving good performance in a DSM application is its high sensitivity to the cost of consistency-related communication. The Munin system [13] proposed multiple consistency protocols to suit several common sharing patterns of shared variables and showed that they substantially reduce communication costs in many DSM-based parallel applications. Swarm employs a similar methodology to reduce consistency-related communication in wide-area applications. However, its configurable consistency framework is designed to handle problems unique to the wide area such as partial node/network failures. False-sharing is a serious problem in page-based DSM systems as it forces replicas to synchronize more often than necessary. It is caused by allocating multiple independently accessed shared variables on a single shared memory page. Munin handles false-sharing by allowing concurrent writes to distinct shared variables even if they reside on a single page. It employs write-combining, i.e., collecting multiple updates to a page and propagating them once during memory synchronization. Write-combining can be achieved by using Swarm’s manual replica synchronization control. Distributed object systems avoid false-sharing by enforcing consistency at the granularity of shared objects instead of pages. The configurable consistency framework provides options that allow parallel updates, and relies on an application-provided plugin to resolve write conflicts. The plugin’s resolution routine can perform write-combining similar to Munin. PerDis [65] provides DSM-style access to pointer-rich persistent shared objects in cooperative engineering (CAD) applications. It replicates CAD objects at the granularity of an object cluster (e.g., a ‘file’ representing a CAD design) instead of individual objects. To ensure referential integrity, PerDis requires applications to access object clusters within transactions. Pessimistic transactions serialize accesses via locking, whereas optimistic transactions allow parallel accesses and detect conflicts, but require the application to resolve them. To allow cooperative data sharing among untrusting companies, PerDis 188 distinguishes between administrative/geographic domains. Interdomain consistency is enforced by checking out remote domain clusters via a gateway node and checking in updates later. Collaborative applications can be classified as synchronous where participants must be in contact in real-time, vs. asynchronous where participants choose when to synchronize with each other [21]. Multiplayer games and instant messaging belong to the former category, whereas shared calendars, address books, email and other groupware (e.g., Lotus Notes [32]) belong to the latter. Distributed games must disseminate player moves to other players in a common global order and resolve state inconsistencies with minimum latency and bandwidth consumption [53]. In other words, conflicts due to concurrent state updates by players must be detected and resolved in real-time. For multiplayer games, a client-server architecture is typically employed for simplicity. Here, a central arbiter receives all updates, globally orders and disseminates them, and resolves conflicts. However, the arbiter’s outgoing bandwidth requirement grows quadratically with number of clients. A hierarchical replica network (such as that provided by Oceanstore or Swarm) can be used to disseminate updates in a more bandwidth-efficient manner. Asynchronous collaboration requires that participants are given complete freedom to update data when and where needed, without having to coordinate with others for each access. Bayou [21] proposed epidemic-style replication with eventual consistency for applications of this type. The configurable consistency framework provides optimistic failure-handling and manual synchronization options to support asynchronous collaboration. 6.2 Flexible Consistency Previous research efforts have identified the need for flexibility in consistency management of wide area applications. Existing systems provide this flexibility in several ways: (i) by providing a set of discrete choices of consistency policies to suit the common needs of applications in a particular domain [74, 49], (ii) by providing a set of application independent consistency mechanisms that can be configured to serve a variety of application needs [77], or (iii) by providing hooks in the data access path where the application is given control to implement its own consistency management [75, 12, 60]. Consistency 189 solutions of the first type are simple to adopt. They are well-suited to applications whose requirements exactly match the discrete choices supported and rarely change. WebOS [74] provides three discrete consistency policies in its distributed file system that can be set individually on a per-file-access basis: last-writer-wins, which is appropriate for rarely write-shared Unix file workloads; append-only consistency, which is appropriate for interactive chat and whiteboard applications; and support for broadcast file updates or invalidation messages to a large number of wide area clients. Fluid replication [16] provides last-writer-wins, optimistic and strong consistency flavors. On the other hand, Bayou only guarantees eventual consistency among replicas as it relies on epidemic-style update propagation whenever replicas come into contact with each other. It requires applications to supply procedures to apply updates and to detect and resolve conflicts. For clients that require a self-consistent view of data in spite of mobility, it offers four types of session guarantees, which we discussed in Section 3.5. TACT [77] provides a consistency model that offers continuous control over the degree of replica divergence along several orthogonal dimensions. Our approach to flexible consistency comes closest to that of TACT. As discussed in Section 3.5.5, TACT provides continuous control over concurrency by bounding the number of uncommitted updates that can be held by a replica before it must serialize with remote updates. configurable consistency provides a binary choice between fully exclusive and fully concurrent access for reasons mentioned in Section 2.7. The TACT model also allows applications to express concurrency constraints in terms of application-level operations (e.g., allow multiple enqueue operations in parallel), whereas the configurable consistency framework requires such operations to be mapped to one of its available concurrency control modes for reads and writes. In contrast, the configurable consistency framework offers additional flexibility in some aspects that the TACT model cannot. For instance, causal and atomic dependencies are hard to express in the TACT model, as are soft best-effort timeliness guarantees required by applications like chat and event dissemination where writers cannot be blocked. Expressing them in the configurable consistency framework is straightforward. 190 Systems such as Globe [75], and Oceanstore [60] require applications to enforce consistency on their own by giving them control before and after data access. Oceanstore provides additional system-level support for update propagation by organizing replicas into a multicast tree. Each Oceanstore update can be associated with a dependency check and merge procedure where the application must provide its consistency management logic to be executed by the Oceanstore nodes during update propagation. In addition to the above, a variety of consistency solutions have been proposed to offer flexibility for specific classes of applications. We have adopted some of their options into our framework based on their value as revealed by our application survey. A detailed comparison of the configurable consistency framework with those solutions is available in Section 3.5. 6.3 Wide-area Replication Algorithms A variety of techniques have been proposed to support scalable replication over the wide area. In this section, we discuss these techniques in the areas of replica networking, update propagation, consistency maintenance and failure handling. 6.3.1 Replica Networking For scalability, systems organize replicas into a network with various topologies. Client-server file systems such as AFS, NFS and Coda organize replicas into a simple two-level static hierarchy. Clients can only interact with the server but not with other clients. Fluid replication [16] introduces an intermediate cache between clients and the server, called a WayStation, to efficiently manage client access across WAN links. Blaze’s PhD thesis [10] showed the value of constructing dynamic per-file cache hierarchies to support efficient large-scale file sharing and to reduce server load in distributed file systems. Pangaea [62] organizes file replicas into a dynamic general graph as it keeps replicas connected despite several links going down. A graph topology works well to support eventual consistency via update flooding, but cycles cause deadlocks when employing recursive pulls needed to enforce stronger consistency guarantees. Active directory [46] employs a hierarchy, but employs shortcut paths between siblings for 191 faster update propagation. Within a site, it connects replicas into a ring for tolerance to single link failures. Bayou offers the maximum flexibility in organizing replicas as it maintains consistency via pairwise synchronization. It is well-suited to asynchronous collaborative applications which need this flexibility. However, to retain this flexibility, Bayou needs to exchange version vectors proportional to the total number of replicas during every synchronization, which is wasteful when replicas stay connected much of the time. Swarm replicas exchange much smaller relative versions when connected, and fall back on version vectors during replica topology changes. The ROAM [56] mobile wide area file system replicates files on a per-volume basis. For efficient WAN link utilization, it organizes replicas into clusters called WARDs with a designated WARD master replica. Replicas within a WARD are expected to be better connected than across WARDs. However, a WARD’s membership and master replica are manually determined by its administrator. Inter-WARD communication happens only through WARD masters. Within a WARD, all replicas communicate in a peer-to-peer fashion by forming a dynamic ring. Unlike ROAM’s manual organization, Pangaea builds its graph topology dynamically on a per-file basis, and achieves network economy by building a spanning tree using node link quality information gained from an NxN ping algorithm. Unlike ROAM, and like Pangaea, Swarm builds a proximity-aware replica hierarchy automatically for each file. A new Swarm replica decides for itself which other replicas are closer to it and elects one of them as its parent. This is in contrast to Pangaea, where the best neighbors for a new replica are chosen by existing replicas based on transitive network distance computation. 6.3.2 Update Propagation Many replicated systems (e.g., Bayou, Rover, TACT) synchronize replicas by reexecuting semantic operations instead of propagating modified contents for efficiency. Most optimistically replicated systems employ version vectors (VVs) to determine the updates to be exchanged by replicas during synchronization and to detect conflicts. VVs give replicas complete flexibility in choosing other replicas for synchronization. However, the 192 replica topology affects the overall synchronization latency as well as the amount of state exchanged. For replicas to be able to arbitrarily synchronize with each other, they must maintain and exchange version vectors proportional to the total number of replicas in the system [14]. Bayou [54] implements full-length version vectors to retain flexibility in update propagation. Since this hinders large-scale replication (to thousands of replicas), many solutions have been proposed to prune the size of version vectors by somewhat compromising on flexibility. LAN-based systems employ loosely synchronized clocks to avoid having to maintain full-length VVs for each replicated object. ROAM restricts version vector size to that of a WARD by limiting replicas to directly synchronize with others only within their WARD. The main focus of this dissertation is the configurable consistency framework. As such, we have described how it can be enforced in systems employing loosely synchronized clocks as well as version vectors. In Section 4.8, we have also outlined several optimizations that are possible in a hierarchy-based implementation of CC such as Swarm. 6.3.3 Consistency Maintenance Most systems maintain replica consistency by employing a protocol based on invalidations or updates or a hybrid combination of them. Invalidation-based protocols (e.g., AFS) pessimistically block accesses to prevent conflicts, whereas update-based protocols never block reads. Optimistic algorithms (e.g., Pangaea, Bayou, ROAM) do not block writes as well, but deal with conflicts after they happen. The TACT toolkit [77] never blocks reads, but enforces consistency before allowing writes by synchronizing replicas via pull and push operations. TACT enforces consistency via iterative messaging where each replica synchronizes with all other replicas directly to enforce the desired consistency guarantees. Swarm synchronizes replicas recursively by propagating pull and push requests along a replica hierarchy. Thus, each replica communicates in parallel with a small (fixed) number of neighbors, enabling Swarm’s consistency protocol to scale to a large number of replicas. 193 When strong consistency guarantees needs to be enforced, interleaved accesses among replicas could cause thrashing, where replicas spend much of the time shuttling access privileges instead of accomplishing useful computation. Function-shipping avoids the problem by disabling caching and redirecting all data accesses to a central site. Database and object systems have long recognized the performance tradeoffs between function-shipping and data-shipping [11, 50, 52]. Many provide both paradigms, but require the choice to be made for a given data item at application design time. In contrast, Swarm’s contention-aware caching scheme allows applications to make the choice dynamically. The Sprite file system [48] disables client caching when a file is write-shared. Swarm chooses the most frequently accessed copy as the master. The Munin DSM system freezes a locking privilege at a replica for a minimum amount of time (hundreds of milliseconds) before allowing it to migrate away. This scheme is simple to implement, and works well in a LAN environment. However, in the wide area, the cost of privilege-shuttling is much higher. When accesses to an object are interleaved but some replicas exhibit higher locality than others, the time-based privilege freezing approach still allows shuttling, which quickly offsets the speedup achieved. We implemented time-based hysteresis in Swarm and observed this problem in a WAN environment. We found that performance can be improved by actively measuring the locality and redirecting other accesses to the replica that exhibits higher locality. The Mariposa distributed data manager [66] is a middleware service that sits between a database and its clients, and spreads the query and update processing load among a collection of autonomous wide area sites. Mariposa uses an economic model for the allocation of query processing work among sites as well as to decide when and where to replicate database table fragments to spread load. Each site “pays” another site to supply a cached fragment copy and to send it an update stream to keep its contents within a specified staleness limit. Clients can specify an upper bound on staleness and a budget for each query. The system splits the query into subqueries on fragments, and redirects subqueries to sites that bid to execute them at the lowest price. Since there is cost associated with keeping a cached fragment copy, each site keeps a copy only as long 194 as it can recover its cost by serving queries. Thus, the system’s economic model causes the sites to autonomously switch between function-shipping and data-shipping. Mariposa runs a bidding protocol to choose the best sites to execute each query. This approach works well when the amount of processing involved in queries is high enough to justify the overhead of bidding. In contrast, Swarm clients send their requests to an arbitrary replicated server site and let the system direct the requests to a site that triggers minimal consistency traffic. 6.3.4 Failure Resilience Any wide area system must gracefully handle failures of nodes and network links. Client-server systems such as NFS and AFS recover from failures by employing timelimited leases. Swarm implements hierarchical leases for failure recovery in a peer-topeer replication environment. A large-scale peer-to-peer system must also handle node churn where nodes continually join and leave the system. Recent work has studied the phenomenon of node churn in the context of peer-to-peer Distributed Hash Table (DHT) implementations [59]. Their study made several important observations. First, systems are better off by not reacting immediately to a failure as it will likely cause a positive feedback cycle of more failures due to congestion. Second, choosing message timeouts based on actual measured roundtrip times performs better under churn than choosing large statically configured timeouts.. Finally, when choosing nearby nodes as overlay neighbors in a peer-to-peer network, employing a simple algorithm with low bandwidth overhead is more effective at reducing latency under churn since the neighbor set could change rapidly. Our experience with Swarm confirms the validity of those observations. To avoid positive feedback loops due to congestion, a Swarm replica handles a busy neighbor differently from an unreachable neighbor by its use of TCP channels for communication. It treats the neighbor as reachable as long as it can establish a TCP connection with the neighbor. Thus it quickly recovers from link and node failures. Swarm employs much longer timeouts to handle busy neighbors to avoid the positive feedback cycles mentioned above. 195 Finally, a number of voting-based algorithms have been proposed that employ redundancy to tolerate permanent failure of primary replicas [27]. Our Swarm prototype does not implement multiple custodians and hence cannot tolerate permanent failure of custodians. 6.4 Reusable Middleware Our goal of designing the configurable consistency framework was to pave the way for effective middleware support for replication in a wide variety of distributed services. Recent research efforts have developed reusable middleware to simplify distributed application development. Thor [44] is a distributed object-oriented database specifically designed to ease the development of future distributed applications. Its designers advocate that persistent shared objects (as opposed to file systems or shared memory) provide the right abstraction to ease distributed application development, as an object-based approach facilitates tackling a number of hard issues at the system level, such as data and functionality evolution and garbage collection, in addition to caching and consistency. Thor supports sharing at a per-object granularity via transactions. It provides two consistency flavors: strong consistency via locking and eventual consistency via optimistic replica control. Oceanstore [60] takes a different approach by providing a global persistent storage utility that exports immutable files and file blocks as the units of data sharing. Unlike Swarm that updates files in-place, each update in Oceanstore creates new snapshots of a file. Oceanstore classifies servers into those that can be trusted to perform replication protocols and those that cannot be trusted. Though untrusted servers can accept updates, those updates cannot be committed until the client directly contacts the trusted servers and confirms them. Oceanstore’s main goals are secure sharing as well as long-term durability of data, and for this they incur poor write performance. CHAPTER 7 FUTURE WORK AND CONCLUSIONS Before we summarize the contribution of our thesis, we outline several important issues that must be addressed before configurable consistency can be widely adopted by middleware services to support wide area replication and suggest ways to address them in future work. 7.1 Future Work In this dissertation, we presented how a middleware can adopt the configurable consistency framework to provide flexible consistency management to a variety of wide area applications. We did this by presenting a proof-of-concept implementation of a middleware called Swarm and by showing that applications with diverse needs perform well using it. Implementing configurable consistency in a more realistic middleware storage service would allow us to more accurately guage its performance and usability for mainstream application design. For instance, our prototype does not support multiple custodians for tolerance to permanent failures. However, in realistic systems, relying on the accessibility of a single root replica can severely reduce availability. Also, our prototype does not support the causality and atomicity options, which are essential to support the transactional model of computing. Support for transactional properties is essential for database and directory services. 7.1.1 Improving the Framework’s Ease of Use Although the large set of configurable consistency options provides flexibility, determining the right set of options to employ for a given application scenario may not always be obvious and could hinder the framework’s usability. To ease the design burden from application programmers, there needs to be a higher-level interface to our framework that provides a more restricted but popular and well-understood semantics that can also be 197 safely employed in combination for specific application areas. However, as we argued in Chapter 2, applications still need the framework’s flexibility in some scenarios, in which case they can use the framework’s more expressive interface. 7.1.2 Security and Authentication Data management over the wide area often requires application components such as clients and servers to communicate across multiple administrative and/or trust domains. Security is important at two levels. First, the middleware must provide a mechanism that ensures that only authenticated users/applications get access to their data. Second, the distributed components that comprise the middleware might themselves belong to multiple administrative domains that are mutually distrusting. In that case, they need to authenticate to each other and secure their communication channels via mechanisms such as SSL encryption. The consistency protocol that we presented in this thesis assumes that the communicating parties trust each other and behave in a fail-stop manner. It must be extended to correctly operate among mutually distrusting parties, which need further research. However, if the middleware belongs to a single administrative domain or if the distributed components can authenticate to each other and can be guaranteed to run trusted code (such as in the case of middleware systems deployed across a corporate Intranet, our existing consistency protocol is adequate. 7.1.3 Applicability to Object-based Middleware Many researchers advocate the use of persistent shared object abstraction (as opposed to file systems or shared memory) to ease distributed application development, as it facilitates a number of hard issues such as data and functionality evolution and garbage collection to be tackled at the system level in addition to caching and consistency. We believe that implementing configurable consistency in an object-based middleware system is feasible because our framework assumes a general data access interface (described in Section 3.2) that applies to a variety of data abstractions including object systems. However, our belief can only be validated by an actual implementation. 198 7.2 Summary In this dissertation, we presented a novel approach to flexible consistency management called configurable consistency to support caching efficiently for diverse distributed applications in non-uniform network environments (including WANs). Although wide area caching can improve end-to-end performance and availability for many distributed services, there currently exists no data caching support that applications can use to manage write-shared data effectively. Several middleware storage services support caching for read-mostly data or rarely write-shared data suitable for traditional file access. However, their consistency mechanisms are not flexible enough to express and enforce the diverse sharing semantics needed by applications such as databases, directory services and real-time collaborative services. As a result, applications that use these middleware are forced to employ their own replication and consistency solutions outside the middleware. To determine the feasibility of a replication solution that supports the diverse sharing needs of a variety of applications, we have surveyed applications in three diverse classes: file sharing, database and directory services, and collaboration. Our application study, presented in Chapter 2, revealed that applications differ significantly in their data characteristics and consistency needs. Some applications such as file services and databases have dynamically varying consistency requirements on the same data over time. Supporting such requirements warrants a high degree of customizability in consistency management. We also observed that certain core design choices recur in the consistency management of diverse applications, although different applications need to make different sets of choices. Identifying this commonality in the consistency enforcement options enabled us to develop a new taxonomy to classify the consistency needs of diverse applications along five orthogonal dimensions. Based on this taxonomy, we developed a new consistency management framework called configurable consistency. that can express a broader set of consistency semantics than existing solutions using a small set of options along those dimensions. Our consistency framework (presented in Chapter 3) provides several benefits over existing solutions. By adopting our consistency framework, a middleware storage service 199 can support read-write WAN replication efficiently for a wider variety of applications than by adopting an existing solution. If a distributed service is designed to use the configurable consistency interface, its consistency semantics can be customized for a broader variety of sharing needs via a simple reconfiguration of the interface’s options, than if the same application were written using another consistency solution. We have demonstrated these benefits by presenting the design and implementation of configurable consistency in a prototype middleware data service called Swarm (in Chapter 4. We showed that by adopting our framework, Swarm is able to support four network services with diverse sharing needs with performance comparable (i.e., within 20%) or exceeding (by up to 500%) that of alternate implementations. Our Swarm-based distributed file service can support the consistency semantics provided by existing file systems on a per-file-session basis. We showed that it can exploit access locality more effectively than existing file systems, while supporting traditional file sharing as well as reliable file locking. We presented a proxy-caching enterprise service that supports wide area caching of enterprise objects with strong consistency, and outperforms traditional client-server solutions at all levels of contention for shared data, thanks to our novel contention-aware replication control algorithm. We illustrated how transparent replication support can be added to the popular BerkeleyDB database library using Swarm. We showed that configurable consistency enables a replicated BerkeleyDB to be reused, via simple reconfiguration, for a variety of application scenarios with diverse consistency needs. If the application can tolerate relaxed consistency, our framework provides consistency semantics that improve performance by several orders of magnitude beyond that of client-server BerkeleyDB. Although in this dissertation we present the implementation of configurable consistency in the context of a distributed file store, our framework can be implemented in any middleware that provides a session-oriented data access interface to its applications. Thus, it is applicable to object-based middleware as well as clustered file and storage services. For instance, existing clustered storage services predominantly employ read-only replication for load-balancing of read requests and for fault-tolerance using replicas as 200 standbys. Augmenting those services to provide configurable consistency enables them to support coherent read-write replication for a wide variety of services. Read-write replication, even within a cluster, can improve the performance of some applications by utilizing the cluster’s resources better (e.g., by spreading write traffic among them). Moreover, extending the storage service to provide wide area caching would benefit all of its applications. In summary, the main contributions of this dissertation are the following: (1) a new taxonomy for classifying the consistency needs of distributed applications, (2) a new consistency framework based on that taxonomy that can express a broader variety of applications than existing consistency solutions, and (3) a proof by demonstration that the framework is implementable efficiently in a middleware to support the sharing needs of a wide variety of distributed applications. REFERENCES [1] A DYA , A. Weak consistency: A generalized theory and optimistic implementations for distributed transactions. Tech. Rep. TR-786, MIT/LCS, 1999. [2] A HAMAD , M., H UTTO , P., AND J OHN , R. Implementing and programming causal distributed shared memory. In Proceedings of the 11th International Conference on Distributed Computing Systems (May 1991), pp. 274–281. [3] A MERICAN NATIONAL S TANDARD language - sql, 1992. FOR I NFORMATION S YSTEMS. Database [4] ATTIYA , H., AND W ELCH , J. L. Sequential consistency versus linearizability. In Proceedings of the 3rd Annual ACM Symposium on Parallel Algorithms and Architectures (Hilton Head, South Carolina, July 1991), pp. 304–315. [5] BADRINATH , B., AND R AMAMRITHAM , K. Semantics-based concurrency control: beyond commutativity. ACM Transactions on Data Base Systems 17, 1 (Mar. 1992). [6] BAL , H., K AASHOEK , M., AND TANENBAUM , A. Orca: A language for parallel programming of distributed systems. IEEE Transactions on Software Engineering (Mar. 1992), 190–205. [7] B ERENSON , H., B ERNSTEIN , P., G RAY, J., M ELTON , J., O’N EIL , E., AND O’N EIL , P. A critique of ANSI SQL isolation levels. In Proceedings of SIGMOD ’95 (May 1995). [8] B ERNSTEIN , P., H ADZILACOS , V., AND G OODMAN , N. Concurrency Control and Recovery in Database Systems. Addison-Wesley, Reading, Massachusetts, 1987. [9] B ERSHAD , B., Z EKAUSKAS , M., AND S AWDON , W. The Midway distributed shared memory system. In COMPCON ’93 (Feb. 1993), pp. 528–537. [10] B LAZE , M. Caching in Large Scale Distributed File Systems. Princeton University, 1993. [11] B OHRER , K. Architecture of the San Francisco Frameworks. Journal 37, 2 (Feb. 1998). PhD thesis, IBM Systems [12] B RUN -C OTTAN , G., AND M AKPANGOU , M. Adaptable replicated objects in distributed environments. Second BROADCAST Open Workshop (1995). [13] C ARTER , J. Design of the munin distributed shared memory system. Journal of Parallel and Distributed Computing 29, 2 (Sept. 1995), 219–227. 202 [14] C HARRON -B OST, B. Concerning the size of logical clocks in distributed systems. Information Processing Letters 39 (1991), 11–16. [15] C ORPORATION , I. The technology behind hotbot. http://www.inktomi. com/whitepap.html, May 1996. [16] C OX , L., AND N OBLE , B. Fast reconciliations in Fluid replication. In Proc. 21st Intl. Conference on Distributed Conputing Systems (Apr. 2001). [17] CVS. The concurrent versions system. http://www.cvshome.org/, 2004. [18] D EMERS , A., P ETERSEN , K., S PREITZER , M., T ERRY, D., T HEIMER , M., AND W ELCH , B. The Bayou architecture: Support for data sharing among mobile users. In Proceedings of the Workshop on Mobile Computing Systems and Applications (Dec. 1994). [19] D UBOIS , M., S CHEURICH , C., AND B RIGGS , F. Synchronization, coherence, and event ordering in multiprocessors. IEEE Computer 21, 2 (Feb. 1988), 9–21. [20] EBAY. http://www.ebay.com/, 2004. [21] E DWARDS , W. K., M YNATT, E. D., P ETERSEN , K., S PREITZER , M. J., T ERRY, D. B., AND T HEIMER , M. M. Designing and implementing asynchronous collaborative applications with Bayou. In Tenth ACM Symposium on User Interface Software and Technology (Banff, Alberta, Canada, Oct. 1997), pp. 119–128. [22] ET AL ., J. K. OceanStore: An architecture for global-scale persistent storage. In Proceedings of the 9th Symposium on Architectural Support for Programming Languages and Operating Systems (Nov. 2000). [23] F OX , A., , AND B REWER , E. Harvest, yield and scalable tolerant systems. In Proceedings of the Seventh Workshop on Hot Topics in Operating Systems (Mar. 1999). [24] F OX , A., G RIBBLE , S., C HAWATHE , Y., B REWER , E., AND G AUTHIER , P. Cluster-based scalable network services. In Proceedings of the 16th Symposium on Operating Systems Principles (Oct. 1997). [25] F RANCIS , P., JAMIN , S., J IN , C., J IN , Y., R AZ , D., S HAVITT, Y., AND Z HANG , L. Idmaps: A global internet host distance estimation service. IEEE/ACM Trans. on Networking (TON) 9, 5 (Oct. 2001), 525–540. [26] G HARACHORLOO , K., L ENOSKI , D., L AUDON , J., G IBBONS , P., G UPTA , A., AND H ENNESSY, J. Memory consistency and event ordering in scalable sharedmemory multiprocessors. In Proceedings of the 17th Annual International Symposium on Computer Architecture (Seattle, Washington, May 1990), pp. 15–26. [27] G IFFORD , D. Weighted voting for replicated data. In Proceedings of the 13th Symposium on Operating Systems Principles (1979), pp. 150–162. 203 [28] H OWARD , J., K AZAR , M., M ENEES , S., N ICHOLS , D., S ATYANARAYANAN , M., S IDEBOTHAM , R., AND W EST, M. Scale and performance in a distributed file system. ACM Transactions on Computer Systems 6, 1 (Feb. 1988), 51–82. [29] H UTTO , P., AND A HAMAD , M. Slow memory: Weakening consistency to enhance concurrency in distributed shared memories. In Proceedings of the 10th International Conference on Distributed Computing Systems (May 1990), pp. 302–311. [30] J OHNSON , K., K AASHOEK , M., AND WALLACH , D. CRL: High performance all-software distributed shared memory. In Proceedings of the 15th Symposium on Operating Systems Principles (1995). [31] J OSEPH , A., TAUBER , J., AND K AASHOEK , M. F. Mobile computing with the Rover toolkit. IEEE Transactions on Computers: Special issue on Mobile Computing 46, 3 (Mar. 1997). [32] J R ., L. K., B ECKHARDT, S., H ALVORSEN , T., O ZZIE , R., , AND G REIF, I. Replicated document management in a group communication system. In Groupware: Software for Computer-Supported Cooperative Work, edited by D. Marca and G. Bock, IEEE Computer Society Press (1992), pp. 226–235. [33] K AMINSKY, M., S AVVIDES , G., M AZIERES , D., AND K AASHOEK , M. Decentralized user authentication in a global file system. In Proceedings of the 19th Symposium on Operating Systems Principles (Oct. 2003). [34] K A Z AA. http://www.kazaa.com/, 2000. [35] K ELEHER , P. Decentralized replicated-object protocols. In Proceedings of the 18th Annual ACM Symposium on Principles of Distributed Computing (Apr. 1999). [36] K ELEHER , P., C OX , A. L., AND Z WAENEPOEL , W. Lazy consistency for software distributed shared memory. In Proceedings of the 19th Annual International Symposium on Computer Architecture (May 1992), pp. 13–21. [37] K IM , M., C OX , L., AND N OBLE , B. Safety, visibility and performance in a widearea file system. In Proceedings of the 1st USENIX Conference on File and Storage Technologies (FAST02) (Jan. 2002). [38] K ISTLER , J., AND S ATYANARAYANAN , M. Disconnected operation in the Coda file system. In Proceedings of the 13th Symposium on Operating Systems Principles (Oct. 1991), pp. 213–225. [39] K RISHNAKUMAR , N., AND B ERNSTEIN , A. Bounded ignorance: A technique for increasing concurrency in a replicated system. ACM Transactions on Data Base Systems 19, 4 (Dec. 1994). [40] L ADIN , R., L ISKOV, B., S HRIRA , L., AND G HEMAWAT, S. Providing high availability using lazy replication. ACM Transactions on Computer Systems 10, 4 (1992). 204 [41] L AMB , C., L ANDIS , G., O RENSTEIN , J., AND W EINREB , D. The Objectstore database system. Communications of the ACM (Oct. 1991). [42] L IBEN -N OWELL , D., BALAKRISHNAN , H., evolution of peer-to-peer systems. AND K ARGER , D. Analysis of the [43] L IPTON , R., AND S ANDBERG , J. PRAM: A scalable shared memory. Tech. Rep. CS-TR-180-88, Princeton University, Sept. 1988. [44] L ISKOV, B., A DYA , A., C ASTRO , M., DAY, M., G HEMAWAT, S., G RUBER , R., M AHESHWARI , U., M YERS , A. C., AND S HRIRA , L. Safe and efficient sharing of persistent objects in Thor. In Proceedings of SIGMOD ’96 (June 1996). [45] M AZIERES , D. Self-certifying file system. PhD thesis, MIT, May 2000. http: //www.fs.net. [46] M ICROSOFT C ORP. Active directory (in windows 2000 server resource kit). Microsoft Press, 2000. [47] M UTHITACHAROEN , A., C HEN , B., AND M AZIERES , D. Ivy: A read/write peerto-peer file system. In Proceedings of the Fifth Symposium on Operating System Design and Implementation (Dec. 2002). [48] N ELSON , M., W ELCH , B., AND O USTERHOUT, J. Caching in the Sprite network file system. ACM Transactions on Computer Systems 6, 1 (1988), 134–154. [49] N OBLE , B., F LEIS , B., K IM , M., AND Z AJKOWSKI , J. Fluid Replication. In Proceedings of the 1999 Network Storage Symposium (NetStore) (Oct. 1999). [50] O BJECT M ANAGEMENT G ROUP. The Common Object Request Broker: Architecture and Specification, Version 2.0. http://www.omg.org, 1996. [51] O PEN LDAP. OpenLDAP: an open source implementation of the Lightweight Directory Access Protocol. http://www.openldap.org/. [52] O RACLE C ORP. Oracle 7 Server Distributed Systems Manual, Vol. 2, 1996. [53] P ELLEGRINO , J., AND D OVROLIS , C. Bandwidth requirement and state consistency in three multiplayer game architectures. In ACM NetGames (May 2003), pp. 169–184. http://www.cc.gatech.edu/fac/Constantinos. Dovrolis/start page.html. [54] P ETERSEN , K., S PREITZER , M. J., T ERRY, D. B., T HEIMER , M. T., AND D EMERS , A. J. Flexible update propagation for weakly consistent replication. In Proceedings of the 16th Symposium on Operating Systems Principles (1997). [55] P ITOURA , E., AND B HARGAVA , B. K. Maintaining consistency of data in mobile distributed environments. In Proceedings of the 15th International Conference on Distributed Computing Systems (1995), pp. 404–413. 205 [56] R ATNER , D. ROAM: A scalable replication system for mobile and distributed computing. Tech. Rep. 970044, University of California, Los Angeles, 31, 1997. [57] R EIHER , P., H EIDEMANN , J., R ATNER , D., S KINNER , G., AND P OPEK , G. Resolving file conflicts in the Ficus file system. In Proceedings of the 1994 Summer Usenix Conference (1994). [58] ROWSTRON , A., AND D RUSCHEL , P. Storage management and caching in PAST, a large-scale, persistent peer-to-peer storage utility. In Proceedings of the 18th Symposium on Operating Systems Principles (2001). [59] S. R HEA AND D. G EELS AND T. ROSCOE AND J. K UBIATOWICZ . Handling Churn in a DHT. In Proceedings of the USENIX 2004 Annual Technical Conference (June 2004). [60] S. R HEA ET AL. Pond: The oceanstore prototype. In Proceedings of the 2nd USENIX Conference on File and Storage Technologies (FAST03) (Mar. 2003). [61] S AITO , Y., B ERSHAD , B., AND L EVY, H. Manageability, availability and performance in Porcupine: A highly scalable internet mail service. ACM Trans. Comput. Syst. (Aug. 2000). [62] S AITO , Y., K ARAMANOLIS , C., K ARLSSON , M., AND M AHALINGAM , M. Taming aggressive replication in the Pangaea wide-area file system. In Proceedings of the Fifth Symposium on Operating System Design and Implementation (2002), pp. 15–30. [63] S AITO , Y., AND S HAPIRO , M. Replication: Optimistic approaches. Tech. Rep. HPL-2002-33, HP Labs, 2001. [64] S ANDBERG , R., G OLDBERG , D., K LEIMAN , S., WALSH , D., AND LYON , B. Design and implementation of the SUN Network Filesystem. In Proceedings of the Summer 1985 USENIX Conference (1985), pp. 119–130. [65] S HAPIRO , M., F ERREIRA , P., AND R ICHER , N. Experience with the PerDiS large-scale data-sharing middleware. In Intl. W. on Persistent Obj. Sys. (Lillehammer (Norway), Sept. 2000), G. Kirby, Ed., vol. 2135 of Lecture Notes in Computer Science, Springer-Verlag, pp. 57–71. http://www-sor.inria.fr/publi/ EwPLSDSM pos2000.html. [66] S IDELL , J., AOKI , P., S AH , A., C.S TAELIN , S TONEBRAKER , M., AND Y U , A. Data replication in Mariposa. In Proceedings of the 12th International Conference on Data Engineering (New Orleans, LA (USA), 1996). [67] S LEEPYCAT S OFTWARE. com/, 2000. The berkeleydb database. http://sleepycat. [68] S TOICA , I., M ORRIS , R., K ARGER , D., K AASHOEK , M. F., AND BALAKRISH NAN , H. Chord: A scalable peer-to-peer lookup service for internet applications. In Proceedings of the Sigcomm ’01 Symposium (August 2001). 206 [69] S USARLA , S., AND C ARTER , J. DataStations: Ubiquitous transient storage for mobile users. Tech. Rep. UUCS-03-024, University of Utah School of Computer Science, Nov. 2003. [70] TANENBAUM , A., AND VAN S TEEN , M. Distributed Systems: Principles and Paradigms. Prentice-Hall, Upper Saddle River, New Jersey, 2002. [71] T ERRY, D. B., D EMERS , A. J., P ETERSEN , K., S PREITZER , M. J., T HEIMER , M. M., AND W ELCH , B. B. Session guarantees for weakly consistent replicated data. In Proceedings International Conference on Parallel and Distributed Information Systems (PDIS) (Sept. 1994), pp. 140–149. [72] TORRES -ROJAS , F. J., A HAMAD , M., AND R AYNAL , M. Timed consistency for shared distributed objects. In Proceedings of the eighteenth annual ACM symposium on Principles of distributed computing (PODC ’99) (Atlanta, Georgia, May 1999), ACM, pp. 163–172. [73] T RANSACTION P ROCESSING P ERFORMANCE C OUNCIL . The TPC-A Benchmark Revision 2.0, June 1994. [74] VAHDAT, A. Operating System Services For Wide Area Applications. PhD thesis, University of California, Berkeley, CA, 1998. [75] S TEEN , M., H OMBURG , P., AND TANENBAUM , A. Architectural design of Globe: A wide-area distributed system. Tech. Rep. IR-422, Vrije Universiteit, Department of Mathematics and Computer Science, Mar. 1997. VAN [76] W HITE , B., L EPREAU , J., S TOLLER , L., R ICCI , R., G URUPRASAD , S., N EWBOLD , M., H IBLER , M., BARB , C., AND J OGLEKAR , A. An integrated experimental environment for distributed systems and networks. In Proceedings of the Fifth Symposium on Operating System Design and Implementation (Boston, MA, Dec. 2002). [77] Y U , H., AND VAHDAT, A. Design and evaluation of a continuous consistency model for replicated services. In Proceedings of the Fourth Symposium on Operating System Design and Implementation (Oct. 2000). [78] Y U , H., AND VAHDAT, A. The costs and limits of availability for replicated services. In Proceedings of the 18th Symposium on Operating Systems Principles (Oct. 2001). [79] Z HAO , B., K UBIATOWICZ , J., AND J OSEPH , A. Tapestry: An infrastructure for wide-area fault-tolerant location and routing. Submitted for publication, 2001.