CONFIGURABLE CONSISTENCY FOR WIDE-AREA CACHING

advertisement
CONFIGURABLE CONSISTENCY FOR WIDE-AREA
CACHING
by
Sai R. Susarla
A dissertation submitted to the faculty of
The University of Utah
in partial fulfillment of the requirements for the degree of
Doctor of Philosophy
in
Computer Science
School of Computing
The University of Utah
May 2007
c Sai R. Susarla 2007
Copyright All Rights Reserved
THE UNIVERSITY OF UTAH GRADUATE SCHOOL
SUPERVISORY COMMITTEE APPROVAL
of a dissertation submitted by
Sai R. Susarla
This dissertation has been read by each member of the following supervisory committee and by
majority vote has been found to be satisfactory.
Chair:
John B. Carter
Wilson Hsieh
Jay Lepreau
Gary Lindstrom
Edward R. Zayas
THE UNIVERSITY OF UTAH GRADUATE SCHOOL
FINAL READING APPROVAL
To the Graduate Council of the University of Utah:
I have read the dissertation of
Sai R. Susarla
in its final form and
have found that (1) its format, citations, and bibliographic style are consistent and acceptable;
(2) its illustrative materials including figures, tables, and charts are in place; and (3) the final
manuscript is satisfactory to the Supervisory Committee and is ready for submission to The
Graduate School.
Date
John B. Carter
Chair: Supervisory Committee
Approved for the Major Department
Martin Berzins
Chair/Director
Approved for the Graduate Council
David S. Chapman
Dean of The Graduate School
ABSTRACT
Data caching is a well-understood technique for improving the performance and
availability of wide area distributed applications. The complexity of caching algorithms
motivates the need for reusable middleware support to manage caching. To support
diverse data sharing needs effectively, a caching middleware must provide a flexible
consistency solution that (i) allows applications to express a broad variety of consistency
needs, (ii) enforces consistency efficiently among WAN replicas satisfying those needs,
and (iii) employs application-independent mechanisms that facilitate reuse. Existing
replication solutions either target specific sharing needs and lack flexibility, or leave
significant consistency management burden on the application programmer. As a result,
they cannot offload the complexity of caching effectively from a broad set of applications.
In this dissertation, we show that a small set of customizable data coherence mechanisms can support wide-area replication effectively for distributed services with very
diverse consistency requirements. Specifically, we present a novel flexible consistency
framework called configurable consistency that enables a single middleware to effectively support three important classes of applications, namely, file sharing, shared database
and directory services, and real-time collaboration. Instead of providing a few prepackaged consistency policies, our framework splits consistency management into design choices along five orthogonal aspects, namely, concurrency control, replica synchronization, failure handling, update visibility and view isolation. Based on a detailed
classification of application needs, the design choices can be combined to simultaneously enforce diverse consistency requirements for data access. We have designed and
prototyped a middleware file store called Swarm that provides network-efficient wide
area peer caching with configurable consistency.
To demonstrate the practical effectiveness of the configurable consistency framework, we built four wide area network services that store data with distinct consistency
needs in Swarm. They leverage its caching mechanisms by employing different sets of
configurable consistency choices. The services are: (1) a wide area file system, (2) a
proxy-caching service for enterprise objects, (3) a database augmented with transparent
read-write caching support, and (4) a real-time multicast service. Though existing middleware systems can individually support some of these services, none of them provides a
consistency solution flexible enough to support all of the services efficiently. When these
services employ caching with Swarm, they deliver more than 60% of the performance of
custom-tuned implementations in terms of end-to-end latency, throughput and network
economy. Also, Swarm-based wide-area peer caching improves service performance by
300 to 500% relative to client-server caching and RPCs.
v
This dissertation is an offering to
The Divine Mother of All Knowledge, Gaayatri.
CONTENTS
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
iv
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
xi
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv
GLOSSARY OF TERMS USED . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvi
ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii
CHAPTERS
1.
2.
INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
1.1 Consistency Needs of Distributed Applications . . . . . . . . . . . . . . . . . . .
1.1.1 Requirements of a Caching Middleware . . . . . . . . . . . . . . . . . . . .
1.1.2 Existing Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2 Configurable Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2.1 Configurable Consistency Options . . . . . . . . . . . . . . . . . . . . . . . .
1.2.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2.3 Limitations of the Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2.4 Swarm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.4 Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.5 Roadmap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4
6
7
9
11
12
14
15
16
19
20
CONFIGURABLE CONSISTENCY: RATIONALE . . . . . . . . . . . . . . . . . 22
2.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1.1 Programming Distributed Applications . . . . . . . . . . . . . . . . . . . . .
2.1.2 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2 Replication Needs of Applications: A Survey . . . . . . . . . . . . . . . . . . . .
2.3 Representative Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3.1 File Sharing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3.2 Proxy-based Auction Service . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3.3 Resource Directory Service . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3.4 Chat Service . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.4 Application Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.5 Consistency Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.5.1 Update Stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.5.2 View Isolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.5.3 Concurrency Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.5.4 Replica Synchronization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.5.5 Update Dependencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
22
22
23
25
27
27
29
30
30
31
32
32
32
34
35
36
2.5.6 Failure-Handling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.7 Limitations of Existing Consistency Solutions . . . . . . . . . . . . . . . . . . . .
2.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.
CONFIGURABLE CONSISTENCY FRAMEWORK . . . . . . . . . . . . . . . 42
3.1 Framework Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2 Data Access Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3 Configurable Consistency Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3.1 Concurrency Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3.2 Replica Synchronization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3.2.1 Timeliness Guarantee . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3.2.2 Strength of Timeliness Guarantee . . . . . . . . . . . . . . . . . . . . .
3.3.3 Update Ordering Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3.3.1 Ordering Independent Updates . . . . . . . . . . . . . . . . . . . . . . .
3.3.3.2 Semantic Dependencies . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3.4 Failure Handling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3.5 Visibility and Isolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.4 Example Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.5 Relationship to Other Consistency Models . . . . . . . . . . . . . . . . . . . . . . .
3.5.1 Memory Consistency Models . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.5.2 Session-oriented Consistency Models . . . . . . . . . . . . . . . . . . . . . .
3.5.3 Session Guarantees for Mobile Data Access . . . . . . . . . . . . . . . . .
3.5.4 Transactions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.5.5 Flexible Consistency Schemes . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.6.1 Ease of Use . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.6.2 Orthogonality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.6.3 Handling Conflicting Consistency Semantics . . . . . . . . . . . . . . . . .
3.7 Limitations of the Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.7.1 Conflict Matrices for Abstract Data Types . . . . . . . . . . . . . . . . . .
3.7.2 Application-defined Logical Views on Data . . . . . . . . . . . . . . . . .
3.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.
38
38
40
41
42
43
45
45
47
48
49
49
51
51
53
54
56
58
58
62
64
65
67
69
69
70
70
71
71
72
72
IMPLEMENTING CONFIGURABLE CONSISTENCY . . . . . . . . . . . . . 74
4.1 Design Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.1.1 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2 Swarm Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2.1 Swarm Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2.2 Application Plugin Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3 Using Swarm to Build Distributed Services . . . . . . . . . . . . . . . . . . . . . .
4.3.1 Designing Applications to Use Swarm . . . . . . . . . . . . . . . . . . . . .
4.3.2 Application Design Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3.2.1 Distributed File Service . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3.2.2 Music File Sharing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3.2.3 Wide-area Enterprise Service Proxies . . . . . . . . . . . . . . . . .
viii
74
76
77
78
80
81
85
87
87
87
88
4.4 Architectural Overview of Swarm . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.5 File Naming and Location Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.6 Replication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.6.1 Creating Replicas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.6.2 Custodians for Failure-resilience . . . . . . . . . . . . . . . . . . . . . . . . . .
4.6.3 Retiring Replicas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.6.4 Node Membership Management . . . . . . . . . . . . . . . . . . . . . . . . . .
4.6.5 Network Economy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.6.6 Failure Resilience . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.7 Consistency Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.7.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.7.1.1 Privilege Vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.7.1.2 Handling Data Access . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.7.1.3 Enforcing a PV’s Consistency Guarantee . . . . . . . . . . . . . . .
4.7.1.4 Handling Updates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.7.1.5 Leases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.7.1.6 Contention-aware Caching . . . . . . . . . . . . . . . . . . . . . . . . . .
4.7.2 Core Consistency Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.7.3 Pull Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.7.4 Replica Divergence Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.7.4.1 Hard time bound (HT) . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.7.4.2 Soft time bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.7.4.3 Mod bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.7.5 Refinements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.7.5.1 Deadlock Avoidance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.7.5.2 Parallelism for RD Mode Sessions . . . . . . . . . . . . . . . . . . . .
4.7.6 Leases for Failure Resilience . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.7.7 Contention-aware Caching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.8 Update Propagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.8.1 Enforcing Semantic Dependencies . . . . . . . . . . . . . . . . . . . . . . . .
4.8.2 Handling Concurrent Updates . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.8.3 Enforcing Global Order . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.8.4 Version Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.8.5 Relative Versions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.9 Failure Resilience . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.9.1 Node Churn . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.10 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.11 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.11.1 Security . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.11.2 Disaster Recovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.12 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.
88
89
91
91
95
95
95
96
96
97
98
98
98
100
100
102
102
105
106
109
110
110
111
112
112
112
113
114
119
121
121
122
123
124
125
126
127
128
128
129
129
EVALUATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
5.1 Experimental Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
5.2 Swarmfs: A Flexible Wide-area File System . . . . . . . . . . . . . . . . . . . . . 133
5.2.1 Evaluation Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
ix
5.2.2 Personal Workload . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2.3 Sequential File Sharing over WAN (Roaming) . . . . . . . . . . . . . . . .
5.2.4 Simultaneous WAN Access (Shared RCS) . . . . . . . . . . . . . . . . . . .
5.2.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.3 SwarmProxy: Wide-area Service Proxy . . . . . . . . . . . . . . . . . . . . . . . . .
5.3.1 Evaluation Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.3.2 Workload . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.3.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.3.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.4 SwarmDB: Replicated BerkeleyDB . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.4.1 Evaluation Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.4.2 Diverse Consistency Semantics . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.4.3 Failure-resilience and Network Economy at Scale . . . . . . . . . . . . .
5.4.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.5 SwarmCast: Real-time Multicast Streaming . . . . . . . . . . . . . . . . . . . . . .
5.5.1 Results Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.5.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.5.3 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.5.4 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.5.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.
RELATED WORK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
6.1 Domain-specific Consistency Solutions . . . . . . . . . . . . . . . . . . . . . . . . .
6.2 Flexible Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.3 Wide-area Replication Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.3.1 Replica Networking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.3.2 Update Propagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.3.3 Consistency Maintenance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.3.4 Failure Resilience . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.4 Reusable Middleware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.
137
140
143
145
150
150
151
153
154
155
160
161
162
167
168
172
174
174
175
175
176
179
185
188
190
190
191
192
194
195
FUTURE WORK AND CONCLUSIONS . . . . . . . . . . . . . . . . . . . . . . . . . 196
7.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.1.1 Improving the Framework’s Ease of Use . . . . . . . . . . . . . . . . . . .
7.1.2 Security and Authentication . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.1.3 Applicability to Object-based Middleware . . . . . . . . . . . . . . . . . .
7.2 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
196
196
197
197
198
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
x
LIST OF FIGURES
2.1 A distributed application with five components A1..A5 employing the
function-shipping model. Each component holds a fragment of overall
state and operates on other fragments via messages to their holder. . . . . . .
24
2.2 A distributed application employing the data-shipping model. A1..A5
interact only via caching needed state locally. A1 holds fragments 1 and 2
of overall state. A2 caches fragments 1 and 3. A3 holds 3 and caches 5. . .
25
2.3 Distributed application employing a hybrid model. A1, A3 and A5 are in
data-shipping mode. A2 is in client mode, and explicitly communicates
with A1 and A3. A4 is in hybrid mode. It caches 3, holds 4 and contacts
A5 for 5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
26
3.1 Pseudo-code for a query operation on a replicated database. The cc options
are explained in Section 3.3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
45
3.2 Pseudo-code for an update operation on a replicated database. . . . . . . . . .
46
3.3 Atomic sessions example: moving file fname from one directory to another. Updates u1,u2 are applied atomically, everywhere. . . . . . . . . . . . . .
51
3.4 Causal sessions example. Updates u1 and u2 happen concurrently and
are independent. Clients 3 and 4 receive u1 before their sessions start.
Hence u3 and u4 causally depend on u1, but are independent of each other.
Though u2 and u4 are independent, they are tagged as totally ordered, and
hence must be applied in the same order everywhere. Data item B’s causal
dependency at client 3 on prior update to A has to be explicitly specified,
while that of A at client 4 is implicitly inferred. . . . . . . . . . . . . . . . . . . . .
52
3.5 Update Visibility Options. Based on the visibility setting of the ongoing
WR session at replica 1, it responds to the pull request from replica 2
by supplying the version (v0, v2 or v3) indicated by the curved arrows.
When the WR session employs manual visibility, local version v3 is not
yet made visible to be supplied to replica 2. . . . . . . . . . . . . . . . . . . . . . . .
56
3.6 Isolation Options. When replica 2 receives updates u1 and u2 from replica
1, the RD session’s isolation setting determines whether the updates are
applied (i) immediately (solid curve), (ii) only when the session issues the
next manual pull request (dashed curve), or (iii) only after the session ends
(dotted curve). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
56
4.1 A centralized enterprise service. Clients in remote campuses access the
service via RPCs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
82
4.2 An enterprise application employing a Swarm-based proxy server. Clients
in campus 2 access the local proxy server, while those in campus 3 invoke
either server. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
83
4.3 Control flow in an enterprise service replicated using Swarm. . . . . . . . . . .
84
4.4 Structure of a Swarm server and client process. . . . . . . . . . . . . . . . . . . . . .
89
4.5 File replication in a Swarm network. Files F1 and F2 are replicated at
Swarm servers N1..N6. Permanent copies are shown in darker shade.
F1 has two custodians: N4 and N5, while F2 has only one, namely, N5.
Replica hierarchies are shown for F1 and F2 rooted at N4 and N5 respectively. Arrows indicate parent links. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
92
4.6 Replica Hierarchy Construction in Swarm. (a) Nodes N1 and N3 cache
file F2 from its home N5. (b) N2 and N4 cache it from N5; N1 reconnects
to closer replica N3. (c) Both N2 and N4 reconnect to N3 as it is closer
than N5. (d) Finally, N2 reconnects to N1 as it is closer than N3. . . . . . . .
94
4.7 Consistency Privilege Vector. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
99
4.8 Pseudo-code for opening and closing a session. . . . . . . . . . . . . . . . . . . . . 101
4.9 Consistency management actions in response to client access to file F2 of
Figure 4.5(d). Each replica is labelled with its currentPV, and ‘-’ denotes
∞.
104
4.10 Basic Consistency Management Algorithms. . . . . . . . . . . . . . . . . . . . . . . 107
4.11 Basic Pull Algorithm for configurable consistency. . . . . . . . . . . . . . . . . . . 108
4.12 Computation of relative PVs during pull operations. The PVs labelling the
edges show the PVin of a replica obtained from each of its neighbors, and
‘-’ denotes ∞. For instance, in (b), after node N3 finishes pulling from
N5, its N5.PVin = [wr, ∞, ∞, [10,4]], and N5.PVout = [rd, ∞, ∞,∞]. Its
currentPV becomes PVmin( [wr, ∞, ∞, [10,4]], [rd, ∞, ∞,∞], [wrlk]),
which is [rd, ∞, ∞,∞]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
4.13 Basic Push Algorithm for Configurable Consistency. . . . . . . . . . . . . . . . . 111
4.14 Replication State for Adaptive Caching. Only salient fields are presented
for readability. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
4.15 Adaptive Caching Algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
4.16 Master Election in Adaptive Caching. The algorithms presented here
are simplified for readability and do not handle race conditions such as
simultaneous master elections. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
4.17 Adaptive Caching of file F2 of Figure 4.5 with hysteresis set to (low=3,
high=5). In each figure, the replica in master mode is shown darkly
shaded, peers are lightly shaded, and slaves are unshaded. . . . . . . . . . . . . 120
5.1 Network topology emulated by Emulab’s delayed LAN configuration. The
figure shows each node having a 1Mbps link to the Internet routing core,
and a 40ms roundtrip delay to/from all other nodes. . . . . . . . . . . . . . . . . . 134
5.2 Andrew-Tcl Performance on 100Mbps LAN. . . . . . . . . . . . . . . . . . . . . . . 138
xii
5.3 Andrew-Tcl Details on 100Mbps LAN. . . . . . . . . . . . . . . . . . . . . . . . . . . 139
5.4 Andrew-tcl results over 1Mbps, 40ms RTT link. . . . . . . . . . . . . . . . . . . . . 139
5.5 Network Topology for the Swarmfs Experiments described in Sections
5.2.3 and 5.2.4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
5.6 The replica hierarchy observed for a Swarmfs file in the network of Figure
5.5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
5.7 Roaming File Access: Swarmfs pulls source files from nearby replicas.
Strong-mode Coda correctly compiles all files, but exhibits poor performance. Weak-mode Coda performs well, but generates incorrect results
on the three nodes (T2, F1, F2) farthest from server U1. . . . . . . . . . . . . . . 143
5.8 Latency to fetch and modify a file sequentially at WAN nodes. Strong-mode
Coda writes the modified file synchronously to server. . . . . . . . . . . . . . . . . . . 144
5.9 Repository module access patterns during various phases of the RCS experiment. Graphs in Figures 5.10and 5.11 show the measured file checkout latencies on Swarmfs-based RCS at Univ. (U) and Turkey (T) sites
relative to client-server RCS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
5.10 RCS on Swarmfs: Checkout Latencies near home server on the “University” (U) LAN. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
5.11 RCS on Swarmfs: Checkout Latencies at “Turkey” (T) site, far away from
home server. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
5.12 Network Architecture of the SwarmProxy service. . . . . . . . . . . . . . . . . . . 152
5.13 SwarmProxy aggregate throughput (ops/sec) with adaptive replication. . . . 156
5.14 SwarmProxy aggregate throughput (ops/sec) with aggressive caching. The
Y-axis in this graph is set to a smaller scale than in Figure 5.13 to show
more detail. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
5.15 SwarmProxy latencies for local objects at 40% locality. . . . . . . . . . . . . . . 158
5.16 SwarmProxy latencies for non-local objects at 40% locality. . . . . . . . . . . . 159
5.17 SwarmDB throughput observed at each replica for reads (lookups and
cursor-based scans). ‘Swarm local’ is much worse than ‘local’ because
SwarmDB opens a session on the local server (incurring an inter-process
RPC) for every read in our benchmark. . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
5.18 SwarmDB throughput observed at each replica for writes (insertions, deletions and updates). Writes under master-slave replication perform slightly
worse than RPC due to the update propagation overhead to slaves. . . . . . . 166
5.19 Emulated network topology for the large-scale file sharing experiment.
Each oval denotes a ‘campus LAN’ with ten user nodes, each running
a SwarmDB client and Swarm server. The labels denote the bandwidth
(bits/sec) and oneway delay of network links (from node to hub). . . . . . . 169
xiii
5.20 New file access latencies in a Swarm-based peer file sharing network of
240 nodes under node churn. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
5.21 The average incoming bandwidth consumed at a router’s WAN link by file
downloads. Swarm with proximity-aware replica networking consumes
much less WAN link bandwidth than with random networking. . . . . . . . . . 173
5.22 Data dissemination bandwidth of a single Swarm vs. hand-coded relay
for various number of subscribers across a 100Mbps switched LAN. With
a few subscribers, Swarm-based producer is CPU-bound, which limits its
throughput. But with many subscribers, Swarm’s pipelined I/O delivers
90% efficiency. Our hand-coded relay requires a significant redesign to
achieve comparable efficiency. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
5.23 End-to-end packet latency of a single relay with various numbers of subscribers across 100Mbps LAN, shown to a log-log scale.. Swarm relay
induces an order of magnitude longer delay as it is CPU-bound in our
unoptimized implementation. The delay is due to packets getting queued
in the network link to the Swarm server. . . . . . . . . . . . . . . . . . . . . . . . . . . 178
5.24 A SwarmCast producer’s sustained send throughput (KBytes/sec) scales
linearly as more Swarm servers are used for multicast (shown to log-log
scale), regardless of their fanout. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
5.25 Adding Swarm relays linearly scales multicast throughput on a 100Mbps
switched LAN. The graph shows the throughput normalized to a single
ideal relay that delivers the full bandwidth of a 100Mbps link. A network
of 20 Swarm servers deliver 80% of the throughput of 20 ideal relays. . . . 180
5.26 Swarm’s multicast efficiency and throughput-scaling. The root Swarm
server is able to process packets from the producer at only 60% of the
ideal rate due to its unoptimized compute-bound implementation. But its
efficient hierarchical propagation mechanism disseminates them at higher
efficiency. Fanout of Swarm network has a a minor impact on throughput. 181
5.27 End-to-end propagation latency via Swarm relay network. Swarm servers
are compute-bound for packet processing, and cause packet queuing delays. However, when multiple servers are deployed, individual servers
are less loaded. This reduces the overall packet latency drastically. With
a deeper relay tree (obtained with a relay fanout of 2), average packet
latency increases marginally due to the extra hops through relays. . . . . . . . 182
5.28 Packet delay contributed by various levels of Swarm multicast tree (shown
to a log-log scale). The two slanting lines indicate that the root relay and
its first level descendants are operating at their maximum capacity, and
are clearly compute-bound. Packet queuing at those relays contributes the
most to the overall packet latency. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
6.1 The Spectrum of available consistency management solutions. We categorize them based on the variety of application consistency semantics they
cover and the effort required to employ them in application design. . . . . . . 185
xiv
LIST OF TABLES
1.1 Consistency options provided by the Configurable Consistency (CC) framework. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10
1.2 Configurable Consistency (CC) options for popular consistency flavors. . . .
13
2.1 Characteristics of several classes of representative wide-area applications.
28
2.2 Consistency needs of several representative wide-area applications. . . . . .
33
3.1 Concurrency matrix for Configurable Consistency. . . . . . . . . . . . . . . . . .
47
3.2 Expressing the needs of representative applications using Configurable
Consistency. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
50
3.3 Expressing various consistency semantics using Configurable Consistency
options. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
59
3.4 Expressing other flexible models using Configurable Consistency options.
60
4.1 Swarm API. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
79
4.2 Swarm interface to application plugins. . . . . . . . . . . . . . . . . . . . . . . . . . .
81
5.1 Consistency semantics employed for replicated BerkeleyDB (SwarmDB)
and the CC options used to achieve them. . . . . . . . . . . . . . . . . . . . . . . . . . 162
GLOSSARY OF TERMS USED
AFS Andrew File System, a client-server distributed file system developed at CMU
Andrew a file system benchmark created at CMU
CMU Carnegie-Mellon University
CVS Concurrent Versioning System, a more sophisticated version control software based
on RCS
Caching Creating a temporary copy of remotely located data to speed up future retrievals nearby
Coda An extension of the AFS file system that operates in weakly-connected network
environments
LAN Local-area Network
NFS Network File System developed by Sun Labs.
NOP null/no operation
NTP Network Time Protocol used to synchronize wall clock time among computers.
RCS Revision Control System, a public-domain file version control software
Replication Generally keeping multiple copies of a data item for any reason such as
performance or redundancy
TTCP A sockets-based Unix program to measure the network bandwidth between two
machines
VFS Virtual File System: an interface developed for an operating system kernel to
interact with file system-specific logic in a portable way
WAN Wide-area Network
ACKNOWLEDGMENTS
At the outset, I owe the successful pursuit of my Ph.D degree to two great people: my
advisor, John Carter and my mentor, Jay Lepreau. I really admire John’s immeasurable
patience and benevolence in putting up with a ghost student like me who practically
went away in the middle of PhD for four long years to work full-time (at Novell). I thank
John for giving me complete freedom to explore territory that was new to both of us,
while gently nudging me towards practicality with his wisdom. Thanks to John’s able
guidance, I could experience the thrill of systems research for which I joined graduate
school - conceiving and building a really ‘cool’, practically valuable system of significant
complexity all by myself, from scratch. Finally, I owe most of my writing skills to John.
However, all the errors you may find in this thesis are mine.
I would like to thank Jay Lepreau for providing generous financial as well as moral
support throughout my graduate study at Utah. Jay Lepreau and Bryan Ford have taught
me the baby steps in research. I thank Jay for allowing me to use his incredible tool,
the Emulab network testbed, extensively, without which my work would have been
impossible. I thank the members of the Flux research group at Utah, especially Mike
Hibler, Eric Eide, Robert Ricci and Kirk Webb for patiently listening to my technical
problems and ideas, and providing critical feedback that greatly helped my work. I really
enjoyed the fun atmosphere in the Flux group at Utah.
I feel fortunate to have been guided by Wilson Hsieh, a star researcher. From him, I
learnt a great deal how to clearly identify and articulate the novel aspects of my work that
formed my thesis. The critical feedback I received from Wilson, Gary Lindstrom and Ed
Zayas shaped my understanding of doctoral research. The positive energy, enthusiasm
and joy that Ed Zayas radiates is contagious. I cherish every moment I spent with him at
Novell and later at Network Appliance.
My graduate study has been a long journey during which I met many great friends of a
lifetime - especially Anand, Gangu, Vamsi and Sagar. Each is unique, inspiring, pleasant
and heart-warming in his own way. I am grateful to my family members including my
parents, grandparents, aunts and uncles (especially Aunt Ratna) and my wife Sarada,
without whose self-giving love and encouragement I would not have achieved anything.
Finally, my utmost gratitude to the Veda Maataa, Gaayatrii (the Divine Mother of
Knowledge), who nurtured me every moment in the form of all these people. She
provided what I needed just when I needed it, and guides me towards the goal of Life
- Eternal Existence (Sat), Infinite Knowledge and Power (Chit), and Inalienable Bliss
(Ananda).
xviii
CHAPTER 1
INTRODUCTION
Modern Internet-based services increasingly operate over diverse networks and cater
to geographically widespread users. Caching of service state and user data at multiple
locations is well understood as a technique to scale service capacity, to hide network
variability from users, and to provide a responsive and highly available service. Caching
of mutable data raises the issue of consistency management, which ensures that all
replica contents converge to a common final value in spite of updates. A dynamic
wide area environment poses several unique challenges to caching algorithms, such
as diverse network characteristics, diverse resource constraints, and machine failures.
Their complexity motivates the need for a reusable solution to support data caching and
consistency in new distributed services. In this dissertation, we address the following
question: can a single middleware system provide cached data access effectively in
diverse distributed services with a wide variety of sharing needs?
To identify the core desirable features of a caching middleware, we have surveyed
the data sharing needs of a wide variety of distributed applications ranging from personal file access (with little data sharing) to widespread real-time collaboration (with
fine-grain synchronization). We found that although replicable data is prevalent in many
applications, their data characteristics (e.g., the unit of data access, its mutability, and the
frequency of read/write sharing) and consistency requirements vary widely. To support
this diversity efficiently requires greater customizability of consistency mechanisms than
provided by existing solutions. Also, applications operate in diverse network environments ranging from well-connected corporate servers to intermittently connected mobile
devices. The ability to promiscuously replicate data
1
1
and synchronize with any avail-
The term replication has been used in previous literature [62, 78, 22] to refer both to keeping transient
copies/replicas of data to improve access latency and availability (also called caching or second-class
replication [38]) as well as to maintaining multiple redundant copies of data to protect against permanent
2
able replica, called pervasive replication [62], greatly enhances application availability
and performance in such environments. Finally, we observed that certain core design
choices recur in the consistency management of diverse applications, although different
applications need to make different sets of choices. This commonality in the consistency
enforcement options allows the development of a flexible consistency framework and its
implementation in a caching middleware to support diverse sharing needs.
Based on the above observations, this dissertation presents a novel flexible consistency management framework called configurable consistency that supports efficient
caching for diverse distributed applications running in non-uniform network environments (including WANs). The configurable consistency framework can express a broader
mix of consistency semantics than existing models (ranging from strong to eventual
consistency) by combining a small number of orthogonal design choices. The framework’s choices allow an application to make different tradeoffs between consistency,
availability, and performance over time. Its interface also allows different users to impose
different consistency requirements on the same data simultaneously. Thus, the framework is highly customizable and adaptable to varying user requirements. Configurable
consistency can be enforced efficiently among data replicas spread across non-uniform
networks. If a data service/middleware adopts the configurable consistency protocol to
synchronize peer replicas, it can support the read/write data sharing needs of a variety
of applications efficiently. For clustered services, read-write caching with configurable
consistency helps incrementally scale service capacity to handle client load. For wide
area services, it also improves end-to-end service latency and throughput, and reduces
WAN usage.
To support our flexibility and efficiency claims, we present the design of a middleware data store called Swarm2 that provides pervasive wide area caching, and supports
diverse application needs by implementing configurable consistency. To demonstrate
the flexibility of configurable consistency, we present four network services that store
loss of some of them (called first-class replication). In this dissertation, we use the unqualified term
replication to refer to caching, unless stated otherwise.
2
Swarm stands for Scalable Wide Area Replication Middleware.
3
data with distinct consistency needs in Swarm and leverage its caching support with
different configurable consistency choices. Though existing systems support some of
these services, none of them has a consistency solution flexible enough to support all of
the services efficiently. Under worst-case workloads, services using Swarm middleware
for caching perform only within 20% of equivalent hand-optimized implementations
in terms of end-to-end latency, throughput and network utilization. However, relative
to traditional client-server implementations without caching, Swarm-based caching improves service performance by at least 500% on realistic workloads. Thus, layering these
services on top of Swarm middleware only incurs a low worst-case penalty, but benefits
them significantly in the common case.
Swarm accomplishes this by providing the following features: (i) a failure-resilient
proximity-aware replica management mechanism that organizes data replicas into an
overlay hierarchy for scalable synchronization and adjusts it based on observed network
characteristics and node accessibility; (ii) an implementation of the the configurable
consistency framework to let applications customize consistency semantics of shared
data to match their diverse sharing and performance needs; and (iii) a contention-aware
replication control mechanism that limits replica synchronization overhead by monitoring the contention among replica sites and adjusting the degree of replication accordingly.
For P2P file sharing with close-to-open consistency semantics, proximity-aware replica
management reduces data access latency and WAN bandwidth consumption to roughly
one-fifth that of random replica networking. With configurable consistency, database
applications can operate with diverse consistency requirements without redesign. Relaxing consistency improves their throughput by an order of magnitude over strong
consistency and read-only replication. For enterprise services employing WAN proxies,
contention-aware replication control outperforms both aggressive caching and RPCs (i.e.,
no replication) at all levels of contention while still providing strong consistency.
In Section 1.1, we present three important classes of distributed applications that
we target for replication support. We discuss the diversity of their sharing needs, the
commonality of their consistency requirements, and the existing ways in which those
requirements are satisfied to motivate our thesis. In Section 1.2, we outline the config-
4
urable consistency framework and present an overview of Swarm. Finally, in Section
1.3, we outline the four specific representative applications we built using Swarm and
our evaluation of their efficiency relative to alternate implementations.
1.1 Consistency Needs of Distributed Applications
Previous research has revealed that distributed applications vary widely in their consistency requirements [77]. It is also well-known that consistency, scalable performance,
and high availability in the wide area are often conflicting goals [23]. Different applications need to make different tradeoffs based on application and network characteristics.
To understand the diversity in replication needs, we studied three important and broad
classes of distributed services: (i) file access, (ii) directory and database services, and
(iii) real-time collaborative groupware. Though efficient custom replication solutions
exist for many individual applications in all these categories, our aim is to see if a more
generic middleware solution is feasible that provides efficient support for a wide variety
of applications.
File systems are used by myriad applications to store and share persistent data, but
applications differ in the way files are accessed. Personal files are rarely write-shared.
Software and multimedia are widely read-shared. Log files are concurrently appended.
Shared calendars and address books are concurrently updated, but their results can often
be merged automatically. Concurrent updates to version control files produce conflicts
that are hard to resolve and must be prevented. Eventual consistency (i.e., propagating
updates lazily) provides adequate semantics and high availability in the normal case
where files are rarely write-shared. But during periods of close collaboration (e.g., an approaching deadline), users need tighter synchronization guarantees such as close-to-open
(to view latest updates) or strong consistency (to prevent update conflicts) to facilitate
productive fine-grained document sharing. Hence users need to make different tradeoffs
between consistency and availability for files at different times. Currently, the only
recourse for users during close collaboration over the wide area is to avoid distributed file
systems and resort to manual synchronization at a central server (via ssh/rsync or email).
5
A directory service locates resources such as users, devices, and employee records
based on their attributes. The consistency needs of a directory service depend on the
applications using it and their resources being indexed. A music file index used for
peer-to-peer music search such as KaZaa might require a very weak (e.g., eventual) consistency guarantee, but the index must scale to a large floating population of (thousands
of) users frequently updating it. Some updates to an employee directory may need to
be made effective “immediately” at all replicas (e.g., revoking a user’s access privileges
to sensitive data), while other updates can be performed with relaxed consistency. This
requires support for multiple simultaneous accesses to the same data with different consistency requirements.
Enterprise data services (such as auctions, e-commerce, inventory management) involve multi-user access to structured data. Their responsiveness to users spread geographically can be improved by deploying wide area proxies that cache enterprise objects
(e.g., sales and customer records). For instance, the proxy at a call center could cache
many client records locally, thereby speeding up response. However, enterprise services
often need to enforce strong consistency and integrity constraints in spite of wide area
operation. Node/network failures must not degrade service availability when proxies are
added. Also since the popularity of objects may vary widely, caching decisions must
be made based on the available locality. Widely caching data with poor locality leads
to significant coherence traffic and hurts, rather than improves, performance. These
applications typically have semantic dependencies among updates such as atomicity and
causality that must be preserved to ensure data integrity.
Real-time collaboration involves producer-consumer style interaction among multiple users or application components in real-time. The key requirement is to deliver data
from producers to consumers with minimal latency while utilizing network bandwidth
efficiently. Example applications include data logging for analysis, chat (i.e., manyto-many data multicast), stock/game updates, and multimedia streaming to wide area
subscribers. In the traditional organization of these applications, producers send their
data to a central server that disseminates it to interested consumers. Replicating the
central server’s task among a multicast network of servers helps ease load on the central
6
server and thus has the potential to improve scalability. Such applications differ in the
staleness of data tolerated by consumers.
1.1.1 Requirements of a Caching Middleware
In general, applications widely differ in several respects including their data access
locality, the frequency and extent of read and write sharing among replicas, the typical
replication factor, semantic interdependencies among updates, the likelihood of conflicts
among concurrent updates, and their amenability to automatic conflict resolution. Operating these applications in the wide area introduces several additional challenges. In
a wide area environment, network links typically have non-uniform delay and varying
bandwidth due to congestion and cross-traffic. Both nodes and links may become intermittently unavailable. To support diverse application requirements in such a dynamic
environment, we believe a caching solution must have the following features:
• Customizable Consistency: Applications require a high degree of customizability of consistency mechanisms to achieve the right tradeoff between consistency
semantics, availability, and performance. The same application might need to
operate with different consistency semantics based on its changing resources and
connectivity (e.g., file sharing).
• Pervasive Replication: Application components must be able to freely cache data
and synchronize with any available replica to provide high availability in the wide
area. Rigid communication topologies such as client-server or static hierarchies
prevent efficient utilization of network resources and restrict availability.
• Network Economy: The caching and consistency mechanisms must use network
capacity efficiently and hide variable delays to provide predictable response to
users.
• Failure Resilience: For practical deployment in the wide area, a caching solution
must continue to operate correctly in spite of the failure of some nodes and network
links, i.e., it must not violate the consistency guarantees given to applications.
7
1.1.2 Existing Work
Previous research efforts have proposed a number of efficient techniques to support
pervasive wide area replication [62, 56, 18] as well as consistency mechanisms that suit
distinct data sharing patterns [38, 74, 2, 52, 41]. However, as we explain below, existing
systems lack one or more of the listed features that we feel are essential to support the
data sharing needs of the diverse application classes mentioned above.
Many systems [38, 62, 52, 41, 46, 61] handle the diversity in replication needs by
devising packaged consistency policies tailored to applications with specific data and
network characteristics. For example, the Coda file system [38] provides two modes of
operation with distinct consistency policies, namely, close-to-open and eventual consistency, based on the connectivity of its clients to servers. Fluid replication [37] provides
three flavors of consistency on a per-file-access basis to support sharing of read-only
files, rarely write-shared personal files, and files requiring serializability. Their approach
adequately serves specific access patterns, but cannot provide slightly different semantics
without a system redesign. For instance, they cannot enforce different policies for file
reads and writes, or control replica synchronization frequency, which our evaluation
shows can significantly improve application performance.
Several research efforts [39, 77] have recognized the need to allow tuning of consistency to specific application needs, and have developed consistency interfaces that
provide options to customize consistency. TACT defines a continuous consistency model
that provides continuous control over the degree of divergence of replica contents [77].
TACT’s model provides three powerful orthogonal metrics in which to express the replica
divergence requirements of applications: numerical error, order error, and staleness.
However, to express important consistency constraints such as atomicity, causality and
isolation, applications such as file and database services need a session abstraction for
grouping multiple data accesses into a unit, which the TACT model lacks. Also, TACT
only supports strict enforcement of divergence bounds by blocking writers. Enforcing
strict bounds is overkill for chat and directory services, and reduces their throughput and
availability for update operations in the presence of failures.
8
Oceanstore [60] provides an Internet-scale persistent data store for use by arbitrary
distributed applications and comes closest to our vision of a reusable data replication
middleware. However, it takes an extreme approach of requiring applications to both
define and enforce a consistency model. It only provides a semantic update model that
treats updates as procedures guarded by predicates. An application or a consistency
library must design the right set of predicates to associate with updates to achieve the
desired consistency. Their approach leaves unanswered our question about the feasibility
of a flexible consistency solution for the application classes mentioned above. Hence,
although Oceanstore can theoretically support a wide variety of applications, programmers incur the burden of implementing consistency management. Several other systems
[12, 75] adopt a similar approach to flexible consistency.
In summary, the lack of a sufficiently flexible consistency solution remains a key
hurdle to building a replication middleware that supports the wide area data sharing
needs of diverse applications. The difficulty lies in developing a consistency interface
that allows a broad variety of consistency semantics to be expressed, can be enforced
efficiently by a small set of application-independent mechanisms in the wide area, and is
customizable enough to enable a large set of applications to make the right tradeoffs
based on their environment. This dissertation proposes such a consistency interface
and shows how it can be implemented along with the other three features in a single
middleware to support the diverse application classes mentioned above.
In addition to the features mentioned in Section 1.1.1, for practical deployment, a
full-fledged data management middleware must also address several important issues
including security and authentication, fault-tolerance for reliability against permanent
loss of replicas, and long-term archival storage and retrieval for disaster recovery. For the
purpose of this thesis, however, we limit our scope to flexible consistency management
for pervasive replication, as it is an important enabler for reusable data management
middleware.
9
1.2 Configurable Consistency
To aid us in designing a more flexible consistency interface, we surveyed a number of
applications in the three classes mentioned in the previous section, looking for common
traits in their diverse consistency needs. From the survey, described in Chapter 2, we
found that a variety of consistency needs can be expressed in terms of a small number
of design choices for enforcing consistency. Those choices can be classified into five
mostly orthogonal dimensions:
• concurrency - the degree to which conflicting (read/write) accesses can be tolerated,
• replica synchronization - the degree to which replica divergence can be tolerated,
including the types of interdependencies among updates that must be preserved
when synchronizing replicas,
• failure handling - how data access should be handled when some replicas are
unreachable or have poor connectivity,
• update visibility - the granularity at which the updates issued at a replica should be
made visible globally,
• view isolation - the duration for which the data accesses at a replica should be
isolated from remote updates.
There are multiple reasonable options along each of these dimensions that create a multidimensional space for expressing consistency requirements of applications. Based on this
classification, we developed the novel configurable consistency framework that provides
the options listed in Table 1.1. When these options are combined in various ways, they
yield a rich collection of consistency semantics for reads and updates to shared data,
covering the needs of a broad mix of applications.
For instance, this approach lets a proxy-based auction service employ strong consistency for updates across all replicas, while enabling peer proxies to answer queries
with different levels of accuracy by relaxing consistency for reads to limit synchronization cost. A user can still get an accurate answer by specifying a stronger consistency
10
Table 1.1. Consistency options provided by the Configurable Consistency (CC) framework.
Consistency semantics are expressed for an access session by choosing one of the
alternative options in each row, which are mutually exclusive. Options in bold italics
indicate reasonable defaults that suit many applications. In our discussion, when we
leave an option unspecified, we assume its default value.
Dimension
Concurrency
Control
Available Consistency Options
Access mode
Failure Handling
Update Visibility
View Isolation
excl (RDLK, WRLK)
time (staleness = 0..∞ secs)
mod (unseen writes = 0..∞)
Strength
hard soft
Semantic Deps.
none causal atomic causal+atomic
Update ordering
none total serial
optimistic (ignore replicas w/ RTT ≥ 0..∞) pessimistic
session per-update manual
session per-update manual
Timeliness
Replica
Synchronization
concurrent (RD, WR)
manual
requirement for queries, e.g., when placing a bid, when the operation warrants incurring
higher latency.
The configurable consistency framework assumes that applications access (i.e., read
or write) their data in sessions, and that consistency can be enforced at session boundaries
as well as before and after each read or write access within a session. The framework’s
definition of reads and writes is general and includes queries and updates of arbitrary
complexity. In this framework, an application expresses its consistency requirements for
each session as a vector of consistency options (one from each row of Table 1.1) covering
several aspects of consistency management. Each row of the table indicates several
mutually exclusive options available to control the aspect of consistency indicated in
its first column. The table shows reasonable default options in italics, which together
enforce the close-to-open consistency semantics provided by AFS [28] for coherent
read-write file sharing. An application can select a different set of options for subsequent
sessions on the same data item to meet dynamically varying consistency requirements.
Also, different application instances can select different sets of options on the same data
item simultaneously. In that case, the framework guarantees that all sessions achieve
11
their desired semantics by providing mechanisms that serialize sessions with conflicting requirements. Thus, the framework provides applications a significant amount of
customizability in consistency management.
1.2.1 Configurable Consistency Options
We now briefly describe the options supported by the configurable consistency framework, which are explained in detail in Chapter 3.
The framework provides two flavors of access modes to control the parallelism among
reads and writes. Concurrent flavors (RD, WR) allow arbitrary interleaving of accesses
across replicas, while the exclusive modes (RDLK, WRLK) provide traditional concurrentread-exclusive-write semantics globally [48].
Divergence of replica contents (called timeliness in Table 1.1) can be controlled via
limiting staleness in terms of time, the number of unseen remote updates, or both. The
divergence bounds can be hard, i.e., strictly enforced by stalling writes if necessary
(similar to TACT’s model [77]), or soft, i.e., enforced in a best-effort fashion without
stalling any accesses.
Two types of semantic dependencies can be expressed among multiple writes (to
the same or different data items), namely, causality and atomicity. When updates are
issued independently at multiple replicas, our framework allows them to be applied (1)
with no particular constraint on their ordering at various replicas (called ‘none’), (2)
in some arbitrary but common order everywhere (called ‘total’), or (3) sequentially via
serialization (called ‘serial’).
When not all replicas are equally well-connected or available, different consistency
options can be imposed dynamically on different subsets of replicas based on their
relative connectivity. For this, the framework allows qualifying the options with a cutoff
value for a link quality metric such as network latency. In that case, consistency options
will be enforced only relative to replicas reachable via network links of higher quality
(e.g., lower latency) than the cutoff. With this option, application instances using a
replica can optimistically make progress with available data even when some replicas
are unreachable due to node/network failures.
12
Finally, the framework provides control over how long a session is kept isolated from
the updates of remote sessions, as well as when its updates are made ready to be visible to
remote sessions. A session can be made to remain isolated entirely from remote updates
(‘session’, ensuring a snapshot view of data), to apply remote updates on local copies
immediately (‘per-update’, useful for log monitoring), or when explicitly requested via
an API (‘manual’). Similarly, a session’s updates can be propagated as soon as they are
issued (useful for chat), when the session ends (useful for file updates), or only upon
explicit request (‘manual’).
Table 1.2 shows how several popular consistency flavors can be expressed in terms of
configurable consistency options. In the proxy-based auction example described above,
strong consistency can be enforced for updates by employing exclusive write mode
(WRLK) sessions to ensure data integrity, while queries employ concurrent read mode
(RD) sessions with relaxed timeliness settings for high query throughput. A replicated
chat service needs to employ per-update visibility and isolation to force user messages
to be eagerly propagated among chat servers in real-time. On the other hand, updates
to source files and documents are not stable until a write session ends. Hence they need
to employ session visibility and isolation to ensure consistent contents. We discuss the
expressive power of the framework in the context of existing consistency models in detail
in Section 3.5.
1.2.2 Discussion
At first glance, providing a large number of options (as our framework does) rather
than a small set of hardwired protocols might appear to impose an extra design burden on
application programmers. Programmers need to determine how the selection of a particular option along one dimension (e.g., optimistic failure handling) affects the semantics
provided by the options chosen along other dimensions (e.g., exclusive write mode, i.e.,
WRLK). However, thanks to the orthogonality and composability of our framework’s
options, their semantics are roughly additive; each option only restricts the applicability
of the semantics of other options and does not alter them in unpredictable ways. For
example, employing optimistic failure handling and WRLK mode together for a data
13
Table 1.2. Configurable Consistency (CC) options for popular consistency flavors.
For options left unspecified, we assume a value from the options vector: [RD/WR,
time=0, mod=∞, hard, no semantic deps, total order, pessimistic, session-grain visibility
& isolation].
Consistency
Semantics
CC Options
Sample Applications
Existing
Support
Demo Apps.
locking
rd,
wr
(strong
consistency)
RDLK/WRLK
DB, objects, file
locking
SwarmProxy,
RCS, SwarmDB
master-slave wr
WR, serial
shared queue
close-to-open
rd, wr
bounded inconsistency
MVCC rd
time=0, hard
collaborative
file sharing
airline reservation
online shopping
inventory
queries
personal file access
Fluid
Replication
[16],
DBMS,
Objectstore [41]
read-only repl.
(mySQL)
AFS, Coda[38]
Swarmfs
TACT [77]
SwarmDB
eventual / optimistic
close-to-rd
optimistic
wr-to-rd
append consistency
time=x, mod=y,
hard
RD,
time=0,
soft,
causal+atomic
time=0, soft, optimistic,
per-update isolation
time=x, mod=y,
soft, optimistic
WR,
time=0,
soft,
none/total/serial,
per-update
Directory, stock
quotes
logging, chat,
SwarmDB
Objectstore[41],
Oracle
Pangaea [62],
Swarmfs,
Ficus,
Coda,
Fluid, NFS
Active
Directory[46]
WebFS
[74],
GoogleFS
SwarmDB
SwarmDB
SwarmCast
streaming,
games
access session guarantees exclusive write access for that session only within the replica’s
partition (i.e., among replicas connected well to that replica). Thus by adopting our
framework, programmers are not faced with a combinatorial increase in the complexity
of semantics to understand.
To ease the adoption of our framework for application design, we anticipate that middleware systems that adopt our framework will bundle popular combinations of options
as defaults (e.g., ‘Unix file semantics’, ‘CODA semantics’, or ‘best effort streaming’)
for object access, while allowing individual application components to refine their con-
14
sistency semantics when required. In those circumstances, programmers can customize
individual options along some dimensions while retaining the other options from the
default set.
Although the options are largely orthogonal, a few of them implicitly imply others.
For example, the exclusive access modes imply session-grain visibility and isolation,
a hard most-current timeliness guarantee, and serial update ordering. Likewise, serial
update ordering implicitly guarantees that updates are totally ordered as well.
1.2.3 Limitations of the Framework
Although we have designed our framework to support the consistency semantics
needed by a wide variety of distributed services, we have chosen to leave out semantics that are difficult to support in a scalable manner across the wide area or that require consistency protocols with application-specific knowledge. As a consequence, our
framework has two specific limitations that may restrict application-level parallelism in
some scenarios.
First, it cannot support application-specific conflict matrices that specify the parallelism possible among application-level operations [5]. Instead, applications must
map their operations into the read and write modes provided by our framework, which
may restrict parallelism. For instance, a shared editing application cannot specify that
structural changes to a document (e.g., adding/removing sections) can be safely allowed
to proceed in parallel with changes to individual sections, although their results can be
safely merged at the application level. Enforcing such application-level concurrency constraints requires the consistency protocol to track the application operations in progress at
each replica site, which is hard to track efficiently in an application-independent manner.
Second, our consistency framework allows replica divergence constraints to be imposed on data but not on application-defined views of data, unlike TACT’s conit concept
[77]. For instance, a user sharing a replicated bulletin board might be more interested in
tracking updates to specific threads of discussion than others, or messages posted by her
friends than others, etc. To support such requirements, views should be defined on the
bulletin board dynamically (e.g., all messages with subject ‘movie’) and updates should
15
be tracked by the effect they have on those views, instead of on the whole board. Our
framework cannot express such views precisely. It is difficult to efficiently manage such
dynamic views across a large number of replicas, because doing so requires frequent
bookkeeping communication among replicas that may offset the parallelism gained by
replication. An alternative solution is to split the bulletin board into multiple consistency
units (threads) and manage consistency at a finer grain when this is required. Thus, our
framework provides alternate ways to express both of these requirements that we believe
are likely to be more efficient due to their simplicity and reduced bookkeeping.
1.2.4 Swarm
To demonstrate the practicality of the configurable consistency framework, we have
designed and prototyped the Swarm middleware file store. Swarm is organized as a
collection of peer servers that provide coherent file access at variable granularity behind a
traditional session-oriented file interface. Applications store their shared state in Swarm
files and operate on their state via nearby Swarm servers. By designing applications
this way, application writers are relieved of the burden of implementing their own data
location, caching and consistency mechanisms. Swarm allows applications to access an
entire file or individual file blocks within sessions. When opening a session, applications can specify the desired consistency semantics via configurable consistency options,
which Swarm enforces for the duration of that session. Swarm files can be updated
by overwriting previous contents (physical updates) or by invoking a semantic update
procedure that Swarm later applies to all replicas (with the help of application plugins).
Swarm supports three distinct paradigms of shared data access: (i) whole-file access
with physical updates to support unstructured variable-length data such as files, (ii)
page-grained access to support persistent objects and other heap-based data structures,
and (iii) whole file access with semantic updates to support structured data such as
databases and directories.
Swarm builds failure-resilient overlay replica hierarchies dynamically per file to manage large numbers (thousands) of file replicas across non-uniform networks. By default,
Swarm aggressively caches shared data on demand, but dynamically restricts caching
16
when it observes high contention. We refer to this technique as contention-aware replication control. Swarm’s caching mechanism dynamically monitors network quality
between clients and replicas and reorganizes the replica hierarchy to minimize the use
of slow links, thereby reducing latency and saving WAN bandwidth. We refer to this
feature of Swarm as proximity-aware replica management. Finally, Swarm’s replica
hierarchy mechanism is resilient to intermittent node and network failures that are common in dynamic environments. Swarm’s familiar file interface enables applications to be
programmed for a location-transparent persistent data abstraction, and to achieve automatic replication. Applications can meet diverse consistency requirements by employing
different sets of configurable consistency options. The design and implementation of
Swarm and configurable consistency are described in detail in Chapter 4.
1.3 Evaluation
To support our claim that configurable consistency can meet the data sharing needs of
diverse applications in the wide area efficiently, we used the Swarm prototype to support
four diverse network services. With these services, we demonstrate three properties of
our configurable consistency implementation that are important to support wide area
replication effectively: (1) efficient enforcement of diverse consistency semantics, (2)
network economy, and (3) failure-resilience at scale. The Swarm prototype used for our
evaluation supports all of the configurable consistency options except causality, atomicity, and timeliness control via limiting unseen updates (the mod bound option). As
the last column of Table 1.2 indicates, we evaluated the four services with a variety of
consistency flavors. More details of the services and their evaluation are presented in
Chapter 5.
Our first service, called Swarmfs, is a wide area peer-to-peer file system with a
decentralized but uniform hierarchical file name space. It provides ubiquitous file storage
to mobile users via autonomously managed file servers [69]. With Swarmfs, multiple applications can simultaneously access the same or different files enforcing diverse
consistency semantics such as the strong consistency of Sprite [48], the close-to-open
consistency of AFS [28] and Coda, the weak eventual consistency flavors of Coda,
17
Pangaea, NFS, and Ficus file systems [38, 62, 57], and the append-only consistency
of WebFS [74]. On a personal file access workload, Swarmfs delivers 90% of the
performance of comparable network file systems (Coda and NFS) on a LAN and 80%
of local linux file system performance across a WAN despite being layered on top of
an unoptimized middleware. Unlike existing file systems, Swarmfs can exploit data
locality even while providing the strong consistency required for file locking across
nodes. For WAN sharing of version control files that require file locking (e.g., in a
shared RCS/CVS repository), Swarmfs provides near-local latency for checkin and
checkout operations when repository accesses are highly localized to a site, and less
than one-third the latency of a traditional client-server version control system even in the
presence of widespread sharing.
Our second application, called SwarmProxy, models how the responsiveness of an
online shopping service such as Amazon.com could be improved for clients in different
geographic regions by deploying WAN proxies. By caching remote service objects
such as inventory records locally, proxies can improve the response time, throughput
and availability of the service. However, a shopping service tends to have stringent
data integrity constraints for transactions such as inventory updates. Enforcing those
constraints often requires strong consistency [41] among caches which can be expensive
over a WAN, especially when clients contend for access to shared objects. Our evaluation
of SwarmProxy demonstrates that proxy-caching with Swarm improves the aggregate
throughput and the average latency of an enterprise service by several times relative to
a remote centralized server even when clients exhibit significant contention (i.e., 40%
locality). Swarm’s contention-aware replication control provides the low latency and
high throughput of a LAN-based server under low contention (i.e., 60% locality or more),
and more than 90% of a remote server’s performance even under extreme contention (i.e.,
less than 10% locality). In comparison, aggressive caching underperforms by a factor of
3.6 at all levels of contention due to the high cost of synchronization over a WAN.
Our third application, called SwarmDB, illustrates how Swarm enables read-write
caching to be added transparently to an existing database while providing significant
control over consistency to its applications, and the performance and engineering benefits
18
of this approach. We augmented the popular BerkeleyDB database library [67] with
caching support by implementing a wrapper library named SwarmDB around unmodified
BerkeleyDB. SwarmDB intercepts BerkeleyDB calls to operate on replicas. It hosts
each BerkeleyDB database inside a Swarm file and uses Swarm’s semantic updates to
synchronize replicas. Unlike BerkeleyDB’s existing master-slave replication that only
supports read-only replicas, Swarm-based BerkeleyDB can support full-fledged readwrite replication with a wide range of consistency semantics. We evaluate SwarmDB’s
query and update throughput under an update-intensive workload. We evaluate five
distinct consistency flavors for queries and updates, ranging from strong (appropriate for
a conventional database) to time-based eventual consistency (appropriate for many directory services), and compare the performance of the resulting system to BerkeleyDB’s
client-server (RPC) implementation. We find that relaxing consistency requirements
even slightly improves write throughput by an order of magnitude and scales beyond
RPC and master-slave, enabling SwarmDB to be reused for diverse applications.
We implemented a prototype peer-to-peer file sharing service using SwarmDB for
attribute-based file search. We used this service to evaluate the network economy and
failure-resilience of Swarm’s replication and lease-based consistency mechanisms under
node churn (i.e., where nodes continually join and abruptly leave the network). In a 240node file sharing network, Swarm’s proximity-aware replica management mechanism
reduces the latency and WAN bandwidth consumption for new file downloads and limits
the impact of high node churn (5 node deaths/second) to roughly one-fifth that of random
replica networking (employed by several P2P systems such as KaZaa [34]).
Our final application, called SwarmCast, is an online streaming service that stresses
the ability of Swarm to efficiently multicast information to a large number of sites in realtime. It models applications where a number of users or service components interact in
publish-subscribe mode, such as online real-time chatting, Internet gaming, data logging,
and content dissemination. These applications require events/data produced at a site
(such as a message typed by a user to a chat room or a player’s move in a game) to be
propagated to other affected sites in real-time. In SwarmCast, a single event producer
sends event packets at full-speed to a relay server which in turn multicasts them to a
19
large number of subscribers. A centralized relay server can be quickly overwhelmed by
the CPU and network load imposed due to a high rate of concurrent events. Swarm’s
replica network can be leveraged as an overlay network of relay servers to scale the
event-handling rate of SwarmCast, as follows. The producer appends event packets at
full-speed to a shared log file stored in Swarm by sending semantic updates to a Swarm
server. Subscribers open the log file for reading at one or more Swarm servers and
register to receive log updates. In response, Swarm servers cache the log file, thereby
forming a replica hierarchy to efficiently multicast the event packets. Thus, network
I/O-handling is largely eliminated from SwarmCast application code. In spite of our
unoptimized Swarm implementation, Swarm-based multicast pushes data at 60% of the
rate possible with ideal multicast on a 100Mbps switched LAN with up to 20 relays and
120 subscribers. However, since our Swarm prototype is compute-bound, the end-toend packet latency through a tree of Swarm servers 5 levels deep is about 250ms on
an 850MHz Pentium III CPU and 80ms on a 3GHz Pentium 4. Half of the latency is
incurred on the first hop due to queuing before the root server.
Each of these applications has different requirements for accessing cached data and
achieves different consistency semantics using Swarm. Swarm successfully handles
caching for each of these applications, offloading its complexity from the application
logic. Swarm-based implementation of these applications provides comparable (60%) or
better performance relative to implementations based on existing support. This shows
that configurable consistency is powerful, i.e., supports a variety of useful consistency
semantics, as well as practical and useful, i.e., effectively supports wide area replication
for diverse distributed applications.
1.4 Thesis
Therefore, we put forth the thesis that a small set of customizable data coherence
mechanisms can support aggressive wide-area data caching effectively for distributed
services with diverse consistency requirements.
The major contributions of this dissertation research are the following:
20
1. We have studied the diverse data sharing needs of a number of important distributed applications belonging to three major classes, classified those needs in
a way that exposes their commonality, and identified certain core design choices
that recur in the consistency management of these applications.
2. Based on the commonalities identified by our application study, we have developed
a novel flexible consistency framework that can support a broader set of consistency requirements of these applications than existing consistency solutions, with
simple parameters along five orthogonal dimensions.
3. We showed that the framework can be implemented to support diverse application
needs efficiently for caching in non-uniform networks. To do this, we implemented
a caching middleware that adopts this framework, and demonstrated four applications that reuse the middleware and its flexible consistency in very different ways
for effective wide area caching.
In the remainder of this dissertation, we motivate and then demonstrate the truth of our
thesis statement.
1.5 Roadmap
In Chapter 2, we discuss our survey of the replication needs of diverse distributed applications and the rationale behind our configurable consistency management approach.
In Chapter 3, we describe the configurable consistency framework and discuss the rich
variety of useful consistency semantics that are possible in a system that adopts configurable consistency. In Chapter 4, we first describe how configurable consistency can be
used in the design of distributed applications. We then describe the interface, design and
implementation of the Swarm replication middleware, our research vehicle for evaluating
the practicality of configurable consistency. In Chapter 5, we present our evaluation of
Swarm in the context of the four services mentioned in Section 1.3. For each service, we
start by describing how we implemented it on top of Swarm and then present a detailed
evaluation of its performance and compare it with alternative implementations where
applicable. In Chapter 6, we discuss Swarm in the context of the extensive existing
21
research on replication and consistency. Finally, in Chapter 7, we suggest some directions
for future work and summarize our conclusions.
CHAPTER 2
CONFIGURABLE CONSISTENCY: RATIONALE
As mentioned in Chapter 1, our goal is to determine if middleware support for data
caching can be provided in a reusable manner for a wide variety of distributed services.
To address this question, we first surveyed a wide variety of distributed services and
identified those that can benefit from data caching. Next, we examined their sharing
needs to see if there is sufficient commonality in those needs for a reusable middleware
to be useful. In this chapter, we describe our survey and our findings to motivate our
decomposed approach to consistency management.
We begin this chapter by defining what we mean by consistency and state the objective of our application survey in Section 2.1. We describe the classes of applications that
we surveyed in Section 2.2. In particular, we describe four representative applications
from these classes that we use to motivate and evaluate our work. In Section 2.4, we
present our observations on the consistency needs of applications, identifying significant
commonalities. In Section 2.6, we discuss how these commonalities enable the development of our configurable consistency framework. In Section 2.7 we discuss other
approaches to flexible consistency management in the light of our survey of application
requirements.
2.1 Background
2.1.1 Programming Distributed Applications
The paradigms adopted by designers of distributed applications can be loosely classified as data-shipping, function-shipping or a hybrid. In all the three paradigms, a
distributed application is structured in terms of distributed components that own and
manage different portions of its overall state. Here, state refers to application data that
can persist across multiple application-level computations (such as processing different
client requests). In the function-shipping paradigm (commonly known as the client-
23
server paradigm), a component operates on remote state by sending messages to its
owner. However in the data-shipping paradigm, a component operates on remote state
by first obtaining it locally, thereby exploiting locality. A shared data abstraction can
relieve the programming effort involved in data-shipping by providing the illusion that
all application state is available locally to each component, and can be accessed as in a
multithreaded single-machine environment. Data caching/replication is relevant to those
portions of a distributed application that are designed for data-shipping. The structure
of an application using function- and data-shipping is illustrated in Figures 2.1 and 2.2
respectively.
For example, a chat service is traditionally split into two kinds of components with
different roles: the client and the server. The client is responsible for managing the
user interface and sending user messages to the server. The server keeps track of all
application state, e.g., the chat transcript and the chat room membership, and communicates client messages to other clients. However, in a pure data-shipping model, both
the client and the server directly access the application state, namely the chat transcript,
thus blurring their distinction. They become peer service replicas. The job of tracking
chat room membership and broadcasting messages between clients is then delegated to
an underlying data management system that mediates access to application state.
In a hybrid paradigm, illustrated in Figure 2.3, application components (also called
service agents) can take on either role based on the processing, storage and network
resources at their disposal. Thus, the chat service agent at a client site with a high
bandwidth network link can take on the role of a peer chat server to ease load on other
servers, while other service agents continue as clients. Thus, the hybrid paradigm allows considerable flexibility in structuring a distributed application in a way that allows
incremental scaling with offered load.
2.1.2 Consistency
Data-shipping requires coordinating concurrent accesses to application state to ensure
its consistency in the face of distribution and replication. Traditionally, the consistency
of shared data is defined in terms of a consistency model, which is a contract between
24
Figure 2.1. A distributed application with five components A1..A5 employing the
function-shipping model. Each component holds a fragment of overall state and operates
on other fragments via messages to their holder.
an application and the underlying shared data management system regarding data access
[70]. If the application adheres to system-imposed conventions for data access, such as
informing the system before and after each access, the system guarantees to maintain
certain invariants about the quality of data accessed, referred to as its consistency semantics. An example invariant could be “all components see updates in the same order”.
Consistency mechanisms are the actions taken by the system’s components to enforce this
contract. These actions include (i) controlling admission of access and update operations
on data replicas and (ii) propagating updates among replicas to maintain desired data
quality invariants. A consistency framework supports one or more consistency models
for shared data access by providing an interface to express them as well as mechanisms
to enforce them.
Previous research has revealed that distributed applications vary widely in the consistency semantics they need [77]. It is also well-known that consistency, scalable performance, and availability in the wide area are often conflicting goals. Different applications need to make different tradeoffs based on application and network characteristics.
Many existing systems [38, 62, 52, 41, 46, 61] handle this diversity by devising consistency models tailored to specific classes of applications. Several recent research efforts
25
Figure 2.2. A distributed application employing the data-shipping model. A1..A5
interact only via caching needed state locally. A1 holds fragments 1 and 2 of overall
state. A2 caches fragments 1 and 3. A3 holds 3 and caches 5.
(e.g., Fluid replication [37], Oceanstore[60] and TACT [77]) have developed consistency
frameworks that cater to a broader set of applications by providing options to customize
consistency. However, a complete reusable middleware solution that helps arbitrary
wide area applications achieve the right balance does not exist. The difficulty lies in
developing a consistency framework that can express a broad variety of consistency
semantics, can be enforced efficiently by a small set of mechanisms, and is flexible
and adaptable enough to enable applications to make the right tradeoffs based on their
environment. The advantage of having such a framework is that it paves the way for
designing a reusable middleware for replication management in distributed applications.
2.2 Replication Needs of Applications: A Survey
To study the feasibility of a consistency framework that supports a broader set of
replication needs than addressed by existing systems, we surveyed a wide variety of
distributed applications looking for common traits in their diverse needs. We considered
three broad categories of popular Internet-based applications that handle a significant
amount of data and can benefit from replication: (i) file access, (ii) collaborative groupware, and (iii) enterprise data services. Efficient replication solutions exist for individual
applications in many of these categories. However, our aim is to arrive at a systematic
26
Figure 2.3. Distributed application employing a hybrid model. A1, A3 and A5 are in
data-shipping mode. A2 is in client mode, and explicitly communicates with A1 and A3.
A4 is in hybrid mode. It caches 3, holds 4 and contacts A5 for 5.
understanding of their consistency requirements to see if a more generic middleware
solution is feasible. Our study, presented in this section, reveals that the consistency
requirements of a wide variety of applications can be expressed along a small set of
fairly orthogonal dimensions. As we argue in Section 2.7, to express the same set of
requirements naturally, existing consistency interfaces need to be extended further.
The first category of applications facilitates users accessing their files over a network
and sharing them with other users for various collaborative tasks. Access to personal
files, shared documents, media, calendars, address books, and other productivity tools
falls under this category. Although specific instances of end user applications in this
category use files in very different ways, they all exhibit a high degree of locality, making
caching highly effective at improving file availability and access latency. Applications in
the second category enable users to form online communities, exchange information and
interact in real-time or otherwise. Examples include bulletin boards, instant messaging,
Internet gaming, collaborative search and resource sharing. Though these applications
do not always exhibit high degrees of locality, replication helps these applications scale
to a large number of users by spreading the workload among multiple server sites. The
third category involves multi-user access to structured data. In addition to supporting
27
concurrent read-write sharing, these applications also need to meet integrity constraints
on the data. Example applications include enterprise databases (e.g., inventory, sales, and
customer records), CAD environments, and directory services that index dynamic content such as online shopping catalogs, bulletin board messages, user profiles and network
resources. These applications are typically centrally administered, and replication helps
them provide responsive service to geographically widespread users by hiding wide area
latencies and network failures.
Table 2.1 summarizes the data characteristics and replication needs of several applications in each of the three categories. For each application, it lists why and where
replication is beneficial, the typical replication factor, the available locality, the frequency
and extent of read or write sharing among replicas, and the likelihood of conflicts among
concurrent updates.
2.3 Representative Applications
For the remainder of this discussion, we will focus on four applications that are
broadly representative of a variety of wide area services: a file service, a shared music
index, a chat service, and an online auction service (highlighted in bold italics in the
table). We refer to these four applications throughout this thesis to motivate the design
decisions that we made. We describe our implementation and performance evaluation of
these applications using our consistency management framework in Chapter 5. Before
proceeding with our analysis in Section 2.4, we describe each of the four applications in
turn.
2.3.1 File Sharing
Our first focus application is a distributed file service. A file system is a convenient
and popular way to access and share data over a network. File systems have been used to
store and share persistent data by myriad applications, so a large number of file systems
have been developed for various environments. Depending on their application, files are
accessed and shared in different ways in a variety of network environments. Personal files
are rarely write-shared by multiple users, whereas software, multimedia and other read-
28
Table 2.1. Characteristics of several classes of representative wide-area applications.
A row in bold face indicates an application class with default characteristics. Other rows
describe specific applications. An application’s property is the same as that of its class,
unless specifically mentioned.
Applications
Unit
File service
file
personal files
version control
ad-hoc
doc.
sharing
media files
web,
content
distrib.
data/event
logging
Replication
Benefit
#
Data Characteristics
Locality Sharing
Conflicts
file
file
doc
low latency,
availability
availability
low latency
low latency
1-10
100s
10s
single user
read, write
read, write
no
yes
yes
file
low b/w util.
1000s
file
low b/w util.
10000s
log
low b/w util.
100s
write-once, read
many
one-writer, read
many
read, append
no
db
mesg
mailbox
load balance
low b/w util.
load balance
availability
1000s
100s
1-10
event
info
queue
low latency
1000s
game
state
load balance,
10s-
real-time
streaming
low b/w util.
100s
Enterprise
Data
shopping
catalog
airline reservation
directory
online auctions
CAD
sales, inventory
low latency,
availability
Groupware
search index
bulletin board
email
stock, news updates
producerconsumer apps,
search, mining
chat, games
high
low
100s
read-write
read, write
read, write
one-reader,
write-many
one writer, readmany
read, append
low
real-time, readwrite
variable
read-write,
structured data
rare
no
no
no
no
db
100s
rare
flight
10s
yes
db
catalog
object
db, item
low latency
load balance
low latency
100s
10s
100s
10s
high write
long sessions
29
only files tend to be widely read-shared. Collaboration environments involve multiple
users cooperating to update files via version control systems to prevent update conflicts.
Mail systems store messages in files that are mostly accessed by their recipient.
Though files are used in many ways, as indicated by their diverse sharing characteristics in Table 2.1, they exhibit a high degree of locality in general. Hence caching files
aggressively is highly effective at improving latency and availability in the wide area.
Existing file systems support only a subset of the sharing patterns listed in Table 2.1.
2.3.2 Proxy-based Auction Service
Our second focus application is an online auction service. It employs wide-area
service proxies that cache important auction information locally to improve responsiveness to customers in diverse geographic regions. It is modeled after more sophisticated
enterprise services that handle business-critical enterprise objects such as sales, inventory, or customer records. Many of these services benefit from deploying wide area
proxies that cache enterprise objects locally to serve regional clients. For example, the
administrators of a corporation’s database service can deploy proxies at customer service
(or call) centers that serve different sets of clients. The proxy at a call center caches
many of the client records locally, speeding up response. An online auction service such
as EBay [20] could deploy proxies at different sites to exploit regional locality. In this
case, different proxies handle the sale of different items based on how heavily the items
are traded in their region. Even if trading does not exhibit much locality, proxies help
spread the service load among multiple sites.
An important challenge in designing a proxy-based auction service is the need to
enforce strong consistency requirements and integrity constraints in spite of wide area
operation. Node/network failures must not degrade service availability when proxies
are added. Also since locality varies widely based on the popularity of sale items,
caching decisions must be made based on the available locality. Widely caching data
with poor locality leads to significant coherence traffic and hurts rather than improving
performance.
30
2.3.3 Resource Directory Service
Our third focus application is a resource directory service that maintains a global
index of shared resources (e.g., network devices, music files, employee information
and machine configuration) based on their attributes. It is modeled after sophisticated
directory services that offer complex search capabilities [46].
Unlike file accesses, the locality of index queries and updates vary widely. The
consistency needs of the directory service depend on the resources being indexed and
the applications relying on them. A very weak (e.g., eventual) consistency guarantee is
sufficient for a music file index used for peer-to-peer music sharing such as KaZaa, but
it must scale to a large floating population of (thousands of) users frequently updating it.
An employee directory may need some updates to be “immediately” made effective at
all replicas (e.g., revoking a user’s access privileges to sensitive data), while others can
be performed with relaxed consistency.
2.3.4 Chat Service
Our final focus application is an online chat service. It models several applications
that involve producer-consumer style interaction among a number of users or service
components, such as instant messaging, Internet gaming, Internet search and data mining.
The application requires that events/data produced at one or more sites, such as a message
typed by a user to a chat room or a player’s move in a game, be propagated to other
affected sites in real-time. In some cases, the events need to be delivered in certain order,
e.g., total or causal order. For example, chat room messages require causal order for
discussions to make sense (e.g., an answer appearing before its corresponding question
is confusing), whereas players’ moves in a game must be delivered in the same order to
all participants for fairness.
A key issue for scaling such a service is the CPU and network load imposed on
the server. The number of users who interact directly (e.g., participants in a chatroom or
players active in a shared virtual space) is usually small (dozens). But the total number of
passive subscribers to the service could be very large (thousands or more), e.g., listeners
in a chat room. A centralized game or chat room server quickly saturates its outgoing
31
Internet link, because the bandwidth consumed for broadcast among N users through a
single node grows quadratically with N (since each user’s message must be sent to N-1
other users). Replicating service state on proxy (chat or game) servers helps incrementally scale the overall service with load, as the replicas can be organized into an overlay
network and synchronized via an efficient multicast mechanism. When replicated, the
service must provide event ordering and/or real-time propagation guarantees, and scale
to a large number of participants and a moderate to high rate of concurrent events.
2.4 Application Characteristics
Table 2.1 reveals that in general applications widely differ in their natural unit of
sharing, their data access characteristics and their replica scaling requirements. The
natural unit of sharing could be a whole file or database, its individual pages, or the
application objects embedded therein.
Applications differ in their available locality and the likelihood of conflicts among
concurrent updates. By locality, we mean the percentage of data accesses at a site that
are not interleaved by updates to remote replicas. File accesses are known to exhibit high
locality and hence aggressive caching improves their access latency and availability in
the wide area, even when updates need to be serialized to prevent conflicts (as in version
control). Collaborative groupware applications involve concurrent updates to shared data
and exhibit low locality. However, since their updates rarely conflict, caching improves
concurrency. Enterprise applications involve frequent updates that are more likely to
conflict, and hence need to be serialized to preserve data integrity. They exhibit varying
locality. Naively replicating enterprise data with insufficient locality could severely hurt
performance and service availability. Hence unlike in the case of file access, these
applications need to adapt between function-shipping and aggressive replication based
on the amount of data access locality at various sites.
Applications have different scaling requirements in terms of the number of replicas
of a data item, ranging from a few dozens to thousands. The replication factor depends
on the number of users that actively share a data item and the number of users that a
single replica can support.
32
Table 2.1 shows the data characteristics of the applications surveyed. For the purpose
of this survey, we classify data accesses into reads and writes, but define them broadly
to include arbitrary queries and updates. Reads can be arbitrary queries that do not
modify data. Writes access data as well as make arbitrary modifications either by directly
overwriting existing contents or via update procedures that can be later re-executed at
other replicas. The sharing patterns of applications range between little sharing (e.g.,
personal files), widespread read-only sharing (e.g., music files, software), and frequent
read-write sharing (e.g., documents, databases).
2.5 Consistency Requirements
Table 2.2 summarizes the consistency requirements of the surveyed applications.
The columns classify each application’s required invariants along several aspects. We
now discuss each of these aspects, illustrating them with the four target applications
mentioned in Section 2.2.
2.5.1 Update Stability
Applications differ in their notion of what constitutes a “stable” update to a data item,
as shown in Column 2 of Table 2.2. Informally, a stable update is one that transforms
data from one valid state to another valid state that can be made visible (or propagated)
to remote replicas. Updates to source files and documents are typically considered
stable only at the end of a write session, whereas user messages in a chat session are
independent and can be sent to other users as soon as they are received. These notions of
an application’s acceptable granularity of update visibility are referred to in Table 2.2 as
session and per-update visibility respectively.
2.5.2 View Isolation
Applications have different requirements for the extent to which ongoing data access
at a replica site must be isolated from concurrent remote updates. Reading a source file or
querying an auction service requires an unchanging snapshot view of data across multiple
accesses during a file session or a database query session. On the other hand, to enable
33
Table 2.2. Consistency needs of several representative wide-area applications.
Applications
File service
personal files
version control
ad hoc doc.
sharing
media files
WWW, content distrib.
data/event
logging,
Groupware
shared music index
bulletin
board
email
stock, news
updates
chat
games,
streaming
Enterprise
Data
shopping
catalog
airline reservation
enterprise directory
online auctions
CAD
sales, inventory
Update
View
Stability
Isolation
Concurrency Replica synchronization
control
read write Timeli- Str- Update
ness
ength deps.
Failure
Handling
optimistic
session
session
session
session
weak weak current hard
weak excl.
hard
causal
none
session
session
weak weak current hard
causal
session
session
session
session
soft
soft
append
per-update
single op
update
per-update
weak weak time
current soft
weak weak
none
none
total
(time),
serial
soft
manual
pessimistic
optimistic
none
msg op
causal
msg op
update
current soft
time
hard
none
total
pessimistic
post msg
move
current soft
current hard
causal
total
pessimistic
hard
update
per-update
weak weak
reservation
per-update
weak weak mod
update
transaction
weak weak varies
bid, buy
transaction
weak excl.
manual
transaction
manual
transaction
excl.
excl.
excl.
excl.
soft
optimistic
pessimistic
hard
varies
atomic,
serial
atomic
atomic
pessimistic
optimistic
pessimistic
34
real-time interaction, a chat service requires that incoming updates be applied locally
before every access. Column 3 refers to these requirements as session and per-update
isolation respectively.
In general, applications require the ability to group a set of individual data accesses
into a session for expressing their update visibility and view isolation requirements.
Traditionally, transactions [8] are one common way to express such requirements in the
context of database applications.
2.5.3 Concurrency Control
Applications that involve write-sharing differ in the extent to which they can handle
concurrent reads and writes at multiple replicas. In general, allowing parallel reads and
writes has two consequences. First, if replicas apply updates in different order from one
another, their final results could diverge and the effect of some updates could be lost
permanently. Such updates are said to conflict. Replicas can resolve conflicts by undoing
and reapplying updates in a final order consistent at all replicas (called commit order),
or by merging their results to avoid losing the effect of some updates. This leads to the
second consequence: reads may provide an incorrect view of data if they observe the
effect of uncommitted writes (i.e., those whose position in the final commit order is yet
undetermined) that have to be undone later.
An application can afford to perform parallel writes if their semantics allow conflicts
to be resolved in a way that ensures a consistent final result at all replicas. Otherwise,
the application must serialize writes to avoid conflicts. Similarly, the application must
serialize its reads with writes to ensure that its users always view committed data. A
broad variety of applications that we surveyed either require serialized accesses or can
handle an arbitrary amount of parallelism provided replicas converge on a common final
result.
In Columns 3 and 4 of Table 2.2, we refer to accesses that must be serialized with
other writes as ‘excl’ (for exclusive) and others as ‘weak’. For the auction service, conflicting concurrent writes at different replicas are unacceptable as they are hard to resolve
automatically (e.g., selling the same item to different clients). A shared music directory
35
can perform parallel lookups, insertions and deletions as long as replica convergence can
be assured. This is because conflicting insertions and deletions can at most cause some
subsequent file lookups to fail temporarily. Most file accesses except version control
exhibit little write-sharing, and hence need not be serialized. The chat service, data
logging, and other publish-subscribe services perform a special type of write, namely,
appending data to the end of a shared file or queue. Unlike raw writes to a file, appends
can be executed concurrently, as they can be automatically reconciled by applying them
sequentially without losing data.
2.5.4 Replica Synchronization
Replicas must be synchronized in a timely manner to prevent their contents from
diverging indefinitely. Some applications must always access the latest data, while others
can tolerate ‘eventual’ replica convergence. We identify three degrees of acceptable
replica divergence: current, time (time-bounded), and mod (modification-bounded), as
shown by Column 5 labelled ‘Timeliness’ in Table 2.2.
For example, users working on a shared document expect it to reflect the latest
updates made by other users before each editing session (current). For stock-quote
updates to be useful, their staleness must be bounded by a time interval (time). Finally,
an airline reservations system must limit the extent to which a flight is overbooked. To
do this accurately across multiple database replicas, it must enforce an upper limit on the
number of unseen remote reservations at each replica (mod).
Some applications need to hand-tune their replica synchronization strategy based on
domain-specific knowledge. We refer to their requirement as a ‘manual’ divergence
bound. For instance, consider a shared music file index, portions of which are cached
by a large number of Internet users. An index replica needs to resynchronize with other
replicas only to serve a local cache miss, i.e., if an entry being looked up is not found in
the local index replica. Similarly, when a site advertises a new entry to its local index
replica, it need not actively inform all other replicas about the new entry. However, when
a site removes an entry from the index, it needs to propagate that deletion quickly to
those replicas that got notified of its insertion, to force them to purge that entry from their
36
copies. With this scheme, each index replica eventually collects its working set of index
entries, and avoids synchronization as long as its working set does not change, regardless
of new additions elsewhere. Manual control over synchronization is also helpful to
conserve battery power on mobile devices by exploiting batched sessions. Transmitting
data in large chunks infrequently is known to consume less power than multiple frequent
transmissions of small packets [31]. Batching updates also helps conserve bandwidth
by removing self-canceling operations (such as addition and deletion of the same name
from a directory) from the update stream.
Applications differ regarding whether their timeliness bounds are ‘hard’, i.e., need
to be strictly enforced, or ‘soft’, indicating that best effort suffices. Strictly enforcing a
timeliness bound incurs synchronization delays and reduces an application’s availability
when some replicas become inaccessible or weakly connected. Moreover, not all applications require hard bounds. It is unacceptable for applications such as chat as it severely
limits their responsiveness (e.g., delaying the acceptance of a user’s next message until
her previous message reaches everybody). Groupware applications such as chat, bulletin
boards and resource directories can tolerate best-effort eventual convergence qualified by
a soft bound in exchange for better responsiveness and service availability. Column 6 of
Table 2.2 lists the strength of timeliness bound required by the applications surveyed as
either hard or soft, and reveals the diversity in this aspect.
2.5.5 Update Dependencies
Some updates are semantically dependent on other updates made earlier to the same
or different data items. For application correctness, such dependencies must be taken
into account when propagating updates among replicas. We identify two important types
of dependencies that capture the needs of a wide variety of applications: causality, and
atomicity. Applications also differ in the order in which they require concurrent independent updates to be applied at various replicas, which we classify as none, total order, and
serial order. The ‘update deps.’ column of Table 2.2 shows these requirements.
Causal dependency [18] means that if replica A sees B’s update and then makes an
update, B’s update must be seen before A’s everywhere. For example, in a file system
37
directory, a file deletion followed by the creation of a new file with the same name must
be performed in the same order everywhere [62]. Other collaborative applications such
as chat service and email delivery require that causally dependent updates must be seen
in causal order everywhere, but otherwise require no ordering.
Some updates must be applied atomically (in an all-or-nothing fashion) at each replica
to preserve data integrity despite partial failures, especially when updates span multiple
objects or consistency units. Examples include the file rename operation in a file system
and money transfer between bank accounts.
Unordered update delivery (none) suffices for applications whose updates are independent and commutative. For instance, updates to different entries in a distributed file
system directory or a music index are commutative and can be applied in any order at
different directory replicas.
When updates are independent but not commutative (e.g., when a data item is overwritten with a new value), they must be applied in the same (i.e., total) order everywhere
to ensure replica convergence. For example, if concurrent conflicting changes to a file
are propagated to different replicas in a different order, their contents could diverge
permanently. Depending on the application, the criterion for the actual order could be
arbitrary or based on update arrival times. Applications that rely on the chronology of
distributed events need to impose a total update ordering that matches their time order.
Examples include user moves in a multi-player game and real-time event monitoring.
Finally, some updates that require total ordering cannot be undone, and hence cannot
be reordered during propagation among multiple replicas. They must be globally serialized (e.g., by executing one after another). For example, consider a replicated queue.
It is unacceptable for multiple clients that concurrently issue the dequeue operation at
different replicas to obtain the same item, although multiple enqueue operations can be
executed concurrently and reordered later to ensure the same order everywhere. Hence
the dequeue operation requires ‘serial’ ordering, while enqueue requires ‘total’ ordering.
38
2.5.6 Failure-Handling
Partial failures (such as some nodes or network links going down) are inevitable
in the wide area due to independent failure modes of components. Applications differ
in how they can handle partial failures. When desired consistency semantics cannot
be guaranteed due to some replicas becoming inaccessible, some applications can be
optimistic, i.e., continue to operate under degraded consistency so as to remain available
to users until the failure heals. Others must pessimistically abort affected operations or
delay them until the failure is repaired, as otherwise the application could enter an invalid
state due to conflicting updates. The choice depends on the likelihood of conflicts and
whether an application can automatically recover from the inconsistency. For example,
an online auction service must be pessimistic when partitioned, as selling the same item
to multiple clients is unacceptable behavior and cannot be undone. In contrast, a file
service can optimistically allow file access when partitioned, because conflicting file and
directory updates are either rare or could be reconciled by merging their results.
Wide area networks often exhibit low network quality due to the presence of lowbandwidth links or due to transient conditions such as congestion. Even if network connectivity is not lost completely, maintaining normal consistency in such ‘weak connectivity’ scenarios could drastically degrade application performance. For some applications,
e.g., a file service, better application performance might be achieved by treating weak
connectivity as a failure condition and switching to degraded consistency. The Coda file
system provides a weak connectivity mode in which a client does not maintain strong
consistency with a server while the quality of its network link to the server drops below
a quality threshold. An enterprise directory may need to maintain different levels of
consistency for clients within a region and for clients in between campuses for good performance. However, a chat service must continue to provide real-time synchronization
despite weak connectivity.
2.6 Discussion
In the previous section, we have examined the characteristics and consistency needs
of a broad mix of distributed applications. We have expressed their diverse consistency
39
needs along several dimensions and summarized them in Table 2.2. This exercise reveals
that although the surveyed applications differ widely in their consistency needs, those
needs can be expressed in terms of a small set of options.
In particular, we found that consistency requirements can be classified along five
major dimensions:
• concurrency control - the degree to which concurrent (read/write) accesses can be
tolerated,
• replica synchronization - the degree to which replica divergence can be tolerated,
including the types of interdependencies among updates that must be preserved
when synchronizing replicas,
• failure handling - how data access should be handled when some replicas are
unreachable or have poor connectivity,
• update visibility - the granularity at which the updates issued at a replica should be
made visible globally,
• view isolation - the duration for which the data accesses at a replica should be
isolated from remote updates.
In effect, there is a multidimensional space for expressing the consistency requirements
of applications. There are multiple reasonable options along each of these loosely orthogonal dimensions. Each option suits several applications, though the suitable combination of options varies from one application to another. Moreover, the five dimensions
correspond to the five tasks that any wide area consistency management system must
address. Hence expressing application requirements along these dimensions amounts to
specifying the appropriate consistency mechanisms that suit those needs. Based on these
observations, we developed a novel approach to structuring consistency management in
distributed applications, called configurable consistency. Configurable consistency lets
applications express consistency requirements individually along the above dimensions
on a per-access basis instead of only supporting a few packaged combinations of semantics. We believe that this approach gives applications more flexibility in balancing
40
consistency, availability, and performance by giving them direct control over the tradeoffs
possible.
2.7 Limitations of Existing Consistency Solutions
A number of consistency solutions have been proposed to meet the replication needs
of the application classes addressed by our survey. In this section, we discuss three
existing solutions that are representative of distinct approaches to flexible consistency
management.
Fluid replication [37] targets cached access to remote file systems. It provides three
discrete flavors of consistency semantics, namely, last-writer, optimistic, and pessimistic
semantics. The first two flavors allow concurrent conflicting updates with total ordering,
while the last option prevents conflicts by serializing updates. The ‘last-writer’ semantics
imposes a total order by arbitrarily choosing one among conflicting updates to overwrite
others, whereas ‘optimistic’ semantics detects and resolves the conflict using applicationspecific knowledge. Fluid replication leaves several important issues open, including
replica synchronization frequency and dependencies between updates.
Oceanstore [60] provides an Internet-scale persistent data store for use by arbitrary
distributed applications. It supports data replication, but lets applications define and
enforce consistency semantics by providing a semantic update model that treats updates
as procedures guarded by predicates. Each application must define its own set of predicates that serve as preconditions for applying updates. Oceanstore provides primary-copy
replication and propagates updates along an application-level multicast tree of secondary
replicas. In contrast to Oceanstore, our goal is to develop a consistency framework with
a small set of predefined predicates that can express the consistency needs of a broad
variety of applications.
The TACT toolkit [77] defines a continuous consistency model that can be tuned for
a variety of consistency needs. Its model provides three orthogonal metrics in which
to express the consistency requirements of applications: numerical error, order error,
and staleness. Numerical error bounds replica divergence in terms of the cumulative
weight of unseen remote writes at a replica, while staleness expresses the tolerable
41
replica divergence in terms of elapsed time. These metrics capture the ‘mod’ and ‘time’
criteria for timeliness described in Section 2.5.4. TACT’s order error restricts concurrency among data accesses by bounding the number of uncommitted updates that can
be held by a replica before it must serialize with remote updates. A value of zero
enforces serializability (corresponding to the ‘excl’ requirement), whereas a value of
infinity allows unrestricted parallelism (corresponding to ‘weak’ accesses). Though this
metric provides continuous control over concurrency, we believe, based on our study, that
the two extreme settings (‘excl’ and ‘weak’) are both intuitive and sufficient to capture
the needs of many applications. If application designers require finer-grain control over
conflicts among parallel updates, they can achieve it by controlling the frequency of
replica synchronization, which is more intuitive to reason about in application terms.
Finally, to naturally express atomicity, causality and isolation constraints, applications
need a session abstraction.
2.8 Summary
In this chapter, we studied the feasibility of providing reusable support for replicated
data management by surveying a variety of popular distributed applications belonging to
three distinct categories: (i) file access, (ii) collaborative groupware, and (iii) enterprise
data services. We argued that although they have replicable data and their replication
needs vary widely, they have significant common needs that can be satisfied by common
consistency mechanisms. To this end, we classified their consistency requirements along
several dimensions that correspond to various aspects of consistency management. We
argued that expressing applications’ consistency requirements along these dimensions
paves the way for a flexible consistency framework called configurable consistency.
Finally we examined several existing solutions for flexible consistency to see how well
they can support the surveyed application classes.
In the next chapter, we describe the configurable consistency framework and discuss
how it can be used in the design of distributed applications.
CHAPTER 3
CONFIGURABLE CONSISTENCY
FRAMEWORK
In the previous chapter, our survey of a variety of applications revealed several common characteristics that affect consistency. We observed that the consistency needs of
those applications can be expressed as combinations of a small number of requirements
along five dimensions: concurrency control, replica synchronization, failure handling,
update visibility and view isolation. Based on this insight, in this chapter we present a
novel consistency framework called configurable consistency that can express a broad
variety of consistency needs, including those of the applications we surveyed in Chapter
2.
In Section 3.1, we enumerate the specific objectives and requirements of our framework based on the conclusions of our application survey. In Section 3.2, we describe
the data access model for configurable consistency. In Section 3.3, we present the
consistency framework including the options it provides. In Section 3.4, we describe
how these options can employed in combination to meet the consistency needs of the
four representative applications we described in Section 2.2. In Section 3.5, we discuss
the framework’s generality by showing how its interface allows expressing semantics
provided by a variety of existing consistency models. Finally, in Section 3.6, we discuss
situations in which our framework shows its limitations.
3.1 Framework Requirements
Our goal in developing the configurable consistency (CC) framework is to provide
efficient replication support for diverse applications in the three classes surveyed in
Section 2.2. Those application classes have a broader diversity of replication and scaling
needs than supported by any existing consistency solution. Hence to support them, the
configurable consistency framework must meet the following requirements:
43
Generality: It must be able to express a broad variety of consistency semantics, specifically those required by the applications listed in Table 2.2.
Practicality: It must be enforceable in a WAN replication environment via a small set
of application-independent mechanisms that facilitate reuse. These mechanisms
must satisfy the diverse consistency needs of the applications mentioned in Table
2.1 efficiently.
3.2 Data Access Interface
Since replication is relevant to applications designed for data-shipping, our consistency framework assumes a system organization based on shared data access as explained
in Section 2.1 and illustrated in Figure 2.2. In that organization, a distributed application
is structured as multiple distributed components, each of which holds a portion of the
global application state and caches other portions locally as needed. A consistency
management system (depicted as a cloud providing the shared state abstraction in Figure
2.2) mediates all accesses to application state/data at each site and enforces the configurable consistency interface. A data item that is the unit of sharing and consistency
maintenance could be an entire database or a file, an individual page, or an application
object embedded therein. Our framework suits systems that provide a session-based
interface for application components (e.g., A1..A5 in Figure 2.2, hereafter referred to as
clients of the consistency management system) to access their local replicas of shared
data. This data access interface allows clients to open a session for each consistency
unit, to read and write the unit’s data in the context of that session, and to close that
session when the access has completed. The definition of reads and writes assumed by
our framework is general, and includes both raw accesses to individual bytes of data (e.g.,
for file access) as well as arbitrary queries and updates in the form of application-specific
semantic operations (e.g., debit account by 100). One possible way for the system to
support semantic update operations in an application-independent manner is to allow
clients to register a plugin that interprets those operations at each site. When clients
supply a semantic update operation as opaque data as part of their writes (e.g., via a
special update interface), the system can invoke the plugin to apply the operation at
44
various replicas to keep them synchronized. Also, when consistency semantics allow
concurrent updates that can conflict, clients can supply routines via the plugin that detect
and resolve those conflicts in an application-specific manner.
Given this data access interface to the consistency management system, our framework extends the interface in several ways to provide consistency guarantees to clients
on a per-session basis. For instance, when opening a session at a replica site, each client
needs to specify its required consistency semantics for the data item accessed during
that session as a vector of configurable consistency options explained in Section 3.3. In
response, the system guarantees that the data accessed during the session will satisfy
those consistency requirements. If allowing a session operation (i.e., open, read, write,
update or close) to proceed would violate the semantics guaranteed to its session or to
other sessions anywhere, the semantics are said to conflict. In that case, the system
delays the session operation until the conflict ceases to exist (i.e., until the session with
conflicting semantics closes). As a result, if two sessions request conflicting consistency
semantics (e.g., both request exclusive write access to data), the system delays one of
them until the other session closes. If the system cannot ascertain whether a session’s
consistency semantics conflict with those of other sessions elsewhere (e.g., because some
replicas are unreachable), the system fails that session’s operations.
Figures 3.1 and 3.2 illustrate in pseudo-code how a replicated database can be queried
and updated using our framework. With the consistency options specified, the query
guarantees to return results that are not more than 5 seconds stale, and the update guarantees to operate on data not more than 1 second stale. A distributed application that adopts
our framework can impose a default per-data-item consistency semantics to be specified
globally for all of its sessions, or impose a per-replica semantics to be specified by default
for all of its local sessions, and still allow individual sessions to override those default
semantics if necessary. Thus, the framework’s per-session consistency enforcement gives
the application a significant degree of flexibility in managing data consistency.
45
querydb(dbfile)
{
cc options = [
access-mode=RD,
staleness=5sec,
update ordering=TOTAL,
failure handling=OPTIMISTIC,
isolation & visibility=PER UPDATE,
];
sid = open(dbfile, cc options);
ret = read(sid, dbfile, &buf);
...
close(sid);
}
Figure 3.1. Pseudo-code for a query operation on a replicated database. The cc options
are explained in Section 3.3.
3.3 Configurable Consistency Options
In this section, we describe the options provided by the configurable consistency
framework. Table 1.1 in Section 1.2 lists those options, which can be classified along
five dimensions: concurrency control, replica synchronization, failure handling, update
visibility and view isolation. Clients express their consistency semantics for each session
as a vector of options, choosing one among the alternatives in each row of Table 1.1.
We describe configurable consistency options along each of the five dimensions and
refer to the application needs that motivated their inclusion in our framework’s option
set. When describing the options, we also illustrate their use in the context of the four
representative applications we mentioned in Section 2.2.
3.3.1 Concurrency Control
By concurrency, we mean the parallelism allowed among read and write sessions.
With our framework, this can be controlled by specifying an access mode as one of
the consistency options when opening a session. Our survey revealed that two types of
read and write accesses, namely, concurrent and exclusive accesses, adequately capture
the needs of a wide variety of applications. Based on this observation, we support
two distinct flavors of access mode for each of the reads and writes: concurrent (RD,
46
updatedb(dbfile, op, params)
{
cc options = [
access-mode=WR,
staleness=1sec,
update ordering=TOTAL,
failure handling=OPTIMISTIC,
isolation & visibility=SESSION,
];
sid = open(dbfile, WR, cc options);
updt pkt = pack(op, params, plugin);
ret = update(sid, updt pkt);
...
close(sid);
}
Figure 3.2. Pseudo-code for an update operation on a replicated database.
WR) and exclusive (RDLK, WRLK) access modes. Concurrent modes allow arbitrary
interleaving of accesses across replicas. As a result, reads could return stale data and
replicas at multiple sites could be written independently, which may cause write conflicts.
Exclusive access mode sessions are globally serialized to enforce traditional concurrentread-exclusive-write (CREW) semantics for strong consistency [48]. Table 3.1 shows
the concurrency matrix that indicates the concurrency possible among various modes.
When different sessions request both flavors simultaneously, RD mode sessions proceed
in parallel with all other sessions including exclusive sessions, while WR mode sessions
are serialized with exclusive sessions, i.e., they can occur before a RDLK/WRLK session
begins or are deferred until it ends. Finally, exclusive mode sessions can proceed in
parallel with RD sessions, but must block until ongoing WR sessions finish everywhere.
Regardless of the access modes that clients employ for their sessions, individual write
operations by those sessions arriving at a replica are always guaranteed to be applied
atomically and serially (one after another), and each read is guaranteed to return the
result of a previous completed write. Thus replica contents are never clobbered due to
the interleaved or partial execution of writes. Our model allows application-writers to
provide their own routines via a plugin to resolve conflicting updates (caused by parallel
writes to the same version of data via WR mode sessions). A simple resolution policy
47
Table 3.1. Concurrency matrix for Configurable Consistency.
’X’ indicates that sessions with those modes are not allowed to proceed in parallel as
they have ‘conflicting’ semantics.
Concurrency
RD
RDLK
WR
WRLK
RD
RDLK
X
X
WR
WRLK
X
X
X
X
X
is ‘last-writer-wins’ [62], where the latest update by logical modification time overrides
all others. Other resolution options include reexecuting the update (if it is a semantic
procedure), merging results, or rejecting the update.
To ensure serializable transactions that operate on the latest data, our proxy-caching
enterprise application described in Section 2.3.2 would need to access its data within
exclusive mode sessions that provide strong consistency. However, our other focus
applications (music search directory, file access and chat service) can employ concurrent
modes to access their data, since their requirements are not as strict. In particular, the
chat service described in Section 2.3.4 must allow multiple users to concurrently append
to a shared transcript file. This can be supported by treating an append operation as a
semantic update and executing it in WR mode sessions for parallelism. The chat service
can resolve concurrent appends across replicas by providing a resolution routine that
applies all appends one after another in some order without losing data.
3.3.2 Replica Synchronization
Replica synchronization controls the extent of divergence among replica contents
by propagating updates in a timely fashion. Our framework lets applications express
their replica synchronization requirements in terms of timeliness requirements and the
ordering constraints to be imposed on updates during their propagation.
48
3.3.2.1 Timeliness Guarantee
Timeliness refers to how close data viewed by a particular client at any given time
must be to a version that includes all updates. Based on the divergence needs revealed by
our application survey, our framework allows clients to impose two flavors of divergence
bounds in combination for a session: a time bound, and a modification bound. These flavors are applicable only to concurrent access modes (RD, WR) because, exclusive mode
sessions are always guaranteed to view the latest contents. The time and modification
bounds are equivalent to the staleness and numeric error metrics provided by the TACT
consistency model [77]. Relaxed timeliness bounds weaken consistency by reducing
the frequency of replica synchronization for improved parallelism and performance.
Most applications that involve real-time interaction among distributed users, including
cooperative editing, chat, and online auctions need the most current guarantee, which
can be expressed as a time or modification bound of zero.
Time bound ensures that a replica is not stale by more than a given time interval from
a version that includes all updates. It provides a conceptually simple and intuitive way
to weaken consistency, and ensures an upper limit on the synchronization cost regardless
of update activity. However, the extent of replica divergence within a time interval varies
based on the number of updates that happen in that interval.
Modification bound 1 specifies the maximum cumulative weight of unseen remote
writes tolerated at a replica, when each write can be assigned a numeric weight by the
application (the default being 1). Applications can use this to control the degree of replica
divergence in application-specific terms regardless of the frequency of updates.
In addition, our framework allows a client to explicitly request synchronization at
any time during a session (called manual timeliness) via a pull interface that obtains the
latest remote updates, and a push interface that immediately propagates local updates to
all replicas. This enables an application to hand-tune its replica synchronization strategy
based on domain-specific knowledge and relax timeliness of data for performance. As
1
Although we present a design for enforcing modification bounds in Section 4.7, our prototype
implementation of configurable consistency does not support this option.
49
explained in Section 2.5.4, the shared music file index benefits from manual timeliness
control.
3.3.2.2 Strength of Timeliness Guarantee
Our framework allows applications to indicate that their timeliness bounds are either
hard or soft (referred to as the ’strength’ option in Table 3.2). When a client imposes a
hard timeliness bound on the data viewed by a session, replicas must be synchronized
(i.e., via pulling or pushing updates) often enough, stalling session operations such as
reads or writes if necessary, to ensure that the data provided during the session stays
within the timeliness bound. For instance, enforcing a hard time bound of zero requires
a replica to pull remote updates before every access. Enforcing a hard mod bound of
zero at a replica requires that as soon as a remote session issues a write operation, the
write does not return successfully until it is propagated and applied to the zero-bound
replica. A session’s hard manual push operation cannot succeed until local updates are
propagated to and acknowledged by all replicas. In contrast, soft timeliness bounds
are enforced by replicas asynchronously pushing updates as they arrive in a best-effort
manner, i.e., without blocking for them to be applied elsewhere. Employing soft bounds
weakens consistency guarantees, but increases concurrency and availability. File sharing
and online auctions require hard timeliness bounds, while a soft bound is adequate for
the chat service and music search index.
3.3.3 Update Ordering Constraints
Update ordering constraints refer to the order in which an application requires multiple updates to be viewed by sessions. Our framework allows two types of ordering
constraints to be expressed: (i) constraints based on semantic dependencies among updates (referred to as ’semantic deps.’ in Table 3.2, and (ii) constraints that impose an
implicit order among concurrent independent updates (referred to as ’update ordering’
in the table). Our framework supports two types of semantic dependencies: causality
and atomicity, and three choices for ordering independent updates: none, total ordering
50
Table 3.2. Expressing the needs of representative applications using Configurable
Consistency.
The options in parentheses are implied by the WRLK option. ’Serial’ order means that
the write is linearizable with other ongoing writes.
Focus
Application
File Service
Personal files,
read-mostly
shared doc
lock file: read
lock, write
shared log
Music index
Chat service
Auction: browse
bid, buy, sell
Concur.
control
Replica Synch.
Update
Str- Timeli- Data Visibility
ength ness
Deps.
RD,WR
soft
latest
none
RD,WR
RD
WRLK
RD,WR
hard
hard
(hard)
soft
latest
latest
(latest)
varies
RD
RD,WR
RD
WRLK
soft
time,
manual
soft
latest
soft
latest
(hard) (latest)
View
Isolation
Failure
Handling
session
optimistic
total
session
none
session
(serial) (session)
varies per-update
session
session
(session)
per-update
optimistic
optimistic
pessimistic
optimistic
none
per-update
per-update
optimistic
causal per-update
none
session
(serial) (session)
per-update
session
(session)
pessimistic
pessimistic
pessimistic
session
and serial ordering, because our application survey revealed that they capture the update
ordering constraints of a variety of applications.
Our framework guarantees that updates issued by sessions at the same site are always
applied in the order their sessions issued them, everywhere. Otherwise, updates are
considered to be independent unless clients explicitly specify their semantic dependencies. Applications can express their ordering constraints in our framework by specifying
the dependency and ordering types for each session when opening the session. Each
update inherits the ordering and dependency types of its issuing session. To help capture
dependencies among multiple objects/sessions, the framework also provides a depends()
interface. The depends() interface expresses the dependencies of an ongoing session
with other sessions or data item updates. The depends() interface and the framework’s
interface to express ordering constraints among updates are together powerful enough to
support higher-level constructs such as transactions to ease application programming, as
we explain in Section 3.5.4.
51
3.3.3.1 Ordering Independent Updates
Updates whose ordering is of type ‘none’ could be applied in different order at
different replicas. Updates tagged with ‘total’ order are eventually applied in the same
final order with respect to each other everywhere, but may need to be reordered (by being
undone and redone) before their position in the final order stabilizes. Updates tagged with
‘serial’ order are guaranteed never to be reordered, but applied only once at each replica,
in their final order. ‘Serial’ order can be used for updates that cannot be undone, such as
results sent to a user.
3.3.3.2 Semantic Dependencies
Given its interface to express dependencies, our framework provides the following
guarantees to sessions, which we illustrate with examples in Figure 3.3 and 3.4:
1. Updates belonging to sessions grouped as ‘atomic’ (via the depends() interface)
are applied atomically everywhere.
2. Updates by a ‘causal’ session are treated as causally dependent on previous causal
and totally ordered updates viewed by the session as well as its causally preceding
sessions (specified via the depends() interface); they are applied in an order that
preserves this causal order.
Figure 3.3 illustrates how update atomicity can be expressed using our framework,
by giving the pseudo-code for a file move operation in a distributed file system from
mv file(srcdir, dstdir, fname)
{
sid1 = open(srcdir, WR, ATOMIC);
sid2 = open(dstdir, WR, TOTAL);
depends(sid2, sid1, ATOMIC);
update(sid1, u1=’del fname’);
update(sid2, u2=’add fname’);
close(sid1);
close(sid2);
}
Figure 3.3. Atomic sessions example: moving file fname from one directory to another.
Updates u1,u2 are applied atomically, everywhere.
52
Client 1
Client 2
sid1 = open(A, CAUSAL);
update(sid1, u1);
close(sid1);
u1
Client 3
sid3 = open(B, CAUSAL);
depends(sid3, A, CAUSAL);
update(sid3, u3);
close(sid3);
sid2 = open(A,
TOTAL | CAUSAL);
update(sid2, u2);
close(sid2);
u1
Client 4
sid4 = open(A,
TOTAL | CAUSAL);
update(sid4, u4);
close(sid4);
Figure 3.4. Causal sessions example. Updates u1 and u2 happen concurrently and
are independent. Clients 3 and 4 receive u1 before their sessions start. Hence u3 and
u4 causally depend on u1, but are independent of each other. Though u2 and u4 are
independent, they are tagged as totally ordered, and hence must be applied in the same
order everywhere. Data item B’s causal dependency at client 3 on prior update to A has
to be explicitly specified, while that of A at client 4 is implicitly inferred.
one directory to another. Updates u1 and u2 are applied atomically because they are
issued by sessions grouped as atomic. Figure 3.4 illustrates how causal dependencies
can be expressed, using a scenario where clients at four replicas operate via sessions
on two objects A and B. Client 1 performs update u1 and pushes it to clients 3 and 4.
Subsequently, clients 3 and 4 issue an update each. Independently, client 2 performs
update u2. In this case, u1 causally precedes u3 and u4 but not u2.
Although updates u2 and u4 are independently issued at clients 2 and 4, since they
are performed within totally ordered sessions, they will be applied in the same order
everywhere, while preserving other dependencies. However, no ordering or dependency
exists between updates u3 and u4. Hence they can be applied in different order at
different replicas.
Allowing an application to specify ordering categories for each of its sessions enables it to simultaneously enforce distinct ordering requirements for various types of
update operations. Consider the case of a replicated employee directory service. If two
administrators try to simultaneously assign the same username to different employees by
53
adding them to different directory replicas, only one of them should prevail. This can
be ensured by performing the name insertions within totally ordered sessions. Changing an employee’s name from ‘A’ to ‘B’ requires that ‘B’ is added and ‘A’ removed
from the directory atomically. The application employing our framework can achieve
this by inserting ‘B’ and removing ‘A’ in a session marked as atomic. Subsequently,
suppose a new employee is assigned the username ‘A’. For the new name insertion to
succeed everywhere, it must be causally preceded by the previous removal of ‘A’. Such
a dependency can be expressed in our framework by tagging the session that does the
removal as causal. The new insertion thus forms a causal dependency with the previous
removal and gets propagated in the correct order. Finally, updates to an employee’s
membership information in different business divisions require no ordering since they
are commutative operations.
3.3.4 Failure Handling
When a consistency or concurrency control guarantee cannot be met due to node
failures or network partitions, applications may be optimistic, i.e., continue with the
available data in their partition and risk inconsistency, or be pessimistic i.e., treat the
data access as having failed and deal with it at the application-level. During failure-free
performance, optimistic and pessimistic failure-handling behave identically.
Our framework provides these two options for handling failures, but offers more flexibility to applications by allowing different tradeoffs to be made for individual sessions
between consistency and performance based on network quality. A network link below
a quality threshold by some metric, e.g., a combination of RTT and bandwidth, is called
a weak link, and is treated as a failed link by optimistic sessions. Thus, an optimistic
session’s consistency guarantee holds only relative to replicas reachable by strong links
(i.e., with quality above a tolerated threshold), whereas a pessimistic session’s guarantee
holds relative to all replicas. By restricting consistency maintenance to an accessible
subset of replicas, the optimistic option trades off consistency for higher availability. For
example, an enterprise directory could enforce strong consistency only among replicas
within a well-connected region such as a campus. Clients can still obtain a globally
54
consistent view of data at the expense of incurring WAN synchronization and reduced
availability in case of failures.
The file service (except when locking semantics are required) and the music index
must be optimistic for increased availability. The online auction service must be pessimistic for correctness.
3.3.5 Visibility and Isolation
Our survey revealed that applications have three distinct notions of what constitutes
a stable update that preserves the validity of application state: session, per-update, and
manual updates. For update propagation to preserve the validity of application state
at various replicas, it must take update stability into account. Our framework enables
application-dependent notions of update stability to be expressed via two categories of
consistency options: update visibility and view isolation.
Visibility refers to the time at which a session’s updates are ready to be made visible
(i.e., propagated) to remote replicas. The time when they will be actually propagated
is determined by the timeliness requirements of remote replicas. Isolation refers to the
time during a session when its replica can incorporate incoming updates from remote
replicas. A replica must delay applying incoming remote updates until it can meet the
isolation requirements of all local sessions. These options only apply with respect to
remote sessions. Concurrent local sessions at a replica still see each other’s updates
immediately. Our framework provides three useful flavors of visibility and isolation:
Session: Session visibility specifies that updates are not to be made visible until the
session that issued them ends. They are not interleaved with other updates during
propagation, which prevents remote sessions from seeing the intermediate writes
of a session.
Session isolation specifies that a replica checks for remote updates only before
opening a session. Once a session is open, the system delays applying incoming
updates until the session ends. Session isolation ensures that a replica’s contents
remain unchanged for the duration of the session, which is important for document
sharing and atomic updates to shared data structures.
55
Per-update: Per-update visibility means that a session’s updates are immediately available to be propagated to remote replicas. Updates belonging to multiple sessions
can be interleaved during propagation.
Per-update isolation means that a replica applies incoming updates as soon as they
arrive, after serializing them with other ongoing local update operations. This
setting enables fine-grain replica synchronization in the presence of long-lived
sessions, such as for real-time interactive applications like chat, multiplayer games,
and event logging.
Manual Manual visibility means that a session’s updates are made visible to remote sessions only when explicitly requested, or when the session ends. Manual isolation
means that remote updates are only incorporated when the local client explicitly
requests that they be applied. These settings can be used in conjunction with
manual update propagation to completely hand-tune replica synchronization.
Figure 3.5 shows how a WR session’s visibility setting affects the version of data supplied
to remote replicas. Replica 1 receives a pull request from replica 2 (in response to a
RD request) after an ongoing WR session makes three updates. If the WR session is
employing per-update visibility, all the three updates are made visible as soon as they are
applied locally. Hence the replica 1 supplies version v3 to replica 2. Instead, if the WR
session employs manual visibility and issues a manual push after the first two updates,
the version supplied is v2 even though the latest local version is v3 at that time. Finally,
if the WR session employs session visibility, version v0 is supplied to replica 2.
Figure 3.6 illustrates how an ongoing RD session’s isolation setting determines when
incoming updates are applied locally. With per-update isolation, they are applied immediately. With manual isolation, both the pending updates are applied when the session
issues a manual pull request. With session isolation, they are deferred until the RD
session ends.
Mobile file access and online shopping requires session visibility and isolation to
ensure data integrity. The shared music index requires per-update visibility and isolation
since each index insertion/deletion is independent and needs to be propagated as such.
56
Replica 1
v0
local
updates
v1
manual push
v2
v3 WR
Time
per−update visibility
manual visibility
session visibility
Replica 2
pull RD
Time
Figure 3.5. Update Visibility Options. Based on the visibility setting of the ongoing
WR session at replica 1, it responds to the pull request from replica 2 by supplying the
version (v0, v2 or v3) indicated by the curved arrows. When the WR session employs
manual visibility, local version v3 is not yet made visible to be supplied to replica 2.
Replica 1
u1
u2
WR
Time
per−update push
session
manual
per−update
Replica 2
RD
Time
manual pull
Figure 3.6. Isolation Options. When replica 2 receives updates u1 and u2 from replica
1, the RD session’s isolation setting determines whether the updates are applied (i)
immediately (solid curve), (ii) only when the session issues the next manual pull request
(dashed curve), or (iii) only after the session ends (dotted curve).
The chat service also requires per-update visibility and isolation to ensure real-time propagation of user messages to others during a long-term chat session. Other combinations
of visibility and isolation are also useful. For instance, employing session visibility for
update transactions and per-update isolation for long-running (data mining) queries to an
inventory database allows users to track the latest trends in live sales data without risking
inconsistent results.
3.4 Example Usage
The expressive power of configurable consistency comes from the orthogonality of
the options described in the previous sections. By orthogonality, we mean the ability
57
to enforce each of the options independently. Each individual option is required by at
least one of the applications surveyed, but applications widely differ in the particular
combination of options they require. Hence, the ability to independently select the
various options from different dimensions that our framework provides gives applications
greater flexibility than other consistency interfaces.
To support our claim, we discuss how the consistency needs of the four focus applications we mentioned in Section 2.2 can be described in our framework. Table 3.2 lists
the combination of options that express various application needs.
As mentioned in Section 2.3.1, the consistency needs of the file service vary due
to the different ways in which files are used by applications. Eventual consistency is
adequate for personal files that are rarely write-shared, and media and system files that
are rarely written [62]; close-to-open consistency is required for users to safely update
shared documents [28]; globally serializable writes [48] are required for reliable file
locking across replicas. Log file sharing requires the ability to propagate individual
appends as semantic updates, provided by our per-update visibility and isolation settings.
The shared music index requires the ability to manually control when to synchronize
with remote replicas based on the operation being performed on an index replica, as explained in Section 2.5.4. This can be expressed using the manual timeliness requirement
and enforced via explicit pull and push operations.
The chat service also requires eventual consistency and per-update visibility, but
in addition, it requires a user’s chat message written to the shared transcript file to be
propagated at once to other replicas, which requires setting the time bound to zero. At
the same time, imposing a hard timeliness bound forces chat clients to synchronously
wait for messages to propagate to other clients, which is unnecessary. Hence, the chat
service requires a soft time bound of zero.
Finally, the online auction service requires strong consistency for serializing bids and
finalizing purchases, which can be expressed by our framework’s WRLK mode. The
WRLK mode automatically implies a hard most-current timeliness guarantee on data,
which is shown in the table by parenthesizing those options. However, the number of
users browsing the auction items is typically much larger than those making purchase
58
bids. Hence they must not block bids. This requirement can be expressed by specifying
RD mode for casual browsing operations.
3.5 Relationship to Other Consistency Models
Due to its flexibility, the configurable consistency framework can be used to express the semantics of a number of existing models. In this section, we explore the
consistency semantic space covered by our framework in relation to other consistency
solutions. First, we explore several consistency models designed for shared memory in
the context of multiprocessors, distributed shared memory systems, and object systems,
because many other models can be described as special cases of these models. Next, we
discuss a variety of consistency models defined for session-oriented data access in the
context of file systems and databases, including transactions. Finally, we discuss several
flexible consistency schemes that offer tradeoffs between consistency and performance,
in relation to our framework.
Tables 3.3 and 3.4 list some well-known categories of consistency semantics supported by these schemes (in the first column) and the sets of CC options that express
them.
3.5.1 Memory Consistency Models
Traditionally, consistency models are discussed in the context of read and write
operations on shared data. A variety of consistency models have been developed to
express the coherence requirements of shared memory in multiprocessors, DSM systems
and shared object systems [70]. Though our framework targets wide area applications,
we discuss shared memory consistency semantics here because they form the conceptual
basis for reasoning about consistency in many applications. Shared memory systems
assume a data access model consisting of individual reads and writes to underlying
memory, whereas shared object systems typically enforce consistency at the granularity
of object method invocations. Pure memory consistency models by themselves do not
enforce concurrency control, but define special operations by which applications can
59
Table 3.3. Expressing various consistency semantics using Configurable Consistency
options.
Blank cells indicate wildcards (i.e., all options are possible).
Consistency
model
Shared
mem[70]:
Linearizability
Causal
consistency
FIFO/PRAM
consistency
Weak
consistency
Concur. control
Read Write
Replica Synch.
Str- Timeli- Order
ength ness
RDLK WRLK
Update
Visibility
View
Isolation
Failure
handling
eager
per-update
pess.
RD
WR
soft
latest
causal
eager
per-update
pess.
RD
WR
soft
latest
none
eager
per-update
pess.
RD
WR
hard
manual none
manual
manual
pess.
manual
manual
pess.
manual
manual
pess.
push,
pull
manual none
push
manual none
pull
Eager release
consistency
Lazy
release
consistency
RD
WR
hard
RD
WR
hard
Close-to-open
[28]
Close-to-rd[62]
Wr-to-rd[74]
Wr-to-open[64]
wr-follows-rd,
monotonic wr
[71]
rd-your-wr,
monotonic rd
RD
WR
hard
latest
total
session
session
RD
RD
RD
RD
WR
WR
WR
WR
soft
soft
hard
latest
latest
time
none
none
total
causal
session
eager
eager
session
per-update
per-update
session
session
TXN isolation
[40]: Degree 0
Degree 1: rd uncommitted
Degree 2: rd
committed
Degree 3: serializable
Snapshot
Isolation[8, 52]
Read
consistency
[52]
pull
on
switch
RD
short
WRLK
RD
long
WRLK
short long
RDLK WRLK
long
long
RDLK WRLK
RD
WR
atomic
session
session
RD
atomic
session
per-update
WR
atomic
atomic
atomic
atomic
60
Table 3.4. Expressing other flexible models using Configurable Consistency options.
Blank cells indicate wildcards (i.e., all options are possible).
Consistency
model
Concur. control
Read Write
TACT[77]: Numerical error
Staleness
Order error
Fluid
Replication
[16]: last-writer
optimistic
pessimistic
Lazy repl. [40]:
Causal ops
Forced ops (total order)
Immediate ops
(serialized)
N-ignorant
TXN[39]
Timed/delta
consistency[72]
Cluster
consistency[55]:
Weak ops
Strict ops
RD
WRLK,
WR
WR
RD
WR
Replica Synch.
Stre- Timeli- Order
ngth ness
hard
mod
hard
time
Update
Visibility
View
Isolation
Failure
handling
all
total,
time
total,
app
RDLK WRLK
causal
total
WRLK
mod
N
time
opt.
pess.
61
notify the memory system of synchronization events. We describe how the configurable
consistency framework allows expressing most of these models and can be used to
implement those special operations.
Sequential consistency requires that all replicas see the result of data accesses in
the same order, and that order is itself equivalent to some serial execution on a single
machine. Linearizability [4] is a stronger semantics than sequential consistency that
can be expressed in our framework via exclusive modes (RDLK, WRLK) for reads and
writes. Linearizability requires a serial order that preserves the time order of arrival of
operations at their originating replicas. This is the strongest consistency guarantee that
our framework can provide, and hereafter, we refer to it as strong consistency. The rest
of the consistency semantics that we list below can be expressed in our framework by
employing concurrent mode (RD, WR) sessions with per-update visibility and isolation
for accesses. Causal consistency [29] permits concurrent writes to be seen in a different
order at different replicas, but requires causally related ones to be seen in the same
order everywhere. It can be achieved with our framework by tagging all writes as
causally ordered. FIFO/PRAM consistency [43] only requires that all writes issued at
a replica be seen in the same order everywhere; it requires no ordering among concurrent
writes at different replicas. Our framework’s “unordered” updates option implicitly
preserves per-replica update ordering and hence guarantees FIFO consistency. All of
the above consistency models enforce consistency on individual reads and writes. The
configurable consistency framework’s per-update visibility and isolation options provide
similar semantics.
Another set of consistency models enforce consistency on groups of reads and writes
delimited by special ‘synchronization’ operations. Weak consistency [19] enforces sequential consistency on groups of operations delimited by a special ‘sync’ memory
operation. Thus, individual reads and writes need not incur synchronization. A ‘sync’
operation causes all pending updates to be propagated to all replicas before it completes,
and can be emulated in our framework by issuing a manual push operation followed
by a manual pull operation. (Eager) Release consistency [26] identifies two kinds of
special operations, namely acquire and release. The system must ensure that before an
62
acquire completes, all remote writes to locally cached data items have been pulled, and
before a release completes, all local writes have been pushed to any remote replicas.
Typically, the bulk of the update propagation work is done by the release. However, in
lazy release consistency [36], nothing is pushed on release, but the next acquire has to
pull pending writes to locally accessed data from all remote replicas before proceeding.
Both variants can be achieved with the CC framework. The acquire operation must
issue a manual pull request for updates to all data from remote replicas. The release
must issue a synchronous push of local updates to all remote replicas to ensure the eager
variant of release consistency, and must be a no-op in the case of the lazy variant. Entry
consistency [9] mandates association of synchronization variables with each shared data
item instead of on the entire shared memory. An acquire on a synchronization variable
only brings the associated data items up-to-date. Distributed object systems such as
Orca [6] and CRL[30] employ entry consistency to automatically synchronize replicas
of an object before and after invocation of each of its methods. Entry consistency can
be readily achieved with our framework using a similar implementation of acquire and
release operations as for release consistency, but only to push and pull the affected data
items.
3.5.2 Session-oriented Consistency Models
A number of session-oriented models have been developed in the context of file
systems [28, 38, 62] and databases [18, 52]. These models define consistency across
multiple read and write operations, grouped into sessions.
Databases and file locking often require sessions to be serializable [8]. Serializability
is the database analogue of sequential consistency in memory systems, and guarantees
that any concurrent execution of access sessions produces results equivalent to their
serial execution on a single machine in some order. It is achieved in our framework
by employing exclusive access mode (RDLK and WRLK) sessions that provide strong
consistency or linearizability.
A variety of weaker consistency flavors have been developed that sacrifice serializability to improve parallelism and availability [63]. They can be classified based on
63
the hardness of their bound on replica divergence. Hard-bounded (also called bounded
inconsistency) flavors ensure an upper limit on replica divergence within a partition,
regardless of the network delays between replicas [63]. Soft-bounded (also called eventual consistency) flavors only guarantee eventual replica convergence qualified by a
timeliness bound, subject to network delays [18]. We can classify weaker consistency
flavors further based on the degree to which they provide isolation among concurrent
sessions, as close-to-open, close-to-rd, wr-to-rd and wr-to-open. They can be expressed
in our framework with various combinations of visibility and isolation options.
Close-to-open semantics, provided by AFS [28], guarantees that the data viewed by a
session include the latest writes by remote sessions that were closed before opening that
session, and not other writes by ongoing sessions. If an application’s updates preserve
data integrity only at session boundaries, this semantics ensures the latest valid (i.e.,
internally “consistent”) snapshot of data that does not change during a session, regardless
of ongoing update activity elsewhere. It can be expressed in our framework with session
visibility and isolation (i.e., exporting and importing updates only at session boundaries)
and a hard most-current timeliness guarantee.
In contrast, “close-to-rd” semantics guarantees to incorporate remote session updates
before every read, not just at open() time. It can be obtained by employing per-update
isolation. The Pangaea file system [62] provides the soft-bounded (i.e., eventual) flavor
of close-to-rd semantics by propagating file updates eagerly as soon as a file’s update
session ends.
“Wr-to-rd” semantics ensures tighter convergence by guaranteeing that a read sees
the latest writes elsewhere, including those of ongoing sessions. Thus it might cause
a session’s reads to see the intermediate writes of remote sessions. WebFS [74] provides append-only consistency, which is a special case of wr-to-rd semantics for sharing
append-only logs. Several systems (such as Bayou [18]) that synchronize replicas via
operational updates provide wr-to-rd semantics.
Finally, “wr-to-open” semantics ensures that a replica incorporates the latest remote
updates (including those of ongoing sessions) at the time of opening a session. For
instance, this is required for the Unix command ‘tail -f logfile’ to correctly track remote
64
appends to a shared log file. The semantics of NFS version 3 [64] file system can be
loosely categorized as a time-bounded flavor of wr-to-open semantics, because an NFS
client can delay pushing local writes to the server for up to 30 seconds, but checks the
server for file updates on every open(). Wr-to-open semantics can be achieved with our
framework’s eager write visibility and session isolation options.
3.5.3 Session Guarantees for Mobile Data Access
Some applications must present a view of data to users that is consistent with their
own previous actions, even if they read and write from multiple replica sites. For example, if a user updates her password at one password database replica and logs in at another
replica, she might be denied access if the password update does not get propagated in
time. Four types of causal consistency guarantees have been proposed by the designers
of Bayou [71] to avoid such problems and to provide a consistent view of data to a mobile
user:
read-your-writes: reads reflect previous writes by the same client,
monotonic reads: successive reads by a client return the result of same or newer updates,
writes-follow-reads: writes are propagated after the reads on which they depend, and
monotonic writes: writes are propagated after the writes that logically precede them.
Among them, writes-follow-reads and monotonic writes are actually forms of causality
relations among updates. As mentioned in Section 3.3.2, our framework provides causal
update ordering as an explicit option. Read-your-writes and monotonic read semantics
can be achieved with our framework as follows. Causal update ordering is enforced as
before. In addition, whenever a mobile client application that requires these semantics
switches to accessing a new replica, it performs a manual “pull updates” operation that
works as described in Section 3.3.2. The pull must necessarily bring the new replica in
sync and at least as up-to-date as the previous replica. If the pull operation fails, then
the mobile client knows that the monotonic read or read-your-writes semantics cannot be
65
guaranteed (e.g., because the previous replica cannot be reached). If so, the mobile client
application can either fail the data access request, or retry later.
3.5.4 Transactions
The transaction model of computing is a well known and convenient paradigm for
programming concurrent systems. A transaction is a series of data accesses and logical operations that represents an indivisible piece of work. Traditionally, a transaction
guarantees the ACID properties:
Atomicity: “all or nothing” property for updates,
Consistency: the guarantee that application data are always transformed from one valid
state to another,
Isolation: noninterference of concurrent transactions, e.g., by ensuring their serializability, and
Durability: committed updates are never lost, and their effect persists beyond the transaction’s lifetime.
The CC framework provides update atomicity as an explicit option. To ensure the
durability of a transaction’s updates, their final position in the global commit order must
be fixed to be the same at all replicas. This requires updates to be totally ordered.
Durability also implies that once a transaction makes its result visible to subsequent
transactions, they become causally dependent on it for commitment. The total order
imposed on updates must preserve such causal dependencies as well. The CC framework’s exclusive access modes (RDLK, WRLK) pessimistically serialize accesses to
eagerly commit updates and ensure durability, but incur global synchronization. Its
concurrent access modes (RD, WR) optimistically allow parallel transactions, in which
case, durability can be achieved by enforcing both total ordering and causality.
To maintain data consistency, the transaction model requires that transactions are
executed as if they are isolated from each other. In our data access model, each client’s
local cache serves as a private workspace for transaction processing. When a transaction
66
T is invoked at a machine, T’s entire execution is performed on that machine, and
remote data is accessed through the machine’s local cache. When the CC framework’s
session visibility is used, no partial result of the execution is visible at other replica
sites. Moreover, a site’s updates are propagated in the same order everywhere. These
features together guarantee that transactions at different sites are isolated from each other.
However, concurrent transactions running at the same site have to be isolated via locking
schemes provided by the local operating system.
Enforcing serializability of transactions requires frequent synchronization, which
reduces performance in a wide area replicated environment. To improve concurrency
among transactions in the traditional database context, the ANSI SQL standard[3] has
defined three isolation levels for queries in a locking-based implementation. The three
isolation levels provide increased concurrency by employing either no read locks, short
duration read locks (held for the duration of a single data access), or long duration read
locks (held for the duration of a transaction). These isolation levels for transactions can
be achieved with our framework by employing the RD mode when no locking is desired
for reads, and RDLK mode for read locks of various durations.
With snapshot isolation [7], provided by Microsoft Exchange and the Oracle database
system [52], a transaction T always reads data from a snapshot of committed data valid as
of the (logical) time when T started. Updates of other transactions active after T started
are not visible to T. T is allowed to commit if no other concurrent committed transaction
has already written data that T intends to write; this is called the first-committer-wins
rule to prevent lost updates. Oracle’s Read Consistency [52] ensures that each action
(e.g., SQL statement) in a transaction T observes a committed snapshot of the database
state as it existed before the action started (in logical time). A later action of T observes
a snapshot that is at least as recent as that observed by an earlier action of T. Read
consistency can provide a different snapshot to every action, whereas snapshot isolation
provides the same snapshot to all actions in a transaction. Snapshot isolation and read
consistency are the transactional analogues of close-to-open and close-to-rd semantics
defined earlier, and hence can be expressed with our framework. By relaxing timeliness
bounds beyond ‘most current’, our framework can enhance concurrency further, at the
67
risk of increased aborts of update transactions. To reduce aborts, the relaxed semantics
can be employed only for query transactions.
There are numerous other relaxed consistency models designed for transactional
concurrency control in replicated databases. They are explored in depth in Adya’s PhD
thesis [1]. Examining the extent to which our framework can express them requires
further study, and is beyond the scope of our thesis.
3.5.5 Flexible Consistency Schemes
In this section, we discuss several important consistency schemes that provide tradeoffs between consistency and performance, in relation to our framework.
The continuous consistency model of the TACT toolkit [77] was specifically designed
to provide fine-grain control over replica divergence. Instead of supporting a sessionoriented access model, TACT mediates all application-level accesses to a replicated
data store and classifies them as reads and writes. Their reads and writes are generic
and encapsulate the application logic comprising an entire query or update transaction
respectively. In that sense, propagating a write means reexecuting the update transaction
on another database replica.
TACT’s consistency model allows applications to dynamically define logical views of
data that they care about, called conits, and control the divergence of those views rather
than the replica contents themselves. Example conits include the number of available
seats in a flight’s reservation database, each discussion thread in a bulletin board, and the
number of entries in a distributed queue. Conits help exploit application-level parallelism
and avoid false-sharing. But maintaining dynamically defined conits among wide area
replicas incurs significant bookkeeping, and it is not clear whether it is justified by
the increased parallelism at scale. In contrast, the CC framework allows control over
divergence of replica contents.
TACT provides three metrics to continuously control divergence, namely, numerical
error, order error and staleness. Numerical error limits the total weight of unseen
writes affecting a conit that a replica can tolerate, and is analogous to our framework’s
modification bound. Staleness places a real-time bound on the delay of write propagation
68
among replicas, and is equivalent to our time bound. Order error limits the number of
outstanding tentative writes (i.e., subject to reordering) that affect a conit, at a replica.
For reasons mentioned in Section 2.7, the CC framework supports the two extreme values
for order error, namely, zero (with its exclusive access modes) and infinity (with its
concurrent access modes).
The Lazy replication of Ladin et al. [40] supports three types of ordering of various
operations on data. Causal operations are causally ordered with respect to other causal
operations, forced operations are performed at all replicas in the same order relative to
one another, and immediate operations are serializable, i.e., performed at all replicas
in the same order relative to all operations. The ordering type for each operation on
a replica must be specified by the application-writer at design time. For example, the
designer of a replicated mail service would indicate that send mail and read mail are
causal, add user and del user are forced, and del at once is immediate. Causal and forced
operation ordering can be achieved in our framework by performing those operations in
concurrent mode sessions and tagging updates involving those operations as causally or
totally ordered, respectively. The effect of immediate operations can be achieved by
performing them inside a WRLK session, which ensures serializability with respect to
all other operations.
In an N-ignorant system [39] developed in the context of databases, an N-ignorant
transaction may be ignorant of the results of at most N other transactions. To emulate
this behavior in our framework, such a transaction is executed as an update operation
within a session that sets a modification (mod) bound of N on its local replica. Since the
execution of a transaction at a replica increments its divergence by 1, the modification
bound ensures that the N-ignorant transaction misses the updates of at most N remote
transactions.
Timed/delta consistency models (e.g., [72]) require the effect of a write to be observed everywhere within specified time delay. This can be readily expressed in our
framework using a time bound.
In the cluster consistency model [55] proposed for mobile environments, replicas
are partitioned into clusters, where nodes are strongly connected within a cluster but
69
weakly/intermittently connected across clusters. Consistency constraints within a cluster
must always be preserved, while inter-cluster consistency may be violated subject to
bounds (called m-consistency). To implement this, two kinds of operations have been
defined: weak operations check consistency only within their cluster, whereas strict
operations ensure consistency across all clusters. The Coda file system’s consistency
scheme makes a similar distinction between weak and strong connectivity between client
and server, but provides no control over replica divergence in weak connectivity mode.
A Coda client skips consistency checks with a server on reads if the quality of the link
to its server is “poor” (i.e., below a quality threshold), and pushes writes to the server
lazily (called trickle re-integration). The CC framework’s “optimistic” and “pessimistic”
failure handling options for sessions provide similar semantics to that of the weak and
strict operations of the cluster consistency model, where nodes reachable via network
links of higher quality than a threshold are considered to belong to a cluster.
3.6 Discussion
In the previous section, we examined the expressiveness and generality of the configurable consistency framework. We discussed how it can support a wide variety of
existing consistency models. In this section, we discuss several issues that affect the
framework’s adoption for distributed application design.
3.6.1 Ease of Use
At first glance, providing a large number of options (as our framework does) rather
than a small set of hardwired protocols might appear to impose an extra design burden on
application programmers as follows. Programmers need to determine how the selection
of a particular option along one dimension (e.g., optimistic failure handling), affects
the semantics provided by the options chosen along other dimensions (e.g., exclusive
write mode, i.e., WRLK). However, thanks to the orthogonality and composability of our
framework’s options, their semantics are roughly additive; each option only restricts the
applicability of the semantics of other options and does not alter them in unpredictable
ways. For example, employing optimistic failure handling and WRLK mode together
70
for a data access session guarantees exclusive write access for that session only within
the replica’s partition (i.e., among replicas well-connected to that replica). Thus by
adopting our framework, programmers are not faced with a combinatorial increase in
the complexity of semantics to understand.
To ease the adoption of our framework for application design, we anticipate that middleware systems that adopt the configurable consistency framework will bundle popular
combinations of options as defaults (e.g., ‘Unix file semantics’, ‘CODA semantics’, or
‘best effort streaming’) for object access, while allowing individual application components to refine their consistency semantics when required. In those circumstances,
programmers can customize individual options along some dimensions while retaining
the other options from the default set.
3.6.2 Orthogonality
As Tables 3.3 and 3.4 show, the expressive power of the configurable consistency
framework is due to its orthogonal decomposition of consistency mechanisms. Although
the consistency options provided by our framework along various dimensions are largely
orthogonal, there are a few exceptions. Certain consistency options implicitly imply
other options. For example, exclusive access modes imply session-level isolation and a
most-current timeliness guarantee. Also, certain combinations of options do not make
sense in practical application design. For example, exclusive read mode and concurrent
write mode are unlikely to be useful together in an application.
3.6.3 Handling Conflicting Consistency Semantics
Although the composability of options allows a large number of possible combinations and applications can employ different combinations on the same data simultaneously on a per-session basis, the semantics provided by certain combinations are
inherently conflicting, i.e., cannot be enforced simultaneously. Although the configurable consistency implementation presented in this thesis correctly serializes sessions
that employ conflicting semantics, the CC interface enables some conflicting semantics to
be expressed that might not make sense in practical application design, and can also lead
71
to hard-to-detect bugs in distributed applications. For instance, consider two sessions
that operate on the same data. Session 1 is a write session that wants all readers to see its
writes immediately. Session 2 is a read session that does not want to see any writes until
it explicitly asks for them (e.g., via manual pull), or until the session ends. When session
1 issues a write while session 2 is in progress, the write must be synchronously pushed
everywhere and must be blocked until all replicas apply it locally and acknowledge it.
Therefore, session 1’s write operation has to block until session 2 accepts its write, which
could mean an indefinite and unproductive wait. Certain other conflicting semantics are
perfectly normal, such as a RDLK and a WRLK session. Ideally, a configurable consistency implementation should be able to detect and reject unproductive combinations of
semantics. Further work is needed to enumerate such combinations of semantics possible
with configurable consistency.
3.7 Limitations of the Framework
Although we have designed our framework to express the consistency semantics
needed by a wide variety of distributed services, we have chosen to leave out semantics that are difficult to support in a scalable manner across the wide area or that require consistency protocols with application-specific knowledge. As a consequence, our
framework has two specific limitations that may restrict application-level parallelism in
some scenarios.
3.7.1 Conflict Matrices for Abstract Data Types
Our framework cannot support application-specific conflict matrices that specify the
parallelism possible among application-level operations [5]. Instead, applications must
map their operations into the read and write modes provided by our framework, which
may restrict parallelism. For instance, a shared editing application cannot specify that
structural changes to a document (e.g., adding or removing sections) can be safely allowed to proceed in parallel with changes to individual sections, although their results
can be safely merged at the application level. Enforcing such application-level concurrency constraints requires the consistency protocol to track the application operations
72
in progress at each replica site, which is hard to track efficiently in an applicationindependent manner.
3.7.2 Application-defined Logical Views on Data
Our framework allows replica divergence constraints to be expressed on data but not
on application-defined views of data, unlike TACT’s conit concept [77]. For instance,
a user sharing a replicated bulletin board might be more interested in tracking updates
to specific threads of discussion than others, or messages posted by her friends than
others, etc. To support such requirements, views should be defined on the bulletin board
dynamically (e.g., all messages with subject ‘x’) and updates should be tracked by the
effect they have on those views, instead of on the whole board. Our framework cannot
express such views precisely. It is difficult to manage such dynamic views efficiently
across a large number of replicas, because doing so requires frequent communication
among replicas, which may offset the parallelism gained by replication. An alternative
solution is to split the bulletin board into multiple consistency units (threads) and manage
consistency at a finer grain when this is required. Thus, our framework provides alternate
ways to express both of these requirements that we believe are likely to be more efficient
due to their simplicity and reduced bookkeeping.
3.8 Summary
In this chapter, we described a novel consistency framework called configurable
consistency that lets a wide variety of applications compose an appropriate consistency
solution for their data by allowing them to choose consistency options based on their
sharing needs. To motivate our design of the framework, we stated its objectives and requirements. We presented the framework including its options along five dimensions. We
illustrated how the diverse consistency needs of the four focus applications we described
in the previous chapter can be expressed using the framework. Next, we discussed the
generality of the framework by illustrating how it can support a wide variety of existing
consistency models. Finally, we discussed a few situations in which our framework
shows its limitations.
73
In subsequent chapters, we focus on the practicality of our framework. We start
the next chapter by presenting the design of a wide area replication middleware that
implements the CC framework.
CHAPTER 4
IMPLEMENTING CONFIGURABLE
CONSISTENCY
To pave the way for building reusable replication middleware for distributed systems, we have developed a configurable consistency framework that supports the diverse
consistency requirements of three broad classes of distributed applications. In previous
chapters, we described our framework and discussed its expressive power in the context
of diverse applications. However, to be deployed in real wide area applications, our
framework must be practical to implement and use. In this chapter, we show how
configurable consistency can be implemented in a peer replication environment spanning variable-quality networks. We do this by presenting the design of SWARM,1 an
aggressively replicated wide area data store that implements configurable consistency.
4.1 Design Goals
As explained in Section 2.4, the three application categories that we surveyed vary
widely with respect to their natural unit of sharing, their typical degree of replication,
and their consistency and availability needs.
For configurable consistency management to benefit a broad variety of distributed
applications in these categories, its implementation must be application-independent,
support aggressive data caching, operate well over diverse networks, and be resilient to
node and network failures. Application-independence allows mechanisms to be reused
across multiple applications. Aggressive caching enables applications to efficiently leverage available storage, processing, and network resources and incrementally scale with
offered load. To efficiently support aggressive caching over networks of diverse capabilities, our consistency implementation must employ a scalable mechanism to manage
a large number of replicas, and adapt its communication based on network quality to
1
SWARM is an acronym for Scalable Wide Area Replication Middleware.
75
conserve network bandwidth across slow links. In a wide area system with a large
number of nodes, both transient and permanent failures of some of the nodes and/or
network links are inevitable, so our system must detect failures in a timely manner and
gracefully recover from them.
Specifically, the design of our configurable consistency system must meet the following three goals:
Scalability: Our system must scale along several dimensions: the number of shared
objects, their granularity of sharing (e.g., whole file or individual blocks), the
number of replicas, the number of participating nodes in the system and their
connectivity. The number of shared objects that our target applications use vary
from a few hundreds (in the case of personal files) to hundreds of thousands (in
the case of widespread music sharing). Their typical replication factors also range
from a few dozens of replicas in some cases to a few thousands in others. The
number of nodes in a typical configuration also varies accordingly. Our consistency
system must gracefully scale to these sizes.
Network Economy: Most of our targeted applications need to operate in a very wide
area network environment consisting of clusters of LANs often spanning continents, over links whose bandwidth ranges from a few tens of Kbps to hundreds
of Mbps. Many of them are likely to be wired links that are more reliable and
stay mostly connected, while others are lossy wireless links. To provide a responsive service, our system must automatically adapt to this diversity by efficiently
utilizing the available network resources.
Failure Resilience: Our target applications operate in a variety of environments that can
be loosely categorized as server/enterprise, workstation, and mobile environments,
with distinct failure characteristics. Server-class environments consist of powerful
machines with reliable high-speed network connectivity. Workstations are static
commodity PCs of moderate capacity and connectivity that can be turned off by
their owners anytime. Mobile devices such as laptops and PDAs are the least
powerful and intermittently connected at different network locations. The typical
76
rate at which nodes join and leave the system, called the node churn rate [59] also
varies widely based on the average time spent by a node in the system, called its
mean session time. Session times range from days or months (in case of servers)
causing low churn, to as low as a few minutes (in case of mobile devices) causing
high churn. Our system must gracefully handle failures and node churn in all the
three types of environments, while preserving scalability and network economy.
4.1.1 Scope
The focus of our work is to show how flexible consistency management can be
provided in a distributed system that aggressively replicates data over the wide area.
However, there are several other important issues that overlap with the design of consistency management, and must be addressed by a full-fledged data management system to
facilitate wide area data sharing in the real-world. Some of these issues are: security and
authentication, long-term archival for durability of data, disaster recovery, management
of storage and quotas at individual nodes. We do not address these issues in this thesis,
but leave them for future work.
In the rest of this chapter, we describe our design of a wide area file store called
Swarm, our research vehicle for evaluating the feasibility of implementing configurable
consistency in a reusable replication middleware. In Section 4.2, we outline Swarm’s
basic organization and interface. In Section 4.3, we outline Swarm’s expected usage for
building distributed applications. In Section 4.5, we describe the design of its naming
and location tracking schemes. In Section 4.6 we describe how Swarm supports largescale network-aware replication. In Section 4.7, we describe our design of configurable
consistency in Swarm which is the focus of this dissertation. We discuss the failure
resilience characteristics of our design in Section 4.9. We provide details of our prototype
Swarm implementation in Section 4.10. Finally, we discuss several issues that our current
design does not address, and outline ways to address them.
77
4.2 Swarm Overview
Swarm is a distributed file store organized as a collection of peer servers (called
Swarm servers) that cooperate to provide coherent wide area file access at variable
granularity. Swarm supports the data-shipping paradigm of computing described in
Section 2.1 by letting applications store their shared state in Swarm files and operate
on it via Swarm servers deployed nearby. Files (also called regions) in Swarm are
persistent variable-length flat byte arrays named by globally unique 128-bit numbers
called SWIDs. Swarm exports a file system-like session-oriented interface that supports
traditional read/write operations on file blocks as well as operational updates on files
(explained below). A file block (also called a page) is the smallest unit of data sharing
and consistency in Swarm. Its size (which must be a power-of-2 bytes and is 4-Kilobytes
by default) can be set for each file when creating it. This gives applications considerable
flexibility in balancing consistency overhead and parallelism. A special metadata block
holds the file’s attributes such as its size, consistency options and other applicationspecific information, and is similar to an inode in Unix file systems. Swarm servers
locate files by their SWIDs (as described in Section 4.5), cache them as a side-effect
of local access (as described in Section 4.6), and maintain consistency of cached copies
according to the per-file consistency attributes (as described in Section 4.7). Each Swarm
server utilizes its configured persistent local store for permanent copies of some files and
the rest of the available space to cache remotely stored files. Swarm servers discover each
other as a side-effect of locating files by their SWIDs. Each Swarm server monitors its
connection quality (latency, bandwidth, connectivity) to other Swarm servers with which
it communicated in the recent past, and uses this information in forming an efficient
hierarchical overlay network of replicas dynamically for each file (as described in 4.6).
Swarm allows files to be updated in place in two ways: (i) by directly overwriting
previous contents on a per file block basis (called absolute or physical updates), or
(ii) by supplying a semantic update procedure that Swarm applies to all replicas via an
application-supplied plugin at each Swarm server (called operational updates). Absolute
updates are useful for traditional file sharing as well as for implementing persistent pagebased distributed data structures (as we did for the proxy-caching application described
78
in Section 5.3). Operational updates can greatly reduce the amount of data transferred
to maintain consistency and increase parallelism when sharing structured objects such
as databases and file system directories. Swarm maintains consistency at the smallest
granularity at which applications access a file at various replicas. Thus, it can maintain
consistency at the granularity of individual file blocks (when using physical updates) or
the entire file (when using either physical or operational updates). Swarm clients can
control consistency per-file (imposed by default on all replicas), per-replica (imposed by
default on its local sessions), or per-session (affecting only one session). This gives each
client the ability to tune consistency to its individual resource budget.
In the rest of this section, we describe the interface Swarm exports to distributed
applications to facilitate shared data programming for transparent caching.
4.2.1 Swarm Interface
Swarm exports the file operations listed in Table 4.1 to its client applications via
a Swarm client library that is linked to each application process. The Swarm client
library locates a nearby Swarm server via an external lookup service similar to DNS,
and communicates with it using IPC. sw alloc() creates a new file with consistency attributes specified by the ‘cc options’ and returns its unique Swarm ID (SWID).
sw open() starts an access session at a Swarm server for a portion (perhaps all) of the
specified file (called a segment), returning a session id (sid). The ‘cc options’ denotes a
vector of desired configurable consistency options for this session along the dimensions
listed in Table 1.1 overriding the file’s consistency attributes. Depending on the file’s
consistency attributes and the ‘cc options’ that include the requested access mode, the
Swarm server may need to obtain a copy of the data and/or bring the local copy up
to date. A session is Swarm’s point of concurrency control, isolation and consistency
management. sw read() and sw write() transfer file contents to/from a user buffer.
sw close() terminates the session. sw getattr() and sw setattr() let the
client inspect or modify a file’s attributes including its consistency settings. Finally,
sw free() evicts the local Swarm replica if the ‘global’ flag is false. Otherwise it
79
Table 4.1. Swarm API.
Operation
Description
SWID ← sw alloc(page size, cc options)
create a new file
sw free(SWID, global)
destroy local replica/all replicas of file
sw getattr/setattr(SWID, attr)
set/get file attributes (size, consistency etc.)
sid ← sw open(SWID, offset, size, cc options) open file/segment with specified consistency
sw read/write(sid, offset, size, buf)
read/write file segment
sw update(sid, plugin id, update proc)
apply update proc to file via plugin
sw snoop(sid, update filter, callback func)
notify remote updates via callback
sw depends(sid, SWID, offset, size, dep types) establish interfile/segment dependencies
sw close(sid)
close file
deletes file itself and all its replicas from Swarm, freeing up its resources (including
SWID) for reuse.
sw update() can be used to provide an operational update procedure (as an opaque
update packet) to be invoked on data at the local Swarm server. Example operational
updates include “add(name) to directory”, and “debit(1000) from account”. To use this
facility, the application must implement a plugin library to be linked to the local Swarm
server at each node running an application instance. When a Swarm server receives an
update packet, it invokes the specified plugin to interpret and apply its procedure on the
local copy and returns the result in response to the sw update() call. Swarm servers
propagate the packet instead of the new file contents to synchronize replicas. We describe
Swarm’s plugin interface in Section 4.2.2.
sw snoop() can be used by clients to register to receive notification of updates
made to a file replica via a registered callback function, i.e., to “snoop” on updates
arriving for the local replica. It can be used by application components to monitor activity
at other components without costly polling. For example, a chat client can snoop on
writes to the transcript file by other clients and supply a callback that updates the user’s
local screen when a remote write is received. sw depends() can be used to express
80
semantic dependencies (i.e., causality and atomicity) among updates to multiple Swarm
objects.
Finally, an application can access Swarm files either through the Swarm client interface described above, or natively via a local file system mount point. To facilitate the
latter, the Swarm server also exports a convenient native file system interface to Swarm
files via the local operating system’s VFS layer. It does so by providing a wrapper that
talks to a file system module in the kernel such as CodaFS [38]. The wrapper provides a
hierarchical file name space by implementing directories within Swarm files. This allows
Swarm files to be accessed via the operating system’s native file interface.
We chose to export a file as Swarm’s basic data abstraction as opposed to fixed-length
blocks of storage, because of the needs of our target application classes. Choosing a file
abstraction simplifies distributed storage management at the middleware level by avoiding the distributed garbage collection issues that arise in block-level storage allocation.
The file abstraction also provides applications with an independently growable entity and
gives applications the autonomy to manage storage allocation to suit their needs. The
abstraction directly matches the needs of several target applications (such as file systems
and databases). Other applications such as distributed data structures can submanage
pages within a file as a persistent page heap by exploiting Swarm’s page-granularity
sharing.
4.2.2 Application Plugin Interface
A Swarm application plugin must implement the operations listed in Table 4.2. Swarm
invokes the ‘apply’ operation to locally apply incoming operational updates. When
Swarm receives a remote update but finds that the local copy got updated independently,
it invokes the ‘merge’ operation instead to let the plugin resolve the conflict. We discuss
conflict resolution in Section 4.8. A plugin can directly operate on the local copy of a
Swarm file via the local file system bypassing Swarm. It can also employ local memory
buffering techniques to speed up its operations and lazily reflect them to the underlying
Swarm file. Swarm invokes the ‘sync’ operation to let the plugin flush those changes
to the Swarm file before it transfers raw file contents to other replicas. The ‘cleanup’
81
Table 4.2. Swarm interface to application plugins.
Each plugin must implement these operations in a library linked into local Swarm server.
Operation
init()
ret← apply(file, logentry, update)
success←merge(file, neighbor, update)
sync(file)
cleanup(file)
uninit()
Description
Initialize plugin state
apply operational update locally, return result
merge update into local copy (resolve conflicts)
sync changes into local Swarm file copy (before full file transfer)
cleanup file state before eviction
Destroy plugin state
operation allows the plugin to free up its state for a particular file before its eviction from
local cache.
4.3 Using Swarm to Build Distributed Services
We envision Swarm being employed to provide coherent wide area caching to distributed services (such as file and directory services) in two ways. A new distributed
service can be implemented as a collection of peer-to-peer service agents that operate on
Swarm-hosted service data. Alternatively, an existing centralized service (such as an enterprise service) can export its data into Swarm’s space at its home site, and deploy wide
area service proxies that access the data via Swarm caches to improve responsiveness
to wide area clients. We have built a wrapper library around the BerkeleyDB database
library to enable applications to transparently replicate their BerkeleyDB database across
the wide area. We describe it in Section 5.4.
For example, consider an enterprise service such as payroll that serves its clients by
accessing information in a centralized database. Figure 4.1 shows how such a service is
typically organized. Clients access the service from geographically dispersed campuses
by contacting the primary server in its home campus via RPCs. As the figure shows,
the enterprise server is normally implemented on a cluster of machines and consists of
two tiers; the “application server” (labelled ‘AS’ in the figure) implements payroll logic,
82
Home Campus
...
Clients
...
Clients
Campus 2
Server
cluster
RPCs
Internet
AS
DB
FS
...
Clients
Campus 3
Figure 4.1. A centralized enterprise service. Clients in remote campuses access the
service via RPCs.
while the “DB” server (such as mySQL, BerkeleyDB, or Oracle) handles database access
when requested by ‘AS’ using the storage provided by a local storage service (FS).
Figure 4.2 shows how the same service could be organized to employ wide area
proxies based on Swarm. Each enterprise server (home or proxy) is a clone of the
original two-tier server with an additional component, namely, a Swarm server. Clients
can now access any of the available server clusters, as they are functionally identical.
At each enterprise server cluster, the ‘AS’ must be modified to issue its database queries
and updates to the Swarm server by wrapping them within Swarm sessions with desired
consistency semantics. In response, Swarm brings the local database to desired consistency by coordinating with Swarm servers at other proxy locations, and then invokes
the operations on the local DB server. The DB server remains unchanged. Employing
this 3-tier architecture offloads data replication and consistency management complexity
from the enterprise service by enabling it to leverage Swarm’s support instead.
Figure 4.3 shows the architecture of the server cluster and its control flow in more
detail. The DB plugin encapsulates DB-specific logic needed by Swarm to manage
replicated access. It must be implemented and linked to each Swarm server and has
two functions: (i) it must apply the DB query and update operations that Swarm gets
from the AS and remote Swarm servers, to the local database replica; (ii) it must detect
and resolve conflicts among concurrent updates when requested by Swarm.
83
Campus 2
Home Campus
...
Clients
Clients
Swarm Protocol
Server
cluster
Internet
AS
Swarm
DB
...
RPCs
Clients
FS
...
Proxy
cluster
AS
Swarm
DB
FS
Campus 3
Figure 4.2. An enterprise application employing a Swarm-based proxy server. Clients
in campus 2 access the local proxy server, while those in campus 3 invoke either server.
The steps involved in processing a client request in this organization are as follows:
1. A client issues a query or update request to the payroll application server (AS).
2. The AS identifies the database objects required to process the request, and opens
Swarm sessions on those objects specifying the required consistency semantics. It
does so via the Swarm client library using the API described in Section 4.2.1. In
response, Swarm brings the local database objects to the specified consistency as
follows:
3. Swarm contacts remote Swarm servers if necessary, and pulls any outstanding updates
to bring local replicas up-to-date.
4a. Swarm applies those updates locally via the DB plugin which invokes them on the
local DB server.
4b. (Optional) Sometimes, Swarm must load a fresh copy of a database object to the
local store. This is required when instantiating a new database replica, or when
the local replica is too far out of sync with other replicas. In either case, Swarm
invokes the DB plugin to instantiate the local database replica in the local store.
84
...
Clients
1
5c
App server
5b
5a
2
3
Swarm
DB plugin
4a
DB server
4b
6
Local FS
Server cluster
Figure 4.3. Control flow in an enterprise service replicated using Swarm.
The AS can cache recent Swarm sessions. Hence, it can skip step 2 (i.e., opening
Swarm sessions) if it already has the requisite Swarm sessions opened as a sideeffect of previous activity.
5a. The AS can directly invoke query operations on the local DB, bypassing Swarm.
5b. However, the AS must issue update operations on the local DB via Swarm, so Swarm
can log them locally and propagate them later to other replicas. Swarm invokes the
DB plugin to apply such operations, and returns their results to the AS.
6. In either case, the DB server performs the operation on the local database replica, and
is oblivious of replication.
5c. (Optional) When opening a session, the AS can optionally ask Swarm to suggest
a remote proxy server site to redirect its request for better overall locality and
performance. The AS can then forward its client request to the remote AS at
85
the suggested site as an RPC, instead of executing it locally. Swarm dynamically
tracks the available data locality at various sites to determine the appropriate choice
between caching and RPCs.
Wide-area replication can be provided to all or parts of a file system in a similar manner.
First, the files and directories that will be accessed over the wide area are exported (i.e.,
mapped) to Swarm’s file name space. Then, clients access the mapped files by their
Swarm path names via local Swarm servers. Each file is mapped to a different Swarm
file and its consistency attributes can be controlled individually.
4.3.1 Designing Applications to Use Swarm
Employing Swarm for wide area replication in the way described above requires
the designers of a distributed service to map its shared data to Swarm’s data model, its
consistency requirements to the CC framework and its access methods to Swarm’s access
API. Swarm’s familiar data abstractions and intuitive API make this mapping relatively
straightforward for applications that already employ the data-shipping paradigm in their
design. However, the CC framework requires applications to express their consistency
requirements in terms of several low-level consistency options. Though many of these
options are intuitive, the right set of options may not be obvious in all application scenarios, and wrong choices lead to suboptimal performance. Hence determining the appropriate CC options for application data requires adhering to certain guidelines, which
we describe in this section.
First, we briefly outline a recommended methodology for designing distributed applications in the data-shipping paradigm for replication. It consists of three steps that are
independent of configurable consistency.
1. Structure the application as a collection of active objects or application entities,
each in charge of a well-defined portion of application state, with well-defined
operations on that state. This can be done by following the well-known object
design methodology adopted for a networked object model such as that of CORBA
[50]. Express operations on object state as performed in a multithreaded singlemachine environment (such as employed by the Java programming model). Doing
86
so forces the objects to be designed for concurrency (e.g., by adopting a sessionoriented data access model). This is an important prerequisite step to designing for
replication.
2. Identify objects whose state persists across multiple operation invocations, and
consider the possible benefit of replicating each of those objects (with its state
and operations) on multiple application sites. The likely candidates for replication
are those objects for which a clear benefit can be found, such as improved loadbalancing, scalability, access latency or availability.
3. Express all access to a replicable object’s state in terms of access to a file, logical
operations on a database, or as memory mapped access to a shared persistent heap.
This forces application objects to be designed for persistence, which simplifies
their deployment on multiple sites.
Once the replicable objects in an application have been identified, the next step is to examine their consistency requirements and express those requirements using CC options.
The following steps guide this design process:
1. For each of the candidate objects for replication, determine the high-level consistency semantics it needs by identifying unacceptable sharing behavior. What
invariants must be maintained at all times on the object state viewed by its operations? What does invalid state mean? At what points in the object’s logic
must validity be preserved? What interleavings among concurrent operations must
be avoided? Are there any dependencies (e.g., atomicity) among multiple data
access operations that must be preserved? This approach of finding the minimal
acceptable consistency enables maximal parallelism, while ensuring correctness.
2. For each operation on each object, express the above invariants in terms of the
five aspects of configurable consistency described in Table 1.1. For example,
unacceptable interleaving of multiple operations determines concurrency control.
Identify the minimal sequence of operations in the application logic that preserve
state validity. They are likely to be the boundaries of visibility and isolation. Study
87
how each update to state affects other entities viewing that state. The negative
consequences of stale data determine replica divergence control decisions.
3. Express all synchronization between application components, such as critical sections, using message-passing primitives, barriers, or exclusive data locks. Configurable consistency is not designed to support intercomponent synchronization
unless it can be efficiently expressed in terms of data locks.
4. Protect all accesses to application state by sessions that start and end access to
that state. Set the appropriate vector of configurable consistency options for each
session based on the invariants identified in step 1.
4.3.2 Application Design Examples
We illustrate how some of our focus applications (described in Section 2.3) can be
designed to employ Swarm’s caching support, and how the appropriate configurable
consistency options for their sharing needs can be determined.
4.3.2.1 Distributed File Service
A file service has two kinds of replicable objects: files and directories. A file must
be reachable from at least one directory as long as it is not deleted. This requires that
the file and its last directory entry must be removed atomically. When a file or directory
is concurrently updated at multiple places, a common final result must prevail, which
can be ensured by the CC framework’s total ordering option. Source files and documents
typically have valid contents only at the end of an editing session, and hence need session
visibility and isolation. However, each update to a directory preserves its validity. Hence
per-access visibility and isolation are acceptable for directories. A user accessing a
shared document typically wants to see prior updates, and requires a hard most-current
timeliness guarantee.
4.3.2.2 Music File Sharing
There are two classes of replicable objects in a music file sharing application: the
music index, which is frequently updated, and the music files themselves, which are
88
read-only. The index stores mappings from attributes to names of files with matching
attributes. Replicating the music file index speeds up queries. However, a single global
music index could get very large, making synchronization of replicas expensive when a
large number of users updates it frequently. Splitting the music index into a hierarchy of
independently replicable subindices helps spread this load among multiple nodes. In that
case, the only invariant to maintain is the integrity of the index hierarchy when splitting
or merging subindices. Returning stale results to queries is acceptable, provided eventual
replica convergence can be ensured.
4.3.2.3 Wide-area Enterprise Service Proxies
An enterprise service typically manages a number of objects that can be cached individually at wide area proxy sites to exploit regional locality. Examples include customer
records, sale items, inventory information, and other entities that an e-commerce service
deals with. However, these services typically store enterprise objects in a relational
database system that aggregates multiple objects into a single table. Replication at the
granularity of a table could lead to false sharing among multiple sites which performs
poorly under strong consistency requirements. Employing an object veneer such as
that provided by Objectstore [41] on top of the relational database enables individual
objects to be cached with different consistency semantics based on their sharing patterns
for improved overall performance. For instance, close-to-open consistency suffices for
accessing customer records at a service center; bounded inconsistency is required for
reserving seats in an airline to limit the extent of overbooking [77]; the sale of individual
items requires strong consistency; eventual consistency suffices for casual browsing
operations.
4.4 Architectural Overview of Swarm
Having outlined Swarm and its expected usage for employing caching in distributed
applications, we now turn to describing Swarm’s design and implementation. We start
by giving an overview of Swarm’s internal architecture. Figure 4.4 shows the structure
of a Swarm server with its major modules and their flow of control.
89
Figure 4.4. Structure of a Swarm server and client process.
A Swarm file access request is handled by the Swarm server’s session management
module, which invokes the replication module to fetch a consistent local copy in the
desired access mode. The replication module creates a local replica after locating and
connecting to another available replica nearby. It then invokes the consistency module
to pull an up-to-date copy from other replicas. The consistency module performs global
concurrency control and replica divergence control followed by local concurrency control
before granting file access to the local session. The bulk of configurable consistency
functionality in a Swarm server is implemented by the consistency, update, log and the
inter-node protocol modules. Of these, the update and log modules are responsible for
storing updates, and preserving their ordering and dependency constraints during propagation. A Swarm server’s state is maintained in several node-local persistent cache data
structures (using BerkeleyDB) and cached files stored in the local file system. Finally,
application-supplied plugins can be dynamically linked to a Swarm server. They interpret
and apply operational updates issued by application instances and participate in resolving
conflicting updates, at all replicas. We describe each of these modules in the following
sections.
4.5 File Naming and Location Tracking
As mentioned in Section 4.2, Swarm files are named by their SWIDs. To access a
file, a Swarm server first locates peers holding permanent copies of the file, called its
90
custodians, based on its SWID via an external location service (e.g., Chord [68], or a
directory service). The number of custodians can be configured per file, and determines
Swarm’s ability to locate the file in spite of custodian failures. One of the custodians, designated the root custodian (also called the home node), coordinates the file’s replication
as described in the next section, to ensure consistency. By default, the root custodian is
the Swarm server where the file was created. In case of a proxy caching application, the
root custodian is usually the Swarm server at the application’s home service site where
all application state is permanently stored. When the root node fails, a new root is elected
from among the available custodians by a majority voting algorithm.
Scalable file naming and location are not the focus of our work, because several
scalable solutions exist [58, 68, 79]. Hence, for simplicity, we devised the following
scheme for assigning and locating SWIDs in our Swarm prototype. Each Swarm server
manages its own local ID space for files created locally, and SWIDs are a combination
of its owner Swarm server’s IP address, and its ID within the Swarm server’s local ID
space. A Swarm server finds a file’s root custodian based on the address hard-coded in
its SWID. Each allocated SWID is also given a generation number that is incremented
every time the ID is reassigned to a new file, to distinguish between references to its old
and new incarnations.
Swarm’s design allows our simple hardwired naming and location scheme to be easily
replaced by more robust schemes such as those of Pastry [58] and Chord [68], as follows.
Some of the Swarm servers, designated as SWID-location servers, run a peer-to-peer
location protocol such as that of Chord to manage the SWID-to-custodian mappings
among themselves. They advertise themselves to other Swarm servers by registering
with an external DNS-like directory service.
When a Swarm server allocates a new SWID, it becomes the SWID’s root custodian
and adds the SWID-to-custodian mapping to the SWID location service. Whenever a
custodian is added or removed for a SWID (as described in Section 4.6.2), its mapping
gets updated by contacting a SWID-location server. We expect this operation to be
infrequent, since location servers typically run on reliable machines. When looking up a
91
SWID, a Swarm server first discovers an available SWID-location server via the external
directory service and queries it for the SWID’s custodians.
To expedite SWID lookups and to reduce load on the fewer location servers, each
Swarm server caches the results of recent lookups for custodians as well as file replicas
in a lookup cache, and consults it before querying the distributed SWID location service.
Our prototype employs lookup caches.
4.6 Replication
Swarm’s replication design must meet three goals: it must enable the enforcement
of diverse consistency guarantees, provide low latency access to data despite large-scale
replication, and limit the load imposed on each replica to a manageable level.
Swarm servers create file replicas locally as a side-effect of access by Swarm clients.
For network-efficient and scalable consistency management, they dynamically organize
replicas of each file into an overlay replica hierarchy rooted at its root custodian, as
explained in this section. Each server imposes a user-configurable replica fanout (i.e.,
number of children, typically 4 to 8) to limit the amount of replica maintenance traffic
handled for each local replica. To support both strong and weak consistency flavors
of the configurable consistency framework across non-uniform links, Swarm employs a
consistency protocol based on recursive messaging along the links of the replica hierarchy (described in Section 4.7). Hence the protocol requires that the replica network be
acyclic to avoid deadlocks. Figure 4.5 shows a Swarm network of six servers with replica
hierarchies for two files.
In the rest of this section, we describe how Swarm creates and destroys replicas
and maintains dynamic cycle-free hierarchies and how a large number of Swarm nodes
discover and track each other’s accessibility and connection quality.
4.6.1 Creating Replicas
When a Swarm server R wants to cache a file locally (to serve a local access), it
first identifies custodians for the file’s SWID by querying the location service and caches
them in its local lookup cache of known replicas. It then requests one of them (say P)
92
F2 F1
N1
F1
N2
F2
N4
N3
F2
F1
F2
F1
F2
N5
F1
N6
Figure 4.5. File replication in a Swarm network. Files F1 and F2 are replicated at Swarm
servers N1..N6. Permanent copies are shown in darker shade. F1 has two custodians: N4
and N5, while F2 has only one, namely, N5. Replica hierarchies are shown for F1 and
F2 rooted at N4 and N5 respectively. Arrows indicate parent links.
to be its parent replica and provide a file copy, preferring those that can be reached via
high quality (low-latency) links over others. Each node tracks link quality with other
nodes as explained in Section 4.6.5. Unless P is serving “too many” child replicas, P
accepts R as its child, transfers file contents and initiates consistency maintenance as
explained in Section 4.7). P also sends the identities of its children to R, along with an
indication if it has already reached its configured fanout (#children) limit. R augments
its lookup cache with the supplied information. If P was overloaded, R remembers to
avoid asking P for that file in the near future. Otherwise R sets P as its parent replica.
R repeats this process of electing a parent until has a valid file copy and an accessible
parent replica. The root custodian is its own parent. When a replica loses contact with
its parent, it reelects its parent in a similar manner. Even when a replica has a valid
parent, it continually monitors its network quality to known replicas and reconnects to a
closer replica in the background, if found. A replica detaches from current parent only
after attaching successfully to a new parent. This process forms a dynamic hierarchical
93
replica network rooted at the root custodian like Blaze’s file caching scheme [10], but
avoids multiple hops over slow network links when possible, like Pangaea [62].
The above parent election scheme could form cycles. Hence before accepting a new
parent replica, an interior replica (i.e., one that has children) actively detects cycles by
initiating a root probe request that is sent along parent links. If the probe reaches its
originator, it means that accepting the new parent will form a cycle. Leaf replicas do not
accept new children while they themselves are electing their parent. Hence they need not
probe for cycles, nor do they form primitive cycles.
Figure 4.6 illustrates how file F2 ends up with the replica hierarchy shown in Figure
4.5, assuming node distances in the figure roughly indicate their network roundtrip times.
Initially, the file is homed at node N5, its only custodian. Nodes N1 and N3 cache the
file, making N5 their parent (Figure 4.6a). Subsequently, nodes N2 and N4 also cache
it directly from N5, while N1 reconnects to its nearby replica N3 as its parent (Figure
4.6b)2 . Since N5 informs its new children of their sibling N3, they connect to N3, which
is nearer to them than N5 (Figure 4.6c). At that time, N2 gets informed of the nearer
replica at N1. Thus, N2 makes N1 its parent replica for file F2 (Figure 4.6d).
Each Swarm server’s administrator can independently limit the maximum fanout
(number of children) of its local file replicas based on the node’s CPU and network
load-handling capacity. A high fanout increases a node’s CPU load and the network
bandwidth consumed by Swarm communication. A low fanout results in a deep hierarchy, and increases the maximum hops between replicas and access latency. We believe
(based on the Pangaea study [62]) that a fanout of 4 to 8 provides a reasonable balance
between the load imposed due to fanout and the synchronization latency over a deep
hierarchy. We evaluated Swarm with several hundred replicas forming hierarchies up to
4 or 5 levels deep and a fanout of 4 (see Section 5.4.3). Since replicated performance is
primarily sensitive to number of hops and hence depth, we expect Swarm to gracefully
handle a fanout of 8, enabling it to scale to thousands of replicas with hierarchies of
similar depth.
2
Note that N3 could have reconnected to N1 instead. If both try connecting to each other, only
one of them succeeds, as one of them has to initiate a root probe which detects the cycle.
94
4.8
Figure 4.6. Replica Hierarchy Construction in Swarm. (a) Nodes N1 and N3 cache
file F2 from its home N5. (b) N2 and N4 cache it from N5; N1 reconnects to closer
replica N3. (c) Both N2 and N4 reconnect to N3 as it is closer than N5. (d) Finally, N2
reconnects to N1 as it is closer than N3.
The replica networking mechanism just described does not always form the densest
hierarchy possible, although it forms a hierarchy that enables efficient bandwidth utilization over the wide area. However, a dense hierarchy enables the system to scale
logarithmically in the number of replicas. To keep the hierarchy dense, a replica with not
enough children periodically advertises itself to its subtree as a potential parent to give
some of its descendants a chance to reconnect to it (and move up the hierarchy) if they
find it to be nearer.
Finally, all the blocks of a file share the same replica hierarchy, although consistency
is maintained at the granularity of individual file blocks. This keeps the amount of
hierarchy state maintained per file low and independent of file size, and simplifies the enforcement of causality and atomicity constraints for page-grained sharing of distributed
data structures.
95
4.6.2 Custodians for Failure-resilience
The purpose of multiple custodians is to guard against transient failure of the root
replica as well as to ease the load imposed on it by SWID lookup traffic from nodes
reconnecting to the hierarchy. A file’s home node recruits its child replicas to keep
enough number of custodians as configured for the file. The identity of custodians is
propagated to all replicas in the background to help them reconnect to the replica hierarchy after disconnection from their parent. Also, if the root node becomes inaccessible,
other custodians reconnect to each other to keep a file’s replica hierarchy intact until the
root node comes up, and prevent it from getting permanently partitioned.
4.6.3 Retiring Replicas
Each Swarm server (other than a custodian) tracks local usage of cached file copies
and reclaims its cache space by evicting least recently used ones, after propagating
their updates to neighboring replicas and informing neighbors of the replica’s departure.
The orphaned child replicas elect a new parent as described above. However, before
a custodian can retire, it must inform the root node, as custodians must participate in
majority voting for root election. When the root node itself wishes to retire its copy,
it first transfers the root status to another custodian (in our prototype, the one with the
lowest IP address) via a two-phase protocol and becomes its noncustodian child. This
mechanism allows permanent migration of a file’s custody among nodes.
4.6.4 Node Membership Management
Swarm servers learn about each other’s existence as a side-effect of looking up SWIDs.
Each Swarm server communicates with another via a TCP connection to ensure reliable,
ordered delivery of messages. Since Swarm servers form replica hierarchies independently for each file, a Swarm server that caches a lot of files might need to communicate
with a large number of other Swarm servers, potentially causing scaling problems. To
facilitate scalable communication, each Swarm server keeps a fixed-size LRU cache of
live connections to remote nodes (called connection cache), and a larger persistent node
cache to efficiently track the status of remote nodes with which it communicated recently.
96
A replica need not maintain an active connection with its neighboring replicas all the
time, thus ensuring a Swarm server’s scalable handling of a large number of files and
other Swarm servers.
A Swarm server assumes that another node is reachable as long as it has an established TCP connection to it. If a communication error (as opposed to a graceful close
by peer) causes the TCP layer to shut down the connection, the Swarm server infers
that the other node is down and avoids it in subsequent lookups and replica reconnection
attempts for a timeout period. A Swarm node treats a busy neighbor differently from an
unreachable neighbor by employing longer timeouts for replies from a connected node
to avoid prematurely reacting to transient overload.
4.6.5 Network Economy
To aid in forming replica hierarchies that utilize the network efficiently, Swarm
servers continually monitor the quality of their network links to one another as a sideeffect of normal communication, and persistently cache this information in the node
cache. Our prototype employs roundtrip time (RTT) as the link quality metric, but it
can be easily replaced by more sophisticated metrics that also encode bandwidth and
lossiness of links. The link quality information is used by the consistency management
system to enforce various failure-handling options, and by the replication system to
reduce the use of slow links in forming replica hierarchies. In our current design, each
replica decides for itself which other replicas are close to it by pinging them individually
(piggybacked on other Swarm messages if possible). Thus, the ping traffic is initially
high when a number of new Swarm servers join the system for the first time, but quickly
subsides and is unaffected by nodes going up and down. However, Swarm can easily
adopt better RTT estimation schemes such as IDMaps[25] as they become available.
4.6.6 Failure Resilience
Swarm treats custodians (including root) as first-class replicas for reliability against
permanent loss, and employs a majority voting algorithm [35] to provide Byzantine faulttolerance. Hence for high availability under strong consistency, custodian copies must be
97
small in number (typically 4), and hosted on well-connected machines. However, Swarm
can handle a large number (thousands) of secondary replicas and their failures due to its
dynamic hierarchy maintenance. Swarm handles transient root node failure as explained
earlier in this section.
4.7 Consistency Management
In this section, we describe our design of configurable consistency management in
the context of Swarm.
Each Swarm server has a consistency module (CM) that is invoked when clients open
or close a file or when clients perform reads or updates within a session via the operations
listed in Table 4.1. The CM performs checks, interacting with replica neighbors (parent
and children) via pull operations as described below, to ensure that the local file copy
meets client requirements. Similarly, when a client issues updates, the CM propagates
them via push operations to enforce consistency guarantees given by peer servers to their
clients.
Before granting data access, the consistency module performs concurrency control
and replica divergence control. Concurrency control delays the request until all other
access sessions that conflict with the requested access mode (according to the concurrency matrix in Table 3.1) are completed at all replicas. Replica divergence control pulls
updates from other replicas (preserving their ordering constraints) to make sure that the
local copy meets given replica synchronization requirements.
To succintly track the concurrency control and replica divergence guarantees given
to client sessions (i.e., time and mod bounds), the CM internally represents them by a
construct called the privilege vector (PV). The CM can allow a client to access a local
replica without contacting peers if the replica’s consistency guarantee indicated by its
PV is at least as strong as required by the client. For example, if a client requires
200ms staleness and the replica’s PV guarantees a max staleness of 100ms, no pull is
required. CMs at neighboring replicas in the hierarchy exchange PVs based on client
consistency demands, ensure that PVs do not violate guarantees of remote PVs, and
push enough updates to preserve each other’s PV guarantees. Although a Swarm server
98
is responsible for detecting inaccessible peers and repairing replica hierarchies, its CM
must continue to maintain consistency guarantees in spite of reorganization (by denying
access if necessary). To recover from unresponsive peers, the CMs at parent replicas
grant PVs to children as leases, as explained below.
4.7.1 Overview
4.7.1.1 Privilege Vector
A privilege vector (PV) consists of four components that are independently enforced
by different consistency mechanisms: an access mode (mode), a hard time/staleness limit
(HT), a hard mod limit (HM) and a soft time+mod limit (STM.[t,m]). Figure 4.7 defines
a PV and an “includes” relation (⊇) that partially orders PVs. This relation defines how
to compare the strength of consistency guarantee provided by two PVs. The figure also
lists the privilege state maintained by a replica, which will be explained later. Associated
with each replica of a consistency unit (i.e., a file or a portion of it) is a current privilege
vector (currentPV in short) that indicates the highest access mode and the tightest (i.e.,
numerically lowest) staleness and the mod limit guarantees that can be given to local
sessions without violating the guarantees given at remote replicas. By default, a file’s
root custodian has the highest PV (i.e., [WRLK, *, *, *] where * is a wildcard), whereas
a new replica starts with the lowest PV (i.e., [RD, ∞, ∞, ∞]). A replica is a potential
writer if its PV contains a WR or WRLK mode, allowing it to update data.
4.7.1.2 Handling Data Access
When a Swarm client requests a nearby server to open a session on a data item, the
server first caches the item locally and computes the required PV from the session’s
consistency options. If the replica’s currentPV is not high enough to include the required
PV (i.e., the client wants a stronger consistency guarantee than the replica currently has),
the server obtains the required PV from other replicas as described below. Next, the
server performs local concurrency control by delaying the request until the close of other
local sessions whose access modes conflict with the new session’s mode (according to the
concurrency matrix in Table 3.1). These two actions constitute a pull operation. Finally,
99
Privilege Vector (PV):
[
mode: (RD, WR, RDLK, WRLK), /* (pull)
*/
HT: (0, ∞) /* Hard Time limit (pull) */,
HM: (0, ∞) /* Hard Mod limit (push) */,
STM: [t: (0, ∞), m: (0, ∞)]
/* Soft Time and Mod limits (push) */
]
PVincludes(PV1, PV2) /* ⊇relationship */
{
/* checks if PV1⊇PV2 */
/* WRLK⊇WR⊇RD; WRLK⊇RDLK⊇RD */
if (PV1.mode == WRLK) or
((PV1.mode == RDLK) and
(PV1.mode ⊇PV2.mode))
return TRUE;
else if (PV1.mode⊇PV2.mode) and
(PV1.[HT, HM, STM] ≤
PV2.[HT, HM, STM])
return TRUE;
else return FALSE;
}
PVmax(PV1, PV2)
{
return [max(PV1.mode, PV2.mode),
min(PV1.HT, PV2.HT),
min(PV1.HM, PV2.HM),
min(PV1.STM, PV2.STM)];
}
PVcompat(localPV)
{
/* returns the highest remote PV compatible
with localPV */
/* localPV ⇒remote PV */
[WRLK] ⇒[RD, ∞, 0, [0,0]]
[RDLK] ⇒[RDLK]
[WR, finite, *, *] ⇒[RD, ∞, 0, [0,0]]
[RD, finite, *, *] ⇒[RDLK]
[WR, ∞, *, *] ⇒[WR, ∞, 0, [0,0]]
[RD, ∞, *, *] ⇒[WRLK]
}
Replica Privilege State:
foreach neighbor N,
struct {
PVin; /* PV obtained from N */
PVout; /* PV granted to N */
lease; /* if N is parent: lease on PVin;
* if N is child: lease on PVout */
last pull; /* time of last completed pull */
unsent updates; /* #updates to push */
last push; /* time of last completed push */
} N;
localPV = PVmax(S.PV) ∀ongoing local sessions S.
currentPV = PVmin(N.PVin) ∀neighbors N.
Invariants on Privilege State:
(PVcompat(N.PVout)⊇ N.PVin and
PVcompat(N.PVin)⊇N.PVout)
∀neighbors N.
currentPV⊇localPV.
PVmin(PV1, PV2)
{
return [min(PV1.mode, PV2.mode),
min(PV1.HT, PV2.HT),
max(PV1.HM, PV2.HM),
max(PV1.STM, PV2.STM)];
}
Figure 4.7. Consistency Privilege Vector.
100
the server opens the local session, allowing the client to access the local replica. Figure
4.8 gives the pseudo-code for a session open operation.
4.7.1.3 Enforcing a PV’s Consistency Guarantee
A server obtains a higher PV for a local replica (e.g., to satisfy a client request) by
issuing RPCs to its neighbors in the replica’s hierarchy. These RPCs are handled by
those neighbors in a similar manner to client session open requests. To aid in tracking
remote PVs, a replica remembers, for each neighbor N, the relative PV granted to N
(called N.PVout) and obtained from N (called N.PVin). The replica’s currentPV is the
lowest (according to the PVmin() function defined in Figure 4.7) of the PVs it obtained
from its neighbors. Since each replica only keeps track of the PVs of its neighbors, a
replica’s PV state is proportional to its fanout and not to the total number of replicas of
the data item.
When a replica R gets a pull RPC from a neighbor N, R pulls the required PV in turn
by recursively issuing pull RPCs to its other neighbors and following it up with local
concurrency control. Then, R grants the PV to the requestor N and stores it locally as
N.PVout, after updating N.PVin to reflect R’s own local consistency needs relative to the
requestor. By granting a PV, a replica promises to callback its neighbor before allowing
other accesses in its portion of the hierarchy that violate the granted PV’s consistency
guarantee. The recursive nature of the pull operation requires that the replica network be
acyclic, since cycles in the network cause deadlocks. The pull algorithm is discussed in
Section 4.7.2.
4.7.1.4 Handling Updates
When a client issues updates or closes a session, the Swarm server examines the PVs
granted to its neighbors. If it finds that their consistency guarantees (as indicated by
their N.PVout values) will be violated as a result of accepting local updates, it propagates
enough updates to bring them back to consistency before proceeding, a step called the
push operation. The update ordering, visibility and isolation options of the CC frame-
101
sw open(SWID, offset, size, mode, cc options)
{
reqPV = [mode, ∞,∞,∞];
if cc options.strength == hard,
reqPV.[HT, HM] = cc options.[time, mod];
else reqPV.STM = cc options.[time, mod];
if reqPV.STM.m > reqPV.HM,
reqPV.STM.m = reqPV.HM;
cache obj (SWID, offset, size) locally;
pull(reqPV, self);
open new local session S on obj with mode;
S.PV = reqPV;
return S.id;
}
sw update(session id, update)
{
foreach neighbor N,
push(N); /* push outstanding updates to N */
apply update locally;
}
sw close(session id)
{
foreach neighbor N,
push(N); /* push outstanding updates to N */
close session(session id);
wake up waiters;
}
Figure 4.8. Pseudo-code for opening and closing a session.
102
work are enforced as part of the push operation, as explained in Section 4.7.2. Figure 4.8
gives the pseudo-code for update and session close operations.
4.7.1.5 Leases
To reclaim PVs from unresponsive neighbors, a replica always grants PVs to its
children as leases (time-limited privileges), which they need to renew after a timeout
period. The root node grants a lease of 60 seconds, and other nodes grant a slightly
smaller lease than their parent lease to give them sufficient time to respond to parent’s
lease revocation. A parent replica can unilaterally revoke PVs from its children (and
break its callback promise) after their finite lease expires. Thus, all privileges eventually
end up with the root replica. Child replicas that accept updates made using a leased
privilege must propagate them to their parent within the lease period, or risk update
conflicts and inconsistencies such as lost updates. Leases are periodically refreshed via
a simple mechanism whereby a node pings other nodes that have issued it a lease for any
data (four times per lease period in our current implementation). Each successful ping
response implicitly refreshes all leases issued by the pinged node that are held by the
pinging node. If a parent is unresponsive, the node informs its own children that they
cannot renew their lease. Swarm’s lease maintenance algorithm is discussed in more
detail in Section 4.7.6.
4.7.1.6 Contention-aware Caching
A Swarm server normally satisfies data access requests by aggressively caching the
data locally and pulling the PV and updates from other replicas as needed. It keeps
the PV obtained indefinitely (renewing its lease if needed) until another node revokes
it. When clients near various replicas exhibit significant access locality, access-driven
replication and privilege migration help achieve high performance by amortizing their
costs over multiple accesses. However, if multiple sites contend for conflicting privileges
frequently, aggressive caching can incur high latency due to frequent pulls among replicas (called thrashing). Swarm provides a heuristic algorithm called contention-aware
103
caching that helps restrict frequent migration of access privileges among replicas under
contention to limit performance degradation. We describe this algorithm in Section 4.7.7.
Our implementation of consistency via exchange of privilege vectors enables wide
area clients with various degrees of laxity in concurrency and replica synchronization
requirements to co-exist. Clients with looser requirements get more parallelism, while
the strong requirements of others are respected globally.
For example, consider a replicated access scenario for file F2 of Figure 4.6d. Figure
4.9 illustrates the consistency management actions taken in response to clients accessing
the different replicas of F2 with different consistency requirements. Figure 4.9a shows
the initial state where the root replica owns the highest PV. Next, a client opens a write
session at node N1 with eventual consistency, meaning that it wants to be asynchronously
notified of remote writes (Figure 4.9b). In response, Swarm performs pull operations to
get this guarantee from other replicas. When N3 replies to N1, it lowers its own PV to
[rd,∞,∞, ∞]. Thus, when N1 gets update u1 from the client, it does not push it to N3.
Subsequently, a client opens a write session at N5 requiring to be synchronously notified
of remote writes, thus forcing them to serialize with its local updates (Figure 4.9c). When
it issues four updates u2-u5, they need not be propagated to N1 right away, since N1 can
tolerate upto 4 unseen writes. Next, a client at N4 (Figure 4.9d) requires a different kind
of hard guarantee, namely, latest data just before local access (e.g., at session open()
time, called close-to-open consistency). In response, N3 does not pull from N1, as N1
has promised to push its updates to N3 synchronously. When clients at N1 and N5 issue
updates u7 and u6 concurrently (Figure 4.9e), they are propagated to each other but not
to N4, as N4 needs latest data only before local access. Finally, when a client requests a
read lock session at N2 (Figure 4.9f), it blocks until all write sessions terminate globally,
and prevents subsequent writes until the session is closed.
Our hierarchy-based consistency mechanism scales to a large number of replicas by
requiring replicas to track only a limited number of other replicas (corresponding to
their fanout). This is because the PV presented to a replica by each neighbor includes
the summarized consistency invariants of that neighbor’s portion of the hierarchy. This
summarization is possible due to the inclusive property of PVs. Specifically, the locking
104
(a) Initial status
(b) open(wr, soft, 10, 4) at N1
N5
N5 [wrlk]
(c) open(wr, hard, −, 0) at N5
[wr,0,0 ,[0,0]]
[rd,−,−,−]
client
pull
N3
[rd,−,−,−]
N1 [rd,−,−,−]
pull N3 [rd,−,−,−]
[rd,−,−,−]
client
N4
[wr,−,−,
[10,4]]
u1 N1 [wr,−,−,[10,4]]
N2 [rd,−,−,−]
N2 [rd,−,−,−]
[rd,−,−,−]
N4
[wr,−,−,
[10,4]]
u2−u5
u1
N5
u1 N3
pull
[wr,0,0 ,[0,0]]
pull
[wr,−,−,[10,4]]
[rd,−,−,−]
N1 [wr,−,−,[10,4]]
N4
N2 [rd,−,−,−]
(d) open(rd, hard, 0, −) at N4
(e) Parallel updates at N1 and N5
(e) open(rdlk) at N2
[wr,hard,−, 0]
[wr,hard,−,0]
N5 [rd,−,−,−]
u6 N5 [wr,0,0 ,[0,0]]
N5 [wr,0,0 ,[0,0]]
client u7
client
push
pull
push
u2−u5
u6
pull
[wr,−,−,[10,4]]
N3
u7
[wr,−,−,[10,4]]
[rd,−,−,−]
N3
push
pull N3
u1−u5
[rd,−,−,−]
push
[rd,−,−,−]
u7 N1 u2−u6
pull
N1
N1
[rd,−,−,−]
N4
N4
N4
[wr,−,−,
[wr,−,−,
[rd,−,−,−] [wr,−,−,
pull u1−u7
[10,4]]
[10,4]]
[10,4]]
client
N2 [rd,−,−,−]
N2
client
[rdlk] N2 [rdlk]
[rd,−,−,−]
[rd,0,−,−]
Figure 4.9. Consistency management actions in response to client access to file F2 of
Figure 4.5(d). Each replica is labelled with its currentPV, and ‘-’ denotes ∞.
(a) The initial replica network and its consistency status. (b) F2 is opened at N1 with
eventual consistency. Pulls are issued to let N3 and N5 know that N1 is a potential writer.
Subsequently, N1 gets update u1. (c) F2 is opened at N5 with a hard mod limit of 0. N1
must synchronously push its updates from now on. N5 also gets four updates u2-u5. But
since N1 set its mod limit to 4, N5 need not push its updates yet. (d) F2 is opened at N4
with a hard time limit of 0 (such as close-to-open consistency). N3 need not pull from
N1, as N1 will notify N3 before it updates the file. (e) N1 and N5 concurrently issue
updates u7 and u6 respectively. N1 pushes u7 to N5 thru N3 and awaits an ack, while
N5 asynchronously pushes u2-u6. Updates are not pushed to N4, as it did not set a mod
limit in step (d). (f) Finally, F2 is opened in RDLK mode at N2. This reduces PVs at all
other replicas to their lowest value.
105
modes include nonlocking modes, and write modes include read modes (with the exception of WR and RDLK). This allows the access privileges of an entire portion of a replica
hierarchy to be represented concisely by the highest PV that includes all others.
4.7.2 Core Consistency Protocol
The bulk of the CM’s functionality (including its consistency protocol) consists of
the pull and push operations, which we describe in this section.
Pull(PV): Obtain the consistency guarantees represented by the given PV from remote
replicas.
Push(neighbor N): Propagate outstanding updates to a given neighbor replica N based
on its synchronization and ordering requirements as indicated by the PV granted
to it earlier (N.PVout).
A pull operation is performed before opening a session at a replica as well as during
a session if explicitly requested by a client. In addition to notifying a replica’s desired
consistency semantics to other replicas, it also enforces the replica’s hard time (HT)
bound by forcing other replicas to immediately bring it up-to-date by propagating all
their pending updates to it. A push operation is performed when updates arrive at a
replica (originating from local or remote sessions), are applied locally (after satisfying
the isolation options of local sessions) and are ready for propagation (based on the
originating session’s visibility setting). It enforces mod bounds (both hard and soft)
as well as soft time bounds at replicas (HM and STM) by propagating updates to them
while adhering to their ordering constraints.
To implement these operations, Swarm’s core consistency protocol consists of two
primary kinds of messages, namely, get and put. A Swarm server’s protocol handler
responds to a get message from a neighbor replica by first making a corresponding
pull request to the local consistency module to obtain the requested PV locally. It then
replies to the neighbor via a put message, transfering the requested privilege along with
any updates to bring it in sync. It also recomputes its own PVin to reflect its current
106
consistency requirements compatible with the neighbor’s new PV, using the PVcompat()
procedure outlined in Figure 4.7.
When a put message arrives, the protocol handler waits for ongoing update operations to finish, applies the incoming updates atomically, and makes a push request
to the consistency module to trigger their further propagation. Figure 4.10 gives the
pseudo-code for the get and put operations. We describe the pull algorithm in the next
section, and the push algorithm in Section 4.7.4.
4.7.3 Pull Algorithm
The pull operation implements both concurrency and replica divergence control, and
its algorithm is central to the correctness and performance of configurable consistency.
To ease presentation, we first describe the basic algorithm (given in Figure 4.11), and
refine it later with several improvements. We defer the discussion on enforcing the CC
framework’s update ordering options to Section 4.8. The consistency module at a replica
R makes each pull operation go through four successive steps:
1. The pull operation is blocked until previous pull operations complete. This ensures
FIFO ordering of access requests, so that an access request (e.g., exclusive write)
is not starved by a stream of conflicting access requests (e.g., exclusive reads).
2. The desired PV is requested from neighboring replicas that haven’t already granted
the privilege, via get messages. For this, R computes a relative PV that must be
obtained from each neighbor, so that the desired PV can be guaranteed locally
while keeping the PVs established earlier. It sends a get message to a neighbor N
if the PV obtained from N (N.PVin) does not include the relative PV needed. We
describe how the relative PVs are computed to enforce various timeliness bounds
in Section 4.7.4.
3. R waits for replies to its get messages. The neighbors handle get messages as
follows (Figure 4.10 gives the pseudo-code). They pull the PV recursively from
other neighbors, make sure that no other local access at their site conflicts with the
requested PV, and respond with a put reply message that grants the PV to replica
107
get(reqPV, requesting replica R)
{
1: pull(reqPV, R);
2: R.PVout = reqPV;
/* compute my new PV relative to R */
3a: R.PVin = PVmax(R.PVout, S.PV)
∀neighbors N6=R, ∀local sessions S;
3b: R.PVout.HT = (R.PVin.mode == WR,WRLK) ? ∞: 0;
R.PVin.HT = (R.PVout.mode == WR,WRLK) ? ∞: 0;
3c: nwriters = # neighbors N where N.mode == WR, WRLK;
R.PVin.HM /= nwriters;
R.PVin.STM.m /= nwriters;
if R.PVin.STM.m > R.PVin.HM
R.PVin.STM.m = R.PVin.HM;
3d: if (R.PVout.HM == 0) R.PVout.HT = 0;
if (R.PVin.HM == 0) R.PVin.HT = 0;
4: send put(R.PVout, R.PVin, updates) as reply message to R,
where updates bring R within reqPV.[HT, HM, STM] limits
of me and all my neighbors excluding R.
}
put(dstPV, srcPV, updates) from neighbor R
{
wait behind ongoing update operations
apply updates to local replica
R.PVin = dstPV;
R.PVout = srcPV;
foreach neighbor N6=R,
push(N); /* push updates to neighbor N */
signal waiters
}
Figure 4.10. Basic Consistency Management Algorithms.
108
pull(reqPV, requesting replica R)
{
1: wait ongoing pull operations to finish
(except when deadlock is possible as explained in the text)
2a: /* recompute mod bound if #writers changed */
nwriters = #neighbors N, where N6=R and
N.PVout.mode == WR,WRLK;
++ nwriters if reqPV.mode == WR,WRLK;
minHM = localPV.HM / nwriters;
minSM = localPV.STM.m / nwriters;
if nwriters > 0,
foreach neighbor N {
split = nwriters;
PV = (N == R) ? reqPV : N.PVout;
–split if PV.mode == WR,WRLK;
minHM = min(minHM, PV.HM / split);
minSM = min(minSM, PV.STM.m / split);
}
foreach neighboring replica N6=R {
relPV = reqPV;
2b: if relPV.HT < (curtime() - N.last pull),
relPV.HT = ∞;
2c: relPV.[HM, STM.m] = [minHM, minSM];
if not PVincludes(N.PVin, relPV) {
relPV = PVmax(relPV, N.PVin);
send get(relPV) message to N;
}
}
3: wait for put replies to get messages sent above
4: foreach local session S,
if reqPV.mode conflicts with S.PV.mode,
wait for S to close.
}
Figure 4.11. Basic Pull Algorithm for configurable consistency.
109
Figure 4.7. At the end of this step, no replica other than R has a conflicting access
privilege, and R is within the desired timeliness bounds (PV.[HT, HM, STM]).
4. Subsequently, the pull operation is blocked until local sessions with conflicting
access modes end. The pull operation is complete only at the end of this step.
Our algorithm ensures mutual compatibility of the PVs of replicas by establishing the
right set of callback promises between replicas to prevent unwanted divergence in step
2, and by serializing conflicting access mode sessions in step 4. The recursive nature of
the pull operation is illustrated in Figure 4.9b, where node N1 pulls privilege [wr, ∞,∞,
[10, 4]] from N3, which in turn pulls it from N5. Figure 4.12 shows the relative PVs
established at all replicas before and after the client session is opened at N1.
By issuing parallel get requests to all neighbors along the hierarchy, the pull algorithm incurs latency that only grows logarithmically in the number of replicas (provided
the hierarchy is kept dense, as explained in Section 4.6.1). Also, since each replica only
keeps track of the privileges held by its neighbors, the privilege state maintained at a
replica is proportional to its fanout and not to the total number of replicas. The algorithm
exploits locality by performing synchronization (i.e., pulls and pushes) only between
replicas that need it, such as active writers and active readers. This is because, in step 2,
get messages are sent only to replicas that could potentially violate PV semantics. For
instance, in Figure 4.12b, N3 need not pull from N4 because N4 thinks N3’s PV is [wrlk]
and hence cannot write locally without first pulling from N3.
4.7.4 Replica Divergence Control
As mentioned in Section 4.7.2, a hard time/staleness bound (HT) is enforced by a pull
operation, while the other timeliness bounds (HM and STM) are enforced by the push
operation. In general, a replica obtains divergence bound guarantees from each neighbor
via a pull operation by specifying the tightest (i.e., numerically smallest) bound of its
own local sessions and other neighbors.
110
(a) Initial status
(b) After open(wr, soft, 10, 4) at N1
N5
N5
[rd,−,−,−]
[wrlk]
[wr,−,−,[10,4]]
[rd,−,−,−]
[wrlk] N3 [wrlk]
[rd,−,−,−]
[rd,−,−,−]
N1
N4
[wrlk]
[rd,−,−,−]
N2
[rd,−,−,−] N3 [wrlk]
[wr,−,−,[10,4]]
[rd,−,−,−]
N1
N4
[wrlk]
client
[rd,−,−,−]
N2
Figure 4.12. Computation of relative PVs during pull operations. The PVs labelling the
edges show the PVin of a replica obtained from each of its neighbors, and ‘-’ denotes
∞. For instance, in (b), after node N3 finishes pulling from N5, its N5.PVin = [wr, ∞,
∞, [10,4]], and N5.PVout = [rd, ∞, ∞,∞]. Its currentPV becomes PVmin( [wr, ∞, ∞,
[10,4]], [rd, ∞, ∞,∞], [wrlk]), which is [rd, ∞, ∞,∞].
4.7.4.1 Hard time bound (HT)
To enforce HT, a replica R keeps track of the last time it pulled updates from each
neighboring replica N, and includes an HT constraint in its pull request only if the
maximum time limit has elapsed since the last pull (step 2b in the pull() procedure of
Figure 4.11). In response, replica N grants a HT guarantee of 0 to R if its current
privileges relative to R (R.PVin) do not allow writes. Otherwise, it grants a HT guarantee
of ∞ (step 3b in the get() procedure of Figure 4.10). In the former case, R can unilaterally
assume the latest contents; in the latter case, it must pull before every access. This is
illustrated in Figure 4.9d. node N4 requests a HT of 0, but N3 grants it ∞, since N3 is a
potential writer relative to N1.
4.7.4.2 Soft time bound
During a pull, a replica obtains a soft time bound guarantee (STM.t) from each
neighbor that is equal to the smallest of the soft time bounds of its own local sessions
and its other neighbors. To enforce the guarantee, a replica remembers the last time it
pushed updates to each of its neighbors, and pushes new updates when the soft time limit
guaranteed to the neighbor has elapsed since the last push (step 1 of the push() procedure
in Figure 4.13).
111
push(destination neighbor R)
{
1: if R.unsent updates > R.PVout.HM, or
R.unsent updates > R.PVout.STM.m, or
(curtime() - R.last push) > R.PVout.STM.t, {
2: send put(R.PVout, R.PVin, R updates) message to R,
where R updates bring R within R.PVout.[HM, STM] limits.
3. if R.unsent updates > R.PVout.HM,
await ack for put message.
}
}
Figure 4.13. Basic Push Algorithm for Configurable Consistency.
4.7.4.3 Mod bound
To enforce a modification bound of N unseen remote updates globally (HM and
STM.m), a replica splits the bound into smaller bounds to be imposed on each of its
neighbors that are potential writers (in step 2a of the pull() procedure in Figure 4.11). It
thus divides the responsibility of tracking updates among various portions of the replica
hierarchy. Those neighbors in turn recursively split the bound among other neighbors
until all replicas are covered. Each replica tracks the number of updates that are ready
for propagation but unsent to each neighbor, and pushes updates whenever it crosses the
neighbor’s mod limit (in step 1 of the push() procedure). If the mod bound is hard, it
pushes updates synchronously, i.e., waits until they are applied and acknowledged by
receivers, before making subsequent updates.
Since a mod bound guarantee is obtained only from neighbors that are potential
writers, it must be re-established via pulls whenever a replica’s neighbor newly becomes
a writer. When the neighbor issues a pull for write privilege, the replica recomputes
the new bound in step 2a of the pull procedure, issues further pulls if necessary. It also
recomputes its own mod bound relative to the neighbor (in step 3c of the get() procedure)
before granting the write privilege. Our algorithms adjust PVs to reflect the fact that a
zero HM bound implicitly guarantees zero HT bound (step 3d of the get() procedure).
112
4.7.5 Refinements
Next we present several successive refinements to the pull algorithm presented above
to fix several problems as well as to improve its performance.
4.7.5.1 Deadlock Avoidance
The serialization of pull operations in step 1 prevents starvation and ensures fairness,
but is prone to deadlock where concurrent pull operations at parent and child replicas
cause them to send get messages to each other and wait for replies. These gets trigger
their own pulls at their receiving replicas which wait behind the already waiting pulls,
causing a deadlock. To avoid this, we introduce asymmetry: we allow a parent’s pull
request at a child not to wait behind the child’s pull request to parent.
4.7.5.2 Parallelism for RD Mode Sessions
The pull algorithm presented so far can sometimes be overly restrictive. Consider
a RD mode session which must not block for ongoing locking mode (RDLK, WRLK)
sessions to finish elsewhere. However, this scenario is possible with our pull algorithm if
a replica R gets a WR mode session request, while a RDLK is held elsewhere. The WR
mode session issues a pull, which blocks for the remote RDLK session to finish. If a RD
session request arrives at R at this time, it blocks in step 1 behind the ongoing WR mode
pull, which will not complete unless the remote RDLK session completes. To fix this
problem, we exempt RD mode pulls from waiting in step 1 for pulls of non-RD modes
to finish. Instead, separate get requests are sent to other replicas for a RD privilege.
Also, a put reply that carries a RD privilege wakes up only waiters for RD mode. Thus,
a RD session at a replica proceeds with the latest contents while a WR session waits for
conflicting remote sessions to end elsewhere as well.
The next refinements add privilege-leasing and contention-aware caching to the algorithm, which we describe in the following sections.
113
4.7.6 Leases for Failure Resilience
Our next refinement adds leasing of privileges for a limited time to help replicas
recover from unresponsive neighbors, as explained in Section 4.7.1. When a replica P
grants a higher PV than [wr, ∞,∞, ∞] (indicating a non-trivial consistency guarantee)
to its child replica C, it also gives it a lease period L, and locally marks C’s lease expiry
time as (local clock time + L). L can be set to twice the maximum latency for a message
to travel between the root and the farthest leaf along the hierarchy. When C receives the
privilege, it computes its own lease expiry time as (local clock time + L - latency to P),
and sets a smaller lease period for its children based on its own lease. Subsequently, if
P fails to send a get to C as part of a pull operation, it blocks the pull until C’s lease
expires and unilaterally resets the C.PVout to the lowest possible value (i.e., [rd, ∞,∞,
∞]). The two replicas consider the lease as valid as long as the Swarm nodes hosting
them communicate successfully at lease once in a lease period.
A Swarm server keeps the parent leases of its local replicas of data items valid
by periodically pinging the servers hosting their parent replicas (four times in a lease
period). If the server finds that a remote server hosting parent replicas is unresponsive, it
forces local replicas to reconnect to a new parent and re-pull their privilege (via a special
‘lease recovery’ pull, explained below) before the lease with the old parent expires.
Replicas that succeed can continue to allow local access.3 For those replicas that cannot
recover the privilege within its lease period, the Swarm server informs their child replicas
that the lease cannot be renewed. Those replicas must fail ongoing sessions after the lease
period as their consistency can no longer be guaranteed.
Normally, a replica blocks pull requests until leases given to unresponsive children
expire. When a child replica needs to recover a valid lease after reconnecting to a new
parent, it requests a pull marked as a ‘lease recovery’ pull. When a replica receives a such
a pull request, it revokes leases from its unresponsive children immediately, to enable the
reconnecting child replica to recover its leased privilege before it expires. For the same
3
Such a recovery of a leased privilege is possible if the Swarm server could reach the parent server
indirectly through other servers via different network paths. A common example is a Swarm server running
on a mobile computer that gets connected to different networks over time.
114
reason, lease recovery pulls arriving at a node are propagated ahead of normal pulls, and
trigger similar pulls recursively.
Our lease algorithm thus enables replicas to preserve lease validity and maintain
consistency guarantees in spite of inaccessibility of some replicas and a dynamically
changing hierarchy.
4.7.7 Contention-aware Caching
Our last refinement adds adaptation to the available data locality at multiple replicas
by switching between aggressive caching and redirection of client requests, which we
describe next.
Caching data at a site is effective if both the data and associated PVs are locally available when local clients need them most of the time. When clients make interleaved data
accesses at multiple replicas that often require them to synchronize via pull operations,
data access latency goes up, hurting application performance. In that case, better performance is achieved by centralizing data accesses, i.e., redirecting client requests to fewer
replicas and reducing the frequency of pull operations at those replicas. Centralization
eliminates the extra network traffic and delays incurred by synchronous pulls between
replicas, but adds a network roundtrip to remote client requests. The challenge lies in
designing an algorithm that dynamically makes the right tradeoff between caching and
centralization and chooses the replica sites where accesses should be directed to improve
overall performance compared to a static choice. We devised a simple heuristic algorithm
called adaptive caching to achieve this, which we describe in this section. Though our
algorithm is not optimal, it effectively prevents thrashing when accesses are interleaved
across replicas, and resorts to aggressive caching otherwise. Figure 4.14 shows the major
modifications made to replica state for adaptive caching.
To aid in adaptation, Swarm tracks the available locality at various replicas as follows. Each replica monitors the number of access requests to a data item (i.e., session
open() attempts) it satisfied locally between successive revocations of its privilege by
remote replicas, called its PV reuse count. This count serves as an estimate of the data
item’s access locality at that replica. For instance, if clients issue strictly interleaving
115
Replica State for adaptive caching:
foreach neighbor N,
struct {
...
cachemode: {MASTER, SLAVE, PEER},
reuse; /* relative reuse count */
} N;
master; /* none: PEER, self: MASTER, other: SLAVE */
reuse; /* PV reuse count */
master reuse; /* reuse count between notes to slaves */
low thres, high thres; /* hysteresis threshold */
notify thres; /* threshold master reuse to notify slaves */
Figure 4.14. Replication State for Adaptive Caching. Only salient fields are presented
for readability.
write sessions at multiple replicas of a file employing close-to-open consistency, their
reuse counts will all be zero, indicating lack of locality. The relative reuse count of a
replica with respect to its neighbor is the highest of its own reuse count and those of its
other neighbors. Replicas exchange their relative reuse counts with neighbors as part of
consistency messages (get and put), and reset their own count after relinquishing their
PV to neighbors.
Clients can specify a low and a high hysteresis threshold as per-file consistency
attributes that indicate the minimum number of local access requests a replica must see
before it can issue the next pull. A hysteresis threshold of zero corresponds to aggressive
caching, whereas a high value forces centralization by preventing privilege-shuttling as
explained below. An appropriate threshold value for an application would be based on
the ratio of the cost of pulling PVs to that of RPC to a central server.
Figure 4.15 and 4.16 outline the configurable consistencyalgorithms to support adaptive caching. When a replica must pull from other replicas to open a session locally, if its
reuse count is less than the file’s low hysteresis threshold, it sends out an election request
via its high-reuse neighbors for a master (M) mode replica where the session must be
actually opened (see the check locality() algorithm in Figure 4.15). This request, called
elect master msg() in Figure 4.16, gets propagated via the replica hierarchy to an already
elected master replica or one that has accumulated a higher relative reuse count (or higher
116
than the low threshold). The forwarding replica now switches to slave (S) mode, i.e., it
transparently creates a session at the master and performs the local session’s reads and
writes through the master. Figure 4.16 outlines the master election algorithms.
Alternatively, when opening a session, the application can ask Swarm for the location
of the elected master replica and directly forward the application-level operation to its
application component near the master. In any case, the slave as well as the master replica
increment their local reuse counts for each forwarded session. A slave whose reuse count
rises to high threshold unilaterally switches to pulling access privileges, i.e., switches to
peer (P) mode. To keep the slave replicas in slave mode, the master periodically (i.e.,
whenever reuse crosses the low threshold) notifies its local reuse count to all slaves via
the replica hierarchy (in a notify slave() message). If a slave sees that it contributed
to a major portion of the master’s reuse count (via session forwarding), it switches to peer
mode without waiting for its count to reach high threshold. Otherwise it conservatively
assumes contention at the master, and stays in slave mode after resetting its own count.
If many slaves are predominantly inactive (such as under high locality), periodic
usage notifications from the master are both unnecessary and wasteful of network bandwidth. To restrict such messages, a master starts out by never sending any usage notifications. Instead, slaves whose usage rises to low threshold explicitly request a notification
from master (in a notify master() message). In response, the master sends out
a usage notification, and sets its reuse count at that time as the threshold for sending
subsequent notifications.
Our algorithm has several properties. When replicas reuse their PVs beyond the
hysteresis threshold between successive remote pulls (high locality), they all employ
aggressive caching. When accesses are interleaved but some replicas see significantly
more accesses than others (moderate locality), the most frequently accessed among them
gets elected as the master (due to its higher reuse count) to which all other client accesses
are redirected. Since master election stops at a replica with reuse count higher than
the threshold, there could be multiple master mode replicas active at a time, in which
case they are all peers relative to each other. When client accesses are interleaved but
uniformly spread among replicas (poor locality, e.g., due to widespread contention), one
117
check locality():
if reuse < low thres,
return elect master();
else if reuse < high thres,
if I am slave,
send notify master(reuse);
else master = none; /* PEER */
return master;
pull(PV, requesting replica R):
...
if not PVincludes(N.PVin, reqPV),
M = check locality();
if M != self,
return ”try M”;
send get(PV) to N;
await replies to gets;
...
++ reuse;
if master == self, and
++ master reuse > notify threshold,
{
send notify slave(master reuse) message
to each neighbor in slave mode.
master reuse = 0;
}
get(PV, requestor R):
...
master = none; /* PEER */
reuse = 0;
R.cachemode = PEER;
send put(PV) as reply to R;
put(dstPV, srcPV, R updates) from neighbor R:
...
master = none; /* PEER */
wake up waiters;
Figure 4.15. Adaptive Caching Algorithms.
118
elect master():
if master 6=none,
return master;
find replica N (in the order: parent, self, children)
where N.reuse is highest or ≥ low thres;
if N6=self,
send elect master msg() to N;
await reply;
else master = self; /* MASTER */
return master;
elect master msg() from replica R, forwarded by neighbor N:
M = elect master();
N.cachemode = (M==N) ? MASTER : SLAVE;
send elect master reply(M) to R;
elect master reply(M):
master = M;
if M == self, notify thres = ∞;
notify slave(master usage) from neighbor R:
reject unless I’m SLAVE;
if reuse > master usage/2,
master = none; /* PEER */
else
reuse = 0;
forward notify slave(master usage)
to neighbors N6=R where N.cachemode == SLAVE;
notify master(slave usage) from slave replica N:
if master reuse ≥ low thres,
send notify slave(master reuse) message
to each neighbor in slave mode.
notify thres = master reuse;
master reuse = 0;
Figure 4.16. Master Election in Adaptive Caching. The algorithms presented here are
simplified for readability and do not handle race conditions such as simultaneous master
elections.
119
of the replicas gets elected as master based on its past reuse and remains that way until
some asymmetry arises in use counts. This is because, once a node becomes master,
its master status is not revoked by others until their local usage rises from zero to low
threshold. The rise of slaves’ reuse counts is prevented by the master’s usage notifications. Thus, frequent oscillations between master-slave and peer modes is prevented
under all conditions. However, our algorithm does not always elect one of the contenders
as master unless there is significant asymmetry among replica accesses.
Figure 4.17 illustrates the algorithm in the context of accesses to file F2 of Figure 4.5
when employing a low hysteresis threshold of 3 and a high of 5. All replicas start with a
zero reuse count. In this example, they all get write session requests with close-to-open
consistency (i.e., open(WR, hard, 0, ∞)) that require them to pull on each open. When
the replicas at nodes N2 and N4 get an open, they send out election requests to their
parent N3 (Figure 4.17a), since among neighbors with equal counts, the parent is chosen
to be master. N5 becomes the master, to which N1 and N4 forward one open each, while
N2 forwards 3. At the third open, N2 solicits usage notification from N5 (Figures 4.17b
and c). Since N2 contributed to more than half of the 5 opens at N5, it switches to peer
mode and pulls the PV locally (Figure 4.17d) for the next open. When N4 issues an open
to N5, N5’s use count already dropped to 0. So it suggests N4 to try N3. When N1, N3
and N4 issue uniformly interleaved open requests, they all elect N2 as their master due to
its recent high activity (Figure 4.17e). However, this time, they forward 2, 2, and 3 opens
respectively to N2, none of which contribute significantly to the 7 opens at N2. Hence
they keep resetting their local use counts in response to the N2’s slave notifications and
stay as slaves, even if N2 stops getting further opens from local clients (Figure 4.17f).
Ideally, one of them should be elected as master instead of N2, but our algorithm does
not achieve that.
4.8 Update Propagation
The goal of update propagation algorithms is to ensure that each update issued by a
client on a data replica is eventually applied exactly once at all replicas, while satisfying
dependencies and ordering constraints. In other words, an update’s effect must not be
120
(a) N2, N4 elect N5 as master
m0−>m2
2
N5 M
ele
ct_
y
l
p
r
re
t_
lec
e
elect
S
ly
N3
N4
1
S
1 N2 S
S
en
op
3
P−>S
P
elect
2 opens
4 N2 P
0
0
S
N1
1−>0
N4
1
N4
3 N2 S−>P
0
elect
2
2 opens
3 opens
N2
m0−>m7
11
P−>M
S
S
2−>0 N1
N4
3
S 1−>0
(f) Low locality
N5 P
elect N3 S
2 N1
S N4
N3
(e) N1, N3, N4 elect N2 as master
tN
m5−>m0
N5 M
notify_slave(5)
S
1 N1
0
ges
pull N3
pull
2 opens
notify_master(3)
S
N5 Ps
ug
pull
P
5
3 N2 S
(d) N2 switches to peer mode
0 N1
0
S
elect
5−>0
elect
elect
S
0 N1
(c) N5 notifies slaves
m2−>m5
5
N5 M
ep
0
N3
(b) Moderate locality at N2
N5 P
N3 Snotify_slave(7)
2−>0
S
N4
3−>0
11 N2
notify_master(3)
M m7−>m0
Figure 4.17. Adaptive Caching of file F2 of Figure 4.5 with hysteresis set to (low=3,
high=5). In each figure, the replica in master mode is shown darkly shaded, peers are
lightly shaded, and slaves are unshaded.
duplicated or permanently lost anywhere, except due to other updates. If total ordering
is also imposed, an update must have the same effect at all replicas. In this section, we
describe Swarm’s update propagation mechanism.
Swarm propagates updates (in the form of modified file pages or operational update
packets) along the parent-child links of a replica hierarchy via put messages. For
example, in the replica hierarchy shown in Figure 4.6d, updates to file F2 at node N2
reach node N4 after traversing N1 and N3. When a Swarm client issues an update at
a replica (called its origin), it is given a node-local version number and a timestamp
based on local clock time. An update’s origin replica and its version number are used
to uniquely identify it globally. Operational updates are stored in their arrival order
at each node in a persistent update log and propagated in that order to other nodes.
Absolute updates that have semantic dependencies associated with them are also logged
with their old and new values, as they may need to be undone to preserve dependencies
(as explained below).
121
To ensure at-least-once update delivery to all replicas, a Swarm replica keeps local
client updates in its update log in tentative state and propagates them via its parent until
a custodian acknowledges their receipt. The custodian switches the updates to saved
state and takes over the responsibility to propagate them further. The origin replica can
then remove them from its local log. To prevent updates from being applied more than
once at a replica, Swarm can be configured to employ a mechanism based on loosely
synchronized clocks or version vectors, as we explain below.
4.8.1 Enforcing Semantic Dependencies
Swarm enforces ordering constraints by tagging each update with the semantic dependency and ordering types of its issuing session. With each update, Swarm tracks the
previous updates in the local log on which it has a causal and/or atomic dependency.
Replicas always log and propagate updates in their arrival order that also satisfies their
dependencies. Atomically grouped updates are always propagated and applied together.
4.8.2 Handling Concurrent Updates
When applying incoming remote updates, a Swarm replica checks if independent updates unknown to the sender were made elsewhere, indicating a potential update conflict.
Conflicts are possible when clients employ concurrent WR mode sessions. For ‘serially’
ordered updates, Swarm avoids conflicts by forwarding them to the root custodian to
be applied sequentially. For ‘unordered’ updates, Swarm ignores conflicts and applies
updates in their arrival order, which might vary from one replica to another. For ‘totally’
ordered updates, Swarm relies on a conflict resolution routine to impose a common
global order at all replicas. This routine must apply the same update ordering criterion at
all replicas to ensure a convergent final outcome. Swarm provides default resolution
routines that reorder updates by their origination timestamp or by their arrival order
at the root custodian. Swarm invokes the application plugin’s merge() operation (see
Table 4.2) as the resolution routine. The application can employ one of Swarm’s default
routines or supply its own merge() routine, that exploits application knowledge (such as
commutativity) to resolve conflicts more efficiently.
122
If an incoming update could be applied or merged into the local copy successfully,
Swarm adds the update to its local log and re-establishes dependencies for further propagation. If the conflict could not be resolved (e.g., if the update got overridden by a
‘later’ update due to reordering), Swarm rolls back and switches the update to rejected
state and does not propagate it further. It recursively rolls back and rejects all other
updates in its atomicity group as well as those causally dependent on them. If an update
is rejected after its causally succeeding updates have already been propagated to other
nodes in the hierarchy, they will be undone when the overriding update that caused the
initial rejection is propagated to those nodes. This is due to our requirement of a common
conflict resolution criterion at all replicas. The Swarm replica informs the identities of
rejected and accepted updates to their sending neighbor in a subsequent message. The
sender can undo those updates immediately and recursively inform their source, or wait
for their overriding updates to arrive.
4.8.3 Enforcing Global Order
Swarm provides two approaches to enforcing a global order among concurrent updates, namely, a centralized update ‘commit’ approach similar to that of Bayou [54], and
a timestamp-based approach based on loosely synchronized clocks.
Centralized Commit: Swarm’s commit-based conflict resolution routine relies on a
central node (the root custodian) to order updates similar to Bayou [54]. An
update is considered committed if it is successfully applied locally at the root
custodian. The update is aborted if the root has received and rejected it. Otherwise,
it is tentative. When a Swarm replica receives an update from a client, it must
remember the update locally until it knows of the update’s commit or abort status.
At a replica, tentative updates are always reordered (i.e., undone and redone)
after committed updates that arrive later. Thus, a committed update never needs
to be reordered. A client can request a session’s updates to be synchronously
committed to ensure a stable, reliable result that is never lost, by specifying a
‘sync’ consistency option when opening the session. However, this incurs reduced
performance. Centralized commit requires that absolute updates be logged with
123
old and new values as they may need to be reapplied based on the final commit
order at the root. This is expensive for whole file access.
Timestamp-based ordering: Swarm’s timestamp-based conflict resolution routine requires nodes to loosely synchronize their clocks using a protocol such as NTP.
Unlike the centralized routine, each replica reorders updates locally based on their
origination timestamps. Thus, an update must be reordered (i.e., undone and
redone) whenever updates with earlier timestamps arrive locally. However, this
ordering scheme does not require absolute updates to be logged with old and new
values (except to preserve dependencies), as they are applied to each replica at
most once, and need not be undone.
4.8.4 Version Vectors
To identify the updates to propagate among replicas, Swarm maintains a version
vector (VV) at each replica that indicates the version of the latest update originating in
every replica and incorporated locally. A VV concisely describes a replica’s contents for
the purpose of propagation. When a replica R connects to a new parent P, they exchange
their VVs and compare them to determine the updates to propagate to each other. In
general, the size of a VV is proportional to the total number of replicas, which could be
very large (thousands) in Swarm. Swarm needs to enforce consistency on data units as
small as a file block when absolute updates are employed, and at whole-file granularity
in case of operational updates. This means potentially maintaining a VV for each file
block, which is very expensive.
Fortunately, Swarm’s organization of replicas in a tree topology enables several optimizations that significantly reduce the overhead of managing VVs in practice. Since a
replica propagates updates to its neighbors only once and a tree topology provides only
one path between any two replicas, each replica receives remote updates only once in a
stable hierarchy. Thus, a replica needs to exchange VVs only when it reconnects to a new
parent, not for every update exchange. When replica sites employ loosely synchronized
clocks, propagating absolute updates (used for page-based data structures and whole file
access) does not require VVs for reasons explained below.
124
4.8.5 Relative Versions
To identify the updates to propagate and to detect update conflicts among already
connected replicas, Swarm maintains neighbor-relative versions which are lightweight
compared to VVs. A replica gives a node-local version number (called its relative
version) to each incoming update from its neighbors and uses it as the log seqno (LSN)
to log the update locally. It also remembers the sending neighbor to inform it if the
update gets rejected later due to a conflict. For each neighbor, it maintains the latest local
version sent and neighbor version received. Relative versions are compact and effective
for update propagation in the common case where replicas stay connected.
When a replica R reconnects to a new parent replica P, R must first reconcile its contents with P to a state from which it can proceed to synchronize using relative versions.
When absolute updates are employed, reconciliation simply involves overwriting R with
P’s contents if they are newer (or semantically merging their contents by invoking the
application plugin). When replicas’ clocks are loosely synchronized, their timestamps
are sufficient to determine which of them is newer, obviating the need for VVs.
With operational updates (to the whole file as a replication unit), reconciliation is
more involved. Both replicas must compare their VVs to identify the updates missing
from each other’s copy and to recompute their relative sent and received versions. Subsequently when sending updates to its neighbor, each replica must skip updates already
covered by the neighbor’s VV. If neither replica has all the updates in its local log to
bring its neighbor up to date (i.e., if the VVs are ‘too far apart’), they must resort to
a full file transfer. In a full transfer operation, the child replica obtains the entire file
from the parent and then reapplies any local updates unknown to parent, modifying its
VV accordingly. The child replica could also reconcile with its new parent by always
reverting to the parent’s version via full transfer. In that case, the child need not maintain
VVs and does not permanently miss any updates. But without VVs, it must issue a pull
from parent if its clients require a monotonic reads guarantee (defined in Section 3.5.2),
as the new parent may not have seen updates already seen by the child. Therefore VVs
need not be maintained if an application requires the latest data, or if temporarily seeing
an older version of data is acceptable. On the other hand, a full file transfer operation
125
is expensive for large files such as database replicas, and could trigger cascading full
transfers in the replica’s subtree.
A variety of version vector maintenance algorithms have been proposed in the context
of wide area replication [18, 62, 56], that employ different techniques to prune the size of
VVs as replica join and leave the network. Swarm can employ any of those algorithms.
However, we do not maintain VVs in our prototype implementation of Swarm, since
none of the application scenarios that we evaluate require them. We do not discuss VV
algorithms in this dissertation, as they are not the focus of this work.
Though version vectors enable Swarm to provide session guarantees, employing
neighbor-relative versioning schemes and relying on loosely synchronized clocks makes
many of Swarm’s update propagation algorithms simpler and more efficient.
4.9 Failure Resilience
In this section, we discuss Swarm’s resilience to node and network failures. We
assume a fail-stop model wherein a node either behaves correctly or stops responding,
but does not maliciously return incorrect data. We assume both transient and permanent
failures.
Swarm nodes use TCP for reliable, ordered message delivery over the wide area, and
handle network failures as explained in Section 4.6.4. They employ message timeouts to
recover from unresponsive nodes. Since Swarm employs recursive messaging, the reply
to a Swarm node’s request may take time proportional to the number of message hops,
which are hard to determine beforehand. In our prototype we set the reply timeout to 60
seconds, which we found to be adequate for a variety of applications. A Swarm server
recovers from unresponsive clients by terminating their access sessions (and aborting
atomic updates) if their connection is not reestablished within a session timeout period
(60 seconds by default). Swarm client library periodically checks its server connections
to make sure that its sessions remain valid.
A Swarm server ensures the internal consistency of its local state across node crashes
by storing it in transactionally consistent persistent data structures implemented as BerkeleyDB databases [67].
126
Swarm heals partitions in a replica hierarchy caused by node failures as explained
in Section 4.6.1. It employs leases to recover access privileges from unresponsive child
replicas as explained in Section 4.7.6. Swarm treats custodian failures differently from
other replica failures. When a custodian becomes inaccessible, other replicas bypass it
in the hierarchy, but custodians continually monitor its status until it recovers. However,
revoking the custodian status of a replica after its permanent failure requires a majority
vote by other custodians, and it must currently be initiated by the administrator.
Swarm’s update propagation mechanism is resilient to the loss of updates due to
node crashes. When a replica permanently fails, only those updates for which it is the
origin replica and which have not reached a custodian may be permanently lost. Replicas
recover from failure of intermediate parent replicas by repropagating local updates to a
custodian via their new parent.
4.9.1 Node Churn
Swarm can handle node churn (nodes continually joining and leaving Swarm) well
for the following reasons. A majority of nodes in a replica hierarchy are at the leaf
level, and their churn does not adversely affect Swarm’s performance. Since Swarm
nodes join a replica hierarchy at the leaf level, replicas on stable nodes (with less churn)
tend to remain longer and thus become interior replicas. When a replica node fails, its
child replicas elect a new parent lazily in response to local accesses instead of electing
a new parent immediately. Swarm tends to heal the disruption caused by an interior
replica leaving the system in a localized manner. This is because lookup caches absorb
most searches by child replicas for new parents, avoiding hierarchical lookups. Swarm’s
persistent node and lookup caches together enable replicas to quickly learn of other
nearby replicas exploiting past information. Unless nodes have a very short life span
in Swarm (the time between joining Swarm for the first time and leaving it for the last
time), the location and link quality information in their persistent caches remains valid in
spite of nodes continually leaving and rejoining the Swarm network. Finally, the highest
allowed replica fanout can be configured at each Swarm node based on its expected
stability and available resources. For instance, the fanout of a Swarm server running on
127
a mobile device must be set very low (e.g., 0 or 1) to prevent it from becoming a parent
replica, while a corporate server can have a high fanout (4 or more).
4.10 Implementation
We have implemented a prototype of Swarm that runs on FreeBSD and Linux. Swarm
servers are user-level daemon processes that are accessed by applications via a Swarm
client library. Since communication dominates computation in Swarm, a Swarm server
or client library is implemented as a single-threaded non-blocking event-driven state
machine. In response to each external event (such as message arrival or a timeout),
Swarm runs its event handler from start to finish atomically. If the handler needs to block
for another event, Swarm saves its state in system data structures and continuations, and
restarts the handler when the awaited event happens. This organization allows a Swarm
server to efficiently handle a large number of concurrent events with little overhead.
A Swarm server stores all its persistent state (including file data and metadata) in
a single directory tree in the local file system. Each server can be configured with the
maximum amount of local storage it can consume. It uses that storage to host permanent
copies of files as well as to cache remote files. A Swarm server maintains its persistent metadata state in four data structures, all implemented as recoverable BerkeleyDB
databases. The region directory stores file metadata and consistency state keyed by
SWID. The segment directory stores the state of various file segments in the granularity
at which they are shared by Swarm nodes. The update log stores updates in their arrival
order keyed by log sequence number. The node cache remembers the connectivity and
link quality to remote nodes contacted earlier. In addition to these, the server also
maintains several in-memory data structures such as a table of open sessions, cache of
TCP connections, and the SWID lookup cache.
A Swarm server updates a local file atomically after completely importing incoming
contents, so that a crash in the middle of file transfer does not corrupt file contents.
When a Swarm server recovers from a crash, all open client sessions become invalid and
all updates by atomic sessions are undone.
128
Swarm clients normally incur an RPC for each Swarm API invocation to its local
Swarm server. To avoid the high cost of RPCs in the common case where a local client
repeatedly accesses the same data item often, a Swarm server exposes its session table
to local clients via shared memory. A Swarm client can reopen recently closed Swarm
sessions locally without issuing RPC to local server. As our evaluation presented in
the next chapter shows, this significantly improves latency for fine-grained access to
Swarm-based persistent data structures when the client exhibits locality.
Our Swarm prototype does not implement full-sized version vectors and causal dependency tracking.
4.11 Discussion
Swarm servers are reactive. They cache objects locally in response to client accesses.
While Swarm provides mechanisms for creating cached copies and their consistency
maintenance, applications must determine where objects are cached/replicated. Applications can also build object prefetching schemes to suit their access patterns using Swarm
API.
4.11.1 Security
Our Swarm design is based on the assumption that Swarm servers and clients are
mutually trustworthy. It does not provide security and authentication mechanisms necessary for communication between untrusted parties over untrusted channels. Though we
leave their design for future research, we outline a few techniques here. Communication
over untrusted channels can be accomplished by employing encryption-based communication protocols between Swarm nodes, such as secure TCP sockets or by tunneling over
SSH channels. Swarm nodes can mutually authenticate themselves using well-known
public-key encryption mechanisms. However, managing their keys in a decentralized
and scalable manner requires further exploration. A recently proposed scheme, called
Self-certifying pathnames [33], provides a novel way to use public-key encryption mechanisms to support decentralized user authentication in globally distributed file systems
spanning multiple administrative domains.
129
4.11.2 Disaster Recovery
A Swarm server stores all critical data and metadata in a local file system directory
tree. Hence, each Swarm server’s data can be archived for disaster recovery and restored
in the same way existing local file systems are archived.
4.12 Summary
In this chapter, we described the design of Swarm, a wide area file store that employs
configurable consistency. We first described the design goals for Swarm and outlined
its basic architecture and expected usage for building distributed applications. We then
described its major subsystems including naming and location, replication, configurable
consistency management and update propagation.
In the next chapter, we present our evaluation of Swarm in the context of the four
focus applications we presented in Chapter 2. We describe their implementation on
Swarm and analyze their performance characteristics.
CHAPTER 5
EVALUATION
In the previous chapter, we showed how configurable consistency (CC) can be implemented in a peer replication environment spanning variable-quality networks. We
presented the design of Swarm, a data replication middleware that provides caching
with configurable consistency to diverse applications. In this chapter, we evaluate the
effectiveness of Swarm’s CC implementation in meeting the diverse consistency and
performance needs of the four representative applications mentioned in Section 2.3. For
each application, we show how its consistency requirements can be precisely expressed
in the CC framework. We also how the resulting application’s performance (i.e., latency,
throughput and/or network utilization) when using Swarm improves by more than 500%
relative to a client-server implementation without caching, and stays within 20% of
optimized implementations where available.
Our thesis claims that configurable consistency mechanisms can support the wide
area caching needs of diverse applications effectively. Hence, through this evaluation,
we demonstrate the following properties of Swarm’s configurable consistency implementation, which are important to support wide area caching effectively:
1. Flexible Consistency: (a) Swarm can enforce the diverse consistency semantics
required by different applications. Each semantics has its associated performance
and availability tradeoffs. (b) Swarm can enforce different semantics on the same
data simultaneously.
2. Contention-aware Replication: (a) By adopting configurable consistency, Swarm
exploits the available data locality effectively while enforcing a variety of consistency semantics. (b) Swarm’s contention-aware replication management mechanism outperforms both aggressive caching and RPCs (i.e., no caching) at all
levels of contention when providing strong consistency. Thus unlike other systems,
131
Swarm can support strongly consistent wide area caching effectively regardless of
the degree of contention for shared data.
3. Network Economy: Swarm leverages its proximity-aware replica management to
utilize network capacity efficiently for its consistency-related communication and
to hide variable network delays from applications.
4. Failure resilience and Scalability: Swarm performs gracefully and continues to
ensure consistency in spite of replicas continually joining and abruptly leaving
the Swarm network. Swarm preserves its network economy and failure resilience
properties even in networks with hundreds of replicas.
Different subsets of these properties are important for different applications. Since
the configurable consistency implementation presented in this thesis has all of these
properties, it supports the wide area caching needs of a wide variety of applications
effectively.
In Section 5.2, we show that Swarmfs, a peer-to-peer file system built on top of
Swarm, can exploit locality more effectively than existing file systems (property 2a),
while supporting a variety of file sharing modes. In particular, it supports file access
semantics ranging from exclusive file locking to close-to-open to eventual consistency
on a per-file basis (property 1a). Swarm’s proximity-aware replica management enables
Swarmfs to provide low latency file access over networks of diverse quality (property 3).
In Section 5.3, we show that SwarmProxy, a synthetic enterprise service using Swarm
for strongly consistent wide area proxy-caching, can significantly improve its responsiveness to wide-area clients using CC’s contention-aware replication control mechanism
(property 2b). This mechanism restricts caching to a few sites when contention for shared
data is high to limit synchronization overhead, and reverts to aggressive caching when
the contention subsides. This adaptive mechanism provides the best performance of both
aggressive caching and RPCs under all levels of contention.
In Section 5.4, we evaluate how SwarmDB, a wrapper library that augments the
BerkeleyDB database library with replication support using Swarm, performs using five
different consistency requirements ranging from strong (appropriate for a conventional
132
database) to time-bounded eventual consistency (appropriate for many directory services) (properties 1a and 1b). We find that relaxing consistency requirements even
slightly significantly improves throughput, provided the application can tolerate the weaker
semantics. We also show that when Swarm is used for sharing files indexed with SwarmDB, it conserves bandwidth by serving files from nearby sources (property 3) to hundreds of users while they continually join and leave the sharing network (property 4).
Finally, in Section 5.5, we demonstrate how SwarmCast, a real-time streaming multicast application, linearly scales its delivered event throughput to a large number of
subscribers by using Swarm servers that automatically form a multicast network to relay events. A single Swarm server in our unoptimized prototype is CPU-bound, and
thus delivers only 60% of the throughput delivered by a hand-optimized relay server
implementation. However, the aggregate throughput delivered to 120 subscribers scales
linearly with as many as 20 Swarm servers, an order of magnitude beyond what a single
hand-coded relay server can deliver due to its limited outgoing network bandwidth.
Moreover, when using Swarm, the SwarmCast application need not deal with networking
or buffering, which are the tricky aspects of implementing multicast. Thus, Swarm can be
leveraged effectively even for a traditional message-passing application such as real-time
content dissemination.
Each of these applications uses Swarm’s replication and consistency support in very
different ways. Swarm’s flexible API enables each of these applications to manipulate shared data in their diverse natural idioms, e.g., shared files, shared objects, and
database. Swarm’s configurable consistency enables each application to express and
enforce the consistency semantics appropriate for its sharing needs. Swarm’s aggressive
peer replication enabled each application to utilize network bandwidth efficiently and to
fully exploit available locality, which results in performance improvements over alternate
implementations where available.
We begin by describing our experimental environment in the next section. In subsequent sections, we present each of the representative applications in turn, describing how
we implemented it using Swarm before we present our evaluation.
133
5.1 Experimental Environment
For all our experiments, we used the University of Utah’s Emulab Network Testbed
[76]. Emulab allows us to network a collection of PCs emulating arbitrary network
topologies by configuring per-link latency, bandwidth, and packet loss rates. The PCs
have 850MHz Pentium-III CPUs with 512MB of RAM. Depending on the experimental
requirements, we configure them to run FreeBSD 4.7 (BSD), Redhat Linux 7.2 (RH7.2),
or RedHat 9 (RH9). In addition to the emulated experimental network, each PC is
connected to a separate 100Mbps control LAN which we use for logging experimental
output.
Swarm servers run as user-level processes and store files in a single directory in the
local file system, using the SWID as the filename. We set the replica fanout on each
server to a low value of 4 to induce replica hierarchies up to 4 or 5 levels deep and to
magnify the effect of deep hierarchies on access latencies. For the WAN experiments,
we frequently use Emulab’s ‘delayed LAN’ configuration to emulate nodes connected
via point-to-point links of diverse bandwidths and delays to a common Internet routing
network. Figure 5.1 illustrates a 6-node delayed LAN, where each link is configured
with 1Mbps of bandwidth and a 10ms one-way propagation delay, resulting in a 40ms
roundtrip latency between any two nodes. Clients and servers log experimental output
to an NFS server over the control network only at the start or finish of experiments to
minimize interference.
5.2 Swarmfs: A Flexible Wide-area File System
Efficient support for wide area write-sharing of files enables more applications than
the read-only or rarely write-shared files supported by traditional file systems [38, 33, 57,
62, 47]. For instance, support for coherent write-sharing allows fine-grain collaboration
between users across the wide area. Although eventual consistency provides adequate
semantics and high availability when files are rarely write-shared, coherent write-sharing
requires close-to-open or strong consistency.
We built a peer-to-peer distributed file system called Swarmfs that provides the required flexibility in consistency. Swarmfs is implemented as a file system wrapper
134
Routing
Node
core
1Mbps link,
10ms delay
Internode RTT = 40msec
Figure 5.1. Network topology emulated by Emulab’s delayed LAN configuration. The
figure shows each node having a 1Mbps link to the Internet routing core, and a 40ms
roundtrip delay to/from all other nodes.
integrated into each Swarm server. The wrapper exports a native file system interface
to Swarm files at the mount point /swarmfs via the local operating system (currently
Linux/FreeBSD). It provides a hierarchical file name space by implementing directories
within Swarm files using the Coda file system’s directory format [38]. Swarmfs synchronizes directory replicas via operational updates. We refer to a Swarm server with
the wrapper enabled as a Swarmfs agent. A Swarmfs agent interacts with the CodaFS
in-kernel module to provide native Swarmfs file access to local applications. Figure 4.4
illustrates this architecture. It could also be extended to export the NFS server interface
to let remote clients mount the Swarmfs file system.
The root directory of a Swarmfs path is specified by its globally unique Swarm file
ID (SWID). Thus, an absolute path name in Swarmfs looks like this:
‘‘/swarmfs/swid:0xabcd.2/home/me/test.pdf”.
Here, ‘‘swid:0xabcd.2” specifies that the root directory is stored in a swarm file
with SWID ‘‘0xabcd” and generation number ‘‘2” (Recall that a SWID’s generation
number is incremented each time it is reassigned to a new file). Each Swarmfs agent is an
independent file server and can be made to mount a different directory as its root, which
is used as the starting point for pathname lookup when its local clients omit the SWID
component from their absolute path names (e.g., “/swarmfs/home/me/test.pdf” on that
machine), Since numeric SWID components are cumbersome to handle, users can avoid
135
dealing with them by creating symbolic links with more intuitive names. Swarmfs is thus
a federation of autonomous peer file servers that provide a decentralized but globally
uniform file name space. Each server can independently create files and directories
locally, cache remotely created files, and migrate files to other servers. Unlike NFS,
Swarmfs’ absolute path names are globally unique because they use SWIDs, i.e., an
absolute path name refers to the same file at all servers regardless of file migration.
By making use of Swarm, Swarmfs provides a unique combination of features not
found in existing file systems:
1. Network Economy: Unlike typical client-server file systems (e.g., NFS [64], AFS
[28] and Coda[38]), it exploits Swarm’s proximity-aware peer replica networking
to access files from geographically nearby copies, similar to Pangaea [62].
2. Customizable Consistency: Unlike peer-to-peer file systems like Pangaea and Ficus [57], Swarmfs makes use of Swarm’s configurable consistencyframework to
support a broader range of file sharing semantics. In particular, it can provide
on a per-file-access basis: the strong consistency semantics of Sprite [48], the
close-to-open consistency of AFS [28] and Coda [38], the (weak) eventual consistency of Coda, Pangaea [62], and NFS [64], and the append-only consistency
of the WebOS file system [74]. Swarmfs employs close-to-open consistency by
default, but allows users to override it on a per-file-session basis. Swarmfs applies
a directory’s consistency settings automatically to subsequently created files and
subdirectories to facilitate easy administration of file consistency semantics.
3. Uniform, Decentralized Name Space: Unlike many existing file systems (AFS,
Coda, Pangaea, Ficus) that enforce a single global file system root, Swarmfs is
a federated file system similar to NFS and SFS [45]. Swarmfs agents can be
configured to mount the same root everywhere, or share only subtrees of their
name space. For example, a user can expose a project subtree of his home directory to be mounted by remote colleagues without exposing the rest of his local
Swarmfs file name space. This feature is enabled by the use of SWIDs that are
both location-transparent and globally unique.
136
4. Transparent File Migration: Unlike many file systems including SFS, Swarmfs
allows files to be transparently migrated between servers (i.e., permanently moved
without disrupting access). This facilitates the use of Swarmfs agents to provide
ubiquitous transient storage for mobile users with simple storage administration.
Client applications can perform Swarm-specific operations on individual files and directories such as viewing/modifying consistency settings or performing operational updates,
via Unix ioctl() system calls. When clients access files under /swarmfs, the local
operating system invokes the CodaFS in-kernel file system module, which in turn invokes
the user-level Swarmfs agent via upcalls on a special FIFO device (just like Coda’s
client-side implementation). The CodaFS module makes these upcalls only for metadata
and open/close operations on files. Hence a Swarmfs agent can mediate client accesses to
a file only during those times. It invokes Swarm to obtain a consistent Swarm file copy
into the local file system, and supplies its name (inode number) to the kernel module.
The kernel module performs file reads and writes directly on the supplied file at local file
system speed, bypassing Swarmfs.
5.2.1 Evaluation Overview
We evaluate Swarmfs under three different usage scenarios. First, in Section 5.2.2
we consider a personal file system workload with a single client and server connected
by either a high-speed LAN or a slow WAN link. This study shows that the inherent
inefficiencies of layering Swarmfs over the generic Swarm data store have little impact
on its baseline performance relative to a local file system when there is no sharing. Our
second and third experiments focus on sharing files across a WAN. In Section 5.2.3
we consider the case where shared files are accessed sequentially (e.g., by a roaming
user or by collaborators) on nodes spanning continents. This study shows that Swarm
efficiently enforces the close-to-open semantics required for this type of file sharing
while exploiting nearby replicas for near-local latency file access, unlike Coda. Finally,
in Section 5.2.4 we consider the case where a shared RCS repository is accessed in
parallel by a collection of developers spread across continents. This study shows that
Swarm’s lock caching not only provides correct semantics for this application, but can
137
also provide order-of-magnitude improvements in latency over client-server RCS by
caching files close to frequent sharers.
5.2.2 Personal Workload
We first study the baseline performance of Swarmfs and several other representative
distributed file systems in a LAN-based single-server, single-client setup with little sharing. In this experiment, a single client runs Andrew-tcl [62], a scaled up version of the
Andrew benchmark [28] by starting with a source tree extracted into a directory on the
client. Thus, the client starts with a warm cache. Andrew-tcl is a personal software development workload consisting of five phases, each of which stresses a different aspect of
the system (e.g., read/write performance and metadata/directory operation performance):
(1) mkdir: creates 200 directories, (2) copy: copies the 745 Tcl-8.4 source files with a
total size of 13MB from one directory to another, (3) stat: performs “ls -lR” on the
files in both directories to read their metadata/attributes, (4) grep: performs “du” and
“grep” on the files, and (5) compile: compiles the source code. As there is no inter-node
sharing, the choice of consistency protocol does not significantly affect performance,
so we employed Swarmfs’ default close-to-open consistency. We ran the Andrew-tcl
benchmark on four file systems: the Redhat Linux 7.2 local file system, NFS, Coda,
and Swarmfs. For Swarmfs, we considered two modes: scache, where the client is
configured to merely cache remotely homed files, and speer, where files created on the
client are locally homed in full peer-to-peer mode. We averaged the results from five
runs per system, which resulted in a 95% confidence interval of 3% for all the numbers
presented.
Figure 5.2 shows the performance of each system when the client and server are
separated by a 100Mbps LAN, broken down by the time spent in each phase. Execution
time in all cases is dominated by the compute bound compile phase, which is comparable
in all systems. Figure 5.3 focuses on the other phases. As expected, the best performance
is achieved running the benchmark directly on the client’s local file system. Among the
distributed file systems, NFS performs best, followed by speer, but all distributed file
systems perform within 10% of one another. Speer performs better than scache during
138
80
70
60
50
40
30
20
10
0
r
ee
sp
e
sc
ac
h
co
da
stat
mkdir
grep
copy
compile
nf
s
lin
ux
lo
ca
l
seconds
Andrew-Tcl Results on 100Mbps LAN
Figure 5.2. Andrew-Tcl Performance on 100Mbps LAN.
the data-intensive copy and mkdir phases of the benchmark, because files created by
the benchmark are homed on the client node. Coda’s file copy over the LAN takes twice
as long as the scache copy due to Coda’s eager flushes of newly created files to the
server. In the grep and stat phases, Swarm must retrieve the metadata of 1490 files from
a BerkeleyDB database where it stores them, since we only employed an in-memory
cache of 500 file metadata entries. This, coupled with Swarm’s unoptimized code path
results in four times the stat latency and twice the grep latency of Coda.
Figure 5.4 shows the relative performance of Coda and Swarmfs when the client and
server are separated by a 1Mbps, 40ms WAN link. Coda detects that the link to the server
is slow and switches to its weak connectivity mode, where it creates files locally just like
peer-mode Swarmfs. Thus, both Coda and Speer perform within 13% of a local linux
file system for personal file access across a WAN.
In summary, for a traditional personal file access workload, Swarmfs provides performance close to that of a local file system both across a LAN and across a WAN link
from a file server, despite being layered on top of a generic user-level middleware data
server. Like Coda and unlike NFS, Swarmfs exploits whole-file caching to provide local
139
9
8
7
6
5
4
3
2
1
0
sp
ee
r
sc
ac
he
co
da
nf
s
copy
grep
mkdir
stat
lin
ux
lo
ca
l
seconds
Andrew-Tcl Details on 100Mbps LAN
Figure 5.3. Andrew-Tcl Details on 100Mbps LAN.
Andrew-Tcl results across 1Mbps,40ms RTT
link
64
seconds
62
stat
mkdir
grep
copy
compile
60
58
56
54
52
50
coda
speer
Figure 5.4. Andrew-tcl results over 1Mbps, 40ms RTT link.
140
performance across WAN links. In addition, Swarmfs servers can be configured as peer
file servers that create files locally, resulting in double the performance of Coda for file
copies in a LAN environment. Hence, Swarmfs supports traditional file sharing equally
well or better than existing distributed file systems.
5.2.3 Sequential File Sharing over WAN (Roaming)
Our next experiment evaluates Swarmfs on a synthetic “sequential file sharing” or
roaming benchmark, where we model collaborators at a series of locations accessing
shared files, one location at a time. This type of file sharing is representative of mobile
file access or roaming [62], and workflow applications where collaborators take turns
updating shared documents or files. To support close collaboration or near-simultaneous
file access, users require tight synchronization such as provided by close-to-open consistency semantics [48] even under weak connectivity. We compare three flavors of
distributed file systems: (1) Swarmfs employing peer servers (speer), (2) Coda forced to
remain in its strong connectivity mode even under weak connectivity so that it guarantees
close-to-open consistency (coda-s), and (3) Coda in its native adaptive mode where it
switches to providing eventual consistency when it detects weak connectivity (coda-w).
We show that Swarmfs not only provides the close-to-open semantics needed to run this
benchmark correctly, but also performs better than Coda by exploiting Swarm’s ability
to access files from nearby replicas and by employing a symmetric consistency protocol.
Although the Pangaea file system [62] also exploits nearby replicas like Swarm, it does
not provide close-to-open semantics.
We run the sequential file access benchmark on the emulated WAN topology shown
in Figure 5.5. The topology modeled consists of five widely distributed campuses, each
with two machines on a 100Mbps campus LAN. The node U1 (marked ‘home node’)
initially stores the 163 Tcl-8.4 source files with a total size of 6.7MB. We run Swarmfs
servers and Coda clients on all nodes and a Coda server on node U1.
In our synthetic benchmark, clients at various nodes sequentially access files as
follows. Each client modifies one source file (modify), compiles the Tcl-8.4 source tree
by invoking ‘make’ (compile), and then deletes all object files by invoking ‘make clean’
141
Home Node
U1
T1
U2
USA
20
Europe
m
Univ. LAN
s(
Internet
RT
I1
4ms
40
1
0
0
1
0
1
0
1
0
1
100ms
Slow link
1
0
0
1
0
1
0
1
0
1
10
s
Link b/w: 1Mbps
Workstation
Corp. LAN
F2
m
30
C2
F1
ms
ISP
C1
Turkey
ms
T)
I2
T2
1
0
0
1
0
1
Router
0
1
0
1
France
100Mbps LAN
Figure 5.5. Network Topology for the Swarmfs Experiments described in Sections 5.2.3
and 5.2.4.
(cleanup). These operations represent isolated updates, intensive file-based computation,
and creating and deleting a large number of temporary files. Clients on each node in each
campus (in the order University (U) → ISP (I) → Corporate (C) → Turkey (T) → France
(F)) perform the modify-compile-cleanup operations, one client after another. Thus, the
benchmark compares Swarmfs with an existing distributed file system on traditional file
operations in a distributed file access scenario. Figure 5.6 shows the replica hierarchy
that Swarm created for files in this experiment.
Figure 5.7 shows the compilation times on each node. As reported in the previous
section, both Swarmfs and Coda perform comparably when the home file server is on the
same local LAN (i.e., on nodes U1 and U2). Also, Coda in strong connectivity enforces
close-to-open semantics by pushing updates synchronously to the server. However,
since Swarm creates efficient replica hierarchies and acquires files from nearby replicas
(e.g., from another node on the same LAN, or, in the case of France, from Turkey), it
outperforms Coda’s client-server implementation, which always pulls files from server
U1. As a result, compilation at other LANs was two to five times faster over Swarmfs
than over Coda-s on average.
When Coda detects a slow link to the server (as in the case of nodes starting at
node I1), it switches to weak connectivity mode. In this mode, Coda guarantees only
142
Home Node
U1
T1
U2
T2
Parent replica link
Univ.
Turkey
I2
I1
ISP
Internet
F1
F2
France
C1
C2
Corp.
Figure 5.6. The replica hierarchy observed for a Swarmfs file in the network of Figure
5.5.
eventual consistency, which causes incorrect behavior starting at node T2 (hence the
missing results in Figure 5.7 for nodes T2, F1 and F2). For instance, on node T2, the
‘make’ program skipped compiling several files as it found their object files from the
previous compilation on T1 still sticking around. Also, corrupt object files were reported
during the linking step. On closer examination, we found that they are caused by Coda’s
undesirable behavior during weak connectivity by switching to eventual consistency and
performing eager push of updates during trickle reintegration. During the compilation
at T1, Coda-w’s trickle reintegration mechanism starts pushing large object files to the
server across the slow link as soon as they are created locally, which clogs the server
link and delays the propagation of crucial directory updates that indicate the subsequent
object file deletions to the server. By the time T2 sees T1’s file deletion operations, it has
already started its compile and used obsolete object files. This is because Coda in weak
mode never pulls for file updates from the server to avoid the high latency.
Coda-s produced correct results because it enforces close-to-open consistency. However, it performs poorly across a WAN because it implements close-to-open semantics
conservatively by forcing write-through of all client updates to the server. The writethrough policy causes a single file modification on Coda-s to incur double the latency
of Coda-w and Swarm, as shown for node T1 in Figure 5.8. Unlike Coda’s asymmetric
143
Roaming User - Compile Latency
300
seconds
250
200
speer
coda-s
coda-w
150
100
50
2
F
m
s
2
T
30
F1
,1
60
,1
C
T1
1,
,
I1
s
2
m
C
m
s
2
I
50
m
s
2
24
U
U
1
0
LAN-node#, RTT to Home (U1)
Figure 5.7. Roaming File Access: Swarmfs pulls source files from nearby replicas.
Strong-mode Coda correctly compiles all files, but exhibits poor performance. Weak–
mode Coda performs well, but generates incorrect results on the three nodes (T2, F1, F2)
farthest from server U1.
roles of client and server, Swarm’s symmetric (peer) consistency protocol avoids costly
write-throughs across WAN links.
In summary, unlike Coda-w, Swarm provides the close-to-open semantics required
for the correct execution of the sequential sharing benchmark, and is more efficient
than Coda-s. Swarm exploits nearby replicas similar to peer-to-peer file systems such
as Pangaea and Ficus that provide pervasive replication. But it enforces the tight synchronization required for collaboration under all network conditions, which Pangaea and
Ficus cannot guarantee because they ensure only eventual consistency.
5.2.4 Simultaneous WAN Access (Shared RCS)
Some distributed applications (e.g., email servers and version control systems) require reliable file locking or atomic file/directory operations to synchronize concurrent
read/write accesses and thus avoid hard to resolve update conflicts. However, the atom-
144
Roaming User - File Modify Latency
7
6
seconds
5
speer
coda-s
coda-w
4
3
2
1
2
F
m
s
2
T
30
F1
,1
60
,1
T1
1,
C
s
2
m
C
m
s
2
50
I
m
s
2
U
I1
,
24
U
1
0
LAN-node#, RTT to Home (U1)
Figure 5.8. Latency to fetch and modify a file sequentially at WAN nodes. Strong-mode Coda
writes the modified file synchronously to server.
icity guarantees required by these operations are not provided by most wide area file
systems across replicas. As a result, such applications cannot benefit from replication,
even if they exhibit high degrees of access locality.
For example, the RCS and CVS version control systems use the exclusive file creation
semantics provided by the POSIX open() system call’s O EXCL flag to gain exclusive access to repository files. During a checkout/checkin operation, RCS attempts to atomically
create a lock file and relies on its pre-existence to determine if someone else is accessing
the underlying repository file. The close-to-open consistency semantics provided by
AFS and Coda and the eventual consistency semantics provided by most distributed
file systems are inadequate to guarantee the exclusive file creation semantics that RCS
requires. Thus using those file systems to replicate an RCS or CVS repository can lead
to incorrect behavior. NFS provides the file locking required for repository file sharing
via a separate centralized lock manager. However, its client-server architecture inhibits
delegation of locking privileges between clients, preventing NFS from fully exploiting
regional locality across WANs. As a result, CVS designers do not recommend using
145
distributed file systems (other than NFS) to store repositories and most CVS installations (such as SourceForge.net) employ a client-server organization to support wide area
developers, which inhibits caching [17].
In contrast, Swarmfs users can ensure strong consistency semantics for exclusive
updates to repository files and directories as follows. An RCS/CVS root repository
directory’s consistency attribute can be set to use exclusive access mode (WRLK) for
updates, which recursively applies to all subsequently created repository files. As a
result, RCS/CVS programs need not be modified to use Swarmfs. Swarmfs can thus
safely replicate RCS/CVS files across a WAN, and also exploit locality for low latency
access.
5.2.5 Evaluation
To evaluate how effectively Swarmfs can support concurrent file sharing with strong
consistency, we simulated concurrent development activities on a project source tree
using a shared RCS repository for version control. Although CVS is a more popular
version control system than RCS, we chose the latter for our experiment, because RCS
employs per-file locking for concurrency control and hence allows more parallelism
than CVS, which currently locks the entire repository for every operation. Moreover,
CVS stores repository files in RCS internally. Hence modifying CVS to lock individual RCS files allows more parallelism. Our RCS sharing experiments demonstrate the
potential benefits of such an effort. Overall, our results show that Swarmfs’ strongly
consistent caching provides near-local latency to wide area developers for checkin and
checkout operations (i.e., several orders of magnitude improvement over client-server)
when repository file accesses are highly localized, and up to three times lower latency
on average than a traditional client-server version control system even in the presence of
widespread sharing.
We evaluated two versions of RCS, one in which the RCS repository resides in
Swarmfs (peer sharing mode) and one in which the RCS repository resides on one
node and is accessed via ssh from other nodes (client-server/RPC mode). We employed
WRLK mode for repository file access in Swarmfs and allowed repository files to be ag-
146
gressively cached on demand to determine Swarmfs’ behavior under worst-case thrashing conditions. For this set of experiments, we used the topology shown in Figure 5.5
without the ISP (I). The “Home Node” initially hosts three project subdirectories from
the Andrew-tcl benchmark sources: unix (39 files, 0.5MB), mac (43 files, 0.8MB), and
tests (131 files, 2.1MB).
Our synthetic software development benchmark consists of six phases, each lasting 200 seconds, that simulate diverse patterns of wide area file sharing. Figure 5.9
illustrates the access patterns during these six phases. During each phase, a developer
updates a random file every 0.5-2.5 seconds from the module she is currently working
on, producing 24 to 120 updates/minute. Each update consists of an RCS checkout (via
the ‘co’ command), a file modification, and a checkin (via the ‘ci’ command). Though
this update rate is much higher than typically produced by a single user, our intention is
to study the system’s behavior under heavy load. We use the latency of the checkout and
checkin operations as the performance metrics.
In Phase 1 (widespread shared development), all developers (one on each of the eight
nodes) work concurrently on the unix module, and hence contend for unix files across
all sites. In Phase 2 (clustered development), the developers at the University (U) and
Corporate (C) sites in the U.S. switch to the tests module, which restricts contention
to within the U.S. Also, the developers in Turkey (T) continue work on the unix module
and the developers in France (F) switch to the mac module, so accesses to those modules
become highly localized. In Phases 3-6 (migratory development), work is shifted (every
200 seconds) between “cooperating” sites – the unix module’s development migrates
between the U.S. University and Turkey, while the mac module migrates between the
Corporate LAN in the U.S. and France (e.g., to time shift developers).
Figure 5.10 shows the file checkout latencies observed by clients on the University
LAN (U), where the primary copy of the RCS repository is hosted, as a timeline. Figure
5.11 shows the checkout latencies observed by clients on the “Turkey” LAN (T) across
the slow intercontinental link, also as a timeline. Each graph shows a scatter plot of the
latencies observed for Swarmfs-based checkouts, and also the average latency curves for
RPC-based checkouts from various sites for comparison. For example, the latency of
147
Turkey (T)
Univ. (U)
Turkey (T)
Univ. (U)
Unix
Unix
Mac
Mac
Tests
Tests
France (F)
France (F)
Corp. LAN (C)
Phase 1
Shared
Corp. LAN (C)
Turkey (T)
Univ. (U)
Turkey (T)
Univ. (U)
Unix
Unix
Mac
Mac
Tests
Tests
France (F)
France (F)
Corp. LAN (C)
Phases 3,5
Migratory
Phase 2
Clustered
Corp. LAN (C)
Phases 4,6
Migratory
Figure 5.9. Repository module access patterns during various phases of the RCS
experiment. Graphs in Figures 5.10and 5.11 show the measured file checkout latencies
on Swarmfs-based RCS at Univ. (U) and Turkey (T) sites relative to client-server RCS.
a local RCS checkout operation at the primary site U1 is indicated in both graphs by
the curve at the bottom labelled ‘local U1’, whereas the RPC latency for checking out
files from the Turkey (T) LAN is indicated by the topmost curve labelled ‘rpc by T’.
Each vertical line denotes a phase transition. The labels at the bottom indicate the phase
number and those at the top indicate which LANs access which modules during each
phase.
In Phase 1, developers from all LANs work on unix files (as indicated by the ‘all
unix’ label at the top), and Swarmfs -based file checkouts at all sites incur latencies that
vary widely around 1.5 seconds, as indicated by the widely dispersed points in the ‘phase
1’ column in both graphs. However, Swarmfs latencies are roughly half of the RPC
latency incurred from Turkey. In phase 2 (labelled ‘U, C tests’ at the top in the graph of
Figure 5.10), when only the developers in the Univ (U) and the Corporate (C) LANs share
148
RCS Checkout Latency (millisec)
RCS on Swarmfs: Checkout Latencies at the U.S. Univ. site (U)
10000
All unix
U,C tests U unix
(T unix)
(T unix) U unix
(T unix)
rpc by T
rpc by C
1000
rpc by U2
100
local U1
10
Phase 1 Phase 2 Phase 3 Phase 4 Phase 5 Phase 6
0
200
Swarmfs U1
400
600
800
1000
Time Elapsed (seconds)
1200
1400
Swarmfs U2
Figure 5.10. RCS on Swarmfs: Checkout Latencies near home server on the “University” (U) LAN.
the tests module files, Swarmfs-based checkouts incur widely varying but reduced
latencies as there are no sharers across the slow intercontinental link. At the same time,
the development work gets localized in Turkey LAN (indicated by the label ‘T unix’ in
phase 2 column of Figure 5.11). As a result, the two Swarmfs servers in Turkey quickly
cache the working set of files and provide local latency to their clients, as indicated by the
latency points clustered at the bottom near the ‘local U1’ curve. Although NFS provides
file locking it would not have provided the low latency benefit seen in this scenario
because it would force T1 and T2 to always get their file locks from the remote home
server U1. In each of the migratory development phases 3 to 6, whenever developers in
the U.S. University (U) and Turkey (T) LANs become active, their unix file accesses
incur high initial latencies on Swarmfs for a short period during which locking privileges
migrate from their peers. However, due to localized access, the latencies quickly drop to
that of local RCS at U1. These results are indicated by the latency trends for Swarmfs
checkouts in the columns for phases 3 and 5 in Figure 5.10, and in the columns for
phases 2, 4 and 6 in Figure 5.11. Thus, Swarmfs caching is highly effective at reducing
149
RCS Checkout Latency (millisec)
RCS on Swarmfs: Checkout Latencies at Turkey site (T)
10000
All unix
T unix
(U unix) T unix (U unix)
T unix
rpc by T
rpc by C
1000
rpc by U2
100
local U1
10
Phase 1 Phase 2 Phase 3 Phase 4 Phase 5 Phase 6
0
200
Swarmfs T1
400
600
800
1000
Time Elapsed (seconds)
1200
1400
Swarmfs T2
Figure 5.11. RCS on Swarmfs: Checkout Latencies at “Turkey” (T) site, far away from
home server.
latencies when accesses are localized. In contrast, RPC-based checkouts incur the same
high WAN latency for all operations regardless of locality.
Using Swarmfs, no RCS operations failed, nor did any of our sanity checks detect any
problems, which indicates that Swarmfs provides correct file locking semantics and thus
enables unmodified RCS to work correctly under all workloads. Swarmfs enables RCS
developers to realize the low-latency benefit of caching when there is regional locality,
and provides better performance than client-server RCS even when there is widespread
contention for files. For instance, when the development work is highly localized as in
phases 3 and 5 at site U (Figure 5.10) and phases 2, 4 and 6 at site T (Figure 5.11), the
checkout latency quickly drops to that of local RCS operations on the home server U1.
Even when developers at multiple sites operate on a common set of files and exhibit no
locality (as in Phase 1 when they all access the unix module, and Phase 2 when U and
C access the tests module), Swarmfs outperforms client-server RCS. The worst-case
RCS latency on Swarmfs is close to 3 seconds for all developers, which is the same
as the average latency of client-server RCS across the slow link from Turkey (labelled
‘rpc at T’ in Figure 5.11). But Swarmfs’ average latency is less than half of the RPC
150
latency. This is because Swarm often avoids crossing the slow link for consistency traffic
by forming an efficient replica network. Finally, Swarmfs responds quickly to changes
in data locality. When the set of users sharing files changes (e.g., when developers at
sites U and T work on the unix module alternatively in Phases 3-6), Swarm migrates
the replicas to the new set of sharers fairly rapidly.
Thus, Swarmfs not only supports reliable file locking across WANs unlike existing
file systems, but also exploits data locality and enables applications such as version
control to realize latencies close to local file system when accesses are geographically
localized. In addition, Swarm’s efficient replica networking enables Swarmfs to mask
network delays effectively by avoiding frequent use of slow links for consistency traffic.
Due to these benefits, Swarmfs provides a better solution for file locking than traditional
client-server approaches including NFS.
5.2.6 Summary
In this section, we showed that Swarmfs, a distributed file system extension to Swarm,
effectively supports fine-grain collaboration as well as reliable file locking applications
by supporting the configurable consistencyframework, unlike existing wide area file
systems. These applications can leverage Swarm’s proximity-aware replica management
and peer-to-peer consistency protocol to achieve near-local latencies when accesses are
highly localized, while achieving latency half that of traditional client-server schemes
even when there is widespread contention for shared data.
5.3 SwarmProxy: Wide-area Service Proxy
Enterprise servers that handle business-critical enterprise objects (e.g., sales, inventory or customer records) could improve responsiveness to clients in remote geographic
regions by deploying service proxies (commonly known as ASPs) in those regions.
Although a centralized cluster of servers can handle a large volume of incoming client
requests, the end-to-end throughput and availability of the clustered services are limited
by the cluster’s Internet link and the geographic spread of clients. As explained in Section
2.3.2, by caching enterprise objects locally, ASPs can improve response times, spread
151
load, and/or improve service availability. However, enterprise services tend to have
stringent integrity requirements, and enforcing them often requires strong consistency
[41].
Two key design challenges must be overcome to build wide area proxies for enterprise
servers: (i) performance: when clients contend for strongly consistent access to writeshared data, aggressively caching data performs poorly; (ii) availability: some proxies
may temporarily become inaccessible, and the service must continue to be available
despite such failures. However, since ASPs are typically deployed in an enterprise-class
environment consisting of powerful machines with stable Internet connectivity, a small
number (dozens) of proxies can handle a large number of clients.
Swarm can support wide-area caching of enterprise objects efficiently in applications
such as online shopping and auctions without compromising strong consistency due to
its contention-aware replication control mechanism. In this section, we demonstrate
that WAN proxy caching using Swarm improves the aggregate throughput and client
access latency of enterprise services by 200% relative to a central clustered solution
even when clients exhibit contention. Swarm’s contention-aware caching outperforms
both aggressive caching and client-server RPCs under all levels of contention, as well as
under dynamically changing data access patterns and contention across multiple sites.
5.3.1 Evaluation Overview
To evaluate how effectively Swarm’s contention-aware replication control supports
strongly-consistent wide area caching of enterprise objects, we built a simple synthetic
proxy-based enterprise service. The enterprise service consists of three tiers; a middletier enterprise server accepts service requests from front-end web servers and operates
on data stored in a backend object database. Unlike existing clustered architectures for
web-based enterprise services [24, 15], we deploy our web servers across the wide area
from the primary enterprise server to serve clients in different geographic regions.
Figure 5.12 illustrates the network architecture of the modeled service. We run the
primary enterprise server at node E0 that hosts a database of enterprise objects at its
local Swarm server. We deploy workload generators at 16 wide area nodes (E1-E16),
152
Primary
E0
Proxy
E16
Server Node
Workload
generator
E1
E15
Routing
E2
.
...
request
Ent. Server
open
core
..
1Mb
10ms
ply
re
fwd
Swarm
E9
Figure 5.12. Network Architecture of the SwarmProxy service.
each of which simulates a busy web server (with 4 threads) processing query and update
requests for objects from web browsers at full speed by issuing RPCs to the primary
enterprise server. We refer to each workload generator thread as a client. To enhance performance, we gradually deploy proxy enterprise servers, called SwarmProxies, colocated
with the web servers, and cause those web servers to issue RPCs to the colocated proxy if
available. Each enterprise server (primary or proxy) serves client requests by accessing
objects from a local Swarm server, and may forward the request to other servers when
suggested by Swarm. For instance, in Figure 5.12, the proxy at node E15 forwards a
local client request to E9, which replies directly to the client at E15.
We compare the access latency and the aggregate throughput observed by the clients
with three configurations under varying contention: (1) proxy servers requesting Swarm
to adapt between aggressive caching and master-slave (RPC) modes based on observed
contention for each object (denoted as adaptive caching), (2) proxy servers forcing
Swarm to always cache objects locally on demand (denoted as aggressive caching), and
(3) the traditional clustered organization with no proxies, where all clients issue RPCs to
the primary enterprise server across WAN (denoted as RPC).
Our results indicate that deploying adaptive caching proxies across WAN near client
sites improves the aggregate client throughput beyond RPC, even when client requests
exhibit a modest locality of 40% (i.e., each client accesses its own ‘portion’ of the shared
object space only 40% of the time). Clients colocated with adaptively caching proxies
153
experience near-local latency for objects that they access frequently, i.e., with locality of
40%, and an average latency 10% higher than that of a WAN RPC for other objects.
In contrast, when proxies employ aggressive caching, clients incur latencies 50% to
250% higher than that of a WAN RPC for all objects due to the high cost of enforcing
strong consistency across WAN. By avoiding this cost, the adaptive scheme improves
the aggregate client throughput by 360% over aggressive replication even when there is
high locality (i.e., each client accessing its own ‘portion’ of the shared object space 95%
of the time). Finally, Swarm automatically migrates object privileges to frequent usage
sites when working sets change. When perfect (100%) locality exists, i.e., when the same
object is not shared by multiple clients, Swarm redistributes the objects automatically to
clients that use them, and thereby provides linear speedup with WAN proxy caching
under both schemes.
When a proxy fails, clients near that site can redirect their requests to other sites.
After the leases on the cached objects at the failed proxy expire, other sites can serve
those objects, thus restoring service availability.
5.3.2 Workload
The objective of our SwarmProxy workload is to study Swarm’s ability to handle
contention when enforcing strong consistency. Hence we designed a workload similar
to the TPC-A transaction processing benchmark [73]. In TPC-A, each client repeatedly
(at full speed) selects a bank account at random on which to operate and then randomly
chooses to query or update the account (with 20% updates). Similarly, our synthetic
enterprise database consists of two components: an index structure and a collection of
256 enterprise objects (each 4KBytes), all stored in a single Swarm file managed as
a persistent page heap (with a page size of 4KBytes). The index structure is a pagebased B-tree that maps object IDs to offsets within the heap where objects are stored.
We deliberately employ a small object space to induce a reasonable degree of sharing
and contention during a short period of experimental time. We run our workload on
16 WAN nodes (E1-E16) as illustrated in Figure 5.12. Each node is connected to a
common Internet backbone routing core by a 1Mbps, 10ms delay link and has 40ms
154
RTT to other nodes. Each of four clients per node (i.e., workload generator threads, as
explained earlier) repeatedly invokes a query or update operation on a random object
at full speed. Processing a client request at a SwarmProxy involves walking the B-tree
index in RDLK mode to find the requested object’s heap offset from its OID, and performing the requested operation after locking the object’s page in the appropriate mode
(RDLK or WRLK). Thus, index pages are read-only replicated whereas object pages
incur contention due to read-write replication.
We model a higher (50%) proportion of writes than TPC-A to evaluate SwarmProxy’s
performance under heavy write contention. We vary the degree of access locality as
follows. Associated with each of the 16 web servers are 16 “local” objects (e.g., clients
on E1 treat objects 1-16 as their local objects etc.). When a client randomly selects an
object on which to operate, it first decides whether to select a local object or a “random”
object from the entire set. We vary the likelihood of selecting a local object from 0%, in
which case the client selects any of the 256 objects with uniform probability, to 100%, in
which case the client selects one of its node’s 16 local objects with uniform probability.
In essence, the 100% case represents a partitioned object space with maximal throughput
because there is no sharing, while the 0% case represents a scenario where there is no
access locality.
5.3.3 Experiments
We evaluate SwarmProxy performance via three experiments run in sequence. Our
first experiment determines the effect of adding wide area proxies on the overall service
performance when varying the degree of access locality from 0% to 100%. We run
the primary enterprise server on node E0 and deploy web servers running the workload
described above on each of the 16 nodes (E1-E16). Thus initially, all web servers invoke
the primary server via RPCs across WAN links, emulating a traditional client-server
organization with no caching. Subsequently, we start a SwarmProxy (and associated
Swarm server) at a new node (E1-E16) every 50 seconds, and redirect the “web server”
on that node to use its local proxy. As we add proxies, Swarm caches objects near where
155
they are most often accessed, at the cost of potentially increasing the coherence traffic
needed to keep individual objects strongly consistent.
Our second experiment evaluates how Swarm automatically adapts to clients dynamically changing their working sets of objects, while fixing the amount of locality at 40%.
After we deploy SwarmProxies at all web servers, each client shifts its notion of what
objects are “local” to be those of its next cyclical neighbor (Clients on E1 treat objects
16-31 as local, etc.). We run this scenario denoted as “working set shift” for 100 seconds.
Our third experiment evaluates how Swarm performs when many geographically
widespread clients contend for very few objects, e.g., bidding for a popular auction
item. After experiment 2 ends, clients on nodes E9-E16 treat objects 1-16 as their
“local” objects, which introduces very heavy contention for those 16 objects. We run
this scenario, denoted as “contention”, for 100 seconds.
We run these experiments with enterprise servers employing both adaptive and aggressive caching schemes. For the adaptive case, we configured Swarm to require that
an object replica be accessed at least 6 times before its privileges can be migrated away.
We did this by setting the soft and hard caching thresholds (explained in Section 4.7.7)
on the Swarm file hosting the database to 6 and 9 respectively.
5.3.4 Results
Figures 5.13 and 5.14 show how aggregate throughput varies as we add SwarmProxies in Experiment 1 when varying the degree of locality (0%-100%) while enforcing
strong consistency. Vertical bars denote the addition of a proxy. To provide a baseline
for guaging the performance improvement with proxies, we measured the maximum
throughput that a single saturated enterprise server can deliver. To do this, we ran a
single web server’s workload with its four full-speed threads when it is colocated with
the primary server on node E0. The line labelled “local” in the graphs indicates the
performance of this scneario, which is 1000 ops/second. The line labelled “rpc” denotes
the aggregate throughput achieved when all the 16 web servers run the same workload on
the single primary server across the WAN links without proxies (480 ops/sec), which is
156
SwarmProxy: Aggregate throughput w/ Adaptive Lock Caching
16000
Cumulative ops/sec
14000
12000
10000
8000
100% locality
95% locality
90% locality
80% locality
60% locality
40% locality
20% locality
0% locality
phases
expt 2 expt 3
shift contention
6000
4000
2000
local: 1000
rpc: 480
0
0
200
400
600
800
1000
Time (seconds)
Figure 5.13. SwarmProxy aggregate throughput (ops/sec) with adaptive replication.
roughly half of what a single server can deliver. RPC provides low throughput, because
the node E0’s 1Mbps Internet link becomes the bottleneck.
At 100% (i.e., perfect) locality, adding proxies causes near-linear speedup, as expected, for both aggressive and adaptive replication. Swarm automatically partitions the
database to exploit perfect locality if it exists. Also for both replication modes, the higher
the degree of locality, the higher the measured throughput.
With aggressive replication, the aggregate throughput quickly levels off as more proxies contend for the same objects. Even at 95% locality, throughput never exceeds 2500
ops/sec, because objects and their locking privileges are migrated across the WAN from
the site that frequently accesses them to serve cache misses elsewhere. With adaptive
replication, clients initially forward all requests to the root server on E0. When locality
is 40% or higher, SwarmProxies cache their “local” objects soon after they are spawned.
Under these circumstances, nodes use RPCs to access “remote” objects, rather than
replicating them, which eliminates thrashing and allows throughput to continue to scale
as proxies are added. With high (95%) locality, the adaptive scheme can support almost
9000 ops/sec in the fully proxy-based system (a 360% improvement over the aggressive
scheme). Further improvement was not achieved because our contention-aware replica
157
SwarmProxy: Aggregate throughput w/ Eager Lock Caching
3000
2500
Cumulative ops/sec
expt 2 expt 3
shift contention
95% locality
90% locality
80% locality
60% locality
40% locality
20% locality
0% locality
phases
2000
1500
local
1000
rpc
500
0
0
200
400
600
Time (seconds)
800
1000
Figure 5.14. SwarmProxy aggregate throughput (ops/sec) with aggressive caching. The
Y-axis in this graph is set to a smaller scale than in Figure 5.13 to show more detail.
158
SwarmProxy: Access Latency for local objects on node E9
Latency (millisec)
1000
E9
proxy added
100
expt2 expt3
shift contend
to
for E1
E10
rpc
10
local
Avg. update, eager
Avg. update, adaptive
phases
1
0
200
400
600
800
1000
Time (seconds)
Figure 5.15. SwarmProxy latencies for local objects at 40% locality.
control algorithm incurs some communication overhead to track locality, which could be
optimized further.
Figure 5.15 shows the distribution of access latencies for “local” objects on node
E9 throughout the experiment. Even with modest (40%) locality, the adaptive scheme
reduces the access latency of “local” objects from over 100msecs to under 5msecs once a
local proxy is spawned. In contrast, when using aggressive replication, the average access
latency hovers around 100msecs due to frequent lock shuttling. As Figure 5.16 shows,
once a local proxy is started, the adaptive scheme incurs a latency of around 75msecs for
nonlocal objects, which is about 15% higher than the typical client-server RPC latency
of 65msecs. Thus, the adaptive scheme performs close to RPC and outperforms the
aggressive scheme for access to “nonlocal” objects, despite never caching these objects
locally, because it eliminates useless coherence traffic.
When we have each node shift the set of objects that most interest it while maintaining
locality at 40%, the phase denoted “expt2” in Figures 5.15 and 5.16, the adaptive scheme
migrates each object to the node that now accesses it most often within roughly 10
seconds, as seen in Figure 5.15. The performance of the aggressive caching does not
change, nor does its average access latency for “nonlocal” objects. This shows that
159
SwarmProxy: Access Latency for non-local objects on node E9
1000
Latency (millisec)
E9
proxy added
expt 2
shift expt 3
contend
to
E10 for E1
500
400
300
250
200
150
100
70
Avg. update, eager
Avg. update, adaptive
phases
50
40
0
200
400
rpc
600
800
1000
Time (seconds)
Figure 5.16. SwarmProxy latencies for non-local objects at 40% locality.
Swarm’s contention-aware replication mechanism not only directs clients to where an
object is more frequently accessed, it can also dynamically track usage shifts between
sites.
Finally, when we induce extremely heavy contention for a small number of objects,
the phase denoted “expt3” in Figures 5.15 and 5.16, the adaptive scheme almost immediately picks a single replica to cache the data and become master and shifts other
replicas into slave mode. By doing so, Swarm is able to serve even heavily contended
objects with RPC latency (under 100 msecs) at node E9. In contrast, we found that the
aggressive replication protocol often requires over 500msecs to service a request and only
rarely sees sub-100msec latency. Thus, Swarm automatically switches to centralization
when it detects extreme contention and prevents the latency from worsening beyond RPC
to a central server.
In summary, Swarm’s contention-aware replication control mechanism provides better performance than either aggressive caching or exclusive use of RPCs for all levels
of contention. Swarm-based wide area enterprise proxy servers offer several benefits.
First, they improve overall service throughput and responsiveness beyond what can be
achieved with clustered services. Second, they automatically detect which objects are in
160
demand at various sites and automatically migrate them to those sites for good overall
performance. Finally, when there is heavy contention, they switch to a more efficient
centralized mechanism to stem performance degradation.
5.4 SwarmDB: Replicated BerkeleyDB
In this section, we evaluate Swarm’s effectiveness in providing wide area replication
to database and directory services. Since many distributed applications (such as for authentication, user profiles, shopping lists and e-commerce) store their state in a database
or a directory to ease data management, efficient database replication benefits a variety
of applications. The consistency requirements of a replicated database are determined by
its application, and vary widely in the spectrum between strong and eventual consistency.
Currently, popular databases (such as mySQL, Oracle, and BerkeleyDB) predominantly
employ master-slave replication due to its simplicity; read-only replicas are deployed
near wide area clients to scale query performance, while updates are applied at a central
master site to ensure serializability. For applications that can handle concurrent updates
(e.g., directory services that locate resources based on attributes), master-slave replication is restrictive and cannot scale or exploit regional locality in updates. By using
Swarm to implement database replication, we can choose on a per client basis how much
consistency is required. Thus, high throughput can be achieved when the consistency
requirements are less strict, e.g., a directory service [46], while the same code base can
be used to provide a strongly consistent database.
We augmented the BerkeleyDB embedded database library [67] with replication
support, since it is widely used for lightweight database support in a variety of applications including openLDAP [51] and authentication services [33]. We added support for
replication by wrapping a library called SwarmDB around the unmodified BerkeleyDB
library. SwarmDB stores each BerkeleyDB data structure such as a B-tree, a hash table
or a queue (in its unmodified BerkeleyDB persistent format, hereafter referred to as the
DB) in a Swarm file which is automatically cached on demand by Swarm servers near
application instances. SwarmDB transparently intercepts most of the BerkeleyDB data
access interface calls, and supports additional arguments to the DB open() interface with
161
which applications can specify configurable consistencyoptions for each DB access session. Thus, an existing BerkeleyDB-based application can be retargeted to use SwarmDB
with simple modifications to its DB open() invocations, thereby enabling it to reap the
benefits of wide area caching. More details on how a SwarmDB-based application is
organized are given in Section 4.3. SwarmDB handles BerkeleyDB update operations
by invoking them as operational updates on the local Swarm server, treating an entire
DB as a single consistency unit. A SwarmDB plugin linked into each Swarm server is
used to apply update operations via the BerkeleyDB library on a local DB replica and
to resolve update conflicts. Swarm reapplies those operations at other replicas to keep
them consistent. Synchronizing DB replicas via operational updates allows transparent
replication without modifying the underlying database code, at the expense of increasing
contention at replicas when false sharing occurs. A relational database such as mySQL
could also be replicated across wide area in a similar manner.
To illustrate SwarmDB’s programming complexity, the original BerkeleyDB library
is 100K+ lines of C code. The SwarmDB wrapper library is 945 lines (397 semicolons) of
C code, which consists of simple wrappers around BerkeleyDB calls and glue code. The
SwarmDB plugin linked to each Swarm server consists of 828 lines (329 semicolons) of
C code to apply operational updates on the local copy and to provide conflict resolution
and log truncation. SwarmDB took a week to develop and two weeks to debug and to
remove performance bottlenecks in the plugin that arose when Swarm servers propagate
updates at full speed.
5.4.1 Evaluation Overview
In Section 5.4.2, we demonstrate the spectrum of consistency choices that SwarmDBbased replication with configurable consistency makes available to BerkeleyDB applications, and their diverse performance and scaling characteristics. We run a full-speed
update-intensive DB workload on a SwarmDB database with a mix of five distinct consistency flavors listed in Table 5.1 on up to 48 wide area replicas, and compare the throughput to that of a client-server (RPC) version of BerkeleyDB. Consistency flavors that require tight synchronization such as strong and close-to-open consistency offer throughput
162
similar to that of RPC, but do not scale beyond 16 replicas. Relaxing consistency even by
a small time-bound improves throughput by an order of magnitude. Finally, employing
eventual consistency improves throughput by another order of magnitude that also scales
well to 48 replicas.
In Section 5.4.3, we present a peer-to-peer (music) file sharing application modeled
after KaZaa. It implements its attribute-based file search directory as a distributed hierarchy of SwarmDB indexes, each replicated on demand with eventual consistency. In a
240-node file sharing network where files employ close-to-open consistency, Swarm’s
ability to obtain files from nearby replicas reduces the latency and WAN bandwidth
consumption for new file downloads and limits the impact of high node churn (5 node
deaths/second) to roughly one-fifth that of random replica networking. Swarm quickly
recovers from failed replicas due to its efficient implementation of lease-based consistency, causing fewer than 0.5% of the downloads to fail in spite of the high rate of node
failures.
5.4.2 Diverse Consistency Semantics
We measure SwarmDB’s read and write throughput when running an update-intensive
BerkeleyDB workload on a database replicated with Swarm employing five consistency
flavors listed in Table 5.1. We compare SwarmDB against BerkeleyDB’s client-server
(RPC) implementation.
Table 5.1. Consistency semantics employed for replicated BerkeleyDB (SwarmDB) and
the CC options used to achieve them.
The unspecified options are set to [RD/WR, time=0, mod=∞, soft, no semantic deps,
total order, pessimistic, session visibility & isolation].
Consistency Semantics
locking writes (eventual rd)
master-slave writes (eventual rd)
close-to-open rd, wr
time-bounded rd, wr
optimistic/eventual rd, wr
CC options
WRLK
WR, serial
RD/WR, time=0, hard
time=10, hard
RD/WR, time=0, soft
163
In our synthetic SwarmDB benchmark, we create a BerkeleyDB B-tree inside a
Swarm file and populate it with 100 key-value pairs. The database size does not affect
the benchmark’s performance, except initial replica creation latency, because we employ
operational updates where Swarm treats the entire database (file) as a single consistency
unit and propagates operations instead of changed contents. We run a SwarmDB process
(i.e., a client process linked with the SwarmDB library) and a Swarm server on each of 2
to 48 nodes. The nodes run FreeBSD 4.7 and are each connected by a 1Mbps, 40-msec
roundtrip latency WAN link to a backbone router core using Emulab’s delayed LAN configuration illustrated in Figure 5.1. Each SwarmDB process executes an update-intensive
workload consisting of 10,000 random operations at full speed (back-to-back, without
think time) on its local database replica. The operation mix consists of 5% adds, 5%
deletes, 20% updates, 30% lookups, and 40% cursor-based scans. Reads (lookups and
cursor-based scans) are performed directly on the local database copy, while writes (adds,
deletes, and updates) are sent to the local Swarm server as operational updates. Each
operation opens a Swarm session on the database file in the appropriate mode, performs
the operation, and closes the session. We employed a high update rate in our workload
to characterize Swarm’s worst-case performance when employing various consistency
semantics.
We employ the following consistency flavors, which in some cases differ for reads
and writes:
1. Locking writes and optimistic/eventual reads: This combination ensures strong
consistency for writes via exclusive locks to ensure their serializability, but provides eventual consistency for reads, so they can be executed in parallel with
writes. Write sessions are executed at multiple replicas, but serialized via locks.
Their updates are propagated to other replicas by a best-effort eager push. This
flavor can be used when a database’s workload is dominated by queries that need
not guarantee to return the latest data, and writes that exhibit high locality but
require the latest data and serializability of writes (e.g., a sales database).
164
2. Master-slave writes and eventual reads: This combination provides read-only
replication supported by many existing wide area database replication solutions
(e.g., BerkeleyDB and mySQL). Unlike locking writes, master-slave writes are
always forwarded to a single ‘master’ replica (the root replica in Swarm) where
they are executed serially, but interleaved with other local sessions. Updates are
propagated to other replicas by a best-effort eager push. This flavor can be used
when a database must support a higher proportion of writes issued from multiple
sites, or if write conflicts are hard to resolve and should be avoided (e.g., an
inventory database).
3. Close-to-open reads and writes: This flavor provides the latest snapshot view of
the database at the time it is opened that does not change during an access session.
It still allows writes to proceed in parallel at multiple replicas. This semantics is
provided by Oracle as snapshot isolation [52].
4. Time-bounded inconsistency for reads and writes: This flavor relaxes the ‘latest
data’ requirement of close-to-open consistency by allowing a small amount of
staleness in the view of data in exchange for reduced synchronization frequency
and increased parallelism among writes. This semantics is provided by the TACT
toolkit [77] via its staleness metric.
5. Optimistic/eventual reads and writes: This flavor allows maximal parallelism among
reads and writes at multiple replicas but only guarantees eventual convergence of
their contents. It performs a read or write operation without synchronizing with
other replicas, and propagates writes by a best-effort eager push. This semantics is
provided by the Microsoft Active Directory’s replication system [46].
Overall, SwarmDB’s throughput and scalability characteristics vary widely based on the
consistency flavor employed for replication. Compared to a client-server BerkeleyDB
implementation, the throughput of BerkeleyDB replicated with Swarm improves by up
to two orders of magnitude with optimistic/eventual consistency, and by an order of
magnitude even when replica contents are guaranteed not to be out of sync by more
than 20 msecs.
165
Figures 5.17 and 5.18 show the average throughput observed per replica for reads and
writes. For comparison, the graphs present BerkeleyDB’s baseline performance when
clients access the database (i) stored in the local file system (local) via the original library,
(ii) stored in a colocated BerkeleyDB server via RPCs (rpclocal) and, (iii) stored in a
colocated Swarm server via the SwarmDB library (slocal). Slocal represents the best
throughput achievable using SwarmDB on top of our Swarm prototype. The big drop
in the throughput of slocal relative to local is due to two artifacts of our unoptimized
prototype implementation. First, the SwarmDB library communicates with the local
Swarm server via socket-based IPC to open sessions and to issue updates, which also
slows down the rpclocal case. The IPC could be avoided often by employing a session
cache on the client side. Second, the Swarm prototype’s IPC marshalling code employs
verbose messages, and parsing them more than doubles the message processing latency
in the critical path of IPC. When we replaced it with a binary message format, we found
that SwarmDB’s local throughput doubles. We also found several other optimizations
that could be made to improve SwarmDB’s raw performance. However, they do not
change the results presented here qualitatively.
When SwarmDB employs read-write replication with strong or close-to-open consistency, client throughput quickly drops below that of RPC, and does not scale beyond
16 replicas. This workload’s high update rate requires replicas to synchronize across
WAN links for almost every operation to ensure consistency, incurring high latency.
When clients relax consistency by tolerating even a small amount (10ms) of staleness
in their results (i.e., employ time-bounded inconsistency), the per-client read and write
throughput improves by an order of magnitude over close-to-open consistency. This
is because, the cost of synchronization over the wide area is very high and amortizing
it over multiple operations has substantial latency benefit, especially for a high-speed
workload such as ours. Thus, the configurable consistency framework’s staleness option is important for wide area caching. With time-bounded inconsistency, throughput
drops for large replica sets, but scales better than with close-to-open consistency. When
clients employ eventual consistency, they achieve read and write throughput close to the
Swarm local (slocal) case that also scales linearly to large replica sets for two reasons.
166
SwarmDB Read throughput per-replica
Throughput (reads/sec)
100000
local
10000
rpclocal
Swarm local (slocal)
1000
100
optimistic rd, wr
20msec-bounded
10msec-bounded
close-to-open
rpc
10
1
2
4
8
16
# Replicas
32 40 48
Figure 5.17. SwarmDB throughput observed at each replica for reads (lookups and
cursor-based scans). ‘Swarm local’ is much worse than ‘local’ because SwarmDB opens
a session on the local server (incurring an inter-process RPC) for every read in our
benchmark.
SwarmDB Write throughput per-replica
Throughput (writes/sec)
100000
optimistic wr
20msec-bounded
10msec-bounded
close-to-open
locking wr
rpc
master-slave wr
10000
1000
local
rpclocal
Swarm local (slocal)
100
10
1
2
4
8
16
32 40 48
# Replicas
Figure 5.18. SwarmDB throughput observed at each replica for writes (insertions,
deletions and updates). Writes under master-slave replication perform slightly worse
than RPC due to the update propagation overhead to slaves.
167
First, replicas synchronize by pushing updates in the background without stalling client
operations. Second, although replicas accept updates at a high rate and propagate them
eagerly to each other, Swarm servers make use of their local SwarmDB plugin to remove
self-canceling updates (such as the addition and subsequent removal of a database entry)
from the local update log and avoid their unnecessary propagation, reducing update
propagation traffic. The write throughput under eventual consistency is less than the
read throughput because the updates involve an IPC to the local Swarm server, whereas
the SwarmDB library performs a read operation directly on the underlying BerkeleyDB
database.
When clients simultaneously employ different semantics for read and write operations such as strong consistency for writes and eventual consistency for reads, clients
achieve high throughput for reads but incur a low throughput for writes. In other words,
strongly consistent write accesses do not affect the performance of simultaneous weakly
consistent reads. Thus, Swarm can support multiple clients operating on the same data
with different consistency requirements simultaneously without interference, provided
their semantics do not conflict.
In summary, Swarm enables applications to use the same SwarmDB and application
code base to achieve database replication for diverse sharing needs by simply employing
different configurable consistency options. Our experiment demonstrated that employing
different consistency flavors can result in order of magnitude differences in the throughput and scalability of the resulting application.
5.4.3 Failure-resilience and Network Economy at Scale
For configurable consistency to provide a viable consistency solution for wide area
replication, its implementation must not only utilize network capacity efficiently in nonuniform networks, but also gracefully recover from failures and enforce consistency in
a dynamic environment where a large number of replicas continually join and abruptly
leave the sharing network (a phenomenon called node churn). In this section, we demonstrate the failure resilience and network economy of Swarm’s replication and lease-based
consistency mechanisms under node churn involving hundreds of nodes.
168
5.4.4 Evaluation
We emulate a large wide area file sharing network where a number of peer nodes
continually join and leave the network. Each peer comes online, looks up files in a
SwarmDB-based shared index and downloads previously unaccessed files for a while
by accessing them via its local Swarm server, and then abruptly goes offline, killing the
server as well. As files get cached widely, the latency to download a new file drops
since Swarm is likely to obtain it from nearby peers. To evaluate how the lease-based
failure resilience of Swarm’s consistency mechanisms performs under churn, we employ
close-to-open consistency for shared files, which forces Swarm to maintain leases. We
never update the files during the experiment, so when a replica loses contact with its
parent, it must reconnect to the hierarchy to ensure that it has the latest data before it
can serve the file to another node. We employ two metrics to evaluate the performance
of Swarm’s replication and consistency mechanisms under churn: (1) the latency to
download a new file, and (2) the wide area network bandwidth consumed for downloads.
Our results indicate that Swarm’s proximity-aware replica management quickly detects
nearby replicas even in a network of 240 nodes, reducing download latency as well as
WAN bandwidth to roughly one-fifth that of Swarm with proximity-awareness disabled.
We emulate the network topology shown in Figure 5.19. Machines are clustered in
campuses spread over multiple cities across a wide area network. Campuses in the same
city are connected via a 10Mbps network with 10ms RTT. Campuses in neighboring
cities are connected via a 5Mbps network to a backbone router, and have 50ms RTT
to each other. Campuses separated by the WAN have 5Mbps bandwidth and 150ms
roundtrip latency between them and need to communicate across multiple backbone
routers (two in our topology). Each user node runs a file sharing agent and a Swarm
s erver, which are started and stopped together. To conduct large-scale experiments on
limited physical nodes, we emulate 10 user nodes belonging to a campus on a single
physical machine. Thus, we emulate 240 user nodes on a total of 24 physical machines.
The CPU, memory, and disk were never saturated on any of the machines during our
experiment.
169
...
...
10Mb, 2.5ms
...
5Mb, 10ms
11
00
00R2
11
00
11
...
100Mb,
25ms
R3
115Mb,
00
0010ms
11
00
11
...
Internet
11
00
R1
00
11
00
11
5Mb, 10ms
router
...
hub
...
Campus LAN
File
Browser
Swarm
Server
User node
Figure 5.19. Emulated network topology for the large-scale file sharing experiment.
Each oval denotes a ‘campus LAN’ with ten user nodes, each running a SwarmDB client
and Swarm server. The labels denote the bandwidth (bits/sec) and oneway delay of
network links (from node to hub).
We designate 10% of the nodes as ‘home’ nodes (i.e., custodians) that create a total
of 1000 files and make them available for sharing by adding their keys (i.e., attributevalue pairs) to the SwarmDB index. Although Swarm’s replication mechanism itself can
handle transient root custodian failures, the highly simplified SWID lookup mechanism
of our Swarm prototype requires the custodian to be accessible to help bootstrap a new
replica, as explained in Section 4.6.1. Hence we configured the custodian nodes never to
go down during the experiment.
Once every 1-3 seconds, each node looks up the SwarmDB index for files using a
random key and downloads those not available locally by accessing them via Swarm. We
chose the file size to be 50KBytes so that they are large enough to stress the bandwidth
usage of the network links, but small enough to expose the overheads of Swarm’s replica
hierarchy formation. We initially bring up the home nodes and then the other nodes in
random order, one every 2 seconds. We run them for a warm up period (10 minutes)
170
to build replica hierarchies for various files. We then churn the nonhome nodes for a
period of 30 minutes, followed by a quiet (i.e., no churn) period of 10 minutes before
shutting down all the nodes. Peer nodes have an exponentially distributed lifetime (i.e.,
time spent online). We simulate three different median lifetimes (30 seconds, 1 minute,
and 5 minutes) by initiating node death events using a Poisson process that makes them
uncorrelated and bursty. We start a new node each time one is killed to keep the total
number of nodes constant during the experiment. This model of churn is similar to that
described by Liben-Nowell et al. [42] and modeled during the evaluation of the Bamboo
DHT system [59]. We deliberately simulate very short node lifetimes to examine the
sensitivity of Swarm’s replication and consistency mechanisms to high node churn rates,
ranging from an average of 5 node deaths/sec to a node dying every 2 seconds.
We compare three configurations of Swarm with different replica networking schemes.
In the first configuration denoted ‘random hierarchies’, we disable Swarm’s proximityawareness and its associated distance estimation mechanism so that replicas randomly
connect to one another without regard for network distances. In the other configurations,
we enable proximity-aware replica networking (denoted ‘WAN-aware’), but employ different policies for how the Swarm server hosting a file replica reacts when it finds a
nearer parent replica candidate than its current parent (i.e., with a shorter RTT). In the
second configuration, denoted ‘eager download’, Swarm reconnects the replica to the
new parent in the background, and does not disconnect from the old parent until the
reconnection succeeds. As a result, a new replica ends up downloading file contents from
the first nearby replica willing to serve as its parent. This policy is based on the intuition
that eagerly disrupting a replica hierarchy based on potentially inaccurate proximity
information could increase the latency of ongoing accesses. In the third configuration,
denoted ‘deferred download’, so long as a replica does not have valid file contents
(e.g., when it is created), Swarm disconnects it from its current parent in addition to
reconnecting to the new parent candidate. Thus, the third policy is more aggressive in
choosing a nearer parent for a replica before the file download starts and is based on
the intuition that spending a little more time initially to find a nearer parent pays off by
utilizing the network bandwidth more efficiently.
171
Figure 5.20 shows a timeline of the mean latency observed for new file accesses by all
nodes in each of the three configurations. Initially, file latencies with the ‘WAN-aware’
schemes are high because a large number of nodes come online for the first time, learn
about each other, and find relative network distances by pings. However, access latencies
quickly drop as the network traffic subsides and nodes learn about nearby nodes and
file replicas. In the ‘random hierarchies’ scheme, file access latencies do not improve
over time. Though both of the WAN-aware schemes are affected by node churn, their
proximity awareness mechanism gets files from nearby replicas, and results in much
better latencies than with the random scheme. Under node churn, the eager download
scheme performs worse than the deferred scheme, as a new replica often initiates a file
download from a distant replica before it finds and connects to a nearby parent. The
deferred download scheme utilizes the network much more efficiently, resulting in better
latencies that approach those seen during the stable network phase even when a node is
replaced every 2 seconds (corresponding to a median node life span of 5 minutes).
To evaluate the WAN bandwidth savings generated by Swarm’s proximity-aware
replication, we measured the bytes transferred over the inter-router links. Figure 5.21
shows the incoming bandwidth consumed on router R1’s WAN link by Swarm traffic
in each of the configurations. The observed bandwidth reductions are similar to the
access latency reductions discussed above. To evaluate the overhead imposed by Swarm
performing background reconnections in the replica hierarchy, we ran an instance of the
experiment in which we disabled reconnections, i.e., a replica with a valid parent never
reconnects to a nearer parent; we observed virtually no performance improvement, which
indicates that the overhead of replica hierarchy reorganization is not significant even
under high churn at this scale. Fewer than 0.5% of file accesses failed, mainly because a
Swarm server gets killed before it replies to its local client’s file request. Finally, fewer
than 0.5% of the messages received by any Swarm server were root probes that Swarm
uses to detect cycles in the replica network (as explained in Section 4.6), indicating
that our replica cycle detection scheme has negligible overhead even in such a dynamic
network.
172
SwarmIndex: New File Access Latency
4
random hierarchies, 1min life
WAN-aware, eager download, 1min life
WAN-aware, deferred, 30sec life
WAN-aware, deferred, 1min life
WAN-aware, deferred, 5min life
phases
File access latency (seconds)
3.5
3
2.5
2
1.5
1
0.5
nodes join
0
0
warmup
500
node churn
1000
1500
stable
2000
2500
Time Elapsed (seconds)
Figure 5.20. New file access latencies in a Swarm-based peer file sharing network of
240 nodes under node churn.
In summary, Swarm’s proximity-aware replication and consistency mechanisms enable a peer file sharing application to support widespread coherent file sharing among
hundreds of WAN nodes while utilizing the network capacity efficiently. Swarm’s proximity aware replication reduces file download latency and WAN bandwidth utilization to
one-fifth that of random replication. Its lease-based consistency mechanism continues to
provide close-to-open consistency to users in a large, rapidly changing replica network
with nodes failing and new nodes joining at a high rate (up to 5 nodes per second).
5.5 SwarmCast: Real-time Multicast Streaming
Real-time collaboration involves producer-consumer style interaction among multiple users or application components in real-time. Its key requirement is that data from
producers must be delivered to a potentially large number of consumers with minimal
latency while utilizing network bandwidth efficiently. For instance, online chat (for text
as well as multimedia) and distributed multiplayer gaming require efficient broadcast of
text, audio, video, or player moves between all participants in real-time to keep their
views of chat transcript or game state consistent. A live media streaming application
173
WAN Router Bandwidth consumed for file downloads
1.2e+06
random hierarchies, 1min node life
WAN-aware, eager download, 1min node life
WAN-aware, deferred download, 30sec node life
WAN-aware, deferred download, 1min node life
WAN-aware, deferred download, 5min node life
phases
1e+06
Bytes/sec
800000
warmup
600000
stable shutdown
400000
200000
nodes join
0
0
500
node churn
1000
1500
2000
2500
Time Elapsed (seconds)
Figure 5.21. The average incoming bandwidth consumed at a router’s WAN link by file
downloads. Swarm with proximity-aware replica networking consumes much less WAN
link bandwidth than with random networking.
must disseminate multimedia efficiently from one producer to multiple subscribers in
real-time. The data-delivery requirement of these applications can be viewed as the need
to deliver a stream of updates to shared state in real-time to a large number of replicas
without blocking the originator of those updates. When viewed thus, the requirement
translates to eventual consistency with best-effort (i.e., soft) push-based updates.
Traditionally, a client-server message-passing architecture is employed for these applications, wherein producers send their data to a central relay server which disseminates
the data to interested consumers. It is well known that multicasting data via a network
of relays eases the CPU and network load on the central server and scales to a large
number of consumers, at the expense of increasing the end-to-end message transfer delay
[53]. Since a multicast relay network reduces load on each individual relay, it is able
to disseminate events in real-time at a higher rate than a single server can. Though
the configurable consistency framework and Swarm are not specifically optimized for
real-time multicast support, their flexible interface allows application designers to leverage Swarm’s coherent hierarchical caching support for scalable multicast over LANs or
174
WANs without incurring the complexity of a custom multicast solution. To evaluate
Swarm’s effectiveness at supporting real-time multicast, we have built a synthetic data
flow application called SwarmCast. SwarmCast stresses Swarm’s ability to support
real-time synchronization of a large number of replicas with eventual consistency in
the presence of high-speed updates. It validates our hypothesis that the configurable
consistency framework can be used effectively for applications that require real-time
data dissemination.
5.5.1 Results Summary
Our results show that SwarmCast is able to scale its transmission bandwidth and
the number of real-time consumers supported linearly simply by redirecting them to
more Swarm servers. With 120 consumers on a 100Mbps switched LAN, a SwarmCast
producer is able to push data via Swarm server network at 60% of the rate possible using
an ideal multicast relay network of equivalent size, while keeping end-to-end latency
below 250ms. Half of this latency is incurred on the first hop as messages get queued to
be picked up by the root server. This is because Swarm servers are compute-bound due
to message marshaling overheads in our unoptimized prototype (e.g., mallocs and data
copying).
5.5.2 Implementation
We implemented SwarmCast as an extension to the Unix ttcp program. A single
producer simulates a high-speed sequence of events by sending a continuous stream
of fixed size (200-byte) data packets to a central relay server. Numerous subscribers
obtain those events in real-time by registering with the relay or one of its proxies. We
implement two types of relays: (i) a single hand-coded relay server that unicasts each
packet to subscribers via TCP/IP sockets (implemented by modifying the Unix ttcp
program), and (ii) a network of Swarm servers. To use Swarm servers as relays, we built
a wrapper library for SwarmCast that appends event packets to a shared log file hosted
at an origin Swarm server as semantic updates. At SwarmCast subscribers, the wrapper
library snoops for events by registering with one of a set of Swarm servers to receive log
175
file updates.1 In response, Swarm servers cache the log file from the origin and receive
semantic updates with eventual consistency via an eager best-effort push protocol. The
Swarm servers build a replica hierarchy automatically and use it to propagate updates to
each other as well as to the SwarmCast subscribers. Thus, the Swarm replica hierarchy
serves as a multicast overlay network for real-time event dissemination. The wrapper
library consists of about 250 lines of C code, including code to measure packet latencies
and arrival rates, and does not involve any network I/O.
5.5.3 Experimental Setup
We run Swarm servers on each of up to 20 PCs running FreeBSD 4.10. We run
the producer process on a separate PC, but colocate multiple subscriber processes to
run up to 120 subscribers on 30 PCs. Collocating subscriber processes on a single
node did not affect our results; the CPU and memory utilization never crossed 50% on
those machines. There were no disk accesses during our experiments. We distribute the
subscribers evenly among the available relays (i.e., Swarm servers) such that each relay’s
total fanout (i.e., the number of local subscribers + child relays) is approximately equal.
The producer sends 20480 200-byte packets to the root relay as one-way messages over a
TCP connection, follows them up with an RPC request to block till they are all received,
and measures the send throughput. We choose a low packet size of 200 bytes to stress
the effect of Swarm’s message payload overhead. We expect that Swarm would perform
better with higher packet sizes. To highlight the effect of CPU utilization at the relays
on SwarmCast’s overall performance, we run our experiments with all nodes connected
by a 100Mbps switched LAN that provisions nodes with ample network capacity. To
study the effect of Swarm replica tree depth on performance, we conduct the Swarm
experiments with Swarm fanout limits of 2 and 4.
5.5.4 Evaluation Metrics
We employ three metrics to evaluate relay performance: (i) the aggregate bandwidth
delivered to SwarmCast subscribers (the sum of their observed receive throughput),
1
We use Swarm’s sw snoop() API described in Section 4.2.1.
176
(ii) the rate at which the producer can push data (i.e., its send bandwidth), and (iii)
the average end-to-end delay for packets travelling from the producer to subscribers at
various levels of the multicast tree. We measure end-to-end packet delay as follows. We
run the NTP time sync daemon on all the nodes to synchronize their clocks to within
10 milliseconds of each other. We measure the clock skew between the producer and
subscriber nodes by issuing probe RPCs and timing them. We timestamp each packet
at the producer and receiving subscriber (using the gettimeofday() system call),
measure their difference, and adjust it based on their clock skew. We track the minimum,
average, and maximum of the observed delays.
Since each relay can only send data at its link’s maximum capacity of 100Mbps,
multiple relays should theoretically deliver aggregate throughput equal to the sum of
their link capacities. Also, the producer’s ideal send bandwidth equals the root relay’s
link capacity divided by its total fanout. We compare the performance of Swarm and our
hand-coded ttcp-based relay against this ideal performance.
5.5.5 Results
Figures 5.22 and 5.23 compare the throughput and latency characteristics of a single
Swarm relay with our hand-coded TTCP-based relay. The high end-to-end latency of
Swarm in Figure 5.23 indicates that Swarm is compute-bound. The high latency is caused
by packets getting queued in the network due to the slow relay. To confirm this, we
measured the latency from packet arrival at the relay to arrival at subscribers and found
it to be less than 0.5ms. Since a Swarm relay is compute-bound, when there is a single
subscriber, it delivers only 52% of its outgoing link bandwidth and accepts data at a much
slower rate (29%). Our analysis of Swarm’s packet processing path revealed that the
inefficiency is mostly an artifact of our unoptimized prototype implementation, namely,
30% extra payload in Swarm messages and unnecessary marshaling and unmarshaling
overheads including data copying and dynamic heap memory allocation in the critical
path. When we ran the same experiment on faster PCs with 3GHz pentium IV CPU and
1GB of RAM, the delays reduced by a factor of 3 and send and receive throughput went
up beyond 90% of the ideal values.
177
Throughput Characteristics of a Single Relay
100
% of ideal throughput
90
recv thru, Swarm relay
send thru, Swarm relay
send thru, hand-coded relay
recv thru, hand-coded relay
80
70
60
50
40
30
20
1 5 10 15
30
60
120
# subscribers
Figure 5.22. Data dissemination bandwidth of a single Swarm vs. hand-coded relay for
various number of subscribers across a 100Mbps switched LAN. With a few subscribers,
Swarm-based producer is CPU-bound, which limits its throughput. But with many
subscribers, Swarm’s pipelined I/O delivers 90% efficiency. Our hand-coded relay
requires a significant redesign to achieve comparable efficiency.
When a Swarm relay serves a large number of subscribers, Swarm’s pipelined I/O
implementation enables it to deliver more than 90% of its link bandwidth. The multicast
efficiency of our hand-coded TTCP relay (percent of outgoing link capacity delivered
to consumers) is 90% with up to 15 subscribers, but drops down to 40% with more
than 15 subscribers. This drop is an artifact of our simplistic TTCP implementation
that processes packets synchronously (i.e., sends a packet to all outgoing TCP sockets
before receiving the next packet). This effect is visible in Figure 5.23 as a high maximum
packet delay with the hand-coded relay when serving more than 15 subscribers. When
using Swarm, the send throughput of the SwarmCast producer is lower (60%) than the
subscribers’ aggregate receive throughput (90%, see 5.22), because the high queuing delay incurred by packets on the first hop to the root server was included in the computation
of send throughput.
When multiple Swarm servers are deployed to relay packets, Figures 5.24 and 5.25
show that both the producer’s sustained send bandwidth and the aggregate multicast
178
End-to-end Latency Characteristics of a Single Relay
end-to-end packet latency (millisecs)
16384
avg. latency, Swarm relay
max. latency, hand-coded relay
avg. latency, hand-coded relay
4096
1024
256
64
16
4
1
1
2
5
10
15
30
60
120
# subscribers
Figure 5.23. End-to-end packet latency of a single relay with various numbers of
subscribers across 100Mbps LAN, shown to a log-log scale.. Swarm relay induces an
order of magnitude longer delay as it is CPU-bound in our unoptimized implementation.
The delay is due to packets getting queued in the network link to the Swarm server.
throughput scale linearly. The multicast throughput is shown normalized to 100Mbits/sec,
which is the capacity of our LAN links. Figure 5.26 demonstrates that Swarm’s multicast
efficiency does not degrade significantly as more servers are deployed. Swarm-based
multicast with up to 20 relays yields more than 80% of the ideal multicast throughput.
Throughput is largely unaffected by Swarm’s relay fanout (i.e., the number of relays
served by a given relay) and hence by the depth of the relay hierarchy.
Figure 5.27 shows that when multiple Swarm servers are employed to multicast
packets, SwarmCast’s end-to-end latency decreases drastically due to there being proportionately less load on each individual server. The load reduction is particularly longer
on the first two levels of servers in the relay hierarchy. This effect can be observed clearly
in Figure 5.28, which presents the latency contributed by each level of the Swarm relay
hierarchy to the overall end-to-end packet latency on a log-log scale.
In summary, the SwarmCast application demonstrates that Swarm supports scalable
real-time data multicast effectively in addition to offloading its implementation complex-
179
Send Throughput-scaling with Swarm Multicast (120 subscribers)
Send Throughput (KB/sec, log scale)
4096
Ideal Multicast
Swarm (fanout 4)
Swarm (fanout 2)
Hand-coded relay
2048
1024
512
256
128
64
32
1
2
4
5
8
10 12
15
20
# Relays (log scale)
Figure 5.24. A SwarmCast producer’s sustained send throughput (KBytes/sec) scales
linearly as more Swarm servers are used for multicast (shown to log-log scale), regardless
of their fanout.
ity from applications. In spite of our unoptimized Swarm implementation, Swarm-based
multicast provides over 60% of ideal multicast bandwidth with a large number of relays
and subscribers, although it increases end-to-end latency by about 200 milliseconds.
Through this application, we also illustrated how, by using Swarm, the complexity of
optimizing network I/O and building multicast overlays can be offloaded from application design while achieving reasonably efficient real-time multicast.
5.6 Discussion
In this chapter, we have presented our evaluation of the practical effectiveness of
the configurable consistency framework for supporting wide area caching. With four
distributed services built on top of Swarm, we showed how our implementation of configurable consistency meets diverse consistency requirements, enables applications to
exploit locality, utilizes wide area network resources efficiently and is resilient to failures
at scale - four properties that are important for effective wide area replication support.
Aggr. Throughput (normalized to single ideal relay)
180
Multicast Throughput-Scaling with Swarm Relays (120 Subscribers)
20
aggr recv thru, ideal multicast
aggr recv thru, Swarm fanout 4
aggr recv thru, Swarm fanout 2
18
16
14
12
10
8
6
4
2
0
1
2
4
5
6
8
10
12
15
20
# Swarm-based relays
Figure 5.25. Adding Swarm relays linearly scales multicast throughput on a 100Mbps
switched LAN. The graph shows the throughput normalized to a single ideal relay that
delivers the full bandwidth of a 100Mbps link. A network of 20 Swarm servers deliver
80% of the throughput of 20 ideal relays.
In Section 5.2, we presented a distributed file system called Swarmfs that supports
the close-to-open semantics required for sequential file sharing over the wide area more
efficiently than Coda. Unlike other file systems, Swarmfs can also support reliable file
locking to provide near-local file latency for source code version control operations when
development is highly localized, and up to half the latency of a traditional client-server
system. This demonstrates the customizability and network economy of our consistency
solution.
In Section 5.3, we showed that Swarm’s novel contention-aware replication control
mechanism enables enterprise services to efficiently provide wide area proxy caching
of enterprise objects with strong consistency under all levels of contention for shared
objects. The mechanism automatically inhibits caching when there is high contention
to provide latency within 10% of a traditional client-server system, and enables caching
when accesses are highly localized to provide near-local latency and throughput.
181
Throughput Characteristics of Swarm Multicast (120 subscribers)
100
% of ideal throughput
90
80
70
60
50
recv thru, Swarm relays (fanout 4)
recv thru, Swarm relays (fanout 2)
send thru, Swarm relays (fanout 4)
send thru, Swarm relays (fanout 2)
send thru, hand-coded relay
recv thru, hand-coded relay
40
30
20
1 2
4 5 6
8
10
12
15
20
# (Swarm) Relays
Figure 5.26. Swarm’s multicast efficiency and throughput-scaling. The root Swarm
server is able to process packets from the producer at only 60% of the ideal rate due to its
unoptimized compute-bound implementation. But its efficient hierarchical propagation
mechanism disseminates them at higher efficiency. Fanout of Swarm network has a a
minor impact on throughput.
In Section 5.4, we presented a Swarm-based library to augment the BerkeleyDB
database library with replication support. We showed how database-backed applications
using this library to cache the database can leverage the same code base to meet different
consistency requirements by specifying different composable consistency choices. We
showed that selectively relaxing consistency for reads and writes improves application
throughput by several orders of magnitude and enhances scalability.
In Section 5.4.3, we showed how Swarm’s proximity-aware replication can be leveraged to support widespread coherent read-write file sharing among hundreds of WAN
nodes while reducing WAN bandwidth to roughly one-fifth of random replication. Swarm’s
lease-based consistency enforcement mechanism ensures that users get close-to-open
consistency for file sharing with high availability in a large network with nodes failing
and new nodes joining at a high rate (up to 5 nodes per second). This application
demonstrates the failure-resilience and network economy properties of our consistency
solution at scale.
182
End-to-end Propagation Latency via Swarm Relays (120 Subscribers)
2500
60
2 (max hops in msg path)
Latency (msecs)
3
40
1500
30
1000
3 4
20
3 4
4
500
5
% increase due to extra hops
50
2000
6 hops @ f2
5 hops @ f2
4 hops @ f2
3 hops @ f2
2 hops @ f2
4 hops @ f4
3 hops @ f4
2 hops @ f4
% increase
5
4
4
5
4
5
6
10
4
0
0
1
(4)
2
(4)
4 4
(4) (2)
5 5
(4) (2)
8 8
(4) (2)
10 10
(4) (2)
12 12
(4) (2)
15 15
(4) (2)
20 20
(4) (2)
# Relays (fanout)
Figure 5.27. End-to-end propagation latency via Swarm relay network. Swarm servers
are compute-bound for packet processing, and cause packet queuing delays. However,
when multiple servers are deployed, individual servers are less loaded. This reduces the
overall packet latency drastically. With a deeper relay tree (obtained with a relay fanout
of 2), average packet latency increases marginally due to the extra hops through relays.
Together, these results validate our thesis that the small set of configurable consistency options provided by our consistency framework can satisfy the diverse sharing
needs of distributed applications effectively. By adopting our framework, a middleware
data service (as exemplified by Swarm) can provide coherent wide area caching support
for diverse distributed applications/services.
183
Latency Increment at various levels of Swarm Multicast Tree
Latency increment (millisecs)
4096
1024
256
64
16
root (fanout 4)
root (fanout 2)
level 2 (fanout 4)
level 2 (fanout 2)
level 3 (fanout 2)
level 3 (fanout 4)
level 4 (fanout 2)
level 5 (fanout 2)
4
1
0.25
1
2
4 5
# Relays
8
10 12 15
20
Figure 5.28. Packet delay contributed by various levels of Swarm multicast tree (shown
to a log-log scale). The two slanting lines indicate that the root relay and its first level
descendants are operating at their maximum capacity, and are clearly compute-bound.
Packet queuing at those relays contributes the most to the overall packet latency.
CHAPTER 6
RELATED WORK
Data replication as a technique to improve the availability and performance of distributed services has been extensively studied by previous work. Since our work targets
middleware support for replicated data management, it overlaps a large body of existing
work. In this chapter, we discuss prior research on replication and consistency in relation
to Swarm. For ease of presentation, we classify existing replicated data management
systems into four categories: (i) systems that target specific application domains (e.g.,
file systems, conventional databases, collaboration), (ii) systems that provide flexible
consistency management, (iii) systems that address the design issues of large-scale or
wide area replication, and (iv) systems that provide middleware or infrastructural support
for distributed data management.
The array of available consistency solutions for replicated data management can be
qualitatively depicted as a two-dimensional spectrum shown in Figure 6.1 based on the
diversity of data sharing needs they support (flexibility) and the design effort required
to employ them for a given application (difficulty of use). Domain-specific solutions lie
at one corner of the spectrum. They are very well tuned and ready to use for specific
data sharing needs, but are hard to adapt to a different context. At the other extreme lie
application-specific consistency management frameworks such as those of Oceanstore
[22] and Globe[75]. They can theoretically support arbitrarily diverse sharing needs,
but only provide hooks in the data access path where the application is given control
to implement its own consistency protocol. Configurable consistency and TACT offer
a middle ground in terms of flexibility and ease of use. They both provide a generic
consistency solution with a small set of options that enable their customization for a
variety of sharing needs. However, to use the solution, the specific consistency needs of
an application must be expressed in terms of those options.
185
Difficulty of use
Implement
Config
App−defined
(Oceanstore, Globe)
Swarm
TACT
domain−specific
(NFS, Pangaea, DB)
Select
Fluid repl.,
WebOS, Coda
Preset
One Few
Many
Most
App. semantics covered
(Flexibility)
Figure 6.1. The Spectrum of available consistency management solutions. We categorize
them based on the variety of application consistency semantics they cover and the effort
required to employ them in application design.
Next, we discuss several existing approaches to consistency management in relation
to configurable consistency.
6.1 Domain-specific Consistency Solutions
The data-shipping paradigm is applicable to many application domains, and a variety
of replication and consistency solutions have been devised for various specific domains.
An advantage of domain-specific solutions is that they can exploit domain knowledge for
efficiency. In this section, we outline the consistency solutions proposed in the context
of a variety of application classes and compare them with configurable consistencyand
Swarm.
Distributed file systems such as NFS [64], Sprite [48], AFS[28], Coda [38], Ficus
[57], ROAM [56] and Pangaea [62] target traditional file access with low write-sharing
among users. They provide different consistency solutions to efficiently support specific
file sharing patterns and environments. For instance, Sprite provides the strong consistency guarantee required for reliable file locking in a LAN. AFS provides close-to-open
186
consistency to support coherent sharing of source files and documents. Coda provides
weak/eventual consistency to improve file availability during disconnected operation.
Ficus, ROAM and Pangaea provide eventual consistency that suits rarely write-shared
files in a peer replication environment. In contrast, building a file system on top of
Swarm allows you to support multiple file sharing patterns with different consistency
needs in a single wide-area file system. Employing a flexible consistency solution such
as configurable consistencyor TACT allows consistency settings to be configured for
individual files on a per-access basis.
Many distributed file systems (e.g., Coda and Fluid Replication [16]) track the consistency of file replicas in terms of file sets (such as volumes), which can significantly
reduce the bookkeeping cost of replica maintenance relative to tracking files individually.
For instance, this allows changes to a volume to be quickly detected to avoid checks on
individual files (which could be numerous). This optimization exemplifies how domainspecific knowledge can be used to improve the performance of a consistency solution.
Employing such an optimization in conjunction with configurable consistencywould require us to treat volumes as replicated objects. Then, volume-level updates can be used
as a trigger to enable checks on individual files.
Database system designers have devised numerous ways to weaken consistency among
replicas to improve concurrency and availability [8, 52, 39, 55, 7, 1, 40]. Objectstore
[41] provides multiversion concurrency control (MVCC), which allows clients accessing
read-only replicas to see stale but internally consistent snapshots of data without blocking
writers. It improves concurrency while ensuring serializability. Degree-two consistency
[8] avoids cascading transaction aborts by allowing a transaction to release and reacquire
read locks on data and obtain different results on subsequent reads. This semantics favors
data currency over transaction serializability, and is useful for financial quote services.
Some systems relax the strict serialization requirement for transactions if the operations
involved are commutative (such as incrementing a counter). Finally, systems that allow
concurrent writes resolve write conflicts by re-executing transactions and provide hooks
for application-level conflict resolution [18]. In contrast to these systems, configurable
consistency framework takes a compositional approach to supporting these semantics.
187
It provides options to independently express the concurrency and data currency requirements of transactions, and their data dependencies.
Distributed shared memory (DSM) systems provide a shared address space abstraction to simplify parallel programming on distributed memory machines such as multiprocessors and networks of workstations. A key hurdle to achieving good performance in a
DSM application is its high sensitivity to the cost of consistency-related communication.
The Munin system [13] proposed multiple consistency protocols to suit several common
sharing patterns of shared variables and showed that they substantially reduce communication costs in many DSM-based parallel applications. Swarm employs a similar
methodology to reduce consistency-related communication in wide-area applications.
However, its configurable consistency framework is designed to handle problems unique
to the wide area such as partial node/network failures.
False-sharing is a serious problem in page-based DSM systems as it forces replicas to
synchronize more often than necessary. It is caused by allocating multiple independently
accessed shared variables on a single shared memory page. Munin handles false-sharing
by allowing concurrent writes to distinct shared variables even if they reside on a single
page. It employs write-combining, i.e., collecting multiple updates to a page and propagating them once during memory synchronization. Write-combining can be achieved
by using Swarm’s manual replica synchronization control. Distributed object systems
avoid false-sharing by enforcing consistency at the granularity of shared objects instead
of pages. The configurable consistency framework provides options that allow parallel
updates, and relies on an application-provided plugin to resolve write conflicts. The
plugin’s resolution routine can perform write-combining similar to Munin.
PerDis [65] provides DSM-style access to pointer-rich persistent shared objects in
cooperative engineering (CAD) applications. It replicates CAD objects at the granularity
of an object cluster (e.g., a ‘file’ representing a CAD design) instead of individual objects.
To ensure referential integrity, PerDis requires applications to access object clusters
within transactions. Pessimistic transactions serialize accesses via locking, whereas optimistic transactions allow parallel accesses and detect conflicts, but require the application
to resolve them. To allow cooperative data sharing among untrusting companies, PerDis
188
distinguishes between administrative/geographic domains. Interdomain consistency is
enforced by checking out remote domain clusters via a gateway node and checking in
updates later.
Collaborative applications can be classified as synchronous where participants must
be in contact in real-time, vs. asynchronous where participants choose when to synchronize with each other [21]. Multiplayer games and instant messaging belong to the former
category, whereas shared calendars, address books, email and other groupware (e.g., Lotus Notes [32]) belong to the latter. Distributed games must disseminate player moves to
other players in a common global order and resolve state inconsistencies with minimum
latency and bandwidth consumption [53]. In other words, conflicts due to concurrent
state updates by players must be detected and resolved in real-time. For multiplayer
games, a client-server architecture is typically employed for simplicity. Here, a central
arbiter receives all updates, globally orders and disseminates them, and resolves conflicts.
However, the arbiter’s outgoing bandwidth requirement grows quadratically with number
of clients. A hierarchical replica network (such as that provided by Oceanstore or Swarm)
can be used to disseminate updates in a more bandwidth-efficient manner. Asynchronous
collaboration requires that participants are given complete freedom to update data when
and where needed, without having to coordinate with others for each access. Bayou [21]
proposed epidemic-style replication with eventual consistency for applications of this
type. The configurable consistency framework provides optimistic failure-handling and
manual synchronization options to support asynchronous collaboration.
6.2 Flexible Consistency
Previous research efforts have identified the need for flexibility in consistency management of wide area applications. Existing systems provide this flexibility in several
ways: (i) by providing a set of discrete choices of consistency policies to suit the common
needs of applications in a particular domain [74, 49], (ii) by providing a set of application
independent consistency mechanisms that can be configured to serve a variety of application needs [77], or (iii) by providing hooks in the data access path where the application
is given control to implement its own consistency management [75, 12, 60]. Consistency
189
solutions of the first type are simple to adopt. They are well-suited to applications whose
requirements exactly match the discrete choices supported and rarely change.
WebOS [74] provides three discrete consistency policies in its distributed file system
that can be set individually on a per-file-access basis: last-writer-wins, which is appropriate for rarely write-shared Unix file workloads; append-only consistency, which is
appropriate for interactive chat and whiteboard applications; and support for broadcast
file updates or invalidation messages to a large number of wide area clients. Fluid
replication [16] provides last-writer-wins, optimistic and strong consistency flavors. On
the other hand, Bayou only guarantees eventual consistency among replicas as it relies on
epidemic-style update propagation whenever replicas come into contact with each other.
It requires applications to supply procedures to apply updates and to detect and resolve
conflicts. For clients that require a self-consistent view of data in spite of mobility, it
offers four types of session guarantees, which we discussed in Section 3.5.
TACT [77] provides a consistency model that offers continuous control over the
degree of replica divergence along several orthogonal dimensions. Our approach to
flexible consistency comes closest to that of TACT. As discussed in Section 3.5.5, TACT
provides continuous control over concurrency by bounding the number of uncommitted
updates that can be held by a replica before it must serialize with remote updates. configurable consistency provides a binary choice between fully exclusive and fully concurrent
access for reasons mentioned in Section 2.7. The TACT model also allows applications
to express concurrency constraints in terms of application-level operations (e.g., allow
multiple enqueue operations in parallel), whereas the configurable consistency framework requires such operations to be mapped to one of its available concurrency control
modes for reads and writes. In contrast, the configurable consistency framework offers
additional flexibility in some aspects that the TACT model cannot. For instance, causal
and atomic dependencies are hard to express in the TACT model, as are soft best-effort
timeliness guarantees required by applications like chat and event dissemination where
writers cannot be blocked. Expressing them in the configurable consistency framework
is straightforward.
190
Systems such as Globe [75], and Oceanstore [60] require applications to enforce
consistency on their own by giving them control before and after data access. Oceanstore
provides additional system-level support for update propagation by organizing replicas
into a multicast tree. Each Oceanstore update can be associated with a dependency check
and merge procedure where the application must provide its consistency management
logic to be executed by the Oceanstore nodes during update propagation.
In addition to the above, a variety of consistency solutions have been proposed to
offer flexibility for specific classes of applications. We have adopted some of their
options into our framework based on their value as revealed by our application survey.
A detailed comparison of the configurable consistency framework with those solutions
is available in Section 3.5.
6.3 Wide-area Replication Algorithms
A variety of techniques have been proposed to support scalable replication over the
wide area. In this section, we discuss these techniques in the areas of replica networking,
update propagation, consistency maintenance and failure handling.
6.3.1 Replica Networking
For scalability, systems organize replicas into a network with various topologies.
Client-server file systems such as AFS, NFS and Coda organize replicas into a simple
two-level static hierarchy. Clients can only interact with the server but not with other
clients. Fluid replication [16] introduces an intermediate cache between clients and
the server, called a WayStation, to efficiently manage client access across WAN links.
Blaze’s PhD thesis [10] showed the value of constructing dynamic per-file cache hierarchies to support efficient large-scale file sharing and to reduce server load in distributed
file systems. Pangaea [62] organizes file replicas into a dynamic general graph as it
keeps replicas connected despite several links going down. A graph topology works well
to support eventual consistency via update flooding, but cycles cause deadlocks when
employing recursive pulls needed to enforce stronger consistency guarantees. Active
directory [46] employs a hierarchy, but employs shortcut paths between siblings for
191
faster update propagation. Within a site, it connects replicas into a ring for tolerance
to single link failures. Bayou offers the maximum flexibility in organizing replicas as
it maintains consistency via pairwise synchronization. It is well-suited to asynchronous
collaborative applications which need this flexibility. However, to retain this flexibility,
Bayou needs to exchange version vectors proportional to the total number of replicas
during every synchronization, which is wasteful when replicas stay connected much of
the time. Swarm replicas exchange much smaller relative versions when connected, and
fall back on version vectors during replica topology changes.
The ROAM [56] mobile wide area file system replicates files on a per-volume basis.
For efficient WAN link utilization, it organizes replicas into clusters called WARDs with
a designated WARD master replica. Replicas within a WARD are expected to be better
connected than across WARDs. However, a WARD’s membership and master replica are
manually determined by its administrator. Inter-WARD communication happens only
through WARD masters. Within a WARD, all replicas communicate in a peer-to-peer
fashion by forming a dynamic ring. Unlike ROAM’s manual organization, Pangaea
builds its graph topology dynamically on a per-file basis, and achieves network economy
by building a spanning tree using node link quality information gained from an NxN ping
algorithm.
Unlike ROAM, and like Pangaea, Swarm builds a proximity-aware replica hierarchy
automatically for each file. A new Swarm replica decides for itself which other replicas
are closer to it and elects one of them as its parent. This is in contrast to Pangaea, where
the best neighbors for a new replica are chosen by existing replicas based on transitive
network distance computation.
6.3.2 Update Propagation
Many replicated systems (e.g., Bayou, Rover, TACT) synchronize replicas by reexecuting semantic operations instead of propagating modified contents for efficiency. Most
optimistically replicated systems employ version vectors (VVs) to determine the updates
to be exchanged by replicas during synchronization and to detect conflicts. VVs give
replicas complete flexibility in choosing other replicas for synchronization. However, the
192
replica topology affects the overall synchronization latency as well as the amount of state
exchanged. For replicas to be able to arbitrarily synchronize with each other, they must
maintain and exchange version vectors proportional to the total number of replicas in the
system [14]. Bayou [54] implements full-length version vectors to retain flexibility in
update propagation. Since this hinders large-scale replication (to thousands of replicas),
many solutions have been proposed to prune the size of version vectors by somewhat
compromising on flexibility. LAN-based systems employ loosely synchronized clocks
to avoid having to maintain full-length VVs for each replicated object. ROAM restricts
version vector size to that of a WARD by limiting replicas to directly synchronize with
others only within their WARD.
The main focus of this dissertation is the configurable consistency framework. As
such, we have described how it can be enforced in systems employing loosely synchronized clocks as well as version vectors. In Section 4.8, we have also outlined
several optimizations that are possible in a hierarchy-based implementation of CC such
as Swarm.
6.3.3 Consistency Maintenance
Most systems maintain replica consistency by employing a protocol based on invalidations or updates or a hybrid combination of them. Invalidation-based protocols
(e.g., AFS) pessimistically block accesses to prevent conflicts, whereas update-based
protocols never block reads. Optimistic algorithms (e.g., Pangaea, Bayou, ROAM) do
not block writes as well, but deal with conflicts after they happen. The TACT toolkit [77]
never blocks reads, but enforces consistency before allowing writes by synchronizing
replicas via pull and push operations. TACT enforces consistency via iterative messaging
where each replica synchronizes with all other replicas directly to enforce the desired
consistency guarantees. Swarm synchronizes replicas recursively by propagating pull
and push requests along a replica hierarchy. Thus, each replica communicates in parallel
with a small (fixed) number of neighbors, enabling Swarm’s consistency protocol to scale
to a large number of replicas.
193
When strong consistency guarantees needs to be enforced, interleaved accesses among
replicas could cause thrashing, where replicas spend much of the time shuttling access privileges instead of accomplishing useful computation. Function-shipping avoids
the problem by disabling caching and redirecting all data accesses to a central site.
Database and object systems have long recognized the performance tradeoffs between
function-shipping and data-shipping [11, 50, 52]. Many provide both paradigms, but
require the choice to be made for a given data item at application design time. In
contrast, Swarm’s contention-aware caching scheme allows applications to make the
choice dynamically. The Sprite file system [48] disables client caching when a file is
write-shared. Swarm chooses the most frequently accessed copy as the master. The
Munin DSM system freezes a locking privilege at a replica for a minimum amount of
time (hundreds of milliseconds) before allowing it to migrate away. This scheme is
simple to implement, and works well in a LAN environment. However, in the wide
area, the cost of privilege-shuttling is much higher. When accesses to an object are
interleaved but some replicas exhibit higher locality than others, the time-based privilege
freezing approach still allows shuttling, which quickly offsets the speedup achieved.
We implemented time-based hysteresis in Swarm and observed this problem in a WAN
environment. We found that performance can be improved by actively measuring the
locality and redirecting other accesses to the replica that exhibits higher locality.
The Mariposa distributed data manager [66] is a middleware service that sits between
a database and its clients, and spreads the query and update processing load among a
collection of autonomous wide area sites. Mariposa uses an economic model for the
allocation of query processing work among sites as well as to decide when and where
to replicate database table fragments to spread load. Each site “pays” another site to
supply a cached fragment copy and to send it an update stream to keep its contents
within a specified staleness limit. Clients can specify an upper bound on staleness and
a budget for each query. The system splits the query into subqueries on fragments, and
redirects subqueries to sites that bid to execute them at the lowest price. Since there is
cost associated with keeping a cached fragment copy, each site keeps a copy only as long
194
as it can recover its cost by serving queries. Thus, the system’s economic model causes
the sites to autonomously switch between function-shipping and data-shipping.
Mariposa runs a bidding protocol to choose the best sites to execute each query. This
approach works well when the amount of processing involved in queries is high enough
to justify the overhead of bidding. In contrast, Swarm clients send their requests to an
arbitrary replicated server site and let the system direct the requests to a site that triggers
minimal consistency traffic.
6.3.4 Failure Resilience
Any wide area system must gracefully handle failures of nodes and network links.
Client-server systems such as NFS and AFS recover from failures by employing timelimited leases. Swarm implements hierarchical leases for failure recovery in a peer-topeer replication environment.
A large-scale peer-to-peer system must also handle node churn where nodes continually join and leave the system. Recent work has studied the phenomenon of node
churn in the context of peer-to-peer Distributed Hash Table (DHT) implementations
[59]. Their study made several important observations. First, systems are better off by
not reacting immediately to a failure as it will likely cause a positive feedback cycle of
more failures due to congestion. Second, choosing message timeouts based on actual
measured roundtrip times performs better under churn than choosing large statically
configured timeouts.. Finally, when choosing nearby nodes as overlay neighbors in a
peer-to-peer network, employing a simple algorithm with low bandwidth overhead is
more effective at reducing latency under churn since the neighbor set could change
rapidly. Our experience with Swarm confirms the validity of those observations. To avoid
positive feedback loops due to congestion, a Swarm replica handles a busy neighbor
differently from an unreachable neighbor by its use of TCP channels for communication.
It treats the neighbor as reachable as long as it can establish a TCP connection with the
neighbor. Thus it quickly recovers from link and node failures. Swarm employs much
longer timeouts to handle busy neighbors to avoid the positive feedback cycles mentioned
above.
195
Finally, a number of voting-based algorithms have been proposed that employ redundancy to tolerate permanent failure of primary replicas [27]. Our Swarm prototype
does not implement multiple custodians and hence cannot tolerate permanent failure of
custodians.
6.4 Reusable Middleware
Our goal of designing the configurable consistency framework was to pave the way
for effective middleware support for replication in a wide variety of distributed services.
Recent research efforts have developed reusable middleware to simplify distributed application development. Thor [44] is a distributed object-oriented database specifically
designed to ease the development of future distributed applications. Its designers advocate that persistent shared objects (as opposed to file systems or shared memory) provide
the right abstraction to ease distributed application development, as an object-based
approach facilitates tackling a number of hard issues at the system level, such as data and
functionality evolution and garbage collection, in addition to caching and consistency.
Thor supports sharing at a per-object granularity via transactions. It provides two consistency flavors: strong consistency via locking and eventual consistency via optimistic
replica control.
Oceanstore [60] takes a different approach by providing a global persistent storage
utility that exports immutable files and file blocks as the units of data sharing. Unlike
Swarm that updates files in-place, each update in Oceanstore creates new snapshots of
a file. Oceanstore classifies servers into those that can be trusted to perform replication
protocols and those that cannot be trusted. Though untrusted servers can accept updates,
those updates cannot be committed until the client directly contacts the trusted servers
and confirms them. Oceanstore’s main goals are secure sharing as well as long-term
durability of data, and for this they incur poor write performance.
CHAPTER 7
FUTURE WORK AND CONCLUSIONS
Before we summarize the contribution of our thesis, we outline several important
issues that must be addressed before configurable consistency can be widely adopted by
middleware services to support wide area replication and suggest ways to address them
in future work.
7.1 Future Work
In this dissertation, we presented how a middleware can adopt the configurable consistency framework to provide flexible consistency management to a variety of wide
area applications. We did this by presenting a proof-of-concept implementation of a
middleware called Swarm and by showing that applications with diverse needs perform
well using it. Implementing configurable consistency in a more realistic middleware
storage service would allow us to more accurately guage its performance and usability
for mainstream application design. For instance, our prototype does not support multiple
custodians for tolerance to permanent failures. However, in realistic systems, relying
on the accessibility of a single root replica can severely reduce availability. Also, our
prototype does not support the causality and atomicity options, which are essential to
support the transactional model of computing. Support for transactional properties is
essential for database and directory services.
7.1.1 Improving the Framework’s Ease of Use
Although the large set of configurable consistency options provides flexibility, determining the right set of options to employ for a given application scenario may not always
be obvious and could hinder the framework’s usability. To ease the design burden from
application programmers, there needs to be a higher-level interface to our framework that
provides a more restricted but popular and well-understood semantics that can also be
197
safely employed in combination for specific application areas. However, as we argued in
Chapter 2, applications still need the framework’s flexibility in some scenarios, in which
case they can use the framework’s more expressive interface.
7.1.2 Security and Authentication
Data management over the wide area often requires application components such as
clients and servers to communicate across multiple administrative and/or trust domains.
Security is important at two levels. First, the middleware must provide a mechanism
that ensures that only authenticated users/applications get access to their data. Second,
the distributed components that comprise the middleware might themselves belong to
multiple administrative domains that are mutually distrusting. In that case, they need to
authenticate to each other and secure their communication channels via mechanisms such
as SSL encryption. The consistency protocol that we presented in this thesis assumes that
the communicating parties trust each other and behave in a fail-stop manner. It must be
extended to correctly operate among mutually distrusting parties, which need further
research. However, if the middleware belongs to a single administrative domain or if
the distributed components can authenticate to each other and can be guaranteed to run
trusted code (such as in the case of middleware systems deployed across a corporate
Intranet, our existing consistency protocol is adequate.
7.1.3 Applicability to Object-based Middleware
Many researchers advocate the use of persistent shared object abstraction (as opposed
to file systems or shared memory) to ease distributed application development, as it
facilitates a number of hard issues such as data and functionality evolution and garbage
collection to be tackled at the system level in addition to caching and consistency. We believe that implementing configurable consistency in an object-based middleware system
is feasible because our framework assumes a general data access interface (described
in Section 3.2) that applies to a variety of data abstractions including object systems.
However, our belief can only be validated by an actual implementation.
198
7.2 Summary
In this dissertation, we presented a novel approach to flexible consistency management called configurable consistency to support caching efficiently for diverse distributed
applications in non-uniform network environments (including WANs). Although wide
area caching can improve end-to-end performance and availability for many distributed
services, there currently exists no data caching support that applications can use to manage write-shared data effectively. Several middleware storage services support caching
for read-mostly data or rarely write-shared data suitable for traditional file access. However, their consistency mechanisms are not flexible enough to express and enforce the
diverse sharing semantics needed by applications such as databases, directory services
and real-time collaborative services. As a result, applications that use these middleware
are forced to employ their own replication and consistency solutions outside the middleware.
To determine the feasibility of a replication solution that supports the diverse sharing needs of a variety of applications, we have surveyed applications in three diverse
classes: file sharing, database and directory services, and collaboration. Our application study, presented in Chapter 2, revealed that applications differ significantly in
their data characteristics and consistency needs. Some applications such as file services
and databases have dynamically varying consistency requirements on the same data
over time. Supporting such requirements warrants a high degree of customizability in
consistency management. We also observed that certain core design choices recur in the
consistency management of diverse applications, although different applications need to
make different sets of choices. Identifying this commonality in the consistency enforcement options enabled us to develop a new taxonomy to classify the consistency needs
of diverse applications along five orthogonal dimensions. Based on this taxonomy, we
developed a new consistency management framework called configurable consistency.
that can express a broader set of consistency semantics than existing solutions using a
small set of options along those dimensions.
Our consistency framework (presented in Chapter 3) provides several benefits over
existing solutions. By adopting our consistency framework, a middleware storage service
199
can support read-write WAN replication efficiently for a wider variety of applications
than by adopting an existing solution. If a distributed service is designed to use the
configurable consistency interface, its consistency semantics can be customized for a
broader variety of sharing needs via a simple reconfiguration of the interface’s options,
than if the same application were written using another consistency solution.
We have demonstrated these benefits by presenting the design and implementation
of configurable consistency in a prototype middleware data service called Swarm (in
Chapter 4. We showed that by adopting our framework, Swarm is able to support four
network services with diverse sharing needs with performance comparable (i.e., within
20%) or exceeding (by up to 500%) that of alternate implementations.
Our Swarm-based distributed file service can support the consistency semantics provided by existing file systems on a per-file-session basis. We showed that it can exploit
access locality more effectively than existing file systems, while supporting traditional
file sharing as well as reliable file locking.
We presented a proxy-caching enterprise service that supports wide area caching
of enterprise objects with strong consistency, and outperforms traditional client-server
solutions at all levels of contention for shared data, thanks to our novel contention-aware
replication control algorithm.
We illustrated how transparent replication support can be added to the popular BerkeleyDB database library using Swarm. We showed that configurable consistency enables
a replicated BerkeleyDB to be reused, via simple reconfiguration, for a variety of application scenarios with diverse consistency needs. If the application can tolerate relaxed
consistency, our framework provides consistency semantics that improve performance
by several orders of magnitude beyond that of client-server BerkeleyDB.
Although in this dissertation we present the implementation of configurable consistency in the context of a distributed file store, our framework can be implemented in
any middleware that provides a session-oriented data access interface to its applications.
Thus, it is applicable to object-based middleware as well as clustered file and storage services. For instance, existing clustered storage services predominantly employ read-only
replication for load-balancing of read requests and for fault-tolerance using replicas as
200
standbys. Augmenting those services to provide configurable consistency enables them
to support coherent read-write replication for a wide variety of services. Read-write
replication, even within a cluster, can improve the performance of some applications
by utilizing the cluster’s resources better (e.g., by spreading write traffic among them).
Moreover, extending the storage service to provide wide area caching would benefit all
of its applications.
In summary, the main contributions of this dissertation are the following: (1) a new
taxonomy for classifying the consistency needs of distributed applications, (2) a new
consistency framework based on that taxonomy that can express a broader variety of
applications than existing consistency solutions, and (3) a proof by demonstration that
the framework is implementable efficiently in a middleware to support the sharing needs
of a wide variety of distributed applications.
REFERENCES
[1] A DYA , A. Weak consistency: A generalized theory and optimistic implementations
for distributed transactions. Tech. Rep. TR-786, MIT/LCS, 1999.
[2] A HAMAD , M., H UTTO , P., AND J OHN , R. Implementing and programming causal
distributed shared memory. In Proceedings of the 11th International Conference on
Distributed Computing Systems (May 1991), pp. 274–281.
[3] A MERICAN NATIONAL S TANDARD
language - sql, 1992.
FOR
I NFORMATION S YSTEMS.
Database
[4] ATTIYA , H., AND W ELCH , J. L. Sequential consistency versus linearizability.
In Proceedings of the 3rd Annual ACM Symposium on Parallel Algorithms and
Architectures (Hilton Head, South Carolina, July 1991), pp. 304–315.
[5] BADRINATH , B., AND R AMAMRITHAM , K. Semantics-based concurrency control: beyond commutativity. ACM Transactions on Data Base Systems 17, 1 (Mar.
1992).
[6] BAL , H., K AASHOEK , M., AND TANENBAUM , A. Orca: A language for parallel
programming of distributed systems. IEEE Transactions on Software Engineering
(Mar. 1992), 190–205.
[7] B ERENSON , H., B ERNSTEIN , P., G RAY, J., M ELTON , J., O’N EIL , E., AND
O’N EIL , P. A critique of ANSI SQL isolation levels. In Proceedings of SIGMOD
’95 (May 1995).
[8] B ERNSTEIN , P., H ADZILACOS , V., AND G OODMAN , N. Concurrency Control
and Recovery in Database Systems. Addison-Wesley, Reading, Massachusetts,
1987.
[9] B ERSHAD , B., Z EKAUSKAS , M., AND S AWDON , W. The Midway distributed
shared memory system. In COMPCON ’93 (Feb. 1993), pp. 528–537.
[10] B LAZE , M. Caching in Large Scale Distributed File Systems.
Princeton University, 1993.
[11] B OHRER , K. Architecture of the San Francisco Frameworks.
Journal 37, 2 (Feb. 1998).
PhD thesis,
IBM Systems
[12] B RUN -C OTTAN , G., AND M AKPANGOU , M. Adaptable replicated objects in
distributed environments. Second BROADCAST Open Workshop (1995).
[13] C ARTER , J. Design of the munin distributed shared memory system. Journal of
Parallel and Distributed Computing 29, 2 (Sept. 1995), 219–227.
202
[14] C HARRON -B OST, B. Concerning the size of logical clocks in distributed systems.
Information Processing Letters 39 (1991), 11–16.
[15] C ORPORATION , I. The technology behind hotbot. http://www.inktomi.
com/whitepap.html, May 1996.
[16] C OX , L., AND N OBLE , B. Fast reconciliations in Fluid replication. In Proc. 21st
Intl. Conference on Distributed Conputing Systems (Apr. 2001).
[17] CVS. The concurrent versions system. http://www.cvshome.org/, 2004.
[18] D EMERS , A., P ETERSEN , K., S PREITZER , M., T ERRY, D., T HEIMER , M., AND
W ELCH , B. The Bayou architecture: Support for data sharing among mobile users.
In Proceedings of the Workshop on Mobile Computing Systems and Applications
(Dec. 1994).
[19] D UBOIS , M., S CHEURICH , C., AND B RIGGS , F. Synchronization, coherence, and
event ordering in multiprocessors. IEEE Computer 21, 2 (Feb. 1988), 9–21.
[20] EBAY. http://www.ebay.com/, 2004.
[21] E DWARDS , W. K., M YNATT, E. D., P ETERSEN , K., S PREITZER , M. J., T ERRY,
D. B., AND T HEIMER , M. M. Designing and implementing asynchronous collaborative applications with Bayou. In Tenth ACM Symposium on User Interface
Software and Technology (Banff, Alberta, Canada, Oct. 1997), pp. 119–128.
[22]
ET AL ., J. K. OceanStore: An architecture for global-scale persistent storage.
In Proceedings of the 9th Symposium on Architectural Support for Programming
Languages and Operating Systems (Nov. 2000).
[23] F OX , A., , AND B REWER , E. Harvest, yield and scalable tolerant systems. In
Proceedings of the Seventh Workshop on Hot Topics in Operating Systems (Mar.
1999).
[24] F OX , A., G RIBBLE , S., C HAWATHE , Y., B REWER , E., AND G AUTHIER , P.
Cluster-based scalable network services. In Proceedings of the 16th Symposium
on Operating Systems Principles (Oct. 1997).
[25] F RANCIS , P., JAMIN , S., J IN , C., J IN , Y., R AZ , D., S HAVITT, Y., AND Z HANG ,
L. Idmaps: A global internet host distance estimation service. IEEE/ACM Trans.
on Networking (TON) 9, 5 (Oct. 2001), 525–540.
[26] G HARACHORLOO , K., L ENOSKI , D., L AUDON , J., G IBBONS , P., G UPTA , A.,
AND H ENNESSY, J. Memory consistency and event ordering in scalable sharedmemory multiprocessors. In Proceedings of the 17th Annual International Symposium on Computer Architecture (Seattle, Washington, May 1990), pp. 15–26.
[27] G IFFORD , D. Weighted voting for replicated data. In Proceedings of the 13th
Symposium on Operating Systems Principles (1979), pp. 150–162.
203
[28] H OWARD , J., K AZAR , M., M ENEES , S., N ICHOLS , D., S ATYANARAYANAN , M.,
S IDEBOTHAM , R., AND W EST, M. Scale and performance in a distributed file
system. ACM Transactions on Computer Systems 6, 1 (Feb. 1988), 51–82.
[29] H UTTO , P., AND A HAMAD , M. Slow memory: Weakening consistency to enhance
concurrency in distributed shared memories. In Proceedings of the 10th International Conference on Distributed Computing Systems (May 1990), pp. 302–311.
[30] J OHNSON , K., K AASHOEK , M., AND WALLACH , D. CRL: High performance
all-software distributed shared memory. In Proceedings of the 15th Symposium on
Operating Systems Principles (1995).
[31] J OSEPH , A., TAUBER , J., AND K AASHOEK , M. F. Mobile computing with
the Rover toolkit. IEEE Transactions on Computers: Special issue on Mobile
Computing 46, 3 (Mar. 1997).
[32] J R ., L. K., B ECKHARDT, S., H ALVORSEN , T., O ZZIE , R., , AND G REIF, I. Replicated document management in a group communication system. In Groupware:
Software for Computer-Supported Cooperative Work, edited by D. Marca and G.
Bock, IEEE Computer Society Press (1992), pp. 226–235.
[33] K AMINSKY, M., S AVVIDES , G., M AZIERES , D., AND K AASHOEK , M. Decentralized user authentication in a global file system. In Proceedings of the 19th
Symposium on Operating Systems Principles (Oct. 2003).
[34] K A Z AA. http://www.kazaa.com/, 2000.
[35] K ELEHER , P. Decentralized replicated-object protocols. In Proceedings of the
18th Annual ACM Symposium on Principles of Distributed Computing (Apr. 1999).
[36] K ELEHER , P., C OX , A. L., AND Z WAENEPOEL , W. Lazy consistency for software distributed shared memory. In Proceedings of the 19th Annual International
Symposium on Computer Architecture (May 1992), pp. 13–21.
[37] K IM , M., C OX , L., AND N OBLE , B. Safety, visibility and performance in a widearea file system. In Proceedings of the 1st USENIX Conference on File and Storage
Technologies (FAST02) (Jan. 2002).
[38] K ISTLER , J., AND S ATYANARAYANAN , M. Disconnected operation in the Coda
file system. In Proceedings of the 13th Symposium on Operating Systems Principles
(Oct. 1991), pp. 213–225.
[39] K RISHNAKUMAR , N., AND B ERNSTEIN , A. Bounded ignorance: A technique for
increasing concurrency in a replicated system. ACM Transactions on Data Base
Systems 19, 4 (Dec. 1994).
[40] L ADIN , R., L ISKOV, B., S HRIRA , L., AND G HEMAWAT, S. Providing high
availability using lazy replication. ACM Transactions on Computer Systems 10,
4 (1992).
204
[41] L AMB , C., L ANDIS , G., O RENSTEIN , J., AND W EINREB , D. The Objectstore
database system. Communications of the ACM (Oct. 1991).
[42] L IBEN -N OWELL , D., BALAKRISHNAN , H.,
evolution of peer-to-peer systems.
AND
K ARGER , D. Analysis of the
[43] L IPTON , R., AND S ANDBERG , J. PRAM: A scalable shared memory. Tech. Rep.
CS-TR-180-88, Princeton University, Sept. 1988.
[44] L ISKOV, B., A DYA , A., C ASTRO , M., DAY, M., G HEMAWAT, S., G RUBER , R.,
M AHESHWARI , U., M YERS , A. C., AND S HRIRA , L. Safe and efficient sharing
of persistent objects in Thor. In Proceedings of SIGMOD ’96 (June 1996).
[45] M AZIERES , D. Self-certifying file system. PhD thesis, MIT, May 2000. http:
//www.fs.net.
[46] M ICROSOFT C ORP. Active directory (in windows 2000 server resource kit).
Microsoft Press, 2000.
[47] M UTHITACHAROEN , A., C HEN , B., AND M AZIERES , D. Ivy: A read/write peerto-peer file system. In Proceedings of the Fifth Symposium on Operating System
Design and Implementation (Dec. 2002).
[48] N ELSON , M., W ELCH , B., AND O USTERHOUT, J. Caching in the Sprite network
file system. ACM Transactions on Computer Systems 6, 1 (1988), 134–154.
[49] N OBLE , B., F LEIS , B., K IM , M., AND Z AJKOWSKI , J. Fluid Replication. In
Proceedings of the 1999 Network Storage Symposium (NetStore) (Oct. 1999).
[50] O BJECT M ANAGEMENT G ROUP. The Common Object Request Broker: Architecture and Specification, Version 2.0. http://www.omg.org, 1996.
[51] O PEN LDAP. OpenLDAP: an open source implementation of the Lightweight
Directory Access Protocol. http://www.openldap.org/.
[52] O RACLE C ORP. Oracle 7 Server Distributed Systems Manual, Vol. 2, 1996.
[53] P ELLEGRINO , J., AND D OVROLIS , C. Bandwidth requirement and state consistency in three multiplayer game architectures. In ACM NetGames (May 2003),
pp. 169–184.
http://www.cc.gatech.edu/fac/Constantinos.
Dovrolis/start page.html.
[54] P ETERSEN , K., S PREITZER , M. J., T ERRY, D. B., T HEIMER , M. T., AND
D EMERS , A. J. Flexible update propagation for weakly consistent replication.
In Proceedings of the 16th Symposium on Operating Systems Principles (1997).
[55] P ITOURA , E., AND B HARGAVA , B. K. Maintaining consistency of data in mobile
distributed environments. In Proceedings of the 15th International Conference on
Distributed Computing Systems (1995), pp. 404–413.
205
[56] R ATNER , D. ROAM: A scalable replication system for mobile and distributed
computing. Tech. Rep. 970044, University of California, Los Angeles, 31, 1997.
[57] R EIHER , P., H EIDEMANN , J., R ATNER , D., S KINNER , G., AND P OPEK , G.
Resolving file conflicts in the Ficus file system. In Proceedings of the 1994 Summer
Usenix Conference (1994).
[58] ROWSTRON , A., AND D RUSCHEL , P. Storage management and caching in PAST,
a large-scale, persistent peer-to-peer storage utility. In Proceedings of the 18th
Symposium on Operating Systems Principles (2001).
[59] S. R HEA AND D. G EELS AND T. ROSCOE AND J. K UBIATOWICZ . Handling
Churn in a DHT. In Proceedings of the USENIX 2004 Annual Technical Conference
(June 2004).
[60] S. R HEA ET AL. Pond: The oceanstore prototype. In Proceedings of the 2nd
USENIX Conference on File and Storage Technologies (FAST03) (Mar. 2003).
[61] S AITO , Y., B ERSHAD , B., AND L EVY, H. Manageability, availability and performance in Porcupine: A highly scalable internet mail service. ACM Trans. Comput.
Syst. (Aug. 2000).
[62] S AITO , Y., K ARAMANOLIS , C., K ARLSSON , M., AND M AHALINGAM , M. Taming aggressive replication in the Pangaea wide-area file system. In Proceedings
of the Fifth Symposium on Operating System Design and Implementation (2002),
pp. 15–30.
[63] S AITO , Y., AND S HAPIRO , M. Replication: Optimistic approaches. Tech. Rep.
HPL-2002-33, HP Labs, 2001.
[64] S ANDBERG , R., G OLDBERG , D., K LEIMAN , S., WALSH , D., AND LYON , B.
Design and implementation of the SUN Network Filesystem. In Proceedings of the
Summer 1985 USENIX Conference (1985), pp. 119–130.
[65] S HAPIRO , M., F ERREIRA , P., AND R ICHER , N. Experience with the PerDiS
large-scale data-sharing middleware. In Intl. W. on Persistent Obj. Sys. (Lillehammer (Norway), Sept. 2000), G. Kirby, Ed., vol. 2135 of Lecture Notes in Computer
Science, Springer-Verlag, pp. 57–71. http://www-sor.inria.fr/publi/
EwPLSDSM pos2000.html.
[66] S IDELL , J., AOKI , P., S AH , A., C.S TAELIN , S TONEBRAKER , M., AND Y U , A.
Data replication in Mariposa. In Proceedings of the 12th International Conference
on Data Engineering (New Orleans, LA (USA), 1996).
[67] S LEEPYCAT S OFTWARE.
com/, 2000.
The berkeleydb database. http://sleepycat.
[68] S TOICA , I., M ORRIS , R., K ARGER , D., K AASHOEK , M. F., AND BALAKRISH NAN , H. Chord: A scalable peer-to-peer lookup service for internet applications.
In Proceedings of the Sigcomm ’01 Symposium (August 2001).
206
[69] S USARLA , S., AND C ARTER , J. DataStations: Ubiquitous transient storage for
mobile users. Tech. Rep. UUCS-03-024, University of Utah School of Computer
Science, Nov. 2003.
[70] TANENBAUM , A., AND VAN S TEEN , M. Distributed Systems: Principles and
Paradigms. Prentice-Hall, Upper Saddle River, New Jersey, 2002.
[71] T ERRY, D. B., D EMERS , A. J., P ETERSEN , K., S PREITZER , M. J., T HEIMER ,
M. M., AND W ELCH , B. B. Session guarantees for weakly consistent replicated
data. In Proceedings International Conference on Parallel and Distributed Information Systems (PDIS) (Sept. 1994), pp. 140–149.
[72] TORRES -ROJAS , F. J., A HAMAD , M., AND R AYNAL , M. Timed consistency
for shared distributed objects. In Proceedings of the eighteenth annual ACM
symposium on Principles of distributed computing (PODC ’99) (Atlanta, Georgia,
May 1999), ACM, pp. 163–172.
[73] T RANSACTION P ROCESSING P ERFORMANCE C OUNCIL . The TPC-A Benchmark
Revision 2.0, June 1994.
[74] VAHDAT, A. Operating System Services For Wide Area Applications. PhD thesis,
University of California, Berkeley, CA, 1998.
[75]
S TEEN , M., H OMBURG , P., AND TANENBAUM , A. Architectural design
of Globe: A wide-area distributed system. Tech. Rep. IR-422, Vrije Universiteit,
Department of Mathematics and Computer Science, Mar. 1997.
VAN
[76] W HITE , B., L EPREAU , J., S TOLLER , L., R ICCI , R., G URUPRASAD , S., N EWBOLD , M., H IBLER , M., BARB , C., AND J OGLEKAR , A. An integrated experimental environment for distributed systems and networks. In Proceedings of the
Fifth Symposium on Operating System Design and Implementation (Boston, MA,
Dec. 2002).
[77] Y U , H., AND VAHDAT, A. Design and evaluation of a continuous consistency
model for replicated services. In Proceedings of the Fourth Symposium on Operating System Design and Implementation (Oct. 2000).
[78] Y U , H., AND VAHDAT, A. The costs and limits of availability for replicated
services. In Proceedings of the 18th Symposium on Operating Systems Principles
(Oct. 2001).
[79] Z HAO , B., K UBIATOWICZ , J., AND J OSEPH , A. Tapestry: An infrastructure for
wide-area fault-tolerant location and routing. Submitted for publication, 2001.
Download