The Scalable Commutativity Rule: Designing Scalable Software for Multicore Processors

advertisement
The Scalable Commutativity Rule:
Designing Scalable Software for Multicore Processors
by
Austin T. Clements
S.B., Massachusetts Institute of Technology (2006)
M.Eng., Massachusetts Institute of Technology (2008)
Submitted to the Department of Electrical Engineering and Computer Science
in partial fulfillment of the requirements for the degree of
Doctor of Philosophy in Computer Science
at the
MASSACHUSETTS INSTITUTE OF TECHNOLOGY
June 2014
© Massachusetts Institute of Technology 2014. All rights reserved.
Author . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Department of Electrical Engineering and Computer Science
May 21, 2014
Certified by . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
M. Frans Kaashoek
Charles Piper Professor of Computer Science and Engineering
Thesis Supervisor
Certified by . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Nickolai Zeldovich
Associate Professor
Thesis Supervisor
Accepted by . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Leslie A. Kolodziejski
Chair, Department Committee on Graduate Theses
2
The Scalable Commutativity Rule:
Designing Scalable Software for Multicore Processors
by
Austin T. Clements
Submitted to the Department of Electrical Engineering and Computer Science
on May 21, 2014, in partial fulfillment of the
requirements for the degree of
Doctor of Philosophy in Computer Science
Abstract
What fundamental opportunities for multicore scalability are latent in software interfaces,
such as system call APIs? Can scalability opportunities be identified even before any implementation exists, simply by considering interface specifications? To answer these questions
this dissertation introduces the scalable commutativity rule: Whenever interface operations
commute, they can be implemented in a way that scales. This rule aids developers in building
scalable multicore software starting with interface design and carrying on through implementation, testing, and evaluation.
This dissertation formalizes the scalable commutativity rule and defines a novel form of
commutativity named SIM commutativity that makes it possible to fruitfully apply the rule to
complex and highly stateful software interfaces.
To help developers apply the rule, this dissertation introduces an automated method
embodied in a new tool named Commuter, which accepts high-level interface models,
generates tests of operations that commute and hence could scale, and uses these tests to
systematically evaluate the scalability of implementations. We apply Commuter to a model
of 18 POSIX file and virtual memory system operations. Using the resulting 26,238 scalability
tests, Commuter systematically pinpoints many problems in the Linux kernel that past work
has observed to limit application scalability and identifies previously unknown bottlenecks
that may be triggered by future hardware or workloads.
Finally, this dissertation applies the scalable commutativity rule and Commuter to the
design and implementation of a new POSIX-like operating system named sv6. sv6’s novel
file and virtual memory system designs enable it to scale for 99% of the tests generated
by Commuter. These results translate to linear scalability on an 80-core x86 machine for
applications built on sv6’s commutative operations.
Thesis Supervisor: M. Frans Kaashoek
Title: Charles Piper Professor of Computer Science and Engineering
Thesis Supervisor: Nickolai Zeldovich
Title: Associate Professor
3
4
Acknowledgments
This work would not have been possible without the combined efforts and determination of
my advisors Frans Kaashoek, Nickolai Zeldovich, Robert Morris, and Eddie Kohler. Frans’
unwavering focus on The Technical Nugget led to our simplest and most powerful ideas.
Nickolai’s raw energy and boundless breadth are reflected in every corner of this dissertation.
Robert’s keen vision and diligent skepticism pushed the frontiers of what we thought possible.
And Eddie’s inexhaustible enthusiasm and remarkable intuition reliably yielded solutions to
seemingly impossible problems. I am honored to have had them as my mentors.
This research was greatly improved by the valuable feedback of Silas Boyd-Wickizer,
Yandong Mao, Xi Wang, Marc Shapiro, Rachid Guerraoui, Butler Lampson, Paul McKenney,
and the twenty three anonymous reviewers who pored over the many drafts of this work.
Silas and Nickolai contributed substantially to the implementation of sv6 and sv6 would
not have been possible without xv6, which was written by Frans, Robert, and Russ Cox.
I thank my parents, Hank and Robin, for their encouragement, guidance, and support.
Finally, I thank my wife, Emily, for her love and understanding, and for being my beacon
through it all.
∗
∗
∗
This research was supported by NSF awards SHF-964106 and CNS-1301934, by grants from
Quanta Computer and Google, and by a VMware graduate fellowship.
∗
∗
∗
This dissertation incorporates and extends work previously published in the following papers:
Austin T. Clements, M. Frans Kaashoek, Nickolai Zeldovich, Robert T. Morris,
and Eddie Kohler. The scalable commutativity rule: Designing scalable software
for multicore processors. In Proceedings of the 24th ACM Symposium on Operating
Systems Principles (SOSP), Farmington, Pennsylvania, November 2013.
Austin T. Clements, M. Frans Kaashoek, and Nickolai Zeldovich. RadixVM:
Scalable address spaces for multithreaded applications. In Proceedings of the
ACM EuroSys Conference, Prague, Czech Republic, April 2013.
Austin T. Clements, M. Frans Kaashoek, and Nickolai Zeldovich. Concurrent
address spaces using RCU balanced trees. In Proceedings of the 17th International
Conference on Architectural Support for Programming Languages and Operating
Systems (ASPLOS), London, UK, March 2012.
5
6
Contents
1
2
3
4
5
Introduction
1.1 Parallelize or perish . . . .
1.2 A rule for interface design
1.3 Applying the rule . . . . .
1.4 Contributions . . . . . . .
1.5 Outline . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Related work
2.1 Thinking about scalability . . . . . .
2.2 Designing scalable operating systems
2.3 Commutativity . . . . . . . . . . . . .
2.4 Test case generation . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
11
11
12
13
15
16
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
17
17
18
18
19
Scalability and conflict-freedom
3.1 Conflict-freedom and multicore processors
3.2 Conflict-free operations scale . . . . . . . .
3.3 Limitations of conflict-free scalability . . . .
3.4 Summary . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
21
21
23
25
27
.
.
.
.
.
.
.
29
29
30
32
33
33
35
37
Designing commutative interfaces
5.1 Decompose compound operations . . . . . . . . . . . . . . . . . . . . . . . .
5.2 Embrace specification non-determinism . . . . . . . . . . . . . . . . . . . . .
39
39
40
The scalable commutativity rule
4.1 Actions . . . . . . . . . . . .
4.2 SIM commutativity . . . . .
4.3 Implementations . . . . . . .
4.4 Rule . . . . . . . . . . . . . .
4.5 Example . . . . . . . . . . . .
4.6 Proof . . . . . . . . . . . . . .
4.7 Discussion . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
7
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
5.3
5.4
6
Permit weak ordering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Release resources asynchronously . . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
43
43
49
51
51
7
Conflict-freedom in Linux
7.1 POSIX test cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.2 Linux conflict-freedom . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
53
53
54
8
Achieving conflict-freedom in POSIX
8.1 Refcache: Scalable reference counting . . . . .
8.2 RadixVM: Scalable address space operations
8.3 ScaleFS: Conflict-free file system operations .
8.4 Difficult-to-scale cases . . . . . . . . . . . . .
.
.
.
.
57
58
64
71
73
.
.
.
.
.
.
.
75
75
76
78
79
82
83
84
.
.
.
.
.
85
85
86
87
87
87
9
Analyzing interfaces using Commuter
6.1 Analyzer . . . . . . . . . . . . . . .
6.2 Testgen . . . . . . . . . . . . . . . .
6.3 Mtrace . . . . . . . . . . . . . . . .
6.4 Implementation . . . . . . . . . . .
40
40
.
.
.
.
.
.
.
.
.
.
.
.
Performance evaluation
9.1 Experimental setup . . . . . . . . . . . .
9.2 File system microbenchmarks . . . . . .
9.3 File system application performance . .
9.4 Virtual memory microbenchmarks . . .
9.5 Virtual memory application benchmark
9.6 Memory overhead . . . . . . . . . . . . .
9.7 Discussion . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
10 Future directions
10.1 The non-scalable non-commutativity rule
10.2 Synchronized clocks . . . . . . . . . . . . .
10.3 Scalable conflicts . . . . . . . . . . . . . . .
10.4 Not everything can commute . . . . . . .
10.5 Broad conflict-freedom . . . . . . . . . . .
11 Conclusion
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
89
8
Figures and tables
3-1
3-2
3-3
3-4
3-5
A basic cache-coherence state machine. . . . . . . . . . . . . .
Organization of benchmark machines. . . . . . . . . . . . . . .
Conflict-free accesses scale. . . . . . . . . . . . . . . . . . . . . .
Conflicting accesses do not scale. . . . . . . . . . . . . . . . . .
Operations scale until they exceed cache or directory capacity.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
22
23
24
25
26
4-1 Constructed non-scalable implementation mns for history H. . . . . . . . . .
4-2 Constructed scalable implementation m for history H. . . . . . . . . . . . . .
35
37
6-1
6-2
6-3
6-4
6-5
6-6
.
.
.
.
.
.
43
45
46
46
48
50
7-1 Conflict-freedom of commutative system call pairs in Linux. . . . . . . . . .
54
8-1
8-2
8-3
8-4
8-5
8-6
Conflict-freedom of commutative system call pairs in sv6. . . . . . .
Refcache example showing a single object over eight epochs. . . . .
Refcache algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Key structures in a POSIX VM system. . . . . . . . . . . . . . . . . .
Throughput of skip list lookups with concurrent inserts and deletes.
A radix tree containing a three page file mapping. . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
58
61
63
65
67
68
9-1
9-2
9-3
9-4
9-5
File system microbenchmark throughput. . . . .
Mail server benchmark throughput. . . . . . . . .
Virtual memory microbenchmark throughput. .
Metis application scalability. . . . . . . . . . . . .
Memory usage for alternate VM representations.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
77
79
80
83
84
The components of Commuter. . . . . . . . . . . . . . . . . . . . . . .
The SIM commutativity test algorithm. . . . . . . . . . . . . . . . . . .
A simplified version of our rename model. . . . . . . . . . . . . . . . .
The SIM commutativity test algorithm specialized to two operations.
Symbolic execution tree of commutes2 for rename/rename. . . . . . .
An example test case for two rename calls generated by Testgen. . .
9
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
10
One
Introduction
This dissertation presents a pragmatic and formal approach to the design and implementation
of scalable multicore software that spans from the earliest stages of software interface design
through testing and maintenance of complete implementations.
The rest of this chapter introduces the multicore architectures that now dominate generalpurpose computing, the problematic ways in which software developers are coping with these
new architectures, and a new interface-driven approach to the design and implementation of
software for multicore architectures.
1.1
Parallelize or perish
In the mid-2000s, there was a fundamental shift in the construction of high performance
software. For decades, CPU clock speeds had ridden the exponential curve of Moore’s Law
and rising clock speeds naturally translated to faster software performance. But higher clock
speeds require more power and generate more heat, and around 2005 CPUs reached the
thermal dissipation limits of a few square centimeters of silicon. CPU architects could no
longer significantly increase the clock speed of a single CPU core, so they began to increase
parallelism instead by putting more CPU cores on the same chip. Now, with the widespread
adoption of multicore architectures for general-purpose computing, parallel programming
has gone from niche to necessary. Total cycles per second continues to grow exponentially, but
now software must be increasingly parallel to take advantage of this growth. Unfortunately,
while software performance naturally scaled with clock speed, scaling with parallelism is an
untamed problem. Even with careful engineering, software rarely achieves the holy grail of
linear scalability, where doubling hardware parallelism doubles the software’s performance.
Operating system kernels exemplify both the importance of parallelism and the difficulty
of achieving it. Many applications depend heavily on the shared services and resources
provided by the kernel. As a result, if the kernel doesn’t scale, many applications won’t scale.
At the same time, the kernel must cope with diverse and unknown workloads while supporting
the combined parallelism of all applications on a computer. Additionally, the kernel’s role as
the arbiter of shared resources makes it particularly susceptible to scalability problems.
11
Yet, despite the extensive efforts of kernel and application developers alike, scaling software
performance on multicores remains an inexact science dominated by guesswork, measurement, and expensive cycles of redesign. The state of the art for evaluating and improving
the scalability of multicore software is to choose some workload, plot performance at varying numbers of cores, and use tools such as differential profiling [47] to identify scalability
bottlenecks.
This approach focuses developer effort on demonstrable issues, but is ultimately a nearsighted approach. Each new hardware model and workload powers a Sisyphean cycle of
finding and fixing new bottlenecks. Projects such as Linux require continuous infusions of
manpower to maintain their scalability edge. Worse, scalability problems that span layers—for
example, application behavior that triggers kernel bottlenecks—require cross-layer solutions,
and few applications have the reach or resources to accomplish this.
But the deeper problem with this workload-driven approach is that many scalability
problems lie not in the implementation, but in the design of the software interface. By the
time developers have an implementation, a workload, and the hardware to demonstrate a
bottleneck, interface-level solutions may be impractical or impossible.
As an example of interface design that limits implementation scalability, consider the
POSIX open call [63]. This call opens a file by name and returns a file descriptor, a number used
to identify the open file in later operations. Even though this identifier is opaque, POSIX—the
standard for the Unix interface—requires that open return the numerically lowest available
file descriptor for the calling process, forcing the kernel to allocate file descriptors one at
a time, even when many parallel threads are opening files simultaneously. This simplified
the kernel interface during the early days of Unix, but this interface design choice is now
a burden on implementation scalability. It’s an unnecessary burden, too: a simple change
to allow open to return any available file descriptor would enable the kernel to choose file
descriptors scalably. This particular example is well-known [10], but myriad subtler issues
exist in POSIX and other interfaces.
Interface design choices have implications for implementation scalability. If interface
designers could distinguish interfaces that definitely have a scalable implementation from
those that don’t, they would have the predictive power to design scalable interfaces that enable
scalable implementations.
1.2 A rule for interface design
This dissertation presents a new approach to designing scalable software that starts with the
design of scalable software interfaces. This approach makes reasoning about multicore scalability possible before an implementation exists and before the necessary hardware is available
to measure the implementation’s scalability. It can highlight inherent scalability problems,
leading to better interface designs. It sets a clear scaling target for the implementation of a
12
scalable interface. And it enables systematic testing of an implementation’s scalability.
At the core of this dissertation’s approach is this scalable commutativity rule: In any
situation where several operations commute—meaning there’s no way to distinguish their
execution order using the interface—they have a implementation that is conflict-free during
those operations—meaning no core writes a cache line that was read or written by another core.
Empirically, conflict-free operations scale, so this implementation scales. Or, more concisely,
whenever interface operations commute, they can be implemented in a way that scales.
This rule makes intuitive sense: when operations commute, their results (return values and
effect on system state) are independent of order. Hence, communication between commutative
operations is unnecessary, and eliminating it yields a conflict-free implementation. On modern
shared-memory multicores, conflict-free operations can execute entirely from per-core caches,
so the performance of a conflict-free implementation will scale linearly with the number of
cores.
The intuitive version of the rule is useful in practice, but not precise enough to reason
about formally. Therefore, this dissertation formalizes the scalable commutativity rule, proves
that commutative operations have a conflict-free implementation, and demonstrates experimentally that, under reasonable assumptions, conflict-free implementations scale linearly on
modern multicore hardware.
An important consequence of this presentation is a novel form of commutativity we name
SIM commutativity. The usual definition of commutativity (e.g., for algebraic operations)
is so stringent that it rarely applies to the complex, stateful interfaces common in systems
software. SIM commutativity, in contrast, is state-dependent and interface-based, as well as
monotonic. When operations commute in the context of a specific system state, specific
operation arguments, and specific concurrent operations, we show that an implementation
exists that is conflict-free for that state and those arguments and concurrent operations. SIM
commutativity exposes many more opportunities to apply the rule to real interfaces—and thus
discover scalable implementations—than a more conventional notion of commutativity would.
Despite its state dependence, SIM commutativity is interface-based: rather than requiring
all operation orders to produce identical internal states, it requires the resulting states to be
indistinguishable via the interface. SIM commutativity is thus independent of any specific
implementation, enabling developers to apply the rule directly to interface design.
1.3
Applying the rule
The scalable commutativity rule leads to a new way to design scalable software: analyze
the interface’s commutativity; if possible, refine or redesign the interface to improve its
commutativity; and then design an implementation that scales when operations commute.
For example, consider file creation in a POSIX-like file system. Imagine that multiple
processes create files in the same directory at the same time. Can the creation system calls
13
be made to scale? Our first answer was “obviously not”: the system calls modify the same
directory, so surely the implementation must serialize access to the directory. But it turns
out these operations commute if the two files have different names (and no hard or symbolic
links are involved) and, therefore, have an implementation that scales for such names. One
such implementation represents each directory as a hash table indexed by file name, with
an independent lock per bucket, so that creation of differently named files is conflict-free,
barring hash collisions. Before the rule, we tried to determine if these operations could scale
by analyzing all of the implementations we could think of. This process was difficult, unguided,
and itself did not scale to complex interfaces, which motivated our goal of reasoning about
scalability in terms of interfaces.
Real-world interfaces and their implementations are complex. Even with the rule, it can
be difficult to spot and reason about all commutative cases. To address this challenge, this dissertation introduces a method to automate reasoning about interfaces and implementations,
embodied in a software tool named Commuter. Commuter takes a symbolic interface model,
computes precise conditions under which sets of operations commute, generates concrete
tests of commutative operations, and uses these tests to reveal conflicts in an implementation.
Any conflicts Commuter finds represent an opportunity for the developer to improve the
scalability of their implementation. This tool can be integrated into the software development process to drive initial design and implementation, to incrementally improve existing
implementations, and to help developers understand the commutativity of an interface.
We apply Commuter to a model of 18 POSIX file system and virtual memory system
calls. From this model, Commuter generates 26,238 tests of commutative system call pairs,
all of which can be made conflict-free according to the rule. Applying this suite to Linux,
we find that the Linux kernel is conflict-free for 17,206 (65%) of these cases. Many of the
commutative cases where Linux is not conflict-free are important to applications—such as
commutative mmaps and creating different files in a shared directory—and reflect bottlenecks
found in previous work [11]. Others reflect previously unknown problems that may become
bottlenecks on future machines or workloads.
Finally, to demonstrate the application of the rule and Commuter to the design and
implementation of a real system, we use these tests to guide the implementation of a new
research operating system kernel named sv6. sv6 doubles as an existence proof showing
that the rule can be applied fruitfully to the design and implementation of a large software
system, and as an embodiment of several novel scalable kernel implementation techniques.
Commuter verifies that sv6 is conflict-free for 26,115 (99%) of the tests generated by our
POSIX model and confirms that sv6 addresses many of the sources of conflicts found in the
Linux kernel. sv6’s conflict-free implementations of commutative system calls translate to
dramatic improvements in measured scalability for both microbenchmarks and application
benchmarks.
14
1.4 Contributions
This dissertation’s broad contribution is a new approach to building scalable software using
interface-based reasoning to guide design, implementation, and testing. This dissertation
makes the following intellectual contributions:
• The scalable commutativity rule, its formalization, and a proof of its correctness.
• SIM commutativity, a novel form of interface commutativity that is state-sensitive and
interface-based. As we demonstrate with Commuter, SIM commutativity enables us
to identify myriad commutative cases in the highly stateful POSIX interface.
• A set of guidelines for commutative interface design based on SIM commutativity.
Using these guidelines, we propose specific enhancements to POSIX and empirically
demonstrate that these changes enable dramatic improvements in application scalability.
• An automated method for reasoning about interface commutativity and generating
implementation scalability tests using symbolic execution. This method is embodied in
a new tool named Commuter, which generates 26,238 tests of commutative operations
in our model of 18 POSIX file system and virtual memory system operations. These
tests cover many subtle cases, identify many substantial scalability bottlenecks in the
Linux kernel, and guide the implementation of sv6.
This dissertation also contributes several operating system implementation techniques:
• Refcache, a novel scalable reference counting scheme that achieves conflict-free increment and decrement operations with efficient periodic reconciliation and O(1)
per-object space overhead.
• RadixVM, a POSIX virtual memory system based on radix trees that is conflict-free
for commutative mmap, munmap, and pagefault operations.
• ScaleFS, a POSIX file system that is conflict-free for the vast majority of commutative
file system operations.
We validate RadixVM and ScaleFS, and thus their design methodology based on the rule and
Commuter, by evaluating their performance and scalability on an 80-core x86 machine.
The source code to all of the software produced for this dissertation is publicly available
under an MIT license from http://pdos.csail.mit.edu/commuter.
15
1.5
Outline
The rest of this dissertation presents the scalable commutativity rule in depth and explores its
consequences from interface design to implementation to testing.
We begin by relating our thinking about scalability to previous work in chapter 2.
We then turn to formalizing and proving the scalable commutativity rule, which we
approach in two steps. First, chapter 3 establishes experimentally that conflict-free operations
are generally scalable on modern, large multicore machines. Chapter 4 then formalizes the rule,
develops SIM commutativity, and proves that commutative operations can have conflict-free
(and thus scalable) implementations.
We next turn to applying the rule. Chapter 5 starts by applying the rule to interface design,
developing a set of guidelines for designing interfaces that enable scalable implementations
and proposing specific modifications to POSIX that broaden its commutativity.
Chapter 6 presents Commuter, which uses the rule to automate reasoning about interface
commutativity and the conflict-freedom of implementations. Chapter 7 uses Commuter
to analyze the Linux kernel and demonstrates that Commuter can systematically pinpoint
significant scalability problems even in mature systems.
Finally, we turn to the implementation of scalable systems guided by the rule. Chapter 8
describes the implementation of sv6 and how it achieves conflict-freedom for the vast majority
of commutative POSIX file system and virtual memory operations. Chapter 9 confirms that
theory translates into practice by evaluating the performance and scalability of sv6 on real
hardware for several microbenchmarks and application benchmarks.
Chapter 10 takes a step back and explores some promising future directions for this work.
Chapter 11 concludes.
16
Two
Related work
The scalable commutativity rule is to the best of our knowledge the first observation to directly
connect scalability to interface commutativity. This chapter relates the rule and its use in sv6
and Commuter to prior work.
2.1 Thinking about scalability
Israeli and Rappoport introduce the notion of disjoint-access-parallel memory systems [41].
Roughly, if a shared memory system is disjoint-access-parallel and a set of processes access
disjoint memory locations, then those processes scale linearly. Like the commutativity rule,
this is a conditional scalability guarantee: if the application uses shared memory in a particular
way, then the shared memory implementation will scale. However, where disjoint-access
parallelism is specialized to the memory system interface, our work encompasses any software
interface. Attiya et al. extend Israeli and Rappoport’s definition to additionally require nondisjoint reads to scale [3]. Our work builds on the assumption that memory systems behave
this way, and we confirm that real hardware closely approximates this behavior (chapter 3).
Both the original disjoint-access parallelism paper and subsequent work, including the
paper by Roy et al. [56], explore the scalability of processes that have some amount of nondisjoint sharing, such as compare-and-swap instructions on a shared cache line or a shared
lock. Our work takes a black-and-white view because we have found that, on real hardware, a
single modified shared cache line can wreck scalability (chapters 3 and 9).
The Laws of Order [2] explore the relationship between the strong non-commutativity of
an interface and whether any implementation of that interface must have atomic instructions
or fences (e.g., mfence on the x86) for correct concurrent execution. These instructions slow
down execution by interfering with out-of-order execution, even if there are no memory
access conflicts. The Laws of Order resemble the commutativity rule, but draw conclusions
about sequential performance, rather than scalability. Paul McKenney explores the Laws of
Order in the context of the Linux kernel, and points out that the Laws of Order may not apply
if linearizability is not required [48].
17
It is well understood that cache-line contention can result in bad scalability. A clear
example is the design of the MCS lock [50], which eliminates scalability collapse by avoiding contention for a particular cache line. Other good examples include scalable reference
counters [19, 29]. The commutativity rule builds on this understanding and identifies when
arbitrary interfaces can avoid conflicting memory accesses.
2.2 Designing scalable operating systems
Practitioners often follow an iterative process to improve scalability: design, implement, measure, repeat [14]. Through a great deal of effort, this approach has led kernels such as Linux to
scale well for many important workloads. However, Linux still has many scalability bottlenecks, and absent a method for reasoning about interface-level scalability, it is unclear which
of the bottlenecks are inherent to its system call interface. This dissertation identifies situations
where POSIX permits or limits scalability and points out specific interface modifications that
would permit greater implementation scalability.
Venerable scalable kernels like K42 [1], Tornado [31], and Hurricane [65] have developed
design patterns like clustered objects and locality-preserving IPC as general building blocks
of scalable kernels. These patterns complement the scalable commutativity rule by suggesting
practical ways to achieve conflict-freedom for commutative operations as well as ways to cope
with non-commutative operations.
Multikernels for multicore processors aim for scalability by avoiding shared data structures
in the kernel [4, 68]. These systems implement shared abstractions using distributed systems
techniques (such as name caches and state replication) on top of message passing. It should be
possible to generalize the commutativity rule to distributed systems, and relate the interface
exposed by a shared abstraction to its scalability, even if implemented using message passing.
The designers of the Corey operating system [10] argue for putting the application in
control of managing the cost of sharing without providing a guideline for how applications
should do so; the commutativity rule could be a helpful guideline for application developers.
2.3
Commutativity
The use of commutativity to increase concurrency has been widely explored. Steele describes
a parallel programming discipline in which all operations must be either causally related
or commutative [61]. His work approximates commutativity as conflict-freedom. This dissertation shows that commutative operations always have a conflict-free implementation,
making Steele’s model more broadly applicable. Rinard and Diniz describe how to exploit
commutativity to automatically parallelize code [55]. They allow memory conflicts, but generate synchronization code to ensure atomicity of commutative operations. Similarly, Prabhu
et al. describe how to automatically parallelize code using manual annotations rather than
18
automatic commutativity analysis [53]. Rinard and Prabhu’s work focuses on the safety of
executing commutative operations concurrently. This gives operations the opportunity to
scale, but does not ensure that they will. Our work focuses on scalability directly: given
concurrent, commutative operations, we show they have a scalable implementation.
The database community has long used logical readsets and writesets, conflicts, and
execution histories to reason about how transactions can be interleaved while maintaining
serializability [7]. Weihl extends this work to abstract data types by deriving lock conflict
relations from operation commutativity [67]. Transactional boosting applies similar techniques in the context of software transactional memory [35]. Shapiro et al. extend this to a
distributed setting, leveraging commutative operations in the design of replicated data types
that support updates during faults and network partitions [59, 60]. Like Rinard and Prabhu’s
work, the work in databases and its extensions focuses on the safety of executing commutative
operations concurrently, not directly on scalability.
2.4 Test case generation
Prior work on concolic testing [33, 58] and symbolic execution [12, 13] generates test cases by
symbolically executing a specific implementation. Our Commuter tool uses a combination
of symbolic and concolic execution, but generates test cases for an arbitrary implementation
based on a model of that implementation’s interface. This resembles QuickCheck’s [15] or
Gast’s [42] model-based testing, but uses symbolic techniques. Furthermore, while symbolic
execution systems often avoid reasoning precisely about symbolic memory accesses (e.g.,
accessing a symbolic offset in an array), Commuter’s test case generation aims to achieve
conflict coverage (section 6.2), which tests different access patterns when using symbolic
addresses or indexes.
19
20
Three
Scalability and conflict-freedom
Understanding multicore scalability requires first understanding the hardware. This chapter
shows that, under reasonable assumptions, conflict-free operations scale linearly on modern
multicore hardware. The following chapter will use conflict-freedom as a stepping-stone in
establishing the scalable commutativity rule.
3.1 Conflict-freedom and multicore processors
The connection between conflict-freedom and scalability mustn’t be taken for granted. Indeed,
some early multi-processor architectures such as the Intel Pentium depended on shared buses
with global lock lines [40: §8.1.4], so even conflict-free operations did not scale.
Today’s multicores avoid such centralized components. Modern, large, cache-coherent
multicores utilize peer-to-peer interconnects between cores and sockets; partition and distribute physical memory between sockets (NUMA); and have deep cache hierarchies with
per-core write-back caches. To maintain a unified, globally consistent view of memory despite
their distributed architecture, multicores depend on MESI-like coherence protocols [52] to
coordinate ownership of cached memory. A key invariant of these coherence protocols is that
either 1) a cache line is not present in any cache, 2) a mutable copy is present in a single cache,
or 3) the line is present in any number of caches but is immutable. Maintaining this invariant
requires coordination, and this is where the connection to scalability lies.
Figure 3-1 shows the basic state machine implemented by each cache for each cache line.
This maintains the above invariant by ensuring a cache line is either “invalid” in all caches,
“modified” in one cache and “invalid” in all others, or “shared” in any number of caches. Practical implementations add further states—MESI’s “exclusive” state, Intel’s “forward” state [34],
and AMD’s “owned” state [27: §7.3]—but these do not change the basic communication
required to maintain cache coherence.
Roughly, a set of operations scales when maintaining coherence does not require communication in the steady state. There are three memory access patterns that do not require
communication:
21
invalid
W
R
rW rW
rR
R
shared
modified
R/W
W
Figure 3-1: A basic cache-coherence state machine. “R” and “W” indicate local read and write
operations, while “rR” and “rW” indicate remote read and write operations. Thick red lines
show operations that cause communication. Thin green lines show operations that occur
without communication.
• Multiple cores reading different cache lines. This scales because, once each cache line is
in each core’s cache, no further communication is required to access it, so further reads
can proceed independently of concurrent operations.
• Multiple cores writing different cache lines scales for much the same reason.
• Multiple cores reading the same cache line scales. A copy of the line can be kept in each
core’s cache in shared mode, which further reads from those cores can access without
communication.
That is, when memory accesses are conflict-free, they do not require communication. Furthermore, higher-level operations composed of conflict-free reads and writes are themselves
conflict-free and will also execute independently and in parallel. In all of these cases, conflictfree operations execute in the same time in isolation as they do concurrently, so the total
throughput of N such concurrent operations is proportional to N. Therefore, given a perfect
implementation of MESI, conflict-free operations scale linearly. The following sections verify
this assertion holds on real hardware under reasonable workload assumptions and explore
where it breaks down.
The converse is also true: conflicting operations cause cache state transitions and the
resulting coordination limits scalability. That is, if a cache line written by one core is read or
written by other cores, those operations must coordinate and, as a result, will slow each other
down. While this doesn’t directly concern the scalable commutativity rule (which says only
when operations can be conflict-free, not when they must be conflicted), the huge effect that
conflicts can have on scalability affirms the importance of conflict-freedom. The following
sections also demonstrate the effect of conflicts on real hardware.
22
80-core Intel
QPI
DRAM
DRAM
Socket
48-core AMD
QPI
Core
Socket
HT
30 MB
L3
5 MB
L3
1 MB Dir
DDR2
Core
HT
DDR3
32 KB 2.5 MB
L1
L2
64 KB 512 KB
L1
L2
Figure 3-2: Organization of Intel and AMD machines used for benchmarks [18, 21, 22].
3.2 Conflict-free operations scale
We use two machines to evaluate conflict-free and conflicting operations on real hardware: an
80-core (8 sockets × 10 cores) Intel Xeon E7-8870 (the same machine used for evaluation in
chapter 9) and, to show that our conclusions generalize, a 48-core (8 sockets × 6 cores) AMD
Opteron 8431. Both are cc-NUMA x86 machines with directory-based cache coherence, but
the two manufacturers use different architectures, interconnects, and coherence protocols.
Figure 3-2 shows how the two machines are broadly organized.
Figure 3-3 shows the time required to perform conflict-free memory accesses from varying
numbers of cores. The first benchmark, shown in the top row of Figure 3-3, stresses read/read
sharing by repeatedly reading the same cache line from N cores. The latency of these reads
remains roughly constant regardless of N. After the first access from each core, the cache line
remains in each core’s local cache, so later accesses occur locally and independently, allowing
read/read accesses to scale perfectly. Reads of different cache lines from different cores (not
shown) yield identical results to reads of the same cache line.
The bottom row of Figure 3-3 shows the results of stressing conflict-free writes by assigning
each core a different cache line and repeatedly writing these cache lines from each of N
cores. In this case these cache lines enter a “modified” state at each core, but then remain
in that state, so as with the previous benchmark, further writes can be performed locally
and independently. Again, latency remains constant regardless of N, demonstrating that
conflict-free write accesses scale.
Figure 3-4 turns to the cost of conflicting accesses. The top row shows the latency of N
cores writing the same cache line simultaneously. The cost of a write/write conflict grows
23
Write latency (cycles)
Read latency (cycles)
80-core Intel
16
14
12
10
8
6
4
2
0
40
35
30
25
20
15
10
5
0
48-core AMD
mean
1 10 20 30 40 50 60 70 80
# reader cores
1 10 20 30 40 50 60 70 80
# writer cores
16
14
12
10
8
6
4
2
0
40
35
30
25
20
15
10
5
0
1
6
12 18 24 30 36 42 48
# reader cores
1
6
12 18 24 30 36 42 48
# writer cores
Figure 3-3: Conflict-free accesses scale. Each graph shows the cycles required to perform a
conflict-free read or write from N cores. Shading indicates the latency distribution for each
N (darker shading indicates higher frequency).
dramatically as the number of writing cores increases because ownership of the modified
cache line must pass to each writing core, one at a time. On both machines, we also see a
uniform distribution of write latencies, which further illustrates this serialization, as some
cores acquire ownership quickly, while others take much longer.
For comparison, an operation like open typically takes 1,000–2,000 cycles on these machines, while a single conflicting write instruction can take upwards of 50,000 cycles. In the
time it takes one thread to execute this one machine instruction, another could open 25 files.
The bottom row of Figure 3-4 shows the latency of N cores simultaneously reading a
cache line last written by core 0 (a read/write conflict). For the AMD machine, the results are
nearly identical to the write/write conflict case, since this machine serializes requests for the
cache line at CPU 0’s socket. On the Intel machine, the cost of read/write conflicts also grows,
albeit more slowly, as Intel’s architecture aggregates the read requests at each socket. We see
this effect in the latency distribution, as well, with read latency exhibiting up to eight different
modes. These modes reflect the order in which the eight sockets’ aggregated read requests are
served by CPU 0’s socket. Intel’s optimization helps reduce the absolute latency of reads, but
nevertheless, read/write conflicts do not scale on either machine.
24
Read latency (cycles)
Write latency (cycles)
80-core Intel
48-core AMD
60k
50k
20k
mean
15k
40k
30k
10k
20k
5k
10k
0
1 10 20 30 40 50 60 70 80
# writer cores
6k
1
6
12 18 24 30 36 42 48
# writer cores
1
6
12 18 24 30 36 42 48
# reader cores
20k
5k
15k
4k
3k
10k
2k
5k
1k
0
0
1 10 20 30 40 50 60 70 80
# reader cores
0
Figure 3-4: Conflicting accesses do not scale. Each graph shows the cycles required to perform
a conflicting read or write from N cores. Shading indicates the latency distribution for each
N (estimated using kernel density estimation).
3.3
Limitations of conflict-free scalability
Conflict-freedom is a good predictor of scalability on real hardware, but it’s not perfect.
Limited cache capacity and associativity cause caches to evict cache lines (later resulting in
cache misses) even in the absence of coherence traffic. And, naturally, a core’s very first access
to a cache line will miss. Such misses directly affect sequential performance, but they may
also affect the scalability of conflict-free operations. Satisfying a cache miss (due to conflicts
or capacity) requires the cache to fetch the cache line from another cache or from memory.
If this requires communicating with remote cores or memory, the fetch may contend with
concurrent operations for interconnect resources or memory controller bandwidth.
Applications with good cache behavior are unlikely to exhibit such issues, while applications with poor cache behavior usually have sequential performance problems that outweigh
scalability concerns. Nevertheless, it’s important to understand where our assumptions about
conflict-freedom break down.
Figure 3-5 shows the results of a benchmark that explores some of these limits by performing conflict-free accesses to regions of varying sizes from varying numbers of cores. This
25
Read latency (cycles)
Write latency (cycles)
80-core Intel
140
80 cores
120
40 cores
100
10 cores
80
60
40
20
0
32 KB 256 KB 2 MB 16 MB 128 MB
48-core AMD
900
800
48 cores
700
24 cores
600
6 cores
500
400
300
200
100
0
32 KB 256 KB 2 MB 16 MB 128 MB
2000
2000
1800
1800
1600
1600
1400
1400
1200
1200
1000
1000
800
800
600
600
400
400
200
200
0
0
32 KB 256 KB 2 MB 16 MB 128 MB
32 KB 256 KB 2 MB 16 MB 128 MB
Thread working set size
Thread working set size
Figure 3-5: Operations scale until they exceed cache or directory capacity. Each graph shows
the latency for repeatedly reading a shared region of memory (top) and writing separate
per-core regions (bottom), as a function of region size and number of cores.
benchmark stresses the worst case: each core reads or writes in a tight loop and all memory is
physically allocated from CPU 0’s socket, so all misses contend for that socket’s resources.
The top row of Figure 3-5 shows the latency of reads to a shared region of memory. On both
machines, we observe slight increases in latency as the region exceeds the L1 cache and later
the L2 cache, but the operations continue to scale until the region exceeds the L3 cache. At
this point the benchmark becomes bottlenecked by the DRAM controller of CPU 0’s socket,
so the reads no longer scale, despite being conflict-free.
We observe a similar effect for writes, shown in the bottom row of Figure 3-5. On the
Intel machine, the operations scale until the combined working set of the cores on a socket
exceeds the socket’s L3 cache size. On the AMD machine, we observe an additional effect for
smaller regions at high core counts: in this machine, each socket has a 1 MB directory for
tracking ownership of that socket’s physical memory, which this benchmark quickly exceeds.
These benchmarks show some of the limitations to how far we can push conflict-freedom
before it no longer aligns with scalability. Nevertheless, even in the worst cases demonstrated
by these benchmarks, conflict-free operations both perform and scale far better than conflicted
operations.
26
3.4 Summary
Despite some limitations, conflict-freedom is a good predictor of linear scalability in practice.
Most software has good cache locality and high cache hit rates both because this is crucial
for sequential performance, and because it’s in the interest of CPU manufacturers to design
caches that fit typical working sets. For workloads that exceed cache capacity, NUMA-aware
allocation spreads physical memory use across sockets and DRAM controllers, partitioning
physical memory access, distributing the DRAM bottleneck, and giving cores greater aggregate
DRAM bandwidth.
Chapter 9 will return to hard numbers on real hardware to show that conflict-free implementations of commutative interfaces enable software to scale not just at the level of memory
microbenchmarks, but at the level of an entire OS kernel and its applications.
27
28
Four
The scalable commutativity rule
This chapter addresses two questions: What is the precise definition of the scalable commutativity rule, and why is the rule true? We answer these questions using a formalism based on
abstract actions, histories, and implementations, combined with the empirical result of the
previous chapter. This formalism relies on a novel form of commutativity, SIM commutativity, whose generality makes it possible to broadly apply the scalable commutativity rule to
complex software interfaces. Our constructive proof of the scalable commutativity rule also
sheds some light on how real conflict-free implementations might be built, though the actual
construction is not very practical.
4.1 Actions
Following earlier work [37], we model a system execution as a sequence of actions, where an
action is either an invocation or a response. In the context of an operating system, an invocation
represents a system call with arguments (such as getpid() or open("file", O_RDWR)) and a
response represents the corresponding return value (a PID or a file descriptor). Invocations
and responses are paired. Each invocation is made by a specific thread, and the corresponding
response is returned to the same thread. An action thus comprises (1) an operation class
(e.g., which system call is being invoked); (2) operation arguments (for invocations) or a
return value (for responses); (3) the relevant thread; and (4) a tag for uniqueness. We’ll
write invocations as left half-circles A (“invoke A”) and responses as right half-circles A
(“respond A”), where the letters match invocations and their responses. Color and vertical
offset differentiate threads: A and B are invocations on different threads.
A system execution is called a history. For example
H=
A B C A C B D D E E
F G H F H G
consists of eight invocations and eight corresponding responses across three different threads.
We’ll consider only well-formed histories, in which each thread’s actions form a sequence of
invocation–response pairs. H above is well-formed; checking this for the red thread t, we see
29
that the thread-restricted subhistory H∣t = A A D D H H formed by selecting t’s actions
from H alternates invocations and responses as one would want. In a well-formed history,
each thread has at most one outstanding invocation at every point.
The specification distinguishes whether or not a history is “correct.” A specification S is
a prefix-closed set of well-formed histories. The set’s contents depend on the system being
modeled; for example, if S specified a Unix-like OS, then [ A = getpid(), A = 0] ∈/ S , since
no Unix thread can have PID 0. Our definitions and proof require that some specification
exists, but we aren’t concerned with how it is constructed.
4.2
SIM commutativity
Commutativity captures the idea that the order of a set of actions “doesn’t matter.” This
happens when the caller of an interface cannot distinguish the order in which events actually
occurred, either through those actions’ responses or through any possible future actions.
To formalize this inability to distinguish orders, we use the specification. A set of actions
commutes in some context when the specification is indifferent to the execution order of that
set. This means that any response valid for one order of the commutative set is valid for any
order of the commutative set, and likewise any response invalid for one order is invalid for
any order. The rest of this section formalizes this intuition as SIM commutativity.
Our definition has two goals: state dependence and interface basis. State dependence
means SIM commutativity must to capture when operations commute in some states, even if
those same operations do not commute in other states. This is important because it allows
the rule to apply to a much broader range of situations than traditional non-state-dependent
notions of commutativity. For example, few OS system calls unconditionally commute in
every state, but many system calls commute in restricted states. Consider POSIX’s open call.
In general, two calls to open("a", O_CREAT|O_EXCL) don’t commute: one call will create the file
and the other will fail because the file already exists. However, two such calls do commute if
called from processes with different working directories; or if the file "a" already exists (both
calls will return the same error). State dependence makes it possible to distinguish these
cases, even though the operations are the same in each. This, in turn, means the scalable
commutativity rule can tell us that scalable implementations exist in all of these commutative
cases.
Interface basis means SIM commutativity must evaluate the consequences of execution
order using only the specification, without reference to any particular implementation. Since
our goal is to reason about possible implementations, it’s necessary to capture the scalability
inherent in the interface itself. This in turn makes it possible to use the scalable commutativity
rule early in software development, during interface design and initial implementation.
The right definition for commutativity that achieves both of these goals is a little tricky,
so we build it up in two steps.
30
Definition. An action sequence, or region, H ′ is a reordering of an action sequence H when
H∣t = H ′ ∣t for every thread t. This is, regions H and H ′ contain the same invocations and
responses in the same order for each individual thread, but may interleave threads differently.
If H = A B A C B C , then B B A A C C is a reordering of H, but B C B C A A is
not, since it doesn’t respect the order of actions in H’s red thread.
Definition. Consider a history H = X ∣∣ Y (where ∣∣ concatenates action sequences). Y
SI-commutes in H when given any reordering Y ′ of Y, and any action sequence Z,
X ∣∣ Y ∣∣ Z ∈ S if and only if X ∣∣ Y ′ ∣∣ Z ∈ S .
This definition captures the state dependence and interface basis we need. The action sequence
X puts the system into the state we wish to consider, without specifying any particular
representation of that state (which would depend on an implementation). Switching regions
Y and Y ′ requires that the exact responses in Y remain valid according to the specification
even if Y is reordered. The presence of region Z in both histories requires that reorderings of
actions in region Y are indistinguishable by future operations.
Unfortunately, SI commutativity doesn’t suffice for our needs because it is non-monotonic.
Given an action sequence X ∣∣ Y1 ∣∣ Y2 , it is possible for Y1 ∣∣ Y2 to SI-commute after region X
even though Y1 without Y2 does not. For example, consider a get/set interface and the history
Y = [ A = set(1), A , B = set(2), B , C = set(2), C ].
´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹¸ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¶ ´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¶
Y1
Y2
Y SI-commutes because set returns nothing and every reordering sets the underlying value to 2,
so future get operations cannot distinguish reorderings of Y. However, its prefix Y1 alone does
not SI-commute because some orders set the value to 1 and some to 2, so future get operations
can distinguish them. Whether or not Y1 will ultimately form part of a commutative region
thus depends on future operations! This is usually incompatible with scalability: operations
in Y1 must “plan for the worst” by remembering their order, even if they ultimately form part
of a SI-commutative region. Requiring monotonicity eliminates this problem.
Definition. An action sequence Y SIM-commutes in a history H = X ∣∣ Y when for any
prefix P of any reordering of Y (including P = Y), P SI-commutes in X ∣∣ P.
Returning to the get/set example, while the sequence Y given in the example SI-commutes
(in any history), Y does not SIM-commute because its prefix Y1 is not SI commutative.
Like SI commutativity, SIM commutativity captures state dependence and interface basis.
Unlike SI commutativity, SIM commutativity excludes cases where the commutativity of a
region changes depending on future operations and, as we show below, suffices to prove the
scalable commutativity rule.
31
4.3
Implementations
To reason about the scalability of an implementation of an interface, we need to model
implementations in enough detail to tell whether different threads’ “memory accesses” are
conflict-free. We represent an implementation as a step function: given a state and an invocation, it produces a new state and a response. We can think of this step function as being
invoked by a driver algorithm (a scheduler in operating systems parlance) that repeatedly picks
which thread to take a step on and passes state from one application of the step function to
the next. Special yield responses let the step function request that the driver “run” a different
thread and let us represent concurrent overlapping operations and blocking operations.
We begin by defining three sets:
• S is the set of implementation states.
• I is the set of valid invocations, including continue.
• R is the set of valid responses, including yield.
Definition. An implementation m is a function in S × I ↦ S × R. Given an old state and
an invocation, the implementation produces a new state and a response. The response must
have the same thread as the invocation. A yield response indicates that a real response
for that thread is not yet ready, and gives the driver the opportunity to take a step on a
different thread without a real response from the current thread. continue invocations give
the implementation an opportunity to complete an outstanding request on that thread (or
further delay its response).¹
An implementation generates a history when calls to the implementation (perhaps including continue invocations) could potentially produce the corresponding history. For example,
this sequence shows an implementation m generating a history A B B A :
• m(s0 ,
A)
= ⟨s1 , yield⟩
• m(s1 ,
B)
= ⟨s2 , yield⟩
• m(s2 , continue) = ⟨s3 , yield⟩
• m(s3 , continue) = ⟨s4 ,
B
⟩
• m(s4 , continue) = ⟨s5 ,
A
⟩
¹There are restrictions on how a driver can choose arguments to the step function. We assume, for example,
that it passes a continue invocation for thread t if and only if the last step on t returned yield. Furthermore,
since implementations are functions, they must be deterministic. We could model implementations instead as
relations, allowing non-determinism, though this would complicate later arguments somewhat.
32
The state is threaded from step to step; invocations appear as arguments and responses as
return values. The generated history consists of the invocations and responses, in order, with
yields and continues removed.
An implementation m is correct for some specification S when the responses it generates
are always allowed by the specification. Specifically, let H be a valid history that can be
generated by m. We say that m is correct when every such H is in S . Note that a correct
implementation need not be capable of generating every possible legal response or every
possible history in S ; it’s just that every response it does generate is legal.
To reason about conflict freedom, we must peek into implementation states, identify reads
and writes, and check for access conflicts. Let each state s ∈ S be a tuple ⟨s.0, . . . , s. j⟩, and
let s i←x indicate component replacement: s i←x = ⟨s.0, . . . , s.(i − 1), x, s.(i + 1), . . . , s. j⟩.
Now consider an implementation step m(s, a) = ⟨s′ , r⟩. This step writes state component
i when s.i ≠ s′ .i. It reads state component i when s.i may affect the step’s behavior; that is,
when for some y,
m(s i←y , a) ≠ ⟨s ′i←y , r⟩ .
Two implementation steps have an access conflict when they are on different threads and one
writes a state component that the other either writes or reads. This notion of access conflicts
maps directly onto the read and write access conflicts on real shared-memory machines
explored in chapter 3. A set of implementation steps is conflict-free when no pair of steps in
the set has an access conflict; that is, no thread’s steps read or write a state component written
by another thread’s steps.
4.4 Rule
We can now formally state the scalable commutativity rule. Assume a specification S with a
correct reference implementation M. Consider a history H = X ∣∣ Y where Y SIM-commutes
in H, and where M can generate H. Then there exists a correct implementation m of S whose
steps in the Y region of H are conflict-free. Empirically, conflict-free operations scale linearly
on modern multicore hardware (chapter 3), so, given reasonable workload assumptions, m
scales in the Y region of H.
4.5
Example
Before we turn to why the scalable commutativity rule is true, we’ll first illustrate how the
rule helps designers think about interfaces and implementations, using reference counters as
a case study.
In its simplest form, a reference counter has two operations, inc and dec, which respectively
increment and decrement the value of the counter and return its new value. We’ll also consider
33
a third operation, iszero, which returns whether the reference count is zero. Together, these
operations and their behavior define a reference counter specification Src . Src has a simple
reference implementation using a shared counter, which we can represent formally as
mrc (s, inc) ≡ ⟨s + 1, s + 1⟩
mrc (s, dec) ≡ ⟨s − 1, s − 1⟩
mrc (s, iszero) ≡ ⟨s, s = 0⟩
Consider a reference counter that starts with a value of 2 and the history
H = [ A = iszero(), A = false, B = iszero(), B = false, C = dec(), C = 1, D = dec(), D = 0]
´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹¸¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹¶ ´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¶
HAB
HCD
The region HAB SIM commutes in H. Thus, by the rule, there is an implementation of Src
that is conflict-free for HAB . In fact, this is already true of the shared counter reference
implementation mrc because iszero reads the state, but does not change it. On the other hand,
HCD does not SIM commute in H, and therefore the rule does not apply (indeed, no correct
implementation can be conflict-free for HCD ).
The rule suggests a direction to make HCD conflict-free: if we modify the specification so
that dec (and inc) return nothing, then these modified operations do commute (more precisely:
any region consisting exclusively of these operations commutes in any history). With this
modified specification, Src′ , the caller must invoke iszero to detect when the object is no
longer referenced, but in many cases this delayed zero detection is acceptable and represents
a desirable trade-off.
The equivalent history with this modified specification is
′
HABC
³¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ · ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ µ
H ′ = [ A = iszero(), A = false, B = iszero(), B = false, C = dec(), C , D = dec(), D ]
´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹¸¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹¶ ´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹¶
′
HAB
′
HCD
′
Unlike HCD , HCD
SIM commutes. And, accordingly, there is an implementation of Src′
′
that is conflict-free for HCD
. By using per-thread counters, we can construct such an implementation. Each dec can modify its local counter, while iszero sums the per-thread values.
Per-thread and per-core sharding of data structures like this is a common and long-standing
pattern in scalable implementations.
′
The rule highlights at least one more opportunity in this history. HABC
also SIM commutes
′
(still assuming an initial count of 2). However, the implementation given above for HCD
is
′
not conflict-free for HABC because C will write one component of the state that is read
and summed by A (and B ). But, again, there is a conflict-free implementation based on
adding a Boolean iszero snapshot to the state. iszero simply returns this snapshot. When dec’s
per-thread value reaches zero, it can read and sum all per-thread values and update the iszero
snapshot if necessary.
34
mns (s, a) ≡
If head(s.h) = a:
r ← continue
else if a = yield and head(s.h) is a response
and thread(head(s.h)) = thread(a):
r ← head(s.h)
else if s.h ≠ emulate:
H ′ ← an invocation sequence consistent with s.h
For each invocation x in H ′ :
⟨s.refstate, _⟩ ← M(s.refstate, x)
s.h ← emulate
If s.h = emulate:
⟨s.refstate, r⟩ ← M(s.refstate, a)
else:
s.h ← tail(s.h)
Return ⟨s, r⟩
// replay s.h
// H complete or input diverged
// switch to emulation mode
// replay mode
Figure 4-1: Constructed non-scalable implementation mns for history H and reference implementation M.
These two implementations of Src′ are fundamentally different. Which is most desirable
depends on whether the workload is expected to be write-heavy (mostly inc and dec) or
read-heavy (mostly iszero). Thus an implementer must determine what opportunities to scale
exist, decide which are likely to be the most valuable, and choose the implementation that
scales in those situations.
In section 8.1, we’ll return to the topic of reference counters when we describe Refcache, a
practical and broadly conflict-free implementation of a lazy reference counter similar to Src′ .
4.6 Proof
We derived implementations of the reference counter example by hand, but a general, constructive proof for the scalable commutativity rule is possible. The construction builds a
conflict-free implementation m from an arbitrary reference implementation M and history
H = X ∣∣ Y. The constructed implementation emulates the reference implementation and is
thus correct for any history. Its performance properties, however, are specialized for H. For
any history X ∣∣ P where P is a prefix of a reordering of Y, the constructed implementation’s
steps in P are conflict-free. That is, within the SIM-commutative region, m scales.
To understand the construction, it helps to first imagine constructing a non-scalable
implementation mns from the reference M. This non-scalable implementation begins in
replay mode. As long as each invocation matches the next invocation in H, mns simply replays
the corresponding responses from H, without invoking the reference implementation. If the
35
input invocations diverge from H, however, mns can no longer replay responses from H, so it
enters emulation mode. This requires feeding M all previously received invocations to prepare
its state. After this, mns responds by passing all invocations to the reference implementation
and returning its responses.
A state s for mns contains two components. First, s.h either holds the portion of H that
remains to be replayed or has the value emulate, which denotes emulation mode. s.h is
initialized to H. Second, s.refstate is the state of the reference implementation, and starts as
the value of the reference implementation’s initial state. Figure 4-1 shows how the simulated
implementation works. We make several simplifying assumptions, including that mns receives
continue invocations in a restricted way; these assumptions aren’t critical for the argument.
One line requires expansion, namely the choice of H ′ “consistent with s.h” when the input
sequence diverges. This step calculates the prefix of H up to, but not including, s.h; excludes
responses; and adds continue invocations as appropriate.
This implementation is correct—its responses for any history always match those from
the reference implementation. But it isn’t conflict-free. In replay mode, any two steps of mns
conflict on accessing s.h. These accesses track which invocations have occurred; without
them it would be impossible to later initialize the state of M. And this is where commutativity comes in. The action order in a SIM-commutative region doesn’t matter by definition.
Since the specification doesn’t distinguish among orders, it is safe to initialize the reference
implementation with the commutative actions in a different order than they were received. All
future responses will still be valid according to the specification.
Figure 4-2 shows the construction of m, a version of M that scales over Y in H = X ∣∣ Y.
m is similar to mns , but extends it with a conflict-free mode used to execute actions in Y. Its
state is as follows:
• s.h[t]—a per-thread history. Initialized to X ∣∣ commute ∣∣ (Y∣t), where the special
commute action indicates the commutative region has begun.
• s.commute[t]—a per-thread flag indicating whether the commutative region has been
reached. Initialized to false.
• s.refstate—the reference implementation’s state.
Each step of m in the commutative region accesses only state components specific to the
invoking thread. This means that any two steps in the commutative region are conflict-free,
and the scalable commutativity rule is proved. The construction uses SIM commutativity
when initializing the reference implementation’s state via H ′ . If the observed invocations
diverge before the commutative region, then just as in mns , H ′ will exactly equal the observed
invocations. If the observed invocations diverge in or after the commutative region, however,
there’s not enough information to recover the order of invocations. (The s.h[t] components
track which invocations have happened per thread, but not the order of those invocations
36
m(s, a) ≡
t ← thread(a)
If head(s.h[t]) = commute:
s.commute[t] ← true; s.h[t] ← tail(s.h[t])
If head(s.h[t]) = a:
r ← continue
else if a = yield and head(s.h[t]) is a response
and thread(head(s.h[t])) = t:
r ← head(s.h[t])
else if s.h[t] ≠ emulate:
H ′ ← an invocation sequence consistent with s.h[∗]
For each invocation x in H ′ :
⟨s.refstate, _⟩ ← M(s.refstate, x)
s.h[u] ← emulate for each thread u
If s.h[t] = emulate:
⟨s.refstate, r⟩ ← M(s.refstate, a)
else if s.commute[t]:
s.h[t] ← tail(s.h[t])
else:
s.h[u] ← tail(s.h[u]) for each thread u
Return ⟨s, r⟩
// enter conflict-free mode
// replay s.h
// H complete/input diverged
// conflict-free mode
// replay mode
Figure 4-2: Constructed scalable implementation m for history H and reference implementation M.
between threads.) Therefore, H ′ might reorder the invocations in Y. SIM commutativity
guarantees that replaying H ′ will nevertheless produce results indistinguishable from those
of the actual invocation order, even if the execution diverges within the commutative region.²
4.7 Discussion
The rule and proof construction push state and history dependence to an extreme: the proof
construction is specialized for a single commutative region. This can be mitigated by repeated
application of the construction to build an implementation that scales over multiple commutative regions in a history, or for the union of many histories.³ Nevertheless, the implementation
²We effectively have assumed that M, the reference implementation, produces the same results for any
reordering of the commutative region. This is stricter than SIM commutativity, which places requirements on
the specification, not the implementation. We also assumed that M is indifferent to the placement of continue
invocations in the input history. Neither of these restrictions is fundamental, however. If during replay M
produces responses that are inconsistent with the desired results, m could throw away M’s state, produce a new
H ′ with different continue invocations and/or commutative region ordering, and try again. This procedure must
eventually succeed and does not change the conflict-freedom of m in the commutative region.
³This is possible because, once the constructed machine leaves the specialized region, it passes invocations
directly to the reference and has the same conflict-freedom properties as the reference.
37
constructed by the proof is impractical and real implementations achieve broad scalability
using different techniques, such as the ones this dissertation explores in chapter 8.
We believe such broad implementation scalability is made easier by broadly commutative
interfaces. In broadly commutative interfaces, the arguments and system states for which a set
of operations commutes often collapse into fairly well-defined classes (e.g., file creation might
commute whenever the containing directories are different). In practice, implementations
scale for whole classes of states and arguments, not just for specific histories.
On the other hand, there can be limitations on how broadly an implementation can scale.
It is sometimes the case that a set of operations commutes in more than one class of situation,
but no single implementation can scale for all classes. The reference counter example in
section 4.5 hinted at this when we constructed several possible implementations for different
situations, but never arrived at a broadly conflict-free one. As an example that’s easier to
reason about, consider an interface with two calls: put(x) records a sample with value x, and
max() returns the maximum sample recorded so far (or 0). Suppose
HAB
³¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ·¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹µ
H = [ A = put(1), A , B = put(1), B , C = max(), C = 1].
´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¶
HBC
Both HAB and HBC SIM commute in H, but H overall is not SIM commutative. An implementation could store per-thread maxima reconciled by max and be conflict-free for HAB .
Alternatively, it could use a global maximum that put checked before writing. This is conflictfree for HBC . But no correct implementation can be conflict-free across all of H. Since HAB
and HBC together span H, that means no single implementation can be conflict-free for both
HAB and HBC .
In our experience, real-world interface operations rarely demonstrate such mutually
exclusive implementation choices. For example, the POSIX implementation in chapter 8 scales
quite broadly, with only a handful of cases that would require incompatible implementations.
We hope to further explore this gap between the specificity of the formalized scalable
commutativity rule and the generality of practical implementations. We’ll return to this
question and several other avenues for future work in chapter 10. However, as the rest of this
dissertation shows, the rule is already an effective guideline for achieving practical scalability.
38
Five
Designing commutative interfaces
The rule facilitates scalability reasoning at the interface and specification level, and SIM
commutativity lets us apply the rule to complex interfaces. This chapter demonstrates the
interface-level reasoning enabled by the rule, using POSIX as a case study. Already, many
POSIX operations commute with many other operations, a fact we will quantify in the
following chapters; this chapter focuses on problematic cases to give a sense of the subtler
issues of commutative interface design.
The following sections explore four general classes of changes that make operations
commute in more situations, enabling more scalable implementations.
5.1 Decompose compound operations
Many POSIX APIs combine several operations into one, limiting the combined operation’s
commutativity. For example, fork both creates a new process and snapshots the current
process’s entire memory state, file descriptor state, signal mask, and several other properties.
As a result, fork fails to commute with most other operations in the same process, such as
memory writes, address space operations, and many file descriptor operations. However,
applications often follow fork with exec, which undoes most of fork’s sub-operations. With
only fork and exec, applications are forced to accept these unnecessary sub-operations that
limit commutativity.
POSIX has a little-known API called posix_spawn that addresses this problem by creating a process and loading an image directly (CreateProcess in Windows is similar). This is
equivalent to fork/exec, but its specification eliminates the intermediate sub-operations. As
a result, posix_spawn commutes with most other operations and permits a broadly scalable
implementation.
Another example, stat, retrieves and returns many different attributes of a file simultaneously, which makes it non-commutative with operations on the same file that change
any attribute returned by stat (such as link, chmod, chown, write, and even read). In practice,
applications invoke stat for just one or two of the returned fields. An alternate API that
39
gave applications control of which field or fields were returned would commute with more
operations and enable a more scalable implementation of stat, as we show in section 9.2.
POSIX has many other examples of compound return values. sigpending returns all
pending signals, even if the caller only cares about a subset; and select returns all ready
file descriptors, even if the caller needs only one ready FD.
5.2
Embrace specification non-determinism
POSIX’s “lowest available FD” rule is a classic example of overly deterministic design that
results in poor scalability. Because of this rule, open operations in the same process (and
any other FD allocating operations) do not commute, since the order in which they execute
determines the returned FDs. This constraint is rarely needed by applications and an alternate
interface that could return any unused FD would allow FD allocation operations to commute
and enable implementations to use well-known scalable allocation methods. We will return
to this example, too, in section 9.2. Many other POSIX interfaces get this right: mmap can
return any unused virtual address and creat can assign any unused inode number to a new
file.
5.3
Permit weak ordering
Another common source of limited commutativity is strict ordering requirements between
operations. For many operations, ordering is natural and keeps interfaces simple to use; for
example, when one thread writes data to a file, other threads can immediately read that
data. Synchronizing operations like this are naturally non-commutative. Communication
interfaces, on the other hand, often enforce strict ordering, but may not need to. For instance,
most systems order all messages sent via a local Unix domain socket, even when using
SOCK_DGRAM, so any send and recv system calls on the same socket do not commute (except
in error conditions). This is often unnecessary, especially in multi-reader or multi-writer
situations, and an alternate interface that does not enforce ordering would allow send and
recv to commute as long as there is both enough free space and enough pending messages on
the socket, which would in turn allow an implementation of Unix domain sockets to support
scalable communication (which we use in section 9.3).
5.4 Release resources asynchronously
A closely related problem is that many POSIX operations have global effects that must be
visible before the operation returns. This is generally good design for usable interfaces, but for
operations that release resources, this is often stricter than applications need and expensive
to ensure. For example, writing to a pipe must deliver SIGPIPE immediately if there are no
40
read FDs for that pipe, so pipe writes do not commute with the last close of a read FD. This
requires aggressively tracking the number of read FDs; a relaxed specification that promised
to eventually deliver the SIGPIPE would allow implementations to use more scalable read FD
tracking. Similarly, munmap does not commute with memory reads or writes of the unmapped
region from other threads. Enforcing this requires non-scalable remote TLB shootdowns
before munmap can return, even though depending on this behavior usually indicates a bug.
An munmap (perhaps an madvise) that released virtual memory asynchronously would let the
kernel reclaim physical memory lazily and batch or eliminate remote TLB shootdowns.
41
42
Six
Analyzing interfaces using Commuter
Fully understanding the commutativity of a complex interface is tricky, and achieving an
implementation that avoids sharing when operations commute adds another dimension to
an already difficult task. However, by leveraging the formality of the commutativity rule,
developers can automate much of this reasoning. This chapter presents a systematic, testdriven approach to applying the commutativity rule to real implementations embodied in a
tool named Commuter, whose components are shown in Figure 6-1.
First, Analyzer takes a symbolic model of an interface and computes precise conditions
for when that interface’s operations commute. Second, Testgen uses these conditions to
generate concrete tests of sets of operations that commute according to the interface model,
and thus should have a conflict-free implementation according to the commutativity rule.
Third, Mtrace checks whether a particular implementation is conflict-free for each test case.
A developer can use these test cases to understand the commutative cases they should
consider, to iteratively find and fix scalability issues in their code, or as a regression test suite
to ensure scalability bugs do not creep into the implementation over time.
6.1 Analyzer
Analyzer automates the process of analyzing the commutativity of an interface, saving developers from the tedious and error-prone process of considering large numbers of interactions
between complex operations. Analyzer takes as input a model of the behavior of an interface,
Python model
Analyzer
(§6.1)
Implementation
Commutativity
conditions
Testgen
(§6.2)
Test cases
Mtrace
(§6.3)
Shared cache lines
Figure 6-1: The components of Commuter.
43
written in a symbolic variant of Python, and outputs commutativity conditions: expressions
in terms of arguments and state for exactly when sets of operations commute. A developer
can inspect these expressions to understand an interface’s commutativity or pass them to
Testgen (section 6.2) to generate concrete examples of when interfaces commute.
Given the Python code for a model, Analyzer uses symbolic execution to consider
all possible behaviors of the interface model and construct complete commutativity conditions. Symbolic execution also enables Analyzer to reason about the external behavior
of an interface, rather than specifics of the model’s implementation, and enables models
to capture specification non-determinism (like creat’s ability to choose any free inode) as
under-constrained symbolic values.
6.1.1 Concrete commutativity analysis
Starting from an interface model, Analyzer computes the commutativity condition of each
multiset of operations of a user-specified size. To determine whether a set of operations
commutes, Analyzer executes the SIM commutativity test algorithm given in Figure 6-2. To
begin with, we can think of this test in concrete (non-symbolic) terms as a test of whether a set
of operations commutes starting from a specific initial state s0 and given specific operation
arguments args.
This test is implemented by a function called commutes. commutes codifies the definition
of SIM commutativity, except that it requires the specification to be sequentially consistent so
it needn’t interleave partial operations. Recall that Y SI-commutes in H = X ∣∣ Y when, given
any reordering Y ′ of Y and any action sequence Z,
X ∣∣ Y ∣∣ Z ∈ S if and only if X ∣∣ Y ′ ∣∣ Z ∈ S .
Further, for Y to SIM-commute in H, every prefix of every reordering of Y must SI-commute.
In commutes, the initial state s0 serves the role of the prefix X: to put the system in some
state. ops serves the role of Y (assuming sequential consistency) and the loop in commutes
generates every Y ′ , that is, all prefixes of all reorderings of Y. This loop performs two tests.
First, the result equivalence test ensures that each operation gives the same response in all
reorderings. Finally, the state equivalence test serves the role of the future actions, Z, by
requiring all prefixes of all reorderings to converge on states that are indistinguishable by
future operations.
Since commutes substitutes state equivalence in place of considering all possible future
operations (which would be difficult with symbolic execution), it’s up to the model’s author to
define state equivalence as whether two states are externally indistinguishable. This is standard
practice for high-level data types (e.g., two sets represented as trees could be equal even if
they are balanced differently). For the POSIX model we present in chapter 7, only a few types
need special handling beyond what Analyzer’s high-level data types already provide.
44
def commutes(s0, ops, args):
states = {frozenset(): s0}
results = {}
# Generate all (non-empty) prefixes of all reorderings of ops
todo = list((op,) for op in ops)
while todo:
perm = todo.pop()
# Execute next operation in this permutation
past, op = perm[:-1], perm[-1]
s = states[frozenset(past)].copy()
r = op(s, args[op])
# Test for result equivalence
if op not in results:
results[op] = r
elif r != results[op]:
return False
# Test for state equivalence
if frozenset(perm) not in states:
states[frozenset(perm)] = s
elif s != states[frozenset(perm)]:
return False
# Queue all extensions of perm
todo.extend(perm + (nextop,) for nextop in ops if nextop not in perm)
return True
Figure 6-2: The SIM commutativity test algorithm. s0 is the initial state, ops is the list of
operations to test for commutativity, and args gives the arguments to pass to each operation.
For clarity, this implementation assumes all values in ops are distinct.
The commutes algorithm can be optimized by observing that if two permutations of the
same prefix reach the same state, only one needs to be considered further. This optimization
gives commutes a pleasing symmetry: it becomes equivalent to exploring all n step paths from
⟨0, 0, . . .⟩ to ⟨1, 1, . . .⟩ in an n-cube, where each unit step is an operation and each vertex is
a state.
6.1.2
Symbolic commutativity analysis
So far, we’ve considered only how to determine if a set of operations commutes for a specific
initial state and specific arguments. Ultimately, we’re interested in the entire space of states
and arguments for which a set of operations commutes. To find this, Analyzer executes both
the interface model and commutes symbolically, starting with an unconstrained symbolic
initial state and unconstrained symbolic operation arguments. Symbolic execution lets Analyzer efficiently consider all possible initial states and arguments and precisely determine the
45
SymInode
= tstruct(data = tlist(SymByte),
nlink = SymInt)
SymIMap
= tdict(SymInt, SymInode)
SymFilename = tuninterpreted(’Filename’)
SymDir
= tdict(SymFilename, SymInt)
class POSIX(tstruct(fname_to_inum = SymDir,
inodes
= SymIMap)):
@symargs(src=SymFilename, dst=SymFilename)
def rename(self, src, dst):
if not self.fname_to_inum.contains(src):
return (-1, errno.ENOENT)
if src == dst:
return 0
if self.fname_to_inum.contains(dst):
self.inodes[self.fname_to_inum[dst]].nlink -= 1
self.fname_to_inum[dst] = self.fname_to_inum[src]
del self.fname_to_inum[src]
return 0
Figure 6-3: A simplified version of our rename model.
def commutes2(s0, opA, argsA, opB, argsB):
# Run reordering [opA, opB] starting from s0
sAB = s0.copy()
rA = opA(sAB, *argsA)
rAB = opB(sAB, *argsB)
# Run
sBA =
rB =
rBA =
reordering [opB, opA] starting from s0
s0.copy()
opB(sBA, *argsB)
opA(sBA, *argsA)
# Test commutativity
return rA == rBA and rB == rAB and sAB == sBA
Figure 6-4: The SIM commutativity test algorithm specialized to two operations.
(typically infinite) set of states and arguments for which the operations commute (that is, for
which commutes returns True).
Figure 6-3 gives an example of how a developer could model the rename operation in Analyzer. The first five lines declare symbolic types used by the model (tuninterpreted declares
a type whose values support only equality). The POSIX class, itself a symbolic type, represents
the system state of the file system and its methods implement the interface operations to
be tested. The implementation of rename itself is straightforward. Indeed, the familiarity of
Python and ease of manipulating state were part of why we chose it over abstract specification
languages.
To explore how Analyzer analyzes rename, we’ll use the version of commutes given in
46
Figure 6-4, which is specialized for pairs of operations. In practice, we typically analyze pairs
of operations rather than larger sets because larger sets take exponentially longer to analyze
and rarely reveal problems beyond those already revealed by pairs.
By symbolically executing commutes2 for two rename operations, rename(a, b) and rename(c, d), Analyzer computes that these operations commute if any of the following hold:
• Both source files exist, and the file names are all different (a and c exist, and a, b, c, d all
differ).
• One rename’s source does not exist, and it is not the other rename’s destination (either
a exists, c does not, and b≠c, or c exists, a does not, and d≠a).
• Neither a nor c exists.
• Both calls are self-renames (a=b and c=d).
• One call is a self-rename of an existing file (a exists and a=b, or c exists and c=d) and
it’s not the other call’s source (a≠c).
• Two hard links to the same inode are renamed to the same new name (a and c point to
the same inode, a≠c, and b=d).
Despite rename’s seeming simplicity, Analyzer’s symbolic execution systematically explores its hidden complexity, revealing the many combinations of state and arguments for
which two rename calls commute. Here we see again the value of SIM commutativity: every
condition above except the self-rename case depends on state and would not have been
captured by a traditional, non-state-sensitive definition of commutativity.
Figure 6-5 illustrates the symbolic execution of commute2 that arrives at these conditions.
By and large, this symbolic execution proceeds like regular Python execution, except when it
encounters a conditional branch on a symbolic value (such as any if statement in rename).
Symbolic execution always begins with an empty symbolic path condition. To execute a
conditional branch on a symbolic value, Analyzer uses an SMT solver to determine whether
that symbolic value can be true, false, or either, given the path condition accumulated so far.
If the branch can go both ways, Analyzer logically forks the symbolic execution and extends
the path conditions of the two forks with the constraints that the symbolic value must be true
or false, respectively. These growing path conditions thereby constrain further execution on
the two resulting code paths.
The four main regions of Figure 6-5 correspond to the four calls to rename from commutes2 as it tests the two different reorderings of the two operations. Each call region shows
the three conditional branches in rename. The first call forks at every conditional branch
because the state and arguments are completely unconstrained at this point; Analyzer therefore explores every code path through the first call to rename. The second call forks similarly.
47
sAB
sBA
opB
opA
opB
-1
-1
T
F
0
T
F
0
T
F
-1
0
T
F
0
T
F
-1
T
F
F
0
T
F
T
0
-1
T
-1
T
-1
T
-1
0
T
F
opA
0
T
F
F
F
F
0
-1
T
F
T
F
F
0
T
0
T
F
F
F
T
F
0
T 0
F 0
0
T
0
F
F
F
0
-1
T
F
T
F
0
T
F
0
T
F
0
F
F
T
F
0
T 0
F 0
0
T
F
T
T
F
T
F
0
0
0
F
F
T
F
0
-1
-1
0
T
F
0
T
F
T
F
0
T 0
F 0
-1
T
F
0
T
-1
T
F
0
T
F
F
T
F
0
T
F
0
0
-1
T
F
F
T 0
F 0
F
F
F
F
T
T
T
T
F
F
F
F
F
F
✓
-1
R
-1
R
0
F
F
T 0
F 0
T
F
F
F
F
F
F
F
F
F
F
F
F
F
F
T
F
F
F
F
F
F
F
T
R
T 0
F 0
F
F
F
R
0
F
F
R
✓
T
F
T
✓
T
0
T
F
T
R
✓
F
F
✓
T
R
0
T
-1
T
✓
-1
T
F
✓
T
R
0
T
T
✓
-1
T
F
✓
T
0
T
T
0
0
0
T
T ✓
F S
F
F
F
T
F
0
T
R
-1
R
-1
R
0
R
0
0
0
0
0
0
0
✓
S
S
S
✓
S
✓
T
R
✓
T
✓
T ✓
F S
R
F
T
S
✓
Figure 6-5: Symbolic execution tree of commutes2 for rename/rename. Each node represents
a conditional branch on a symbolic value. The terminals at the right indicate whether each
path constraint yields a commutative execution of the two operations ( ✓ ), or, if not, whether
it diverged on return values ( R ) or final state ( S ).
48
The third and fourth calls generally do not fork; by this point, the symbolic values of s0, argsA,
and argsB are heavily constrained by the path condition produced by the first two calls. As a
result, these calls are often forced to make the same branch as the corresponding earlier call.
Finally, after executing both reorderings of rename/rename, commutes2 tests their commutativity by checking if each operation’s return value is equivalent in both permutations
and if the system states reached by both permutations are equivalent. This, too, is symbolic
and may fork execution if it’s still possible for the pair of operations to be either commutative
or non-commutative (Figure 6-5 contains two such forks).
Together, the set of path conditions that pass this final commutativity test are the commutativity condition of this pair of operations. Barring SMT solver time-outs, the disjunction of
the path conditions for which commutes2 returns True captures the precise and complete set
of initial system states and operation arguments for which the operations commute.
As this example shows, when system calls access shared, mutable state, reasoning about
every commutative case by hand can become difficult. Developers can easily overlook cases,
both in their understanding of an interface’s commutativity, and when making their implementation scale for commutative cases. Analyzer automates reasoning about all possible
system states, all possible sets of operations that can be invoked, and all possible arguments
to those operations.
6.2
Testgen
While a developer can examine the commutativity conditions produced by Analyzer directly,
for complex interfaces these formulas can be large and difficult to decipher. Further, real implementations are complex and likely to contain unintentional sharing, even if the developer
understands an interface’s commutativity. Testgen takes the first step to helping developers apply commutativity to real implementations by converting Analyzer’s commutativity
conditions into concrete test cases.
To produce a test case, Testgen computes a satisfying assignment for the corresponding
commutativity condition. The assignment specifies concrete values for every symbolic variable
in the model, such as the fname_to_inum and inodes data structures and the rename arguments
shown in Figure 6-3. Testgen then invokes a model-specific function on the assignment to
produce actual C test case code. For example, one test case that Testgen generates is shown in
Figure 6-6. The test case includes setup code that configures the initial state of the system and
a set of functions to run on different cores. Every Testgen test case should have a conflict-free
implementation.
The goal of these test cases is to expose potential scalability problems in an implementation,
but it is impossible for Testgen to know exactly what inputs might trigger conflicting memory
accesses. Thus, as a proxy for achieving good coverage on the implementation, Testgen aims
to achieve good coverage of the Python model.
49
void setup_rename_rename_path_ec_test0(void) {
close(open("__i0", O_CREAT|O_RDWR, 0666));
link("__i0", "f0");
link("__i0", "f1");
unlink("__i0");
}
int test_rename_rename_path_ec_test0_op0(void) {
return rename("f0", "f0");
}
int test_rename_rename_path_ec_test0_op1(void) {
return rename("f1", "f0");
}
Figure 6-6: An example test case for two rename calls generated by Testgen for the model in
Figure 6-3.
We consider two forms of coverage. The first is the standard notion of path coverage,
which Testgen achieves by relying on Analyzer’s symbolic execution. Analyzer produces
a separate path condition for every possible code path through a set of operations. However,
even a single path might encounter conflicts in interestingly different ways. For example,
the code path through two pwrites is the same whether they’re writing to the same offset
or different offsets, but the access patterns are very different. To capture different conflict
conditions as well as path conditions, we introduce a new notion called conflict coverage.
Conflict coverage exercises all possible access patterns on shared data structures: looking
up two distinct items from different operations, looking up the same item, etc. Testgen
approximates conflict coverage by concolically executing itself to enumerate distinct tests for
each path condition. Testgen starts with the constraints of a path condition from Analyzer,
tracks every symbolic expression forced to a concrete value by the model-specific test code
generator, negates any equivalent assignment of these expressions from the path condition,
and generates another test, repeating this process until it exhausts assignments that satisfy
the path condition or the SMT solver fails. Since path conditions can have infinitely many
satisfying assignments (e.g., there are infinitely many calls to read with different FD numbers
that return EBADF), Testgen partitions most values in isomorphism groups and considers
two assignments equivalent if each group has the same pattern of equal and distinct values in
both assignments. For our POSIX model, this bounds the number of enumerated test cases.
These two forms of coverage ensure that the test cases generated by Testgen will cover
all possible paths and data structure access patterns in the model, and to the extent that the
implementation is structured similarly to the model, should achieve good coverage for the
implementation as well. As we demonstrate in chapter 7, Testgen produces a total of 26,238
test cases for our model of 18 POSIX system calls, and these test cases find scalability issues in
the Linux ramfs file system and virtual memory system.
50
6.3
Mtrace
Finally, Mtrace runs the test cases generated by Testgen on a real implementation and
checks that the implementation is conflict-free for every test. If it finds a violation of the
commutativity rule—a test whose commutative operations are not conflict-free—it reports
which variables were shared and what code accessed them. For example, when running the
test case shown in Figure 6-6 on a Linux ramfs file system, Mtrace reports that the two
functions make conflicting accesses to the dcache reference count and lock, which limits the
scalability of those operations.
Mtrace runs the entire operating system in a modified version of qemu [5]. At the
beginning of each test case, it issues a hypercall to qemu to start recording memory accesses,
and then executes the test operations on different virtual cores. During test execution, Mtrace
logs all reads and writes by each core, along with information about the currently executing
kernel thread, to filter out irrelevant conflicts by background threads or interrupts. After
execution, Mtrace analyzes the log and reports all conflicting memory accesses, along with
the C data type of the accessed memory location (resolved from DWARF [28] information
and logs of every dynamic allocation’s type) and stack traces for each conflicting access.
6.4
Implementation
We built a prototype implementation of Commuter’s three components. Analyzer and
Testgen consist of 3,050 lines of Python code, including the symbolic execution engine,
which uses the Z3 SMT solver [24] via Z3’s Python bindings. Mtrace consists of 1,594 lines
of code changed in qemu, along with 612 lines of code changed in the guest Linux kernel (to
report memory type information, context switches, etc.). Another program, consisting of
2,865 lines of C++ code, processes the log file to find and report memory locations that are
shared between different cores for each test case.
51
52
Seven
Conflict-freedom in Linux
To understand whether Commuter is useful to kernel developers, we modeled several POSIX
file system and virtual memory calls in Commuter, then used this both to evaluate Linux’s
scalability and to develop a scalable file and virtual memory system for our sv6 research kernel.
The rest of this chapter focuses on Linux and uses this case study to answer the following
questions:
• How many test cases does Commuter generate, and what do they test?
• How good are current implementations of the POSIX interface? Do the test cases
generated by Commuter find cases where current implementations don’t scale?
In the next chapter, we’ll use this same POSIX model to guide the implementation of a
new operating system kernel, sv6.
7.1
POSIX test cases
To answer the first question, we developed a simplified model of the POSIX file system and
virtual memory APIs in Commuter. The model covers 18 system calls, and includes inodes,
file names, file descriptors and their offsets, hard links, link counts, file lengths, file contents,
file times, pipes, memory-mapped files, anonymous memory, processes, and threads. Our
model also supports nested directories, but we disable them because Z3 does not currently
handle the resulting constraints. We restrict file sizes and offsets to page granularity; for
pragmatic reasons, some kernel data structures are designed to be conflict-free for offsets on
different pages, but offsets within a page conflict. Commuter generates a total of 26,238 test
cases from our model. Generating the test cases and running them on both Linux and sv6
takes a total of 16 minutes on the machine described in section 9.1.
The model implementation and its model-specific test code generator are 596 and 675
lines of Python code, respectively. Figure 6-3 showed a part of our model, and Figure 6-6
gave an example test case generated by Commuter. We verified that all test cases return the
expected results on both Linux and sv6.
53
memwrite
memread
mprotect
munmap
mmap
pwrite
pread
write
read
pipe
close
lseek
fstat
stat
rename
unlink
link
open
open
link
unlink
rename
stat
fstat
lseek
close
pipe
read
write
pread
pwrite
mmap
munmap
mprotect
memread
memwrite
52
8
212 42 30
32
28
35
5
105 44 20
32
20
12
2
33
6
11
6
35
5
105 40 20
32
20
12
2
33
10
8
9
7
41
18
30 20
12
2
47
36
1
28
4
16
4
117
16
7
9
51
44 34
62
17 151 139
11
74
16
3
16
6
9
63
99
3
49 26
4
15
1
2
68 137 52
4
4
4
7
39
1
6
5
16
2
153 50
25
3
156 64 44 42
20
2
232 60 122
40
4
218 114
All tests
conflict-free
612 180 3955
29
1
114
23
20
All tests
conflicted
28
Linux (17,206 of 26,238 cases scale)
Figure 7-1: Conflict-freedom of commutative system call pairs in Linux, showing the fraction
and absolute number of test cases generated by Commuter that are not conflict-free for each
system call pair. One example test case was shown in Figure 6-6.
7.2
Linux conflict-freedom
To evaluate the scalability of existing file and virtual memory systems, we used Mtrace to
check the above test cases against Linux kernel version 3.8. Linux developers have invested
significant effort in making the file system scale [11], and it already scales in many interesting
cases, such as concurrent operations in different directories or concurrent operations on
different files in the same directory that already exist [20]. We evaluated the ramfs file system
because ramfs is effectively a user-space interface to the Linux buffer cache. Since exercising
ramfs is equivalent to exercising the buffer cache and the buffer cache underlies all Linux
file systems, this represents the best-case scalability for a Linux file system. Linux’s virtual
memory system, in contrast, involves process-wide locks that are known to limit its scalability
and impact real applications [11, 16, 62].
Figure 7-1 shows the results. Out of 26,238 test cases, 9,032 cases, widely distributed
across the system call pairs, were not conflict-free. This indicates that even a mature and
reasonably scalable operating system implementation misses many cases that can be made to
scale according to the commutativity rule.
A common source of access conflicts is shared reference counts. For example, most file
54
name lookup operations update the reference count on a struct dentry; the resulting write
conflicts cause them to not scale. Similarly, most operations that take a file descriptor update
the reference count on a struct file, making commutative operations such as two fstat calls on
the same file descriptor not scale. Coarse-grained locks are another source of access conflicts.
For instance, Linux locks the parent directory for any operation that creates file names,
even though operations that create distinct names generally commute. Similarly, we see that
coarse-grained locking in the virtual memory system severely limits the conflict-freedom of
address space manipulation operations. This agrees with previous findings [11, 16, 17], which
demonstrated these problems in the context of several applications.
Figure 7-1 also reveals many previously unknown bottlenecks that may be triggered by
future workloads or hardware.
The next chapter shows how these current and future bottlenecks can be removed in a
practical implementation of POSIX.
55
56
Eight
Achieving conflict-freedom in POSIX
Given that many commutative operations are not conflict-free in Linux, is it feasible to build
file systems and virtual memory systems that do achieve conflict-freedom for commutative
operations? To answer this question, we designed and implemented a ramfs-like in-memory
file system called ScaleFS and a virtual memory system called RadixVM for sv6, our research
kernel based on xv6 [23]. Although it is in principle possible to make the same changes in
Linux, we chose not to implement ScaleFS in Linux because ScaleFS’s design would have
required extensive changes throughout the Linux kernel. The designs of both RadixVM and
ScaleFS were guided by the commutativity rule. For ScaleFS, we relied heavily on Commuter
throughout development to guide its design and identify sharing problems in its implementation. RadixVM was built prior to Commuter, but was guided by manual reasoning about
commutativity and conflicts (which was feasible because of the virtual memory system’s
relatively simple interface). We later validated RadixVM using Commuter.
Figure 8-1 shows the result of applying Commuter to sv6. In contrast with Linux, sv6 is
conflict-free for nearly every commutative test case.
For a small number of commutative operations, sv6 is not conflict-free. Some appear to
require implementations that would be incompatible with conflict-freedom in other cases. In
these situations, we preferred the conflict-freedom of the cases we considered more important.
Other non-conflict-free cases represent intentional engineering decisions in the interest of
practical constraints on memory consumption and sequential performance. Complex software
systems inevitably involve conflicting requirements, and scalability is no different. However,
the presence of the rule forced us to explicitly recognize, evaluate, and justify where we made
such trade-offs.
The rest of this chapter describes the design of sv6 and how it achieves conflict-freedom
for cases where current file and virtual memory systems do not. This chapter starts with the
design of Refcache, a scalable reference counting mechanism used throughout sv6. It then
covers the designs of RadixVM and ScaleFS. Finally, it discusses operations that are difficult
to make conflict-free without sacrificing other practical concerns.
57
memwrite
memread
mprotect
munmap
mmap
pwrite
pread
write
read
pipe
close
lseek
fstat
stat
rename
unlink
link
open
open
link
unlink
rename
stat
fstat
lseek
close
pipe
read
write
pread
pwrite
mmap
munmap
mprotect
memread
memwrite
9
2
5
2
4
1
1
5
6
12
4
13
All tests
conflict-free
12
24
1
12
1
9
All tests
conflicted
sv6 (26,115 of 26,238 cases scale)
Figure 8-1: Conflict-freedom of commutative system call pairs in sv6.
8.1 Refcache: Scalable reference counting
Reference counting is critical to many OS functions and used throughout RadixVM and
ScaleFS. RadixVM must reference count physical pages shared between address spaces, such
as when forking a process, as well as nodes in its internal radix tree. Likewise, ScaleFS must
reference count file descriptions, FD tables, and entries in various caches as well as scalably
maintain other counts such as link counts. This section introduces Refcache, a novel reference
counting scheme used by RadixVM and ScaleFS. Refcache implements space-efficient, lazy,
scalable reference counting using per-core reference delta caches. Refcache targets uses of
reference counting that can tolerate some latency in releasing reference counted resources
and is particularly suited to uses where increment and decrement operations often occur on
the same core (e.g., the same thread that allocated pages also frees them).
Despite the seeming simplicity of reference counting, designing a practical reference
counting scheme that is conflict-free and scalable for most operations turns out to be an
excellent exercise in applying the scalable commutativity rule to both implementation and
interface design.
We first touched on this problem in section 4.5, where we introduced the notion of a
lazy reference counter that separates manipulation from zero detection. Simple reference
58
counters have two operations, inc(o) and dec(o), which both return the new reference count.
These operations trivially commute when applied to different objects, so we focus on the case
of a single object. When applied to the same object, these operations never commute (any
reordering will change their responses) and cannot be made conflict-free.
A better (and more common) interface returns nothing from inc and, from dec, returns
only an indication of whether the count is zero. This is equivalent to the Linux kernel’s
atomic_inc and atomic_dec_and_test interface and is echoed in software ranging from Glib to
Python. Here, a sequence of inc and dec operations commutes if (and only if) the count does
not reach zero in any reordering. In this case, the results of the operations will be the same
in all reorderings and, since reordering does not change the final sum, no future operations
can distinguish different orders. Therefore, by the scalable commutativity rule, any such
sequence has some conflict-free implementation. Indeed, supposing a particular history with
a commutative region, an implementation can partition the count’s value across cores such
that no per-core partition drops to zero in the commutative region. However, a practical,
general-purpose conflict-free implementation of this interface remains elusive.
Refcache pushes this interface reduction further, targeting reference counters that can
tolerate some latency between when the count reaches zero and when the system detects that
it’s reached zero.
In Refcache, both inc and dec return nothing and hence always commute, even if the
count reaches zero. A new review() operation finds all objects whose reference counts recently reached zero (and, in practice, calls their destructors). review does not commute in
any sequence where any object’s reference count has reached zero and its implementation
conflicts on a small number of cache lines even when it does commute. However, unlike
dec, the system can control when and how often it invokes review. sv6 invokes it only at
10ms intervals—several orders of magnitude longer than the time required by even the most
expensive conflicts on current multicores. Refcache strikes a balance between broad conflictfreedom, strict adherence to the scalable commutativity rule, and practical implementation
concerns, providing a general-purpose, efficient implementation of an interface suitable for
many uses of referencing counting.
By separating count manipulation and zero detection, Refcache can batch increments
and decrements and reduce cache line conflicts while offering an adjustable time bound on
when an object will be garbage collected after its reference count drops to zero. inc and dec are
conflict-free with high probability and review induces only a small constant rate of conflicts
for global epoch maintenance.
Refcache inherits ideas from sloppy counters [11], Scalable NonZero Indicators (SNZI) [29],
distributed counters [1], shared vs. local counts in Modula-2+ [26], and approximate counters [19]. All of these techniques speed up increment and decrement operations using per-core
counters, but make different trade-offs between conflict-freedom, zero-detection cost, and
space. With the exception of SNZIs, these techniques do not detect when a count reaches
59
zero, so the system must poll each counter for this condition. SNZIs detect when a counter
reaches zero immediately, but at the cost of conflicts when a counter’s value is small. Refcache
balances these extremes, enabling conflict-free operations and efficient zero-detection, but
with a short delay. Furthermore, in contrast with sloppy, SNZI, distributed, and approximate
counters, Refcache requires space proportional to the sum of the number of reference counted
objects and the number of cores, rather than the product, and the per-core overhead can be
adjusted to trade off space and scalability by controlling the reference delta cache collision
rate. Space overhead is particularly important for RadixVM, which must reference count
every physical page; at large core counts, other scalable reference counters would require
more than half of physical memory just to track the remaining physical memory.
8.1.1 Basic Refcache
In Refcache, each reference counted object has a global reference count (much like a regular
reference count) and each core also maintains a local, fixed-size cache of deltas to objects’
reference counts. Incrementing or decrementing an object’s reference count modifies only the
local, cached delta and review periodically flushes this delta to the object’s global reference
count. The true reference count of an object is thus the sum of its global count and any local
deltas for that object found in the per-core delta caches. The value of the true count is generally
unknown, but we assume that once the true count drops to zero, it will remain zero (in the
absence of weak references, which we discuss later). Refcache depends on this stability to
detect a zero true count after some delay.
To detect a zero true reference count, Refcache divides time into periodic epochs during
which each core calls review once to flush all of the reference count deltas in its cache and
apply these updates to the global reference count of each object. The last core in an epoch
to finish flushing its cache ends the epoch and all of the cores repeat this process after some
delay (our implementation uses 10ms). Since these flushes occur in no particular order and
the caches batch reference count changes, updates to the reference count can be reordered.
As a result, a zero global reference count does not imply a zero true reference count. However,
once the true count is zero, there will be no more updates, so if the global reference count
of an object drops to zero and remains zero for an entire epoch, then review can guarantee
that the true count is zero and free the object. To detect this, the first review operation to
set an object’s global reference count to zero adds the object to the local per-core review
queue. Review then reexamines it two epochs later (which guarantees one complete epoch
has elapsed) to decide whether its true reference count is zero.
Figure 8-2 gives an example of a single object over the course of eight epochs. Epoch 1
demonstrates the power of batching: despite six reference count manipulations spread over
three cores, all inc and dec operations are conflict-free (as we would hope, given that they
commute), the object’s global reference count is never written to and cache line conflicts arise
only from epoch maintenance. The remaining epochs demonstrate the complications that
60
core 0
1
2
3
global 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 0 1 1 1 0 0 0 0 1 1 1 1 1 1 1 1 1
1 3 2 1 11 2 1 1 1 1 1 1 1 11 1 2 1 1 1 2 1 1 1 1 1 1 1 11 1 1 1 11
true
1
2
3
4
5
6
7
Figure 8-2: Refcache example showing a single object over eight epochs. Plus and minus
symbols represent increment and decrement operations, dotted lines show when cores flush
these to the object’s global count, and blue circles show when each core flushes its local
reference cache. The loops around the global count show when the object is in core 0’s review
queue and the red zeroes indicate dirty zeroes.
arise from batching and the resulting lag between the true reference count and the global
reference count of an object.
Because of the flush order, the inc and dec in epoch 2 are applied to the global reference
count in the opposite order of how they actually occurred. As a result, core 0 observes the
global count temporarily drop to zero when it flushes in epoch 2, even though the true count is
non-zero. This is remedied as soon as core 1 flushes its increment, and when core 0 reexamines
the object at the beginning of epoch 4, after all cores have again flushed their delta caches, it
can see that the global count is non-zero; hence, the zero count it observed was not a true
zero and the object should not be freed.
It is necessary but not sufficient for the global reference count to be zero when an object
is reexamined; there must also have been no deltas flushed to the object’s global count in the
interim, even if those changes canceled out or the deltas themselves were zero. For example,
core 0’s review will observe a zero global reference count at the end of epoch 4, and again
when it reexamines the object in epoch 6. However, the true count is not zero, and the global
reference count was temporarily non-zero during the epoch. We call this a dirty zero and in
this situation review will queue the object to be examined again two epochs later, in epoch 8.
8.1.2 Weak references
As described, Refcache is well suited to reference counts that track the number of true
references to an object, since there is no danger of the count going back up from zero once the
object becomes unreachable. However, operating systems often need untracked references to
objects; for example, ScaleFS’s caches track objects that may be deleted at any time, and may
even need to bring an object’s reference count back up from zero. RadixVM’s radix tree has
similar requirements. To support such uses, we extend Refcache with weak references, which
provide a tryget operation that will either increment the object’s reference count (even if it has
reached zero) and return the object, or will indicate that the object has already been deleted.
A weak reference is simply a pointer marked with a “dying” bit, along with a back-reference
from the referenced object. When an object’s global reference count initially reaches zero,
61
review sets the weak reference’s dying bit. After this, tryget can “revive” the object by atomically
clearing the dying bit and fetching the pointer value, and then incrementing the object’s
reference count as usual. When review decides to free an object, it first atomically clears both
the dying bit and the pointer in the weak reference. If this succeeds, it can safely delete the
object. If this fails, it reexamines the object again two epochs later. In a race between tryget
and review, whether the object is reviewed or deleted is determined by which operation clears
the dying bit first.
This protocol can be extended to support multiple weak references per object using a
two phase approach in which review first puts all weak references to an object in an intermediate state that can be rolled back if any reference turns out to be revived. Our current
implementation does not support this because it is unnecessary for RadixVM or ScaleFS.
8.1.3
Algorithm
Figure 8-3 shows pseudocode for Refcache. Each core maintains a hash table storing its
reference delta cache and the review queue that tracks objects whose global reference counts
reached zero. A core reviews an object after two epoch boundaries have passed, which guarantees that all cores have flushed their reference caches at least once.
All of the functions in Figure 8-3 execute with preemption disabled, meaning they are
atomic with respect to each other on a given core, which protects the consistency of per-core
data structures. Fine-grained locks protect the Refcache-related fields of individual objects.
For epoch management, our current implementation uses a barrier scheme that tracks
a global epoch counter, per-core epochs, and a count of how many per-core epochs have
reached the current global epoch. This scheme suffices for our benchmarks, but schemes with
fewer cache-line conflicts are possible, such as the tree-based quiescent state detection used
by Linux’s hierarchical RCU implementation [46].
8.1.4 Discussion
Refcache trades latency for scalability by batching increment and decrement operations in percore caches. As a result, except when collisions in the reference delta cache cause evictions, inc
and dec are naturally conflict-free with all other operations. Furthermore, because Refcache
uses per-core caches rather than per-core counts, it is more space-efficient than other scalable
reference counting techniques. While not all uses of reference counting can tolerate Refcache’s
latency, its scalability and space-efficiency are well suited to the requirements of RadixVM
and ScaleFS.
62
inc(obj) ≡
If local-cache[hash(obj)].obj ≠ obj:
evict(local-cache[hash(obj)])
local-cache[hash(obj)] ← ⟨obj, 0⟩
local-cache[hash(obj)].delta += 1
tryget(weakref) ≡
Do:
⟨obj, dying⟩ ← weakref
while weakref.cmpxchg(⟨obj, dying⟩, ⟨obj, false⟩) fails
If obj is not null:
inc(obj)
Return obj
evict(obj, delta) ≡
If delta = 0 and obj.refcnt ≠ 0: return
With obj locked:
obj.refcnt ← obj.refcnt + delta
If obj.refcnt = 0:
If obj is not on any review queue:
obj.dirty ← false
obj.weakref.dying ← true
Add ⟨obj, epoch⟩ to the local review queue
else:
obj.dirty ← true
flush() ≡
Evict all local-cache entries and clear cache
Update the current epoch
review() ≡
flush()
For each ⟨obj, objepoch⟩ in local review queue:
If epoch < objepoch + 2: continue
With obj locked:
Remove obj from the review queue
If obj.refcnt ≠ 0:
obj.weakref.dying ← false
else if obj.dirty or obj.weakref.cmpxchg(⟨obj, true⟩, ⟨null, false⟩) fails:
obj.dirty ← false
obj.weakref.dying ← true
Add ⟨obj, epoch⟩ to the local review queue
else:
Free obj
Figure 8-3: Refcache algorithm. Each core calls review periodically. evict may be called by
flush or because of a collision in the reference cache. dec is identical to inc except that it
decrements the locally cached delta.
63
8.2
RadixVM: Scalable address space operations
The POSIX virtual memory interface is rife with commutativity, but existing implementations
scale poorly. We model a virtual memory system as four key logical operations: mmap adds a
region of memory to a process’ virtual address space, munmap removes a region of memory
from the address space, and memread and memwrite access virtual memory at a particular
virtual address or fail if no mapped region contains that address. In reality, the hardware implements memread and memwrite directly and the VM system instead implements a pagefault
operation; we discuss this distinction below.
VM operations from different processes (and hence address spaces) trivially commute.
Most existing implementations use per-process address space representations, so such operations are also naturally conflict-free (and scale well). VM operations from the same process
also often commute; in particular, operations on non-overlapping regions of the address
space commute. Many multithreaded applications exercise exactly this increasingly important
scenario: mmap, munmap, and related variants lie at the core of high-performance memory
allocators and garbage collectors and partitioning the address space between threads is a key
design component of these systems [30, 32, 44]. Applications that frequently map and unmap
files also generate such workloads for the OS kernel.
Owing to complex invariants in virtual memory systems, widely used kernels such as Linux
and FreeBSD protect each address space with a single lock. This induces both conflicts and,
often, complete serialization between commutative VM operations on the same address space,
which can dramatically limit application scalability [11, 16]. As a result, applications often
resort to workarounds with significant downsides. For example, a Google engineer reported to
us that Google’s memory allocator is reluctant to return memory to the OS precisely because
of scaling problems with munmap and as a result applications tie up gigabytes of memory until
they exit. This delays the start of other applications and uses servers inefficiently. Engineers
from other companies have reported similar problems to us.
RadixVM is a novel virtual memory system in which commutative operations on nonoverlapping address space regions are almost always conflict-free. This ensures that if two
threads operate on non-overlapping regions of virtual memory, the VM system does not
limit the application’s scalability. Furthermore, if multiple threads operate on the same shared
region, RadixVM constrains conflicts to the cores executing those threads.
Achieving this within the constraints of virtual memory hardware, without violating
POSIX’s strict requirements on the ordering and global visibility of VM operations, and
without unacceptable memory overhead is challenging. The following sections describe
RadixVM’s approach. Section 8.2.1 describes the general architecture of POSIX VM systems.
Sections 8.2.2 and 8.2.3 describe the data structures that form the core of RadixVM. Finally,
section 8.2.4 describes how RadixVM uses these structures to implement standard VM
operations.
64
Application
mmap
munmap
load
/bin/ls
rwxs
(anon)
rwxs
store
18bca: 00230
87c38: 0049c
18bcb: 00325
Page faults
TLB misses
b987a: 00520
8a4bd: 00382
87c38: 0049c
libc
rwxs
Memory map
Page table
Per-CPU TLBs
Figure 8-4: Key structures in a POSIX VM system, showing an address space with three
mapped regions. Crossed-out page table entries are marked “not present.” Some pages have
not been faulted yet, and thus are not present, despite being in a mapped region.
8.2.1
POSIX VM architecture
Between the POSIX VM interface on one side and the hardware virtual memory interface on
the other, RadixVM’s general architecture necessarily resembles typical Unix VM systems.
Figure 8-4 shows the principle structures any Unix VM system must manage. On the left is the
kernel’s internal representation of the address space. Logically, this consists of a set of mapped
regions of the virtual address space, where each region may map either a file or “anonymous”
heap memory, has various access permissions, and may be either shared or copied across fork.
mmap and munmap manipulate this view of the address space.
Translating virtual addresses to physical addresses is performed in hardware by the memory management unit (MMU) and the MMU’s representation of virtual memory generally
differs significantly from the kernel’s internal representation of the address space. Most hardware architectures specify some form of page table data structure, like the x86 page table
depicted in Figure 8-4, for the kernel to inform the MMU of the mapping from virtual addresses to physical addresses. These structures map addresses at a page granularity (typically
4 KB, though most architectures also support larger pages) and represent access permissions
(which are checked by the MMU), but have little flexibility.
While we model the VM system in terms of mmap, munmap, memread, and memwrite, only
the first two are directly implemented by the VM system. The latter two are implemented by the
MMU using the kernel-provided page table. However, the VM system has a key hook into the
MMU: when the virtual address of a read or a write is marked “not present” in the page table
or fails a permission check, the MMU invokes the VM system’s pagefault operation. Modern
Unix kernels exploit this mechanism to implement demand paging, populating page table
65
entries by allocating pages or reading them from disk only once a page is first accessed, rather
than when it is mapped. Hence, mmap and munmap simply clear the page table entries of the
region being mapped or unmapped; mmap depends on pagefault operations to later fill these
entries with mappings. While demand paging significantly affects the implementation of the
VM system, it is nevertheless an implementation issue and does not affect the commutativity
of memread or memwrite.
The final key structure the VM system must manage is the hardware translation lookaside buffer (TLB). The TLB is a per-core associative cache of virtual-to-physical mappings.
In architectures with a hardware-filled TLB (x86, ARM, PowerPC, and many others), the
MMU transparently fills this cache from the page table. Invalidating the TLB, however, is the
responsibility of the VM system. Hence, munmap, in addition to removing regions from the
kernel’s internal address space representation and clearing entries in the page table, must also
invalidate the corresponding entries from each core’s TLB.
8.2.2
Radix tree
We first tackle RadixVM’s internal address space representation. Widely-used operating
system kernels represent an address space as a balanced tree of mapped memory regions.
For example, Linux uses a red-black tree [64], FreeBSD uses a splay tree [54], and Solaris
and Windows use AVL trees [51, 66]. This ensures O(log n) lookups and modifications for
address spaces that can easily contain thousands of mapped regions, but these data structures
are poorly suited for concurrent access: not only do they induce cache line conflicts between
operations on non-overlapping regions, they force all of these kernels to use coarse-grained
locking.
Lock-free data structures seem like a compelling solution to this problem. However,
while lock-free (or partially lock-free) data structures such as the Bonsai tree used by the
Bonsai VM [16], relativistic red-black trees [38], and lock-free skip lists [36] eliminate the
coarse-grained serialization that plagues popular VM systems, their operations are far from
conflict-free. For example, insert and lookup operations for a lock-free concurrent skip list can
conflict on interior nodes in the skip list—even when the lookup and insert involve different
keys and hence commute—because insert must modify interior nodes to maintain O(log n)
lookup time. The effect of these conflicts on performance can be dramatic, as shown by the
simple experiment in Figure 8-5. Furthermore, while these data structures maintain their own
integrity under concurrent operations, a VM system must maintain higher-level invariants
across the address space representation, page tables, and TLBs, and it is difficult to extend the
precision design of a single lock-free data structure to cover these broader invariants.
This raises the question: what is a good conflict-free data structure for virtual memory
metadata?
66
Lookups/sec/core
6M
0 writers
1 writer
5 writers
5M
4M
3M
2M
1M
0
1
10
20
30
40
50
60
# cores performing lookups
70
80
Figure 8-5: Throughput of skip list lookups with concurrent inserts and deletes. The skip list
represents ∼1,000 virtual memory regions. Lookups commute with modifications, but even a
small number of writers significantly limits lookup scalability.
Straw-man solution. A (completely impractical) way to achieve conflict-free commutative
address space operations is to represent a process’s address space as a large linear array indexed
by virtual page number that stores each virtual page’s metadata individually. In this linear
representation, mmap, munmap, and pagefault can lock and manipulate precisely the range
of pages being mapped, unmapped, or faulted. Operations on non-overlapping regions are
clearly conflict-free and precise range locking makes maintaining invariants across structures
relatively straightforward.
Compressed radix trees. RadixVM follows the same general scheme as this straw-man
design, but reins in its memory consumption using a multilevel, compressed radix tree.
RadixVM’s internal address space representation resembles a hardware page table structurally, storing mapping metadata in a fixed-depth radix tree, where each level of the tree is
indexed by nine or fewer bits of the virtual page number, as shown in Figure 8-6. Like the
linear array, the radix tree supports only point queries (not range queries) and iteration, but
unlike the linear array, RadixVM can compress repeated entries and lazily allocate the nodes
of the radix tree. The radix tree folds any node that would consist entirely of identical values
into a single value stored in the parent node. This continues up to the root node of the tree,
allowing the radix tree to represent vast swaths of unused virtual address space with a handful
of empty slots and to set large ranges to identical values very quickly. This optimization reins
in the radix tree’s memory use, makes it efficient to set large ranges to the same value quickly,
and enables fast range scans. It does come at a small cost: operations that expand a subtree
will conflict with other operations in that same subtree, even if their regions do not ultimately
overlap. However, after initial address space construction, the radix tree is largely “fleshed
out” and these initialization conflicts become rare.
67
VPN 000 000000000 110000000 001010010 010010
/bin/ls rwxs
/bin/ls rwxs
/bin/ls rwxs
Figure 8-6: A radix tree containing a three page file mapping. This highlights the path for
looking up the 36-bit virtual page number shown in binary at the top of the figure. The last
level of the tree contains separate mapping metadata for each page.
Mapping metadata. To record each mapping, RadixVM logically stores a separate instance
of the mapping metadata in the radix tree for each page in the mapped range. This differs
from a typical design that allocates a single metadata object to represent the entire range of a
mapping (e.g., virtual memory areas in Linux). This is practical because RadixVM’s mapping
metadata object is small and foldable (the mapping metadata objects of all pages in a mapped
range are initially byte-wise identical), so large mappings can be created and stored efficiently
in just a few slots in the radix tree.
Also unlike typical virtual memory system designs, RadixVM stores pointers to physical
memory pages in the mapping metadata for pages that have been allocated. This is natural to
do in RadixVM because, modulo folding, there is a separate mapping metadata object for
each page. It’s also important to have this canonical representation of the physical memory
backing a virtual address space because of the way RadixVM handles TLB shootdown (see
section 8.2.3). This does increase the space required by the radix tree, but, asymptotically, it’s
no worse than the hardware page tables, and it means that the hardware page tables themselves
are cacheable memory that can be discarded by the OS to free memory.
Radix node garbage collection. To keep the memory footprint of the radix tree in check,
the OS must be able to free nodes that no longer contain any valid mapping metadata. To
accomplish this without introducing undue conflicts, we leverage Refcache to scalably track
the number of used slots in each node. When this count drops to zero, the radix tree can
remove the node from the tree and delete it. Since RadixVM may begin using a node again
before Refcache reconciles the used slot count, nodes link to their children using weak
references, which allows the radix tree to revive nodes that go from empty to used before
Refcache deletes them, and to safely detect when an empty child node has been deleted.
Collapsing the radix tree does potentially introduce conflicts; however, unlike with eager
garbage collection schemes, rapidly changing mappings cannot cause the radix tree to rapidly
delete and recreate nodes. Since a node must go unused for at least two Refcache epochs
before it is deleted, any cost of deleting or recreating it (and any additional contention that
results) is amortized.
68
In contrast with more traditional balanced trees, using a radix tree to manage address space
metadata allows RadixVM to achieve near-perfect conflict-freedom (and hence scalability)
for metadata operations on non-overlapping regions of an address space. This comes at the
cost of a potentially larger memory overhead; however, address space layouts tend to exhibit
good locality and folding efficiently compresses large ranges, making radix trees a good fit for
a VM system, as we’ll confirm in section 9.6.
8.2.3
TLB management
The other major impediment to scaling mmap and munmap operations is the need to explicitly
invalidate cached virtual-to-physical mappings in per-core TLBs when a page is unmapped
(or remapped). Because TLB shootdowns must be performed on every CPU that may have
cached a page mapping that’s being modified, and because hardware does not provide information about which CPUs may have cached a particular mapping, typical Unix VM systems
conservatively broadcast TLB shootdown interrupts to all CPUs using the modified address
space [8], inducing cache line conflicts and limiting scalability.
RadixVM addresses this problem by precisely tracking the set of CPUs that have accessed
each page mapping. In architectures with software-filled TLBs (such as the MIPS or UltraSPARC) this is easy, since the MMU informs the kernel of each miss in the TLB. The kernel
can use these TLB miss faults to track exactly which CPUs have a given mapping cached and,
when a later mmap or munmap changes this mapping, it can deliver shootdown requests only
to cores that have accessed this mapping. On architectures with hardware-filled TLBs such as
the x86, RadixVM achieves the same effect using per-core page tables. With TLB tracking,
if an application thread allocates, accesses, and frees memory on one core, with no other
threads accessing the same memory region, then RadixVM will perform no TLB shootdowns.
The downside to this approach is the extra memory required for per-core page tables.
We show in section 9.6 that this overhead is small in practice compared to the total memory
footprint of an application, but for applications with poor partitioning, it may be necessary
for the application to provide hints about widely shared regions so the kernel can share page
tables (similar to Corey address ranges [10]) or for the kernel to detect such regions. The
kernel can also reduce overhead by simply discarding page table pages when memory is low.
8.2.4 VM operations
With the components described above, RadixVM’s implementation of the core VM operations
is surprisingly straightforward. One of the most difficult aspects of the POSIX VM interface
in a multicore setting is its strict ordering and global visibility requirements [16]. For example,
before munmap in one thread can return, the region must appear unmapped to every thread
on every core despite caching at both software and hardware levels. Similarly, after mmap
69
returns, any thread on any core must be able to access the mapped region. Furthermore,
though not required by POSIX, it is generally assumed that these operations are atomic.
RadixVM enforces these semantics by always locking, from left to right, the bottom-most
radix tree entries for the region of an operation. This simple mechanism ensures that mmap,
munmap, and pagefault operations are linearizable [37] without causing conflicts between
operations on non-overlapping regions (assuming the tree is already expanded).
An mmap invocation, like all RadixVM operations, first locks the range being mapped.
If there are existing mappings within the range, mmap unmaps them, as described later for
munmap. mmap then fills each slot in the region with the new mapping metadata (protection
level and flags arguments to mmap, as well as what backs this virtual memory range, such as
a file or anonymous memory). If parts of the mapping span entire nodes of the radix tree,
RadixVM will fold them into individual higher-level slots. Finally, RadixVM unlocks the
range. Like in other VM systems, mmap doesn’t allocate any physical pages, but leaves that to
pagefault, so that pages are allocated only when they are needed.
A pagefault invocation traverses the radix tree to find the mapping metadata for the
faulting address, and acquires a lock on it. It then allocates a physical page if one has not been
allocated yet (for anonymous memory), or fetches it from the buffer cache (for file mappings)
and stores it in the mapping metadata. Finally, pagefault fills in the page table entry in the local
core’s page table, and adds the local core number to the TLB shootdown list in the mapping
metadata for that address. pagefault then releases the lock and returns.
To implement munmap, RadixVM must clear mapping metadata from the radix tree, clear
page tables, invalidate TLBs, and free physical pages. munmap begins by locking the range
being unmapped, after which it can scan the region’s metadata to gather references to the
physical pages backing the region, collect the set of cores that have faulted pages in the region
into their per-core page tables, and clear each page’s metadata. It can then send inter-processor
interrupts to the set of cores it collected in the first step. These interrupts cause the remote
cores (and the core running munmap) to clear the appropriate range in their per-core page
table and invalidate the corresponding local TLB entries. Once all cores have completed
this shootdown process, munmap can safely release its lock on the range and decrement the
reference counts on the physical pages that were unmapped.
8.2.5
Discussion
By combining Refcache for scalable reference counting, radix trees for maintaining address
space metadata, and per-core page tables for precise TLB tracking and shootdown, RadixVM
achieves conflict-freedom for the vast majority of commutative VM operations. The clear
commutativity properties of the POSIX VM interface, combined with the right data structures
makes this possible with a straightforward concurrency plan based on precise range locking.
RadixVM balances practical memory consumption with strict adherence to scalable
commutativity rule, allowing conflicts between some commutative operations where doing
70
so enables far more efficient memory use. However, such conflicts are rare: operations that
allocate a radix node will conflict with later operations on that node’s range of the key space,
but typical address space usage involves few node-allocating operations.
As shown in Figure 8-1, Commuter confirmed that most commutative operations in
RadixVM are conflict-free. In chapter 9, we further confirm that RadixVM’s design translates
into scalable performance for application workloads.
8.3
ScaleFS: Conflict-free file system operations
ScaleFS encompasses sv6’s unified buffer cache and VFS layers, providing operations such as
read, write, open, and unlink. We focused on the VFS and buffer cache layers because these
are the common denominators of all file systems in a Unix kernel. In contrast with RadixVM,
ScaleFS makes extensive use of well-known techniques for scalable implementations, such
as per-core resource allocation, double-checked locking, lock-free readers using RCU [49],
and seqlocks [43: §6]. ScaleFS also employs Refcache for tracking both internal resources and
inode link counts. ScaleFS is also structured much like contemporary Unix VFS subsystems,
with inode and directory caches represented as concurrent hash tables and per-file page
caches. What sets ScaleFS apart is that the details of its implementation were guided by the
scalable commutativity rule and, in particular, by Commuter. This led to several common
design patterns, which we illustrate with example test cases from Commuter:
Layer scalability. ScaleFS uses data structures that themselves naturally satisfy the commutativity rule, such as linear arrays, radix trees, and hash tables. In contrast with structures like
balanced trees, these data structures typically share no cache lines when different elements
are accessed or modified. For example, ScaleFS stores the cached data pages for a given inode
using a radix tree, so that concurrent reads or writes to different file pages scale, even in the
presence of operations extending or truncating the file. Commuter led us to an additional
benefit of this representation: many operations also use this radix tree to determine if some
offset is within the file’s bounds without reading the size and conflicting with operations that
change the file’s size. For example, pread first probes this radix tree for the requested read
offset: if the offset is beyond the last page of the file, it can return 0 immediately without
reading (and potentially conflicting on) the file size.
Defer work. Many ScaleFS resources are shared, such as file descriptions and inode objects,
and must be freed when no longer referenced. Typically, kernels release resources immediately,
but this requires eagerly tracking references to resources, causing commutative operations
that access the same resource to conflict. Where releasing a resource is not time-sensitive,
ScaleFS uses Refcache to batch reference count reconciliation and zero detection. This way,
71
resources are eventually released, but within each Refcache epoch commutative operations
can be conflict-free.
Some resources are artificially scarce, such as inode numbers in a typical Unix file system.
When a typical Unix file system runs out of free inodes, it must reuse an inode from a recently
deleted file. This requires finding and garbage collecting unused inodes, which induces
conflicts. However, the POSIX interface does not require that inode numbers be reused, only
that the same inode number is not used for two files at once. Thus, ScaleFS never reuses
inode numbers. Instead, inode numbers are generated by a monotonically increasing per-core
counter, concatenated with the core number that allocated the inode. This allows ScaleFS
to defer inode garbage collection for longer periods of time, and enables conflict-free and
scalable per-core inode allocation.
Precede pessimism with optimism. Many operations in ScaleFS have an optimistic check
stage followed by a pessimistic update stage, a generalized sort of double-checked locking. The
optimistic stage checks conditions for the operation and returns immediately if no updates
are necessary (this is often the case for error returns, but can also happen for success returns).
This stage does no writes or locking, but because no updates are necessary, it is often easy to
make atomic. If updates are necessary, the operation acquires locks or uses lock-free protocols,
re-verifies its conditions to ensure atomicity of the update stage, and performs updates. For
example, lseek computes the new offset using a lock-free read-only protocol and returns early
if the new offset is invalid or equal to the current offset. Otherwise, lseek locks the file offset
and re-computes the new offset to ensure consistency. In fact, lseek has surprisingly complex
interactions with state and other operations, and arriving at a protocol that was both correct
and conflict-free in all commutative cases would have been difficult without Commuter.
rename is similar. If two file names a and b point to the same inode, rename(a, b) should
remove the directory entry for a, but it does not need to modify the directory entry for b, since
it already points at the right inode. By checking the directory entry for b before updating it,
rename(a, b) avoids conflicts with other operations that look up b. As we saw in section 6.1.2,
rename has many surprising and subtle commutative cases and, much like lseek, Commuter
was instrumental in helping us find an implementation that was conflict-free in these cases.
Don’t read unless necessary. A common internal interface in a file system implementation
is a namei function that checks whether a path name exists, and if so, returns the inode for
that path. However, reading the inode is unnecessary if the caller wants to know only whether
a path name existed, such as an access(F_OK) system call. In particular, the namei interface
makes it impossible for concurrent access(b, F_OK) and rename(a, b) operations to scale when
a and b point to different inodes, even though they commute. ScaleFS has a separate internal
interface to check for existence of a file name, without looking up the inode, which allows
access and rename to scale in such situations.
72
8.4 Difficult-to-scale cases
As Figure 8-1 illustrates, there are a few (123 out of 26,238) commutative test cases for which
RadixVM and ScaleFS are not conflict-free. The majority of these tests involve idempotent
updates to internal state, such as two lseek operations that both seek a file descriptor to the
same offset, or two anonymous mmap operations with the same fixed base address and permissions. While it is possible implement these scalably, every implementation we considered
significantly impacted the performance of more common operations, so we explicitly chose
to favor common-case performance over total scalability. Even though we decided to forego
scalability in these cases, the commutativity rule and Commuter forced us to consciously
make this trade-off.
Other difficult-to-scale cases are more varied. Several involve reference counting of pipe
file descriptors. Closing the last file descriptor for one end of a pipe must immediately affect the
other end; however, since there’s generally no way to know a priori if a close will close the pipe,
a shared reference count is used in some situations. Other cases involve operations that return
the same result in either order, but for different reasons, such as two reads from a file filled
with identical bytes. By the rule, each of these cases has some conflict-free implementation,
but making these particular cases conflict-free would have required sacrificing the conflictfreedom of many other operations.
73
74
Nine
Performance evaluation
Given that nearly all commutative ScaleFS and RadixVM operations are conflict-free, in
principle, applications built on these operations should scale perfectly. This chapter confirms
this, completing a pyramid whose foundations were set in chapter 3 when we demonstrated
that conflict-free memory accesses scale in most circumstances on real hardware. This chapter
extends these results, showing that complex operating system calls built on conflict-free
memory accesses scale and that, in turn, applications built on these operations scale. We focus
on the following questions:
• Do conflict-free implementations of commutative operations and applications built
using them scale on real hardware?
• Do non-commutative operations limit performance on real hardware?
Since real systems cannot focus on scalability to the exclusion of other performance
characteristics, we also consider the balance of performance requirements by exploring the
following question:
• Can implementations optimized for linear scalability of commutative operations also
achieve competitive sequential performance, reasonable (albeit sub-linear) scalability
of non-commutative operations, and acceptable memory use?
To answer these questions, we use sv6. In addition to the operations analyzed in chapter 8,
we scalably implemented other commutative operations (e.g., posix_spawn) and many of the
modified POSIX APIs from chapter 5. All told, sv6 totals 51,732 lines of code, including
user space and library code. Using sv6, we evaluate two microbenchmarks and one application benchmark focused on file system operations, and three microbenchmarks and one
application benchmark focused on virtual memory operations.
9.1
Experimental setup
We ran experiments on an 80-core machine with eight 2.4 GHz 10-core Intel E7-8870 chips
and 256 GB of RAM (detailed earlier in Figure 3-2). When varying the number of cores,
75
benchmarks enable whole sockets at a time, so each 30 MB socket-level L3 cache is shared by
exactly 10 enabled cores. We also report single-core numbers for comparison, though these
are expected to be higher because without competition from other cores in the socket, the
one core can use the entire 30 MB cache.
We run all benchmarks with the hardware prefetcher disabled because we found that it
often prefetched contended cache lines to cores that did not ultimately access those cache
lines, causing significant variability in our benchmark results and hampering our efforts to
precisely control sharing. We believe that, as large multicores and highly parallel applications
become more prevalent, prefetcher heuristics will likewise evolve to not induce this false
sharing.
As a performance baseline, we run the same benchmarks on Linux 3.5.7 from Ubuntu
Quantal. All benchmarks compile and run on Linux and sv6 without modifications. Direct comparison is difficult because Linux implements many features sv6 does not, but this
comparison confirms sv6’s sequential performance is sensible.
We run each benchmark three times and report the mean. Variance from the mean is
always under 5% and typically under 1%.
9.2
File system microbenchmarks
Each file system benchmark has two variants, one that uses standard, non-commutative
POSIX APIs and another that accomplishes the same task using the modified, more broadly
commutative APIs from chapter 5. By benchmarking the standard interfaces against their
commutative counterparts, we can isolate the cost of non-commutativity and also examine
the scalability of conflict-free implementations of commutative operations.
In general, it’s difficult to argue that an implementation of a non-commutative
interface achieves the best possible scalability for that interface and that no implementation
could scale better. However, in limited cases, we can do exactly this. We start with statbench,
which measures the scalability of fstat with respect to link. This benchmark creates a single file
that n/2 cores repeatedly fstat. The other n/2 cores repeatedly link this file to a new, unique file
name, and then unlink the new file name. As discussed in chapter 5, fstat does not commute
with link or unlink on the same file because fstat returns the link count. In practice, applications
rarely invoke fstat to get the link count, so sv6 introduces fstatx, which allows applications to
request specific fields (a similar system call has been proposed for Linux [39]).
We run statbench in two modes: one mode uses fstat, which does not commute with the
link and unlink operations performed by the other threads, and the other mode uses fstatx
to request all fields except the link count, an operation that does commute with link and
unlink. We use a Refcache scalable counter for the link count so that the links and unlinks are
conflict-free, and place it on its own cache line to avoid false sharing. Figure 9-1(a) shows the
statbench.
76
(a) statbench throughput
(b) openbench throughput
3M
2M
1.5M
fstatx+Refcache
fstat+Refcache
opens/sec/core
fstats/sec/core
2.5M
fstat+Shared counter
1M
links+unlinks/sec/core
500k
0
160k
140k
120k
100k
80k
60k
40k
20k
0
1M
900k
800k
700k
600k
500k
400k
300k
200k
100k
0
Any FD
Lowest FD
1 10 20 30 40 50 60 70 80
# cores
1 10 20 30 40 50 60 70 80
# cores
Figure 9-1: File system microbenchmark throughput in operations per second per core with
varying core counts on sv6. The blue dots indicate single core Linux performance for comparison.
results. With the commutative fstatx, statbench scales perfectly for both fstatx and link/unlink
and experiences zero L2 cache misses in fstatx. On the other hand, the traditional fstat scales
poorly and the conflicts induced by fstat impact the scalability of the threads performing link
and unlink.
To better isolate the difference between fstat and fstatx, we run statbench in a third mode
that uses fstat, but represents the link count using a simple shared counter instead of Refcache.
In this mode, fstat performs better at low core counts, but fstat, link, and unlink all suffer at
higher core counts. With a shared link count, each fstat call experiences exactly one L2 cache
miss (for the cache line containing the link count), which means this is the most scalable
that fstat can possibly be in the presence of concurrent links and unlinks. Yet, despite sharing
only a single cache line, the seemingly innocuous conflict arising from the non-commutative
interface limits the implementation’s scalability. One small tweak to make the operation
commute by omitting st_nlink eliminates the barrier to scaling, demonstrating that even an
optimal implementation of a non-commutative operation can have severely limited scalability,
underscoring the results of chapter 3.
In the case of fstat, optimizing for scalability sacrifices some sequential performance.
Tracking the link count with Refcache (or some scalable counter) is necessary to make link
77
and unlink scale linearly, but requires fstat to reconcile the distributed link count to return
st_nlink. The exact overhead depends on the core count, which determines the number of
Refcache caches, but with 80 Refcache caches, fstat is 3.9× more expensive than on Linux. In
contrast, fstatx can avoid this overhead unless the caller requests link counts; like fstat with a
shared count, it performs similarly to Linux’s fstat on a single core.
Figure 9-1(b) shows the results of openbench, which stresses the file descriptor
allocation performed by open. In openbench, n threads concurrently open and close perthread files. These calls do not commute because each open must allocate the lowest unused
file descriptor in the process. For many applications, it suffices to return any unused file
descriptor (in which case the open calls commute), so sv6 adds an O_ANYFD flag to open,
which it implements using per-core partitions of the FD space. Much like statbench, the
standard, non-commutative open interface limits openbench’s scalability, while openbench
with O_ANYFD scales linearly.
Furthermore, there is no performance penalty to ScaleFS’s open, with or without O_ANYFD:
at one core, both cases perform identically and outperform Linux’s open by 27%. Some of the
performance difference is because sv6 doesn’t implement things like permissions checking,
but much of Linux’s overhead comes from locking that ScaleFS avoids.
openbench.
9.3
File system application performance
We perform a similar experiment using a simple mail server to produce a file system workload
more representative of a real application. The mail server uses a sequence of separate, communicating processes, each with a specific task, roughly like qmail [6]. mail-enqueue takes a
mailbox name and a single message on standard in, writes the message and the envelope to
two files in a mail queue directory, and notifies the queue manager by writing the envelope
file name to a Unix domain datagram socket. mail-qman is a long-lived multithreaded process
where each thread reads from the notification socket, reads the envelope information, opens
the queued message, spawns and waits for the delivery process, and then deletes the queued
message. Finally, mail-deliver takes a mailbox name and a single message on standard in and
delivers the message to the appropriate Maildir. The benchmark models a mail client with n
threads that continuously deliver email by spawning and feeding mail-enqueue.
As in the microbenchmarks, we run the mail server in two configurations: in one we use
lowest FD, an order-preserving socket for queue notifications, and fork/exec to spawn helper
processes; in the other we use O_ANYFD, an unordered notification socket, and posix_spawn,
all as described in chapter 5. For queue notifications, we use a Unix domain datagram socket.
sv6 implements this with a single shared queue in ordered mode In unordered mode, sv6 uses
load-balanced per-core message queues. Load balancing only triggers when a core attempts
to read from an empty queue, so operations on unordered sockets are conflict-free as long as
78
emails/sec/core
1k
900
800
700
600
500
400
300
200
100
0
Commutative APIs
Regular APIs
1
10
20
30
40 50
# cores
60
70
80
Figure 9-2: Mail server benchmark throughput. The blue dot indicates single core Linux
performance for comparison.
consumers don’t outpace producers. Finally, because fork commutes with essentially no other
operations in the same process, sv6 implements posix_spawn by constructing the new process
image directly and building the new file table. This implementation is conflict-free with most
other operations, including operations on O_CLOEXEC files (except those specifically duped
into the new process).
Figure 9-2 shows the resulting scalability of these two configurations. Even though the mail
server performs a much broader mix of operations than the microbenchmarks and doesn’t
focus solely on non-commutative operations, the results are quite similar. Non-commutative
operations cause the benchmark’s throughput to collapse at a small number of cores, while the
configuration that uses commutative APIs achieves 7.5× scalability from 1 socket (10 cores)
to 8 sockets.
9.4
Virtual memory microbenchmarks
To understand the scalability and performance of sv6’s RadixVM virtual memory system, we
use three microbenchmarks, each of which exercises a specific pattern of address space usage.
As in the earlier benchmarks, we compare RadixVM against Linux 3.5.7. We additionally compare against the Bonsai VM system [16] (based on Linux 2.6.37). As discussed in
section 8.2.2, Linux’s mmap, munmap, and pagefault all acquire a per-address space lock. This
serializes mmap and munmap operations. Linux’s pagefault acquires the lock in shared mode,
allowing pagefaults to proceed in parallel, but acquiring the address space lock in shared mode
still induces conflicts on the cache line storing the lock. Bonsai modifies the Linux virtual
memory system so that pagefaults do not acquire this lock. This makes most commutative
pagefaults conflict-free in Bonsai, though Bonsai does not modify the locking behavior of
mmap and munmap, so all mmap and munmap operations on the same address space conflict.
79
(b) vmpipeline throughput
600k
sv6
Linux
Bonsai
page writes/sec/core
page writes/sec/core
(a) vmlocal throughput
350k
300k
250k
200k
150k
100k
50k
0
500k
400k
300k
200k
100k
0
1 10 20 30 40 50 60 70 80
# cores
1 10 20 30 40 50 60 70 80
# cores
page writes/sec/core
(c) vmglobal throughput
800k
700k
600k
500k
400k
300k
200k
100k
0
1 10 20 30 40 50 60 70 80
# cores
Figure 9-3: Virtual memory microbenchmark throughput in page writes per second per core
with varying core counts.
Figure 9-3 shows the throughput of our three microbenchmarks on sv6, Bonsai, and
Linux. For consistency, we measure the number of pages written per second per core in all
three benchmarks. Because sv6 uses per-core page tables, each of these writes translates into
a page fault, even if the page has already been faulted by another core. Linux and Bonsai incur
fewer page faults than sv6 for the pipeline and global microbenchmarks because all cores use
the same page table.
The vmlocal microbenchmark exercises completely disjoint address space usage.
In vmlocal, each thread mmaps a single page in the shared address space, writes to the page
(invoking pagefault), and then munmaps the page. Because each thread maps a page at a
different virtual address, all operations performed by vmlocal commute.
Many concurrent memory allocators use per-thread memory pools that specifically
optimize for thread-local allocations and exhibit this pattern of address space manipulation [30, 32]. However, such memory allocators typically map memory in large batches and
conservatively return memory to the operating system to avoid burdening on the virtual
vmlocal.
80
memory system. Our microbenchmark does the opposite, using 4 KB regions to maximally
stress the VM system.
The vmlocal microbenchmark scales linearly on sv6. While this benchmark does incur
cache misses, none are conflict misses and none require cross-core communication. Regardless
of the number of cores, we observe about 75 L2 cache misses per iteration and about 50
L3 cache misses, almost all a result of page zeroing. None of these require cross-socket
communication as all are satisfied from the core’s local DRAM, which is consistent with
Commuter’s determination that these operations are conflict-free. Likewise, because sv6
can track remote TLBs precisely, the vmlocal microbenchmark sends no TLB shootdowns.
Because these operations are conflict-free—there is no lock contention, a small and fixed
number of cache misses, no remote DRAM accesses, and no TLB shootdowns—the time
required to mmap, pagefault, and munmap is constant regardless of the number of cores.
Linux and Bonsai, on the other hand, slow down as we add more cores. This is not
unexpected: Linux acquires the address space lock three times per iteration and Bonsai twice
per iteration, effectively serializing the benchmark. sv6 also demonstrates good sequential
performance: at one core, sv6’s performance is within 8% of Linux, and it is likely this could
be improved.
The vmpipeline benchmark captures the pattern of streaming or pipeline communication, such as a map task communicating with a reduce task in MapReduce [25]. In
vmpipeline, each thread mmaps a page of memory, writes to the page, and passes the page
to the next thread in sequence, which also writes to the page and then munmaps it. The
operations performed by neighboring cores do not commute and their implementation is
not conflict-free. However, the operations of non-neighboring cores do commute, so unlike
the non-commutative cases in the file system benchmarks, increasing the number of cores
in vmlocal increases the number of conflicted cache lines without increasing the number of
cores conflicting on each cache line. As a result, vmpipeline still scales well on sv6, but not
linearly. We observe similar cache miss rates as vmlocal, but vmpipeline induces cross-socket
memory references for pipeline synchronization, returning freed pages to their home nodes
when they are passed between sockets, and cross-socket shootdown IPIs. For Linux and
Bonsai, vmpipeline is almost identical to vmlocal except that it writes twice to each page; hence
we see almost identical performance, scaled up by a factor of two. Again, sv6’s single core
performance is within 8% of Linux.
vmpipeline.
Finally, vmglobal simulates a widely shared region such as a memory-mapped
library or a shared data structure like a hash table. In vmglobal, each thread mmaps a 64 KB
part of a large region of memory, then all threads access all of the pages in the large region
in a random order. These operations commute within each phase—all mmaps are to nonoverlapping regions and pagefaults always commute—but not between phases.
vmglobal.
81
Vmglobal scales well on sv6, despite being conceptually poorly suited to sv6’s per-core
page tables and targeted TLB shootdown. In this benchmark, sv6 performance is limited by
the cost of TLB shootdowns: at 80 cores, delivering shootdown IPIs to the other 79 cores and
waiting for acknowledgments takes nearly a millisecond. However, at 80 cores, the shared
region is 20 MB, so this cost is amortized over a large number of page faults. Linux and Bonsai
perform better on this benchmark than on vmlocal and vmpipeline because it has a higher
ratio of page faults to mmap and munmap calls, but they still fail to scale.
9.5
Virtual memory application benchmark
To evaluate the impact of RadixVM on application performance, we use Metis, a highperformance single-server multithreaded MapReduce library, to compute a word position
index from a 4 GB in-memory text file [25, 45]. Metis exhibits all of the sharing patterns
exercised by the microbenchmarks: it uses core-local memory, it uses a globally shared B+-tree
to store key-value pairs, and it also has pairwise sharing of intermediate results between map
tasks and reduce tasks. Metis also allows a direct comparison with the Bonsai virtual memory
system [16], which used Metis as its main benchmark.
By default Metis uses the Streamflow memory allocator [57], which is designed to minimize pressure on the VM system, but nonetheless suffers from contention in the VM system
when running on Linux [16]. Previous systems that used this library avoided contention
for in-kernel locks by using super pages and improving the granularity of the super page
allocation lock in Linux [11], or by having the memory allocator pre-allocate all memory
upfront [45]. While these workarounds do allow Metis to scale on Linux, we wanted to focus
on the root scalability problem in the VM system rather than the efficacy of workarounds
and to eliminate compounding factors from differing library implementations, so we use
a custom allocator on both Linux and sv6 designed specifically for Metis. In contrast with
general-purpose memory allocators, this allocator is simple and designed to have no internal
contention: memory is mapped in fixed-sized blocks, free lists are exclusively per-core, and
the allocator never returns memory to the OS.
Two factors determine the scalability of Metis: conflicts between concurrent mmaps during
the map phase and conflicts between concurrent pagefaults during the reduce phase. If the
memory allocator uses a large allocation unit, Metis can avoid the first source of contention by
minimizing the number of mmap invocations. Therefore, we measure Metis using two different
allocation units: 8 MB to stress pagefault and 64 KB to stress mmap. At 80 cores, Metis invokes
mmap 4,145 times in the 8 MB configuration, and 232,464 in the 64 KB configuration. In both
cases, it invokes pagefault approximately 10 million times, where 65% of these page faults
cause it to allocate new physical pages and the rest bring pages already faulted on another
core in to the per-core page tables.
Figure 9-4 shows how Metis scales for the three VM systems.
82
jobs/hour/core
18
16
14
12
10
8
6
4
2
0
sv6/8 MB
sv6/64 KB
Linux/8 MB
Linux/64 KB
Bonsai/8 MB
Bonsai/64 KB
1
10
20
30
40
50
# cores
60
70
80
Figure 9-4: Metis application scalability for different VM systems and allocation unit sizes.
sv6/8 MB and sv6/64 KB perform identically, so their lines are indistinguishable.
Metis on sv6 not only scales near-linearly, but performs identically in both the 64 KB and
8 MB configurations because virtually all of the virtual memory operations performed by
Metis are commutative and thus conflict-free on RadixVM.
In contrast, Metis scales poorly on Linux in both configurations, spending most of its
time at high core counts in the virtual memory system acquiring the address space lock rather
than performing useful computation. This is true even in the pagefault-heavy configuration
because, while pagefaults run in parallel on Linux, acquiring the address space lock in shared
mode induces cache line conflicts.
For the 8 MB pagefault-heavy configuration, Bonsai scales as well as RadixVM because it
also achieves conflict-free pagefault operations. Furthermore, in this configuration sv6 and
Bonsai perform similarly—sv6 is ∼ 5% slower than Bonsai at all core counts—suggesting
that there is little or no sequential performance penalty to RadixVM’s very different design.
It’s likely we could close this gap with further work on the sequential performance of sv6.
Bonsai suffers in the 64 KB configuration because, unlike sv6, Bonsai’s mmap and munmap
operations are not conflict-free.
9.6
Memory overhead
Thus far, we’ve considered two measures of throughput: scalability and sequential performance.
This section turns to memory overhead, another important performance metric.
While ScaleFS’s data structures closely resemble those of traditional Unix kernels, the
scalability of RadixVM depends on a less-compact representation of virtual memory than a
traditional binary tree of memory regions and per-core page tables.
To quantify the memory overhead of RadixVM’s radix tree, we took snapshots of the
virtual memory state of various memory-intensive applications and servers running on Linux
83
RSS
Firefox
Chrome
Apache
MySQL
352 MB
152 MB
16 MB
84 MB
Linux
VMA tree Page table
117 KB
124 KB
44 KB
18 KB
1.5 MB
1.1 MB
368 KB
348 KB
Radix tree
(rel. to Linux)
3.9 MB (2.4×)
2.4 MB (2.0×)
616 KB (1.5×)
980 KB (2.7×)
Table 9-5: Memory usage for alternate VM representations. RSS (resident set size) gives the
size of the memory mapped and faulted by the application. The other columns give the size of
the address space metadata structures used by Linux and RadixVM.
and measured the space required to represent the address space metadata in both Linux and
RadixVM. Linux uses a single object to represent each contiguous range of mapped memory
(a virtual memory area or VMA), arranged in a red-black tree, which makes its basic address
space layout metadata very compact. As a result, Linux must store information about the
physical pages backing mapped virtual pages separately, which it cleverly does in the hardware
page table itself, making the hardware page table a crucial part of the address space metadata.
RadixVM, on the other hand, stores both OS-specific metadata and physical page mappings
together in the radix tree. This makes the radix tree larger than the equivalent VMA tree, but
means RadixVM can freely discard memory used for per-core hardware page tables.
Table 9-5 summarizes the memory overhead of these two representations for four applications: the Firefox and Chrome web browsers, after significant use, and the Apache web server
and MySQL database server used by the EuroSys 2013 paper submission web site. Compared
to each application’s resident set size (the memory directly allocated by the application), the
radix tree incurred at most a 3.7% overhead in total memory footprint.
9.7
Discussion
This chapter completes our trajectory from theory to practice. The benchmarks herein have
demonstrated that commutative operations can be implemented to scale, confirming the
scalable commutativity rule for complex interfaces and real hardware and validating the
designs and design methodologies set out in chapter 8. Several of these benchmarks have
also demonstrated the necessity of commutativity and conflict-freedom for scalability by
showing that even a single contended cache line in a complex operation can severely limit
scalability. Furthermore, all of this can be accomplished at little cost to sequential performance
or memory use.
84
Ten
Future directions
In the course of research, every good result inspires at least two good questions. This chapter
takes a step back and reviews some of the questions we have only begun to explore.
10.1 The non-scalable non-commutativity rule
The scalable commutativity rule shows that SIM-commutative regions have conflict-free
implementations. It does not show the inverse, however: that given a non-commutative
region, there is no implementation that is conflict-free (or linearly scalable) in that region.
In general, the inverse is not true. For example, consider an interface with two operations:
increment and get, where get can return either the current value or unknown. increment and
get operations do not SIM commute because the specification permits an implementation
where get always returns the current value and thus the order of these operations is observable.
But the specification does permits a trivially conflict-free implementation with no state where
get always returns unknown. This disproves the inverse of the rule, but seems to do so on an
unsatisfying technicality: this example is “too non-deterministic.”
What if we swing to the other extreme and consider only interfaces that are sequentially
consistent (every legal history has a sequential reordering that is also legal) and sequentially
deterministic (the sequence of invocations in a sequential history uniquely determines the
sequence of responses)? Many interfaces satisfy these constraints, such as arrays, ordered
sets, and ordered maps, as well as many subsets of POSIX like file I/O (read/write/etc.) For
sequentially consistent, sequentially deterministic interfaces, the inverse of the rule is true.
To see why, first consider an implementation that is conflict-free for Y in X ∣∣ Y. By
definition, the state components read by all of the steps of each thread in Y must be disjoint
from the state components written by steps on other threads, so the steps of a thread cannot
be affected by the steps of other threads. Since the invocations on each thread will be the same
in any reordering of Y and the implementation is conflict-free in Y, it must yield the same
responses and the same final state for any reordering of Y.
However, if Y is non-SIM commutative in X ∣∣ Y and the specification is sequentially
consistent and sequentially deterministic, an implementation must yield some different re85
sponse or a different final state in some reordering of Y. Thus, an implementation cannot be
conflict-free in Y.
Many real-world interfaces are sequentially consistent and sequentially deterministic,
but many are not (e.g., unsurprisingly, POSIX). Indeed, we consider non-determinism an
important tool in the design of broadly commutative interfaces.
We believe, however, that specifications with limited non-determinism are similarly
constrained by the inverse of the rule. For example, a specification consisting exclusively
of POSIX’s open is sequentially consistent and sequentially deterministic (concurrent calls
to open appear equivalent to sequential calls and non-concurrent calls have deterministic
behavior). Adding any FD semantics makes open commute in more regions, but does not
fundamentally alter the behavior of regions that did not commute for other reasons (such as
concurrent, exclusive creates the same file name).
Understanding the inverse of the rule would shed light on the precise boundaries of
scalable implementations. However, the greater value of the inverse may be to highlight
ways to escape these boundaries. The following two sections explore hardware features not
captured by the rule’s formalization that may expand the reach of the rule and enable scalable
implementations of broader classes of interfaces.
10.2
Synchronized clocks
Section 4.3 formalized implementations as step functions reflecting a machine capable of
general computation and communication through shared memory. Some hardware has at
least one useful capability not captured by this model: synchronized timestamp counters.
Reads of a synchronized timestamp counter will always observe increasing values, even if
the reads occur on different cores. With this capability, operations in a commutative region
can record their order without communicating and later operations can depend on this order.
For example, consider an append operation that appends data to a file. With synchronized
timestamp counters, the implementation of append could log the current timestamp and the
appended data to per-thread state. Later reads of the file could reconcile the file contents by
sorting these logs by timestamp. The appends do not commute, yet this implementation of
append is conflict-free.
Formally, we can model a synchronized timestamp counter as an additional argument to
the implementation step function that must increase monotonically over a history. With this
additional argument, many of the conclusions drawn by the proof of the scalable commutativity rule no longer hold.
Recent work by Boyd-Wickizer developed a technique for scalable implementations called
OpLog based on using synchronized timestamp counters [9]. Exactly what interfaces are
amenable to OpLog implementations remains unclear and is a question worth exploring.
86
10.3
Scalable conflicts
Another potential way to expand the reach of the rule and create more opportunities for
scalable implementations is to find ways in which non-conflict-free operations can scale.
For example, while streaming computations are in general not linearly scalable because
of interconnect and memory contention, we’ve had success with scaling interconnect-aware
streaming computations. These computations place threads on cores so that the structure of
sharing between threads matches the structure of the hardware interconnect and such that no
link is oversubscribed. For example, on the 80-core x86 from chapter 9, repeatedly shifting
tokens around a ring mapped to the hardware interconnect achieves the same throughput
regardless of the number of cores in the ring, even though every operation causes conflicts
and communication.
It is unclear what useful computations can be mapped to this model given the varying
structures of multicore interconnects. However, this problem has close ties to job placement
in data centers and may be amenable to similar approaches. Likewise, the evolving structures
of data center networks could inform the design of multicore interconnects that support more
scalable computations.
10.4
Not everything can commute
This dissertation advocated fixing scalability problems starting by making interfaces as commutative as possible. But some interfaces are set in stone, and others, such as synchronization
interfaces, are fundamentally non-commutative. It may not be possible to make implementations of these scale linearly, but making them scale as well as possible is as important as
making commutative operations scale.
This dissertation addressed this problem only in ad hoc ways. Most notably, Refcache’s
design focused all of the non-commutativity and non-scalability inherent in resource reclamation into a single, non-critical-path operation. This reclamation operation ran only once every
10ms, allowing Refcache to batch and eliminate many conflicts and amortize the cost of this
operation. However, whether there is a general interface-driven approach to the scalability of
non-commutative operations remains an open question.
10.5
Broad conflict-freedom
As evidenced by sv6 in chapter 8, real implementations of real interfaces can be conflict-free
in nearly all commutative situations. But, formally, the scalable commutativity rule states
something far more restricted: that for a specific commutative region of a specific history,
there is a conflict-free implementation. In other words, there is some implementation that
is conflict-free for each of the 26,238 tests Commuter ran, but passing all of them might
87
require 26,238 different implementations. This strictness is a necessary consequence of SIM
commutativity, but, of course, sv6 shows that reality is far more tolerant.
This gap between the theory and the practice of the rule suggests that there may be a space
of trade-offs between interface properties and construction generality. A more restrictive
interface property may enable the construction of broadly conflict-free implementations. If
possible, this “alternate rule” may capture a more practically useful construction; perhaps even
a construction that could be applied mechanically to build practical scalable implementations.
88
Eleven
Conclusion
We are in the midst of a sea change in software performance, as everything from top-tier
servers to embedded devices turns to increasing parallelism to maintain a competitive performance edge.
This dissertation introduced a new approach for software developers to understand and
exploit multicore scalability during software interface design, implementation, and testing.
We defined, formalized, and proved the scalable commutativity rule, the key observation
that underlies this new approach. We defined SIM commutativity, which allows developers
to apply the rule to complex, stateful interfaces. We further introduced Commuter to help
programmers analyze interface commutativity and test that an implementation scales in
commutative situations. Finally, using sv6, we showed that it is practical to achieve a broadly
scalable implementation of POSIX by applying the rule, and that commutativity is essential to
achieving scalability and performance on real hardware. As scalability becomes increasingly
important at all levels of the software stack, we hope that the scalable commutativity rule will
help shape the way developers meet this challenge.
89
90
Bibliography
[1] Jonathan Appavoo, Dilma da Silva, Orran Krieger, Marc Auslander, Michal Ostrowski,
Bryan Rosenburg, Amos Waterland, Robert W. Wisniewski, Jimi Xenidis, Michael
Stumm, and Livio Soares. Experience distributing objects in an SMMP OS. ACM
Transactions on Computer Systems, 25(3), August 2007.
[2] Hagit Attiya, Rachid Guerraoui, Danny Hendler, Petr Kuznetsov, Maged M. Michael,
and Martin Vechev. Laws of order: Expensive synchronization in concurrent algorithms
cannot be eliminated. In Proceedings of the 38th ACM Symposium on Principles of
Programming Languages, Austin, TX, January 2011.
[3] Hagit Attiya, Eshcar Hillel, and Alessia Milani. Inherent limitations on disjoint-access
parallel implementations of transactional memory. In Proceedings of the 21st Annual
ACM Symposium on Parallelism in Algorithms and Architectures, Calgary, Canada, August
2009.
[4] Andrew Baumann, Paul Barham, Pierre-Evariste Dagand, Tim Harris, Rebecca Isaacs,
Simon Peter, Timothy Roscoe, Adrian Schüpbach, and Akhilesh Singhania. The Multikernel: A new OS architecture for scalable multicore systems. In Proceedings of the 22nd
ACM Symposium on Operating Systems Principles (SOSP), Big Sky, MT, October 2009.
[5] Fabrice Bellard et al. QEMU. http://www.qemu.org/.
[6] Daniel J. Bernstein. Some thoughts on security after ten years of qmail 1.0. In Proceedings
of the ACM Workshop on Computer Security Architecture, Fairfax, VA, November 2007.
[7] Philip A. Bernstein and Nathan Goodman. Concurrency control in distributed database
systems. ACM Computing Surveys, 13(2):185–221, June 1981.
[8] D. L. Black, R. F. Rashid, D. B. Golub, and C. R. Hill. Translation lookaside buffer
consistency: a software approach. In Proceedings of the 3rd International Conference
on Architectural Support for Programming Languages and Operating Systems (ASPLOS),
pages 113–122, Boston, MA, April 1989.
[9] Silas Boyd-Wickizer. Optimizing Communication Bottlenecks in Multiprocessor Operating
System Kernels. PhD thesis, Massachusetts Institute of Technology, February 2014.
[10] Silas Boyd-Wickizer, Haibo Chen, Rong Chen, Yandong Mao, M. Frans Kaashoek, Robert
Morris, Aleksey Pesterev, Lex Stein, Ming Wu, Yuehua Dai, Yang Zhang, and Zheng
Zhang. Corey: An operating system for many cores. In Proceedings of the 8th Symposium
91
on Operating Systems Design and Implementation (OSDI), San Diego, CA, December
2008.
[11] Silas Boyd-Wickizer, Austin Clements, Yandong Mao, Aleksey Pesterev, M. Frans
Kaashoek, Robert Morris, and Nickolai Zeldovich. An analysis of Linux scalability
to many cores. In Proceedings of the 9th Symposium on Operating Systems Design and
Implementation (OSDI), Vancouver, Canada, October 2010.
[12] Cristian Cadar, Daniel Dunbar, and Dawson Engler. KLEE: Unassisted and automatic
generation of high-coverage tests for complex systems programs. In Proceedings of the
8th Symposium on Operating Systems Design and Implementation (OSDI), San Diego,
CA, December 2008.
[13] Cristian Cadar, Vijay Ganesh, Peter M. Pawlowski, David L. Dill, and Dawson R. Engler. EXE: Automatically generating inputs of death. In Proceedings of the 13th ACM
Conference on Computer and Communications Security, 2006.
[14] Bryan Cantrill and Jeff Bonwick. Real-world concurrency. Communications of the ACM,
51(11):34–39, 2008.
[15] Koen Claessen and John Hughes. QuickCheck: A lightweight tool for random testing of
Haskell programs. In Proceedings of the 5th ACM SIGPLAN International Conference on
Functional Programming, Montreal, Canada, September 2000.
[16] Austin T. Clements, M. Frans Kaashoek, and Nickolai Zeldovich. Concurrent address
spaces using RCU balanced trees. In Proceedings of the 17th International Conference
on Architectural Support for Programming Languages and Operating Systems (ASPLOS),
London, UK, March 2012.
[17] Austin T. Clements, M. Frans Kaashoek, and Nickolai Zeldovich. RadixVM: Scalable
address spaces for multithreaded applications. In Proceedings of the ACM EuroSys
Conference, Prague, Czech Republic, April 2013.
[18] Super Micro Computer. X8OBN-F manual, 2012.
[19] Jonathan Corbet. The search for fast, scalable counters, May 2010. http://lwn.net/Articles/
170003/.
[20] Jonathan Corbet. Dcache scalability and RCU-walk, April 23, 2012. http://lwn.net/
Articles/419811/.
[21] Tyan Computer Corporation. M4985 manual, 2006.
[22] Tyan Computer Corporation. S4985G3NR manual, 2006.
[23] Russ Cox, M. Frans Kaashoek, and Robert T. Morris. Xv6, a simple Unix-like teaching
operating system. http://pdos.csail.mit.edu/6.828/2012/xv6.html.
[24] Leonardo de Moura and Nikolaj Bjørner. Z3: An efficient SMT solver. In Proceedings
of the 14th International Conference on Tools and Algorithms for the Construction and
Analysis of Systems, Budapest, Hungary, March–April 2008.
92
[25] Jeffrey Dean and Sanjay Ghemawat. MapReduce: simplified data processing on large
clusters. Communications of the ACM, 51(1):107–113, 2008.
[26] John DeTreville. Experience with concurrent garbage collectors for Modula-2+. Technical Report 64, DEC Systems Research Center, November 1990.
[27] Advanced Micro Devices. AMD64 Architecture Programm’s Manual, volume 2. Advanced
Micro Devices, March 2012.
[28] DWARF Debugging Information Format Committee. DWARF debugging information
format, version 4, June 2010.
[29] Faith Ellen, Yossi Lev, Victor Luchango, and Mark Moir. SNZI: Scalable nonzero indicators. In Proceedings of the 26th ACM SIGACT-SIGOPS Symposium on Principles of
Distributed Computing, Portland, OR, August 2007.
[30] Jason Evans. A scalable concurrent malloc (3) implementation for FreeBSD. In Proc. of
the BSDCan Conference, Ottawa, Canada, April 2006.
[31] Ben Gamsa, Orran Krieger, Jonathan Appavoo, and Michael Stumm. Tornado: Maximizing locality and concurrency in a shared memory multiprocessor operating system.
In Proceedings of the 3rd Symposium on Operating Systems Design and Implementation
(OSDI), pages 87–100, New Orleans, LA, February 1999.
[32] Sanjay Ghemawat. TCMalloc: Thread-caching malloc, 2007. http://gperftools.googlecode.
com/svn/trunk/doc/tcmalloc.html.
[33] Patrice Godefroid, Nils Klarlund, and Koushik Sen. DART: Directed automated random
testing. In Proceedings of the 2005 ACM SIGPLAN Conference on Programming Language
Design and Implementation, Chicago, IL, June 2005.
[34] J. R. Goodman and H. H. J. Hum. MESIF: A two-hop cache coherency protocol for
point-to-point interconnects. Technical report, University of Auckland and Intel, 2009.
[35] Maurice Herlihy and Eric Koskinen. Transactional boosting: A methodology for highlyconcurrent transactional objects. In Proceedings of the 13th ACM Symposium on Principles
and Practice of Parallel Programming, Salt Lake City, UT, February 2008.
[36] Maurice Herlihy and Nir Shavit. The art of multiprocessor programming. Morgan
Kaufmann, 2008.
[37] Maurice P. Herlihy and Jeannette M. Wing. Linearizability: A correctness condition for
concurrent objects. ACM Transactions on Programming Languages Systems, 12(3):463–
492, 1990.
[38] Philip W. Howard and Jonathan Walpole. Relativistic red-black trees. Technical Report
10-06, Portland State University, Computer Science Department, 2010.
[39] David Howells. Extended file stat functions, Linux patch, 2010. https://lkml.org/lkml/
2010/7/14/539.
93
[40] Intel. Intel 64 and IA-32 Architectures Software Developer’s Manual, volume 3. Intel
Corporation, 2013.
[41] Amos Israeli and Lihu Rappoport. Disjoint-access-parallel implementations of strong
shared memory primitives. In Proceedings of the 13th ACM SIGACT-SIGOPS Symposium
on Principles of Distributed Computing, Los Angeles, CA, August 1994.
[42] Pieter Koopman, Artem Alimarine, Jan Tretmans, and Rinus Plasmeijer. Gast: Generic
automated software testing. In Proceedings of the 14th International Workshop on the
Implementation of Functional Languages, Madrid, Spain, September 2002.
[43] Christoph Lameter. Effective synchronization on Linux/NUMA systems. In Gelato
Conference, May 2005. http://www.lameter.com/gelato2005.pdf.
[44] Ran Liu and Haibo Chen. SSMalloc: A low-latency, locality-conscious memory allocator
with stable performance scalability. In Proceedings of the 3rd Asia-Pacific Workshop on
Systems, Seoul, South Korea, July 2012.
[45] Yandong Mao, Robert Morris, and Frans Kaashoek. Optimizing MapReduce for multicore architectures. Technical Report MIT-CSAIL-TR-2010-020, MIT CSAIL, May
2010.
[46] Paul McKenney. Hierarchical RCU, November 2008. https://lwn.net/Articles/305782/.
[47] Paul E. McKenney. Differential profiling. Software: Practice and Experience, 29(3):219–
234, 1999.
[48] Paul E. McKenney. Concurrent code and expensive instructions. https://lwn.net/Articles/
423994/, January 2011.
[49] Paul E. McKenney, Dipankar Sarma, Andrea Arcangeli, Andi Kleen, Orran Krieger,
and Rusty Russell. Read-copy update. In Proceedings of the Linux Symposium, Ottawa,
Canada, June 2002.
[50] John M. Mellor-Crummey and Michael L. Scott. Algorithms for scalable synchronization
on shared-memory multiprocessors. ACM Transactions on Computer Systems, 9(1):21–65,
1991.
Windows research kernel. http://www.microsoft.com/resources/
sharedsource/windowsacademic/researchkernelkit.mspx.
[51] Microsoft Corp.
[52] Mark S. Papamarcos and Janak H. Patel. A low-overhead coherence solution for multiprocessors with private cache memories. In Proceedings of the 11th Annual International
Symposium on Computer Architecture, Ann Arbor, MI, June 1984.
[53] Prakash Prabhu, Soumyadeep Ghosh, Yun Zhang, Nick P. Johnson, and David I. August. Commutative set: A language extension for implicit parallel programming. In
Proceedings of the 2011 ACM SIGPLAN Conference on Programming Language Design
and Implementation, San Jose, CA, June 2011.
94
[54] The FreeBSD Project. FreeBSD source code. http://www.freebsd.org/.
[55] Martin C. Rinard and Pedro C. Diniz. Commutativity analysis: A new analysis technique
for parallelizing compilers. ACM Transactions on Programming Languages and Systems,
19(6):942–991, November 1997.
[56] Amitabha Roy, Steven Hand, and Tim Harris. Exploring the limits of disjoint access
parallelism. In Proceedings of the 1st USENIX Workshop on Hot Topics in Parallelism,
Berkeley, CA, March 2009.
[57] Scott Schneider, Christos D. Antonopoulos, and Dimitrios S. Nikolopoulos. Scalable
locality-conscious multithreaded memory allocation. In Proc. of the 2006 ACM SIGPLAN International Symposium on Memory Management, Ottawa, Canada, June 2006.
[58] Koushik Sen, Darko Marinov, and Gul Agha. CUTE: A concolic unit testing engine for
C. In Proceedings of the 13th ACM SIGSOFT International Symposium on Foundations of
Software Engineering, Lisbon, Portugal, September 2005.
[59] Marc Shapiro, Nuno Preguica, Carlos Baquero, and Marek Zawirski. Conflict-free
replicated data types. In Proceedings of the 13th International Conference on Stabilization,
Safety, and Security of Distributed Systems, Grenoble, France, October 2011.
[60] Marc Shapiro, Nuno Preguica, Carlos Baquero, and Marek Zawirski. Convergent and
commutative replicated data types. Bulletin of the EATCS, 104:67–88, June 2011.
[61] Guy L. Steele, Jr. Making asynchronous parallelism safe for the world. In Proceedings of
the 17th ACM Symposium on Principles of Programming Languages, San Francisco, CA,
January 1990.
[62] Gil Tene, Balaji Iyengar, and Michael Wolf. C4: The continuously concurrent compacting
collector. SIGPLAN Notices, 46(11):79–88, June 2011.
[63] The Institute of Electrical and Electronics Engineers (IEEE) and The Open Group. The
Open Group base specifications issue 7, 2013 edition (POSIX.1-2008/Cor 1-2013), April
2013.
[64] Linus Torvalds et al. Linux source code. http://www.kernel.org/.
[65] R. Unrau, O. Krieger, B. Gamsa, and M. Stumm. Hierarchical clustering: A structure for
scalable multiprocessor operating system design. Journal of Supercomputing, 1995.
[66] Landy Wang. Windows 7 memory management, November 2009. http://download.
microsoft.com/download/7/E/7/7E7662CF-CBEA-470B-A97E-CE7CE0D98DC2/
mmwin7.pptx.
[67] W. E. Weihl. Commutativity-based concurrency control for abstract data types. IEEE
Transactions on Computers, 37(12):1488–1505, December 1988.
[68] David Wentzlaff and Anant Agarwal. Factored operating systems (fos): The case for
a scalable operating system for multicores. ACM SIGOPS Operating System Review,
43(2):76–85, 2009.
95
Download