ABSTRACTIONS FOR ROBUST HIGHER-ORDER MESSAGE-BASED COMMUNICATION A Dissertation Submitted to the Faculty

advertisement
ABSTRACTIONS FOR ROBUST HIGHER-ORDER MESSAGE-BASED
COMMUNICATION
A Dissertation
Submitted to the Faculty
of
Purdue University
by
Lukasz S. Ziarek
In Partial Fulfillment of the
Requirements for the Degree
of
Doctor of Philosophy
May 2011
Purdue University
West Lafayette, Indiana
ii
To my Wife and Parents...
iii
ACKNOWLEDGMENTS
There are many people who have helped me in my journey to complete my dissertation.
I want to first thank my wife for helping me through my defense and thesis writing. She
always kept me focused and grounded. Without her this work would not have been a
possibility.
I would also like to thank my advisor Professor Suresh Jagannathan who helped me
develop, focus, and expand my research and ideas. Professor Jagannathan helped me at
every stage of my graduate career and our frequent meetings in his office were the spring
board for this work. Without him this dissertation would not have seen the light of day.
I also want to thank my committee members for their insightful comments and critiques.
Professors Vitek, Eugster, and Li helped me to refine this dissertation.
My fellow lab mates were always available for interesting discussions, critiquing ideas,
and even proof reading. Many of them helped shape and refine the ideas that are present
in this dissertation. I especially want to thank all of my co-authors on the various research
papers that comprise the core of this dissertation. They were instrumental in the ideas,
semantics, and implementations of the three language primitives found in this dissertation.
Lastly I would like to thank my parents who always believed in me. In their minds there
was never any doubt that I would complete this journey even when I doubted myself.
iv
TABLE OF CONTENTS
Page
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
vii
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
viii
ABBREVIATIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
xiii
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
xiv
1 INTRODUCTION . . . . . . . . . . .
1.1 Context . . . . . . . . . . . . . .
1.1.1 Concurrent ML . . . . .
1.1.2 Semantics and Case Study
1.2 Contributions and Outline . . . .
.
.
.
.
.
1
4
4
5
5
2 CONCURRENT ML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1 Programming with CML . . . . . . . . . . . . . . . . . . . . . . . . .
10
12
3 MLTON . . . . . . . . . . . . .
3.1 Multi-MLton . . . . . . . .
3.1.1 Threading System .
3.1.2 Communication . .
3.1.3 Concluding Remarks
.
.
.
.
.
16
17
18
21
21
4 SWERVE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.0.4 Processing a Request . . . . . . . . . . . . . . . . . . . . . . .
22
24
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
5 LIGHTWEIGHT CHECKPOINTING FOR CONCURRENT ML
5.1 Programming Model . . . . . . . . . . . . . . . . . . . .
5.1.1 Interaction of Stable Sections . . . . . . . . . . .
5.2 Motivating Example . . . . . . . . . . . . . . . . . . . .
5.2.1 Cut . . . . . . . . . . . . . . . . . . . . . . . . .
5.3 Semantics . . . . . . . . . . . . . . . . . . . . . . . . .
5.3.1 Example . . . . . . . . . . . . . . . . . . . . . .
5.3.2 Soundness . . . . . . . . . . . . . . . . . . . . .
5.4 Incremental Construction . . . . . . . . . . . . . . . . .
5.4.1 Example . . . . . . . . . . . . . . . . . . . . . .
5.5 Efficiency . . . . . . . . . . . . . . . . . . . . . . . . .
5.6 Implementation . . . . . . . . . . . . . . . . . . . . . .
5.6.1 Supporting First-Class Events . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
26
29
31
33
36
38
43
45
46
51
53
58
58
v
.
.
.
.
.
.
.
.
.
.
Page
59
60
61
62
62
64
67
71
73
77
6 PARTIAL MEMOIZATION OF CONCURRENCY AND COMMUNICATION
6.1 Programming Model . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.3 Semantics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.3.1 Language . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.3.2 Partial Memoization . . . . . . . . . . . . . . . . . . . . . . .
6.3.3 Constraint Matching . . . . . . . . . . . . . . . . . . . . . . .
6.3.4 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.3.5 Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.3.6 Schedule Aware Partial Memoization . . . . . . . . . . . . . .
6.4 Soundness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.5 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.5.1 Parallel CML and Hooks . . . . . . . . . . . . . . . . . . . . .
6.5.2 Supporting Memoization . . . . . . . . . . . . . . . . . . . . .
6.6 Performance Results . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.6.1 Synthetic Benchmarks . . . . . . . . . . . . . . . . . . . . . .
6.6.2 Constraint Matching and Discharge Overheads . . . . . . . . .
6.6.3 Case Study: File Caching . . . . . . . . . . . . . . . . . . . .
6.7 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.8 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . .
78
80
81
82
84
85
89
95
96
98
99
112
112
112
114
114
118
120
121
124
7 ASYNCHRONOUS CML . . . . . . . . .
7.1 Design Considerations . . . . . . . .
7.1.1 Putting it All Together . . . .
7.2 Asynchronous Events . . . . . . . .
7.2.1 Combinators . . . . . . . . .
7.2.2 Mailboxes and Multicast . . .
7.3 Semantics . . . . . . . . . . . . . .
7.3.1 Encoding Communication . .
7.3.2 Base Events . . . . . . . . .
7.3.3 Event Evaluation . . . . . . .
7.3.4 Communication and Ordering
125
127
130
130
134
141
142
143
144
145
145
5.7
5.8
5.9
5.6.2 Handling References . . . . . . .
5.6.3 Graph Representation . . . . . .
5.6.4 Handling Exceptions . . . . . . .
5.6.5 The Rest of CML . . . . . . . .
Performance Results . . . . . . . . . . .
5.7.1 Synthetic Benchmarks . . . . . .
5.7.2 Open-Source Benchmarks . . . .
5.7.3 Case Studies: Injecting Stabilizers
Related Work . . . . . . . . . . . . . . .
Concluding Remarks . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
vi
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Page
146
147
147
148
149
149
149
152
153
154
157
8 CONCLUDING REMARKS AND FUTURE DIRECTIONS
8.1 Future Directions . . . . . . . . . . . . . . . . . . . .
8.1.1 Stabilizers . . . . . . . . . . . . . . . . . . .
8.1.2 Partial Memoization . . . . . . . . . . . . . .
8.1.3 Asynchronous CML . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
169
169
169
172
174
LIST OF REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
176
VITA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
187
7.4
7.5
7.6
7.7
7.3.5 Combinators . . . . . . . . . . . . . . . .
7.3.6 Choose Events . . . . . . . . . . . . . . .
7.3.7 Synchronization of Choice . . . . . . . . .
7.3.8 sWrap, aWrap, and Guard of Choose Events
Implementation . . . . . . . . . . . . . . . . . .
Case Study: A Parallel Web-server . . . . . . . . .
7.5.1 Lock-step File and Network I/O . . . . . .
7.5.2 Underlying I/O and Logging . . . . . . . .
7.5.3 Performance Results . . . . . . . . . . . .
Related Work . . . . . . . . . . . . . . . . . . . .
Concluding Remarks . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
vii
LIST OF TABLES
Table
Page
5.1
Benchmark characteristics and dynamic counts. . . . . . . . . . . . . . . .
63
5.2
Benchmark graph sizes and normalized overheads. . . . . . . . . . . . . .
63
5.3
Restoration of the entire web-server. . . . . . . . . . . . . . . . . . . . . .
70
5.4
Instrumented recovery. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
71
7.1
Per module performance numbers for Swerve. . . . . . . . . . . . . . . . .
154
viii
LIST OF FIGURES
Figure
1.1
Page
Single arrows depict communication actions and double arrows represent potential matching communication actions for a thread T1 . . . . . . . . . . . .
3
2.1
CML event operators. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11
3.1
Abstractions found in Multi-MLton. Communication, either synchronous or
asynchronous (expressed via asynchronous events as defined in Chapter 7), is
depicted through arrows. Parasites are shown as raw stack frames that comprise
a single runtime thread. . . . . . . . . . . . . . . . . . . . . . . . . . . .
18
The runtime architecture of Multi-MLton utilizes pthreads each of which contains a queue of light weight threads, the currently executing lightweight thread,
and a pointer to a section of shared memory for allocation. . . . . . . . . .
19
Swerve module interactions for processing a request (solid lines) and error
handling control and data flow (dashed lines) for timeouts. The number above
the lines indicates the order in which communication actions occur. . . . . .
25
5.1
A simple server-side RPC abstraction using synchronous communication. . .
28
5.2
Interaction between stable sections. Clear circles indicate thread-local checkpoints, dark circles represent stabilization actions. . . . . . . . . . . . . . .
31
An excerpt of the the File Processing module in Swerve. The code fragment displayed on the bottom shows the code modified to use stabilizers. Italics
mark areas in the original where the code is changed. . . . . . . . . . . . .
34
An excerpt of the Network Processor module in Swerve. The main processing of the hosting thread, created by the Listener module is wrapped in a
stable section and the timeout handling code can be removed. The code fragment on the bottom shows the modifications made to use stabilizers. Italics
in the code fragment on the top mark areas in the original where the code is
removed in the version modified to use stabilizers. . . . . . . . . . . . . . .
35
An excerpt of the Timeout Manager module in Swerve. The bottom code
fragment shows the code modified to use stabilizers. The expired function can
be removed and trigger now calls stabilize. Italics mark areas in the original
where the code is changed. . . . . . . . . . . . . . . . . . . . . . . . . . .
36
3.2
4.1
5.3
5.4
5.5
ix
Figure
5.6
Page
A multi-server implementation which utilizes a central coordinator and multiple servers. A series of requests is multi-plexed between the servers by the
coordinator. Each server handles its own transient faults. The shaded portions
represent computation which is unrolled due to the stabilize action performed
by the server. Single arrows represent communication to servers and double
arrows depict return communication. Circular wedges depict communications
which are not considered because a cut operation limits their effects. . . . .
37
Stabilizers Semantics – The syntax, evaluation contexts, and domain equations
for a core call-by-value language for stabilizers. . . . . . . . . . . . . . . .
39
Stabilizers Semantics – The global evaluation rules for a core call-by-value
language for stabilizers. . . . . . . . . . . . . . . . . . . . . . . . . . . .
40
An example used to illustrate the interaction of inter-thread communication
and stable sections. The call to f establishes an initial checkpoint. Although
g and h do not interact directly with f, the checkpoint established by f may
nonetheless be restored on a call to stabilize as illustrated by Figure 5.10. . .
44
5.10 An example of global checkpoint construction where the inefficiency of global
checkpoint causes the restoration of a checkpoint established prior to the stable
section in which a call to stabilize occurs. . . . . . . . . . . . . . . . . . .
44
5.11 Stabilizers Semantics – Additional syntax, local evaluation rules, as well as domain equations for a semantics that utilizes incremental checkpoint construction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
49
5.12 Stabilizers Semantics – Global evaluation rules for a semantics that utilizes
incremental checkpoint construction. . . . . . . . . . . . . . . . . . . . .
50
5.13 Stabilizer Semantics – Global evaluation rules for a semantics that utilizes incremental checkpoint construction. (continued) . . . . . . . . . . . . . . .
51
5.14 An example of incremental checkpoint construction for the code presented in
Figure 5.9. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
52
5.15 The relation 7→ defines how to evaluate a schedule T derived from a graph G.
55
5.16 Sample code utilizing exceptions and stabilizers. . . . . . . . . . . . . . .
61
5.17 Asynchronous Communication runtime overheads. . . . . . . . . . . . . . .
65
5.18 Asynchronous Communication memory overheads. . . . . . . . . . . . . .
66
5.19 Communication Pipeline runtime overheads. . . . . . . . . . . . . . . . . .
66
5.20 Communication Pipeline memory overheads. . . . . . . . . . . . . . . . .
67
5.7
5.8
5.9
x
Figure
Page
5.21 Communication graphs for the Asynchronous Communication synthetic benchmark. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
68
5.22 Communication graphs for the Communication Pipeline synthetic benchmark.
69
5.23 Swerve file size overheads for Stabilizers. . . . . . . . . . . . . . . . . . .
72
. . . . . . . . . . . . . . . . .
73
5.24 Swerve quantum overheads for Stabilizers.
6.1
Syntax, grammar, evaluation contexts, and domain equations for a concurrent
language with synchronous communication. . . . . . . . . . . . . . . . . .
83
Operation semantics for a concurrent language with synchronous communication. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
84
Memoization Semantics – A concurrent language supporting memoization of
synchronous communication and dynamic thread creation. . . . . . . . . .
90
Memoization Semantics – A concurrent language supporting memoization of
synchronous communication and dynamic thread creation. . . . . . . . . .
91
Memoization Semantics – A concurrent language supporting memoization of
synchronous communication and dynamic thread creation (continued). . . .
92
Memoization Semantics – The function ℑ yields the set of constraints C which
are not satisfiable in program state P. . . . . . . . . . . . . . . . . . . . .
93
Memoization Semantics – Constraint matching is defined by four rules. Communication constraints are matched with threads performing the opposite communication action of the constraint. . . . . . . . . . . . . . . . . . . . . .
94
Determining if an application can fully leverage memo information may require examining an arbitrary number of possible thread interleavings. . . . .
95
The communication pattern of the code in Figure 6.8. Circles represent operations on channels. Gray circles are sends and white circles are receives. Double
arrows represent communications that are captured as constraints during memoization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
95
6.10 The second application of f can only be partially memoized up to the second
send since only the first receive made by g is blocked in the global state. . .
97
6.11 Memoization of the function f can lead to the starvation of either g or h depending on which value the original application of f consumed from channel
c. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
98
6.12 Memoization Semantics – Schedule Aware Partial Memoization. . . . . . .
100
6.13 T defines an erasure property on program states. The first four rules remove
memo information and restore evaluation contexts. . . . . . . . . . . . . .
101
6.2
6.3
6.4
6.5
6.6
6.7
6.8
6.9
xi
Figure
Page
6.14 Normalized runtime percent speedup for the k-clustering benchmark of memoized evaluation compared to non-memoized execution. . . . . . . . . . . .
117
6.15 Normalized runtime percent speedup for STMBench-7 of memoized evaluation compared to non-memoized execution. . . . . . . . . . . . . . . . . .
118
6.16 Normalized percent runtime overhead for discharging send and receive constraints for ping-pong. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
119
6.17 Normalized runtime percent speedup for Swerve of memoized evaluation compared to non-memoized execution. . . . . . . . . . . . . . . . . . . . . . .
121
7.1
7.2
7.3
7.4
7.5
Two server-client model based concurrency abstractions extended to utilize
logging. In (a) asynchronous sends are depicted as solids lines, where as in
(b) synchronous sends are depicted as solid lines and light weight threads as
boxes. Logging actions are presented as dotted lines. . . . . . . . . . . . .
128
The figure shows a complex asynchronous event ev , built from a base event
aSendEvt , being executed by Thread 1. (a) When the event is synchronized
via aSync , the value v is placed on channel c and post-creation actions are
executed. Afterwards, control returns to Thread 1. (b) When Thread 2 consumes the value v from channel c , an implicit thread of control is created to
execute any post-consumption actions. . . . . . . . . . . . . . . . . . . . .
132
The figure shows a complex asynchronous event ev , built from a base event
aRecvEvt , being executed by Thread 1. (a) When the event is synchronized
via aSync , the receive action is placed on channel c and post-creation actions
are executed. Afterwards, control returns to Thread 1. (b) When Thread 2
sends the value v to channel c , an implicit thread of control is created to
execute any post-consumption actions passing v as the argument. . . . . . .
134
The figure shows a callback event constructed from a complex asynchronous
event ev and a callback function f . (a) When the callback event is synchronized via aSync , the action associated with the event ev is placed on channel
c and post-creation actions are executed. A new event, ev’ , is created and
passed to Thread 1. (b) An implicit thread of control is created after the base
event of ev is consumed. Post-consumption actions are executed passing v ,
the result of consuming the base event for ev , as an argument. The result of the
post-consumption actions, v’ is sent on clocal . (c) When ev’ is synchronized
upon, f is called with v’ . . . . . . . . . . . . . . . . . . . . . . . . . . .
137
The figure shows Thread 1 synchronizing on a complex asynchronous event
ev , built from a choice between two base asynchronous send events; one sending v on channel c and the other v’ on c’ . Thread 2 is willing to receive
from channel c . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
140
xii
Figure
Page
7.6
CML Event and AEvent operators. . . . . . . . . . . . . . . . . . . . . . .
158
7.7
CML mailbox structure. . . . . . . . . . . . . . . . . . . . . . . . . . . .
159
7.8
An excerpt of a CML mailbox structure implemented utilizing asynchronous
events. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
159
Syntax, grammar, and evaluation contexts for a core language for asynchronous
events. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
160
7.9
7.10 Domain equations for a core language for asynchronous events. . . . . . . .
161
7.11 A core language for asynchronous events – base events as well as rules for
spawn, function application, and channel creation. . . . . . . . . . . . . . .
162
7.12 A core language for asynchronous events – rules for matching communication
and ordering. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
163
7.13 A core language for asynchronous events – rules for waiting, blocking, and
enqueueing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
164
7.14 A core language for asynchronous events – combinator extensions for asynchronous events. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
165
7.15 A core language for asynchronous events – choose events and choose event
flattening. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
166
7.16 A core language for asynchronous events – synchronizing and evaluating S C HOOSE
events. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
7.17 A core language for asynchronous events – combinators for S C HOOSE events.
168
xiii
ABBREVIATIONS
ACML
Asynchronous Concurrent ML
CML
Concurrent ML
DFS
Depth-First Search
Dom
Domain
GC
Garbage Collector
min
Minimum
MPI
Message Passing Interface
OS
Operating System
PML
Parallel ML
RPC
Remote Procedure Call
SCC
Single-Chip Cloud Computer
SML
Standard ML
STM
Software Transactional Memory
TM
Transactional Memory
xiv
ABSTRACT
Ziarek, Lukasz S. Ph.D., Purdue University, May 2011. Abstractions for Robust HigherOrder Message-Based Communication. Major Professor: Suresh Jagannathan.
Message passing programming idioms alleviate the burden of reasoning about implicit
program interactions that can lead to deadlock or race conditions by explicitly defining the
interactions between threads and modules. This simplicity comes at the cost of having to
reason about global protocols that span multiple interacting threads and software components. Reasoning about a given thread or component requires reasoning about potential
communication partners and protocols in which the thread participates. Therefore, building modular yet composable communication abstractions is challenging. In this dissertation we present three language abstractions for building robust and efficient communication
protocols. We show how such language abstractions can be leveraged to simplify protocol
design, improve protocol performance, and modularly extend protocols with asynchronous
actions. Extending protocols with asynchronous actions, specifically asynchronous events,
is currently not possible in the context of languages like Concurrent ML.
1
1
INTRODUCTION
The advent of multi-core and multi-processor systems into main-stream computing has
posed numerous challenges for software development. Notably, the development and maintenance of software that utilizes these additional computing resources is notoriously difficult. In this dissertation we explore three language based mechanisms aimed at simplifying
writing and reasoning about multi-threaded programs.
Message passing is a useful abstraction for writing concurrent and parallel programs.
In message passing languages threads communicate with each other by sending and receiving messages on channels. Message passing is the prevalent means of synchronization
and communication between threads in languages such as Erlang [1], Concurrent ML [2],
Manticore [3], and is the basis for MPI [4] and JMS [5]. Message passing is also the key
component in current trends in CMP processor and operating systems design. For example, Intel’s recently announced SCC [6] (Single-Chip Cloud Computer) is a manycore CPU
that features 24 tiles comprised of dual-core x86 IA processors. Notably, there is no shared
L2 cache among these tiles. Instead, communication across cores is via hardware-assisted
message passing over a 2D high-bandwidth mesh network. Thus, the SCC does not support
uniform access memory; application performance is dictated by the degree of affinity that
exists between threads and the data they access. At the software level, Barrelfish [7] is
a new operating system kernel design that treats the underlying machine as a network of
independent cores, and assumes no inter-core sharing at the lowest level. It recasts all OS
services, including memory management and inter-core communication, in terms of message passing, arguing that such a reformulation leads to improved scalability and efficiency
due to additional pipelining and batching opportunities.
In this dissertation we explore message passing as a programming style that is implemented on top of a shared memory system, instead of as a low-level communication
mechanism. In this context, messages are data that threads pass to one another via global
2
conduits, or channels. A message is transferred from one thread to another when a thread
sends the message on a channel and another receives the message from the same channel.
Communication actions can either be synchronous or asynchronous. The former requires
a thread performing a communication action to block until another thread is willing to perform a matching communication action; either receive from or send to the waiting thread.
Asynchronous communication, on the other hand, does not wait for a matching participant.
For example, in the case of a send, the message is placed on the channel regardless of the
existence of a communication partner. Messages, data sent/received on channels, can be
propagated either by copying or by reference. The former typically makes a deep copy
of the data composing the message, thus sender and recipient each execute with their own
separate version of the data. Alternatively, messages can be passed by reference so that
sender and receiver both have access to the same data in memory.
Languages structured around a message passing programming idiom offer a number
of benefits over shared memory. Immutable data that can be witnessed by concurrently
executing threads of control is made explicit in message passing languages through sends
and receives of such data on channels. This alleviates the programmer from having to
reason about shared state since there is no explicit sharing between threads due to the fact
that data is immutable. However, in the presence of mutable data, things become more
complicated. If mutable data is passed by reference between two communicating threads,
future updates to this data may suffer from data-races. If mutable data, on the other hand,
is passed by copy, no association is manifested between the various copies. Updates in one
thread of control are not propagated to another. Yet, even with strictly immutable state,
message passing languages are no panacea. The ease of not having to reason about shared
state comes at the complexity of reasoning about global protocols.
Message passing programs are typically structured around global protocols that govern the interactions between threads and software components. Reasoning about a given
thread, or given region of code, requires reasoning about which protocols that thread may
participate in. This is typically difficult since a thread may potentially participate in many
different protocols and participating in any given protocol may preclude future commu-
3
T2
T1
T3
C1
C1
C1
C2
C2
T4 ... Tn
Figure 1.1. Single arrows depict communication actions and double arrows represent potential matching communication actions for a thread T1 .
nication actions, allow different communication actions, or generate new communication
actions. This occurs because control flow of a given thread, creation of new threads, or
communication actions can be dependent upon the values a thread sends or receives.
To reason about a given region of code in a message passing language, one must take
into account the possible candidates with whom that region might communicate. As an
example, consider the program in Figure 1.1 which consists of some finite number of communicating threads. Communication actions on channels are represented by circles, gray
for sends and white for receives. Potential communications are depicted by arrows. To reason about thread T1 we must consider its communication partners. Clearly, T1 can complete
by matching its communication actions with T2 . However, T1 may be able to also match
with T3 . Notice that T3 is able to send to T1 on channel C1 . Thus, T1 can complete if there
exists a thread (T4 ... Tn ) which is also willing to send on channel C2 . Alternatively there
could be a thread (T4 ... Tn ) willing to receive on channel C1 from T2 , which would allow
the second receive in T2 to match with the second receive in T1 . Reasoning about the different communications for a given thread requires examining the whole program. Notably,
a given protocol may be composed of many interacting threads.
Reasoning about the interaction between software components is a bit simpler as abstraction boundaries effectively decouple participants in a protocol. Message passing explicitly defines the protocols that govern component interactions, simplifying software design and engineering. However, maintaining and augmenting a specific software compo-
4
nent once again requires the programmer to reason about which cross-component protocols
that component may interact with. Changes to cross-component protocols affect both parties.
In this dissertation we explore the design and maintenance of message passing programs
through the definition of three language abstractions: stabilizers, partial memoization, and
asynchronous events. We show how such language abstractions can be leveraged to simplify protocol design, improve protocol performance, and modularly extend protocols with
asynchronous actions.
1.1
Context
The work presented in this dissertation has been done in MLton [8], a whole program
optimizing compiler for Standard ML. ML is a family of functional programming languages whose main dialects include Standard ML, Objective Caml, and F#. More specifically the work focuses on Concurrent ML, a message passing language extension for Standard ML.
1.1.1
Concurrent ML
Concurrent ML is a concurrent extension of Standard ML to include message passing
primitives, and has been extended to execute on parallel architectures [9]. Concurrent ML
at its core is structured around synchronous message passing and typically implemented
using lightweight threads (i.e., green threads). In CML, channels are first-class entities and
can be created dynamically. CML channels support both send and receive operations by a
given thread. In addition to synchronous message passing, part of this dissertation is devoted to exploring an asynchronous extension to CML. We provide additional background
on CML and its primitives in Chapter 2.
5
1.1.2
Semantics and Case Study
For each language abstraction we present a formal operational semantics and relevant
case studies. All of the operational semantics begin with a core functional language with
threading and communication primitives. We show how to extend such a language to support the various language abstractions presented in each chapter. This core functional language is augmented with additional primitives in specific chapters where necessary.
In each chapter we present a case study detailing the performance characteristics of
each language abstraction on Swerve, a real-world third-party benchmark. Swerve is a
web-server entirely written in CML and is roughly 16,000 lines of CML code consisting
of a number of modules and specialized libraries. The goal of Swerve was to highlight
and utilize CML as the predominant concurrency abstraction. The overwhelming majority of the interactions between modules and intra-module concurrency is in fact achieved
through CML primitives, combinators, and events. As such we utilize Swerve as our main
case study in each chapter. We augment the case study with smaller synthetic benchmarks
throughout the chapters. We provide a detailed description of Swerve in Chapter 4.
1.2
Contributions and Outline
We begin by providing details on CML primitives in Chapter 2 as well as a few code
examples to illustrate their use. We then introduce the MLton compiler and its multi-core
extension Multi-MLton in Chapter 3. We present salient details and background information on the implementation environment and extend the description in specific chapters
where additional details are necessary. Details on Swerve are given in Chapter 4.
In Chapter 5 we explore the effects of atomicity and state reversion in a message passing
language through the definition of a lightweight checkpointing abstraction, called stabilizers, and show how it can be used to: a) simplify error handling code and b) recover from
transient faults efficiently. Stabilizers ensure an annotated region of code executes atomically, either all of its effects are visible or none are. They do so through the introduction
of a new keyword: stable, used to demark atomic regions of code. If a transient error is
6
encountered during the execution of a stable region of code, the region is reverted to a safe
checkpoint through the use of the primitive stabilize. Two atomic regions of code which
have communicated via message passing are reverted as a single atomic unit. We demonstrate the usefulness of stabilizers by simplifying the handling of timeouts in Swerve.
Chapter 5 makes the following contributions:
1. The design and semantics of stabilizers, a new modular language abstraction for transient fault recovery in concurrent programs. To the best of our knowledge, stabilizers
are the first language-centric design of a checkpointing facility that provides global
consistency and safety guarantees for transient fault recovery in programs with dynamic thread creation, and selective message passing communication.
2. A lightweight dynamic monitoring algorithm faithful to the semantics that constructs
efficient global checkpoints based on the context in which a restoration action is
performed. Efficiency is defined with respect to the amount of rollback required to
ensure that all threads resume execution after a checkpoint is restored to a consistent
global state as compared to a global checkpointing scheme.
3. A formal semantics along with soundness theorems that formalize the correctness
and efficiency of our design.
4. A detailed explanation of an implementation built as an extension of the Concurrent ML library [2] within the MLton [8] compiler. The library includes support
for synchronous, selective communication, threading primitives, exceptions, shared
memory, as well as events.
5. An evaluation study that quantifies the cost of using stabilizers on various opensource server-style applications. Our results reveal that the cost of defining and monitoring thread state is small, typically adding roughly no more than 4–6% overhead
to overall execution time. Memory overheads are equally modest.
In Chapter 6 we explore monitoring of message passing in our formulation of partial
memoization and show how partial memoization can be leveraged to improve communi-
7
cation protocol performance. Partial memoization is an optimization technique that allows
the omission of redundant computation by monitoring side-effecting actions at the first call
of a function. Subsequent applications of the same function can be avoided if all sideeffecting actions of the first call can be replayed. Such an optimization can be utilized to
reduce overheads of state reversion in a message passing language. To allow for the successful elimination of a redundant call, our partial memoization technique ensures that all
spawns, communication, and channel creation actions are performed in the order witnessed
by the prior execution of the candidate function. If this is not possible, partial memoization will resume execution of the computation from the first such action that cannot be
performed. We demonstrate the usefulness of partial memoization by accelerating the performance of a wide variety of benchmarks including a streaming benchmark and Swerve.
The benchmarks leverage a number of different communication protocols and communication patterns. As a case study we show how to implement an error aware file caching
mechanism in Swerve.
Chapter 6 makes the following contributions:
1. The definition of partial memoization for a core functional language with synchronous
communication primitives as well as threading constructs.
2. A formal definition of partial memoization in terms of an operational semantics along
with a safety theorem for partial memoziation. We include a detailed proof of the
safety theorem.
3. A description of partial memoization in the context of Multi-MLton, a parallel extension of MLton.
4. We present detailed performance evaluation of partial memoization on three parallel
benchmarks. We consider the effect of memoization on improving performance of
multi-threaded CML applications executing on a multi-core architecture. Our results
indicate that memoization can lead to substantial runtime performance improvement
of around 30% over a non-memoized version of the same program, with only modest
increases in memory overhead (15% on average).
8
In Chapter 7 we study how to add explicit support for asynchronous events into the
CML event framework and show how such events can be utilized to extend existing software in a composable and modular fashion. Asynchronous events interoperate fully with
synchronous events, allowing programmers to specify both synchronous and asynchronous
actions on a given channel. This uniformity allows the programmer to modify one component to utilize asynchrony without having to change the other participants in crosscomponent protocols. Asynchronous events provide the ability to utilize asynchrony in
selective communication, which was not possible in CML without sacrificing ordering and
visibility guarantees. Additionally, we provide a rich set of asynchronous event combinators for constructing first-class communication protocols. We demonstrate the usefulness
of asynchronous events by augmenting Swerve with additional functionality and improving
its performance on a multi-core processor.
Chapter 7 makes the following contributions:
1. We present a comprehensive design for asynchronous events, and describe a family
of combinators analogous to their synchronous variants available in CML. To the best
of our knowledge, this is the first treatment to consider the meaningful integration of
composable CML-style event abstractions with asynchronous functionality.
2. We provide implementations of useful asynchronous abstractions such as callbacks
and mailboxes, along with a number of case studies extracted from realistic concurrent applications (e.g., a concurrent I/O library, concurrent file processing, etc.). Our
abstractions operate over ordinary CML channels, enabling interoperability between
synchronous and asynchronous protocols.
3. We discuss an implementation of asynchronous events that has been incorporated
into Multi-MLton, a parallel extension of MLton [8], and present a detailed benchmark study that shows how asynchronous events can help improve the expression
and performance of highly concurrent server applications.
9
Related work is presented at the end of each major chapter. Additional related work
that is relevant to proposed future directions of research is given in Chapter 8 along with
concluding remarks.
10
2
CONCURRENT ML
CML is a concurrent extension of Standard ML that utilizes synchronous message passing to enable the construction of synchronous communication protocols. Threads perform
send and recv operations on typed channels. These operations block until a matching action on the same channel is performed by another thread. The main contributions of CML
are its event framework and selective communication. In CML, an event is a first-class, abstract, synchronous operation that decouples the description of a synchronous action from
the synchronization, or discharge, of the action. For example, a send event that places a
value v on a channel c does not perform the deposit of v on c until a thread explicitly enacts
the event. Similarly to how function composition allows the creation of first-class computational units, event combinators provide a mechanism to build first-class communication
protocols.
The motivation behind first-class events is the desire to support selective communication, allowing a thread to choose between many potential communication partners. For
example, a consumer may wish to choose between a set of producers. The consumer will
pick any producer from this set of producers to synchronize with that is currently willing to
send a value. If no producer from this set of producers is currently available for synchronization, the consumer will block until a producer becomes available. In CML selective
communication is provided through a choice operator that selects between a set of events.
Events and their combinators are necessary because λ-abstraction and function composition are not enough to support selective communication as functions hide the computation
they encapsulate. A choice operation cannot introspect the encapsulated computation and
thus cannot choose the communication action that has a waiting partner.
CML provides first-class synchronous events that abstract synchronous message-passing
operations. An event value of type ’a event when synchronized on yields a value of type
’a . An event value represents a potential computation, with latent effect until a thread syn-
11
spawn
: (unit -> ’a) -> threadID
sendEvt
: ’a chan * ’a -> unit Event
recvEvt
: ’a chan -> ’a Event
alwaysEvt : ’a -> ’a Event
neverEvt
: ’a Event
sync
: ’a Event -> ’a
wrap
: ’a Event * (’a -> ’b) -> ’b Event
guard
: (unit -> ’a Event) -> ’a Event
choose
: ’a Event list -> ’a Event
Figure 2.1. CML event operators.
chronizes upon it by calling sync . The following equivalences therefore hold: send(c,
v) ≡ sync(sendEvt(c,v)) and recv(c) ≡ sync(recvEvt(c)) .
Besides sendEvt and recvEvt , there are other base events provided by CML: an
alwaysEvt which contains a value and is always available for synchronization as well as
a neverEvt , which as its name suggests, is never available for synchronization. These
events are typically generated based on the (un)satisfiability of conditions or invariants that
can be subsequently used to influence the behavior of more complex events built from the
event combinators described below. For example, an always event is typically utilized to
provide a default behavior in conjunction with selective communication and simply returns
its argument when synchronized upon. Notably, thread creation is not encoded as an event
– the thread spawn primitive simply takes a thunk to evaluate as a separate thread, and
returns a thread identifier that allows access to the newly created thread’s state.
Much of CML’s expressive power derives from event combinators that construct complex event values from other events. We list some of these combinators in Figure 2.1. The
expression wrap (ev, f) creates an event that when synchronized on applies the result
of synchronizing on event ev to function f . Conversely, guard(f) creates an event,
which when synchronized on, evaluates f() to yield event ev and then synchronizes on
ev . Wrap is thus utilized to provide post-synchronization actions and guard to specify-
12
ing pre-synchronization computation. The choose event combinator takes a list of events
and constructs an event value that represents the non-deterministic choice of the events
in the list if both are available for synchronization. For example, choose[recvEvt(a),
sendEvt(b, v)] when synchronized on will either receive a unit value from channel a ,
or send value v on channel b if there is a sender available on a and a receiver on b . If
only one of the events has a matching participant, that event is selected. If neither have
matching participants the expression blocks until one becomes available. The choose event
is the fundamental building block for selective communication.
Selective communication provided by choice motivates the need for first-class events.
Composition of first-class functions prevents the expression of choice because function
abstraction does not allow operators like choose from synchronizing on events that may
be embedded within the function.
2.1
Programming with CML
In this section we provide a few examples to illustrate the expressivity of CML and
to provide additional insight on the behavior of relevant primitives. Consider encoding a
simple produce consumer pattern leveraging message passing. We can encode both the
producer and the consumer in separate threads of control and couple them through a shared
channel leveraged for communication of values from the producer to the consumer.
val c = channel()
val v = ()
fun producer() =
(send(c,v); producer())
fun consumer() =
(recv(c); consumer())
val tid1 = spawn(producer)
val tid2 = spawn(consumer)
In the code above, the function producer sends the unit value across a shared channel
c . Analogous to the producer function, the consumer function receives values from the
13
producer on the channel c . Notice that both threads are encoded as infinite loops. Since
the communication between both producer and consumer is synchronous the channel will
have at most one value stored on it. This allows us to express an infinite computation, or
an infinite stream, without having to buffer or potentially utilize infinite space. We call this
type of encoding a lightweight server. Lightweight servers produce values in a demand
driven fashion as they block until a request (matching receive) is available on the channel
over which they are defined.
We can further extend our example to leverage CML events with some minor changes:
val c = channel()
val v = ()
fun producer() =
(sync(sendEvt(c,v)); producer())
fun consumer() =
(sync(recvEvt(c)); consumer())
val tid1 = spawn(producer)
val tid2 = spawn(consumer)
In the code above, we have replaced the CML communication primitives send and
recv with their event counterparts. To trigger the computation of the events, the CML
primitive sync is used. This code behaves exactly as the previous definition as it does not
leverage the inserted events in any interesting way.
Let us consider if there were multiple producers and we wanted to augment our consumer to receive a value from any producer so long as the producer had a value available.
val c1 = channel()
val v1 = 1
val c2 = channel()
val v2 = 2
fun producer1() =
(sync(sendEvt(c1,v1)); producer1())
fun producer2() =
(sync(sendEvt(c2,v2)); producer2())
14
fun consumer() =
(sync(choose([recvEvt(c1), recvEvt(c2)]));
consumer())
val tid1 = spawn(producer1)
val tid2 = spawn(producer2)
val tid3 = spawn(consumer)
In the code above, we extended our previous encoding of the producer and consumer
communication pattern by adding an additional producer, producer2 . Each producer
generates values on a distinct channel, c1 for producer1 and c2 for producer2 . To
be able to select which producer the consumer receives a value from, we need to construct
a complex event from CML event combinators. The choose combinator picks an event
from a list of events based on its satisfiability. In this example we use choose to pick
between receiving between the channels c1 and c2 . If there is a value available on both
channels, choose non-deterministically picks one. If there is only one available value,
choose picks the available value. Similarly, if no values are available, choose blocks
until one becomes available and then picks that value.
Consider if we wanted to perform an action based on which value we received or whom
we might have received that value from. Both types of responses can be encoded by utilizing the wrap combinator in conjunction with the complex event we have already created.
fun consumer() =
(sync(wrap(choose([recvEvt(c1), recvEvt(c2)]),
fn x => if x
then ...
else ...));
consumer())
The code above wraps the complex event with a function that branches on the result of
the choice. In this way we can encode responses that are based on the value received by
the consumer. CML also provides us with a mechanism to craft a response based on which
channel we have received a value from, or more abstractly which event was chosen in the
choice. In this running example, a response based on which event was chosen, corresponds
15
to a response based on with whom the consumer communicated with, as both producers
communicate over unique channels.
fun consumer() =
(sync(choose([wrap(recvEvt(c1),
fn x => response1),
wrap(recvEvt(c2),
fn x => response2)]));
consumer())
By wrapping the individual events occurring within the choice, we can designate a
response based on which event is picked by the choice when it is synchronized upon. In
this code we execute response1 if the receive on c1 is picked and response2 if the receive
on c2 is chosen.
16
3
MLTON
The work presented in the following chapters has been done in the context of MLton, a
whole program optimizing compiler for Standard ML that uses a simply-typed first-order
intermediate language. MLton compiles the full SML 97 language [10], including modules
and functors. MLton’s approach is different from other functional language compilers. It
imposes significant constraints on the compiler, but yields many optimization opportunities
not available with other approaches.
There are numerous issues that arise when translating SML into a simply-typed IL 1 .
First, how does one represent SML modules and functors in a simply-typed IL, since these
typically require much more complicated type systems? MLton’s answer: defunctorize the
program [12]. This transformation turns an SML program with modules into an equivalent one without modules by duplicating each functor at every application and eliminating
structures by renaming variables. Second, how does one represent SML’s polymorphic
types and polymorphic functions in a simply-typed IL? MLton’s answer: monomorphise
the program [13]. This transformation eliminates polymorphism from an SML program by
duplicating each polymorphic datatype and function at every type at which it is instantiated.
Third, how does one represent SML’s higher-order functions in a first-order IL? MLton’s
answer: defunctionalize the program [14]. This transformation replaces higher-order functions with data structures to represent them and first-order functions to apply them; the
resulting IL is a Static Single Assignment (SSA) form [15].
Because each of the above transformations requires matching a functor, function definition, or type definition with all possible uses, MLton is a whole-program compiler. As
such MLton requires all source code for a given program to present at compile-time and
does not support partial compilation. MLton’s whole-program compilation strategy has a
1 Interested
readers are directed to the following article on extending the MLton compiler from which this
short description is cited [11].
17
number of implications. Most importantly, MLton’s use of defunctorization means that the
placement of code in modules has little measurable effect on performance. The result of
this strategy is that modules are purely for the benefit of the programmer in structuring
code. Since MLton duplicates functors at each use, no run-time penalty is incurred for abstracting a module into a functor. The benefits of monomorphisation are similar. In MLton,
whole-program control-flow analysis based on 0CFA [16] is employed early in the compilation process, immediately after defunctorization and monomorphisation, and well before
any serious code motion or representation decisions are undertaken. Information computed
by the analysis is used in the defunctionalization pass to introduce dispatches at call sites
to the appropriate closure.
In addition to a highly optimized compiler, MLton provides a lightweight runtime layer
that supports threads and garbage collection. The MLton runtime permits concurrency
but does not allow for parallelism. MLton provides concurrency through user defined,
lightweight threads that are multiplexed on a single kernel thread. The primary concurrency
abstractions supported by MLton are those found in the CML library.
3.1
Multi-MLton
Multi-MLton extends MLton with multi-core support, library primitives for efficient
lightweight thread creation and management, as well as PCML [9], an optimized, parallel,
synchronous message passing extension to CML. In this section, we present the details of
Multi-MLton’s runtime.
A functional programming discipline, combined with explicit communication via messages (rather than implicit communication via shared-memory), and associated lightweight
concurrency primitives, results in an attractive programming model as it allows the programmer to reason about communication protocols. However, there are numerous challenges to realizing this model in practice on scalable multi- and manycore platforms, with
respect to both language abstractions and their implementation. It is an investigation of
these challenges that guides the design of Multi-MLton.
18
Parasites
Speculation
Lightweight Threads
ACML
Future
Isolates
Message Passing
Stabilizers
First Class
Asynchronous Events
Figure 3.1. Abstractions found in Multi-MLton. Communication, either
synchronous or asynchronous (expressed via asynchronous events as defined in Chapter 7), is depicted through arrows. Parasites are shown as raw
stack frames that comprise a single runtime thread.
Multi-MLton defines a programming model in which threads primarily communicate
via message passing. It differs from other message passing systems insofar as the abstractions it aims to provide permit (a) the expression of isolation between explicitly annotated
groups of threads; (b) composable speculative actions that are message passing aware; (c)
the construction of asynchronous events that seamlessly integrate abstract asynchronous
communication protocols with abstract CML-style events allowing the expression heterogeneous protocols, and (d) deterministic concurrency within threads to enable the extraction of additional parallelism when feasible and profitable. A graphical overview of the
features and goals of Multi-MLton is given in Figure 3.1. Asynchronous events, depicted
in the Figure, are introduce in Chapter 7 and salient details for parasitic threads are presented in Section 3.1.1.
3.1.1
Threading System
To support parallel execution, the Multi-MLton runtime utilizes pthreads [17]. Pthreads
are not managed by the programmer, they are created when the program starts. The programmer can pass a runtime argument to indicate how many pthreads to utilize, typically
this is one pthread per each processor or core. Each pthread manages a lightweight Multi-
19
Local
Allocation
Main
Memory
Local
Allocation
MLton
runtime
heap pointer
running light
weight thread
PThreads
...
queue of
light weight
threads
Figure 3.2. The runtime architecture of Multi-MLton utilizes pthreads
each of which contains a queue of light weight threads, the currently executing lightweight thread, and a pointer to a section of shared memory for
allocation.
MLton thread queue. Lightweight threads are created explicitly through a spawn primitive.
Each pthread switches between lightweight MLton threads on its queue when it is preempted as show in in Figure 3.2. Pthreads can by dynamically spawned and managed by
Multi-MLton’s runtime and such functionality is leveraged by certain primitives.
The MLton garbage collector was also modified to support parallel allocation. Associated with every processor is a memory region used by threads it executes; allocation within
this region requires no global synchronization. These regions are dynamic and growable,
accomplished by grabbing new chunks of memory from a free list. All pthreads must
synchronize when garbage collection is triggered as Multi-MLton does not yet support
concurrent collection.
20
Host threads
In Multi-MLton we support two types of user defined threads, host threads and parasitic
threads [18]. Host threads are analogous to MLton’s lightweight threads. Parasitic threads,
are typically raw stack frames that as their name suggests, temporarily execute on top of
a host thread. The implementation supports m host threads running on top of n pthreads,
where m is never less than n. Each processor runs a single pthread which maintains a queue
of host threads. Each host thread has a contiguous stack, allocated on the MLton heap,
which can dynamically grow and shrink during execution. The information associated
with the currently executing host thread is cached in the runtime state associated with each
processor to improve performance. On a context switch, this information is written back
to the host thread. The code to accomplish the context switch is generated by the compiler
and is highly optimized. A new host thread, created using spawn , is placed in a processor
queue in a round-robin fashion. When there are no host threads in a processor queue, the
pthread is suspended, to be woken up when a new host thread is added.
Parasitic threads
Unlike host threads, parasitic threads are implemented as raw stack frames. The expression parasite(e) pushes a new frame to evaluate expression e , similar to a function
call. We capture the stack top at the point of invocation. This corresponds to the caller’s
continuation and is a de facto boundary between the parasite and its host (or potentially
another parasite). If the parasite is not blocked or preempted, the computation runs to completion and control returns to the caller, just as if the caller made a non-tail procedure call.
If the parasitic thread is blocked, the frames associated with its execution are reified. When
the parasitic thread unblocks it is assigned a new host.
Parasitic threads are utilized to implement short-lived asynchronous actions. A parasitic
thread can be inflated into a host thread. However, once such inflation occurs, the newly
inflated host thread will always remain a host thread. Parasitic threads are inflated into host
threads if their executions is preempted.
21
3.1.2
Communication
The core communication primitives in Multi-MLton are those offered by the CML library. Threads communicate across synchronous first-class channels and can construct abstract protocols from the CML event framework. We extend the CML primitives and event
framework with support of asynchronous actions and asynchronous events as described in
Chapter 7.
3.1.3
Concluding Remarks
Multi-MLton provides a programming model that leverages pervasive lightweight concurrency and robust communication protocols to specify concurrent and parallel interactions. To realize this programming model we introduce three features of Multi-MLton to
build robust, efficient, and expressive communication protocols. Stabilizers provide perthread checkpointing and recovery for building robust, fault tolerant communication abstractions. We introduce partial memoization to accelerate protocols through efficient code
reuse. Lastly we extend CML’s event framework with asynchronous events for building
first-class asynchronous and heterogeneous protocols, protocols which contain both synchronous and asynchronous communication actions.
22
4
SWERVE
Swerve [8] is an open-source, third-party, web-server written in CML and is roughly 16K
lines of CML code. Swerve was originally developed for SML/NJ and later ported to
MLton. The server is composed of a number of modules and libraries. Communication
between modules makes extensive use of CML message passing. Threads communicate
over explicitly defined channels on which they can either send or receive values. Complex
communication patterns are built from CML events.
The web-sever configuration is managed through a configuration file that the server
parses during bootstrapping. A folder is specified during configuration that the web-server
utilizes as a location for hosting files and CGI scripts. This folder is traversed and a representation of its file structure is generated that the web-server manipulates during hosting.
In addition Swerve, like Apache, expects a MIME type configuration file that is also parsed
at bootstrapping.
After bootstrapping, there are five basic modules that govern the core of the server’s
functionality: Listener, File Processor, Network Processor, Timeout Manager,
and the Logger. The Listener module receives incoming HTTP requests and delegates
file serving requirements to concurrently executing processing threads. For each new connection, a new listener is spawned; thus, each connection has one main governing entity.
The Listner manages socket connections and parses incoming requests.
The File Processor module handles access to the underlying file system. Each file
that will be hosted is read by a file processor thread that chunks the file and sends it via
message passing to the Network Processor. The Network Processor, like the File
Processor, handles access to the network. The Network Processor receives the chunks
of the requested file from the File Processor, packetizes them, and sends then on the
network to the client which initiated the request. The File Processor and Network
Processor execute in lock-step, requiring the Network Processor to have completed
23
sending a chunk before the next one is read from disk. Notice that concurrent request by
multiple clients can be processed in parallel as each individual request will have a thread
dedicated to file processing as well as a thread dedicated to network processing. These
threads are managed by the Listener module.
Timeouts are processed by the Timeout Manager through the use of timed events. The
File Processor, which after a connection and a request have been established, starts
the inter-module communication protocol, polls for timeouts. If one is detected, the File
Processor notifies other modules it currently is communicating with of the detection.
Those modules then take appropriate cleanup and recovery actions. Therefore, the error notification protocol mirrors the typical communication protocols between the various
modules which comprise Swerve. At each communication, the modules leverage pattern
matching to check if the message contains the data expected or a notification of a timeout.
Timeouts are the most frequent transient fault present in the server, and are difficult to
deal with naively. Indeed, the system’s author notes that handling timeouts in a modular
way is “tricky” and devotes an entire section of the user manual explaining the pervasive
cross-module error handling in the implementation.
Other errors in the system are propagated using the same mechanism the Timeout
Manager uses to notify all modules of a timeout. Namely, a specific communication pattern
exists, in which interacting modules notify one another through communication primitives
that an error has been detected.
The Logger module is responsible for producing a time-stamped messages and writing
them to a log file. In addition to logging, the Logger module is responsible for terminating
the server if a fatal error occurs. After the Logger module writes a time-stamped message
to the log file that such an error has occurred it terminates the server. This allows for a
clean termination and guarantees a record of the error.
Besides the five main modules in the system, Swerve also contains libraries and smaller
modules for parsing and processing HTML, MIME type configuration parsing, an internal
representation of the file system that his hosted by the server, and file management.
24
4.0.4
Processing a Request
Consider the typical execution flow for processing an incoming request given in Figure 4.1. When a new request is received, the listener spawns a new thread for this connection that is responsible for hosting the requested page. This hosting thread first establishes
a timeout quantum with the timeout manager (1) and then notifies the file processor (2).
The hosting thread creates a file processing thread to process the request and then hosting
thread is notified that the file is ready to be chunked (2). The hosting thread passes to the
file processing thread the channel on which it will receive its timeout notification (2). The
file processing thread is now responsible to check for explicit timeout notification (3).
The file processing thread begins reading the file from disk and sending chunks to the
hosting thread. The hosting thread leverages the functionality of the Network Processor
to receive the chunks, packetizes the chunks, and sends it on the network (2). Since the
communication is synchronous, the file processing thread will not read the next chunk from
disk until the hosting thread receives the chunk to be sent on the network. The Listner
then terminates the connection and performs clean up actions. If detailed logging is enabled, the Logger logs meta-data about the processed request.
Since a timeout can occur before a particular request starts processing a file (4) (e.g.
within the hosting thread defined by the Listener module) or during the processing of
a file (5) (e.g. within the File Processor), the resulting error handling code is cumbersome. Moreover, the detection of the timeout itself is handled by a third module, the
Timeout Manager. The result is a complicated message passing procedure that spans
multiple modules, each of which must figure out how to deal with timeouts appropriately.
The unfortunate side effect of such code organization is that modularity is compromised.
The code now contains implicit interactions that cannot be abstracted (6) (e.g. the File
Processor must explicitly notify the hosting thread of the timeout).
25
Listener
Request
Response
[1]
[2]
[4]
[6]
Timeout
Manager
File
Processor
[3]
[5]
Figure 4.1. Swerve module interactions for processing a request (solid
lines) and error handling control and data flow (dashed lines) for timeouts.
The number above the lines indicates the order in which communication
actions occur.
26
5
LIGHTWEIGHT CHECKPOINTING FOR CONCURRENT ML
In this chapter, we present a safe and efficient checkpointing mechanism for CML that can
be used to recover from transient faults. We adopt the following definition of transient
faults: a transient fault is an exceptional condition that can be often remedied through reexecution of the code in which it has occurred. Typically, these faults are caused by the
temporary unavailability of a resource. For example, a program that attempts to communicate through a network may encounter timeout exceptions because of high network load
at the time the request was issued. Transient faults may also arise because a resource is
inherently unreliable. In large-scale systems comprised of many independently executing
components, failure of one component may lead to transient faults in others even after the
failure is detected [19]. For example, a an application that enters an unrecoverable error
state may need to be rebooted. Here, the server behaves as a temporarily unavailable resource to its clients who must re-issue requests sent during the period the server was being
rebooted.
Other transient errors, which can potentially be remedied through re-execution, may
occur because program invariants are violated. As an example consider software transactional memory (STM), where regions of code execute with atomicity (all or nothing) and
isolation (no intermediate effects are witnessed) guarantees. Serializability violations that
occur in STMs [20, 21] are typically rectified by aborting the offending transaction and
having it re-execute.
A simple solution to recovery from transient faults and errors would be to capture the
global state of the program before an action executes that could trigger such a fault or
error. If the fault or error occurs and raises an exception, the handler only needs to restore
the previously saved program state. Unfortunately, transient faults often occur in longrunning server applications that are inherently multi-threaded but which must nonetheless
exhibit good fault tolerance characteristics; capturing global program state is costly in these
27
environments. On the other hand, simply re-executing a computation without taking prior
thread interactions into account can result in an inconsistent program state and lead to
further errors.
Suppose two threads communicate synchronously over a shared channel and the sender
subsequently re-executes this code to recover from a transient fault. A spurious unhandled
execution of the (re)sent message may result because the receiver would have no knowledge that a re-execution of the sender has occurred. Thus, it has no need to expect retransmission of a previously executed message. In general, the problem of computing a
sensible checkpoint for a transient fault requires calculating the transitive closure of dependencies that manifest among threads and the section of code which must be re-executed.
To support the definition and restoration of safe and efficient checkpoints in concurrent
functional programs, we propose a new language abstraction called stabilizers. Stabilizers encapsulate three operations. The first initiates monitoring of code for communication
and thread creation events, and establishes thread-local checkpoints when monitored code
is evaluated. These thread-local checkpoints can be viewed as a restoration point for any
transient fault encountered during the execution of the monitored region. The second operation reverts control and state to a safe global checkpoint when a transient fault is detected.
The third operation allows previously established checkpoints to be reclaimed.
The checkpoints defined by stabilizers are first-class and composable: a monitored
procedure can freely create and return other monitored procedures and behaves like a
higher-order function. Stabilizers can be arbitrarily nested, and work in the presence of
a dynamically-varying number of threads and non-deterministic selective communication.
We demonstrate the use of stabilizers for several large applications written in CML.
As a more concrete example of exception recovery, consider a typical remote procedure call in CML. The code shown in Figure 5.1 depicts the server-side implementation of
the RPC. Suppose the request to the server is sent asynchronously, allowing the client to
compute other actions concurrently with the server; it eventually waits on replyCh for the
server’s answer. It may be the case that the server raises an exception while processing of
the client’s request. For example, a condition checked in a guarded event that is a part of
28
fun rpc-server (request, replyCh) =
let val ans = process request
in spawn(send(replyCh,ans))
end handle Exn => ...
Figure 5.1. A simple server-side RPC abstraction using synchronous communication.
a choice may fail [22]. In addition, the client may attempt to interact with the server while
it is awaiting the server’s response to its initial request. w When this happens, how should
the client’s state be reverted to ensure it can retry its original request?
For example, if the client is waiting on the reply channel, the server must ensure exception handlers communicate information back on the channel, to make sure the client
does not deadlock waiting for a response. Moreover, if the client must retry its request,
any effects performed by its computation executed concurrently to its request must also
be reverted. Constructing fault remediation protocols that involve multiple communicating
threads can be complex and unwieldy. Stabilizers, on the other hand, provide the ability to
unroll cross-thread computation in the presence of exceptions quickly and efficiently. This
is especially useful for errors that can be remedied by re-execution.
Stabilizers provide a middle ground between the transparency afforded by operating
systems or compiler-injected checkpoints, and the precision afforded by user-injected checkpoints. In our approach, thread-local state immediately preceding a non-local action (e.g.,
thread communication) is regarded as a possible checkpoint for that thread. In addition, applications may explicitly identify program points where local checkpoints should be taken,
and can associate program regions with these specified points. When a rollback operation
occurs, control reverts to one of these saved checkpoints for each thread. Rollbacks are initiated to recover from transient faults. Applications must still detect such faults, and when
those faults are detected, applications can leverage the functionality provided by stabilizers.
The exact set of checkpoints chosen is determined by safety conditions that ensure a globally consistent state is preserved. When a thread is rolled back to a thread-local checkpoint
29
state C, our approach guarantees other threads with which the thread has communicated
will be placed in states consistent with C.
5.1
Programming Model
Stabilizers are created, reverted, and reclaimed through the use of three primitives with
the following signatures:
stable
: (’a -> ’b)
stabilize
: unit -> ’a
cut
: unit -> unit
-> ’a -> ’b
A stable section is a monitored section of code whose effects are guaranteed to be
reverted as a single unit. The primitive stable is used to define stable sections. Given
function f the evaluation of stable f yields a new function f’ identical to f except that
interesting communication, shared memory access, locks, and spawn events are monitored
and grouped based on the stable section in which they occurred. In addition to monitoring
communication actions that occur within stable sections, any communication between a
stable section and a thread not executing within a stable section is also monitored. Thus,
all actions within a stable section are associated with the same checkpoint 1 .
The second primitive, stabilize, reverts execution to a dynamically calculated global
state; this state will always correspond to a program state that existed immediately prior
to the execution of a stable section, communication event, or thread spawn point for each
thread. We qualify this claim by observing that external irrevocable operations that occur
within a stable section that needs to be reverted (e.g., I/O, foreign function calls, etc.) must
be handled explicitly by the application prior to an invocation of a stabilize action in
much the same way as when restoring a checkpoint under an application level checkpointing scheme.
1 Mattern-style
consistent cuts define a consistent state from a number of local checkpoints based on ordering
of communication events. In much the same way, stabilizers also create a consistent global checkpoints from
local checkpoints. However, stable sections allow the programmer to specify a series of communication
actions that are treated as an atomic unit when calculating a global checkpoint.
30
Informally, a stabilize action reverts all effects performed within a stable section much
like an abort action reverts all effects within a transaction. However, whereas a transaction
enforces atomicity and isolation until a commit, stabilizers enforce these properties only
when a stabilize action occurs. Thus, the actions performed within a stable section are
immediately visible to the outside. When a stabilize action occurs these effects, along with
their witnesses, are reverted.
The third primitive, cut, establishes a point beyond which stabilization cannot occur.
Cut points can be used to prevent the unrolling of irrevocable actions within a program (e.g.,
I/O). A cut prevents reversion to a checkpoint established logically prior to it, in the case of
nested stable sections it prevents the reversion to checkpoints established by the outer stable
sections. Informally, a cut executed by a thread T requires that any checkpoint restored
for T be associated with a program point that logically follows the cut in program order.
Thus, if there is an irrevocable action A (e.g., ’launch missile’) that cannot be reverted, the
expression: atomic ( A ; cut() ) ensures that any subsequent stabilization action does
not cause control to revert to a stable section established prior to A . If such control transfer
and state restoration were permitted, it would (a) necessitate revision of A ’s effects, and (b)
allow A to be re-executed; neither of which is possible without some external compensation
mechanism. The execution of the irrevocable action A and the cut() must be atomic to
ensure that another thread does not perform a stabilization action in between the execution
of A and cut() . Careful consideration must given where cut() is used as it affects
all nested stable sections and global checkpoints constructed from the section in which it
occurs, in much the same way as an irrevocable action is a part of such local and global
checkpoints.
Unlike classical checkpointing schemes or exception handling mechanisms, the result
of invoking stabilize does not guarantee that control reverts to the state corresponding to
the dynamically-closest stable section. The choice of where control reverts depends upon
the actions undertaken by the thread within the stable section in which the stabilize call
was triggered (for examples see Section 5.1.1).
31
t1
t1
t2
t2
S1
S3
S1
S3
S2
S2
(a)
(b)
Figure 5.2. Interaction between stable sections. Clear circles indicate
thread-local checkpoints, dark circles represent stabilization actions.
Composability is an important design feature of stabilizers. There is no a priori classification of the procedures that need to be monitored, nor is there any restriction against
nesting stable sections. Stabilizers separate the construction of monitored code regions
from the capture of state. When a monitored procedure is applied, or inter-thread communication action is performed, a potential thread-local restoration point is established. The
application of such a procedure may in turn result in the establishment of other independently constructed monitored procedures. In addition, these procedures may themselves
be applied and have program state saved appropriately. Thus, state saving and restoration
decisions are determined without prejudice to the behavior of other monitored procedures.
5.1.1
Interaction of Stable Sections
When a stabilize action occurs, matching inter-thread events are unrolled as pairs. If
a send is unrolled, the matching receive must also be reverted. If a thread spawns another
32
thread within a stable section that is being unrolled, this new thread (and all its actions)
must also be discarded. All threads which read from a shared variable must be reverted if
the thread that wrote the value is unrolled to a state prior to the write. A program state is
stable with respect to a statement if there is no thread executing in this state affected by the
statement (e.g., all threads are in a point within their execution prior to the execution of the
statement and its transitive effects).
For example, consider thread t1 that enters a stable section S1 and initiates a communication event with thread t2 (see Figure 5.2(a)). Suppose t1 subsequently enters another
stable section S2 , and again establishes a communication with thread t2 . Suppose further
that t2 receives these events within its own stable section S3 . The program states immediately prior to S1 and S2 represent feasible checkpoints as determined by the programmer,
depicted as white circles in the example. If a rollback is initiated within S2 , then a consistent global state would require that t2 revert back to the state associated with the start of S3
since it has received a communication from t1 initiated within S2 . However, discarding the
actions within S3 now obligates t1 to resume execution at the start of S1 since it initiated a
communication event within S1 to t2 (executing within S3 ). Such situations can also arise
without the presence of nested stable sections. Consider the example in Figure 5.2(b). Once
again, the program is obligated to revert t1 , since the stable section S3 spans communication
events from both S1 and S2 .
Consider the RPC example presented in the introduction rewritten to utilize stabilizers:
stable fn () => let fun rpc-server (request, replyCh) =
let val ans = process request
in spawn(send(replyCh,ans))
end handle Exn => ...
stabilize()
in rpc-server
end
If an exception occurs while the request is being processed, the request and the client are
unrolled to a state prior to the RPC. The client is free to retry the RPC, or perform some
other computation.
33
5.2
Motivating Example
The Swerve design, presented in Chapter 4, illustrates the general problem of dealing
with transient faults in a complex concurrent system: how can we correctly handle faults
that span multiple modules without introducing explicit cross-module dependencies to handle each such fault? To motivate the use of stabilizers, we consider the interactions of three
of Swerve’s modules: the Listener, the File Processor, and the Timeout Manager.
More specifically, we consider how a timeout is handled by these three modules.
Figure 5.3 shows the definition of fileReader, a Swerve function in the file processing
module that sends a requested file to the hosting thread by chunking the file contents into a
series of smaller packets. The file is opened by BinIOReader, a utility function in the File
Processing module. The fileReader function must check in every iteration of the file
processing loop whether a timeout has occurred by calling the Timeout.expired function
due to the restriction that CML threads cannot be explicitly interrupted. If a timeout has
occurred, the procedure is obligated to notify the receiver (the hosting thread) through
an explicit send on channel consumer the value XferTimeout; timeout information is
propagated from the Timeout module to the fileReader via the abort argument which
is polled.
Stabilizers allow us to abstract this explicit notification process by wrapping the file
processing logic in a stable section. Suppose a call to stabilize replaced the call to
CML.send(consumer, Timeout). This action would result in unrolling both the actions
of sendFile as well as the receiver, since the receiver is in the midst of receiving file
chunks. However, a cleaner solution presents itself. Suppose that we modify the definition
of the Timeout module to invoke stabilize, and wrap its operations within a stable section (see Figure 5.5). Now, there is no need for any thread to poll for the timeout event.
Since the hosting thread establishes a timeout quantum by communicating with Timeout
and passes this information to the file processor thread, any stabilize action performed
by the Timeout Manager will unroll all actions related to processing this file. This transformation, therefore, allows us to specify a timeout mechanism without having to embed
34
fun fileReader name abort consumer =
let fun loop strm =
if Timeout.expired abort
then CML.send(consumer, XferTimeout)
else let val chunk =
BinIO.inputN(strm, fileChunk)
in read a chunk of the file and send to receiver
loop strm)
end
in (case BinIOReader.openIt abort name
of NONE => ()
| SOME h => (loop (BinIOReader.get h);
BinIOReader.closeIt h)
end
fun fileReader name abort consumer =
let fun loop strm =
let val chunk =
BinIO.inputN(strm, fileChunk)
in read a chunk of the file and send to receiver
loop strm)
end
in stable fn() =>
(case BinIOReader.openIt abort name
of NONE
=>()
| SOME h =>(loop (BinIOReader.get h);
BinIOReader.closeIt h)) ()
end
Figure 5.3. An excerpt of the the File Processing module in Swerve.
The code fragment displayed on the bottom shows the code modified
to use stabilizers. Italics mark areas in the original where the code is
changed.
35
fn () =>
let fun receiver() =
case recv(consumer)
of Info info
=> (sendInfo info; ...)
| Chunk bytes
| timeout
=> (sendBytes bytes; ...)
=> error handling code
| Done
=> ...
...
in ... ; loop receiver
end
stable fn () =>
let fun receiver() =
case recv(consumer)
of Info info
=> (sendInfo info; ...)
| Chunk bytes
=> (sendBytes bytes; ...)
| Done
=> ...
...
in ... ; loop receiver
end
Figure 5.4. An excerpt of the Network Processor module in Swerve.
The main processing of the hosting thread, created by the Listener module is wrapped in a stable section and the timeout handling code can be
removed. The code fragment on the bottom shows the modifications made
to use stabilizers. Italics in the code fragment on the top mark areas in the
original where the code is removed in the version modified to use stabilizers.
36
let fun expired (chan) = isSome (CML.poll chan)
fun trigger (chan) = CML.send(chan, timeout)
...
in ...; trigger(chan)
end
let fun trigger (chan) = stabilize()
...
in stable (fn() => ... ; trigger(chan)) ()
end
Figure 5.5. An excerpt of the Timeout Manager module in Swerve. The
bottom code fragment shows the code modified to use stabilizers. The
expired function can be removed and trigger now calls stabilize. Italics
mark areas in the original where the code is changed.
non-local timeout handling logic within each thread that potentially could be affected. The
hosting thread itself is also simplified (as seen in Figure 5.4). By wrapping its logic within
a stable section, we can remove all of its timeout error handling code as well. A timeout is now handled completely through the use of stabilizers localized within the Timeout
module. This improved modularization of concerns does not lead to reduced functionality
or robustness. Indeed, a stabilize action causes the timed-out request to be transparently
re-processed with the file being resent, or allows the web-server to process a new request,
depending on the desired behavior.
5.2.1
Cut
The cut primitive can be used to delimit the effects of stabilize calls. Consider the
example presented in Figure 5.6, which depicts three separate servers operating in parallel.
A central coordinator dispatches requests to individual servers and acts as the front-end
for handling user requests. The dispatch code, presented below, contains a stable section
and each server has its request processing (as defined in the previous section) wrapped in
37
Figure 5.6. A multi-server implementation which utilizes a central coordinator and multiple servers. A series of requests is multi-plexed between
the servers by the coordinator. Each server handles its own transient faults.
The shaded portions represent computation which is unrolled due to the
stabilize action performed by the server. Single arrows represent communication to servers and double arrows depict return communication. Circular wedges depict communications which are not considered because a
cut operation limits their effects.
stable sections. After each request is completed, the server establishes a cut point so that
the request is not repeated if an error is detected on a different server.
Servers utilize stabilizers to handle transient faults. Since the servers are independent
of one another, a transient fault local to one server should not affect another. The request
allocated only to that server must be re-executed.
When the coordinator discovers an error it calls stabilize to unroll request processing.
All requests which encountered an error will be unrolled and automatically retried. Those
which completed will not be affected.
fun multirequest(requestList) =
foreach
fn (request,replyCh) =>
let val serverCh = freeServer()
in spawn stable(
38
fn () => (send(serverCh, request);
let val reply = recv(serverCh)
in (send(replyCh,reply);
cut())
end))
end
requestList
The code above depicts a front-end function which handles multiple requests by dispatching
them among a number of servers. The function freeServer finds the next available server
to process the request. Once the front-end receives a reply from the server, subsequent
stabilize actions by other threads will not result in the revocation of previously satisfied
requests. This is because the cut() operation prevents rollback of any previously satisfied
request. If a stabilization action does occur, the cut() avoids the now satisfied request to
this server from being re-executed; only the server that raised the exception is unrolled.
5.3
Semantics
Our semantics is defined in terms of a core call-by-value functional language with
threading primitives. We present the syntax of the languages in Figure 5.7 and provide
an operational semantics in Figure 5.8. We first present an interpretation of stabilizers in
which evaluation of stable sections immediately results in the capture of a consistent global
checkpoint. Furthermore, we restrict the language to capture checkpoints only upon entry
to stable sections, rather than at any communication or thread creation action. This semantics reflects a simpler characterization of checkpointing than the informal description
presented in Section 5.1. In Section 5.4, we refine this approach to construct checkpoints
incrementally, and to allow checkpoints to be captured at any communication or thread
creation action.
In the following, we use metavariables v to range over values, and δ to range over stable
sections or checkpoint identifiers. We also use P for thread terms, and e for expressions. We
use over-bar to represent a finite ordered sequence, for instance, f represents f1 f2 . . . fn .
39
S YNTAX :
P ::= 0/ | PkP | t[e]δ
e ::= x | l | λ x.e | e(e) | mkCh() | send(e, e) | recv(e) | spawn(e)
|
stable(e) | stable(e) | stabilize() | cut()
E VALUATION C ONTEXTS :
E ::= • | E(e) | v(E) | send(E, e) | send(l, E) |
recv(E) | spawn(E) | stable(E) | stable(E)
L OCAL E VALUATION RULES :
λ x.e(v)
→ e[v/x]
mkCh()
→ l,
l fresh
stable(stable(λ x.e)) → stable(λ x.e)
P ROGRAM S TATES :
P ∈ Process
t ∈ Tid
x ∈ Var
l ∈ Channel
δ ∈ StableId
v ∈ Val
α, β ∈ Op
= unit | λ x.e | stable(λ x.e) | l
= {LR , SP ( t , e ), COMM ( t , t’ ), SS , ST, ES , CUT}
Λ ∈ Process × StableMap
∆ ∈ StableMap
fin
= (StableId 7→ Process × StableMap)+ ⊥
Figure 5.7. Stabilizers Semantics – The syntax, evaluation contexts, and
domain equations for a core call-by-value language for stabilizers.
40
(L OCAL )
e → e0
LR
Pkt[E[e]]δ , ∆ =⇒ Pkt[E[e0 ]]δ , ∆
(S PAWN )
t0 fresh
Pkt[E[spawn(λ x.e)]]δ , ∆
SP(t0 ,e[unit/x])
=⇒
Pkt[E[unit]]δ kt0 [e[unit/x]]φ , ∆
(C OMM )
P = P0 kt[E[send(l, v)]]δ kt0 [E0 [recv(l)]]δ0
P, ∆
COMM ((t,t0 ))
=⇒
P0 kt[E[unit]]δ kt0 [E0 [v]]δ0 , ∆
(C UT )
P = P0 kt[E[cut()]]δ
P00 = P0 kt[E[unit]]δ
CUT
P, ∆ =⇒ P00 , ⊥
(S TABLE )
δ0 fresh ∀δ ∈ Dom(∆),
δ0 ≥ δ ∆0 = ∆[δ0 7→ Pkt[E[stable(λ x.e)(v)]]δ , ∆)]
Λ = ∆0 (δmin ) ∀δ ∈ Dom(∆0 ) δmin ≤ δ Λ0 = Pkt[E[stable(e[v/x])]]δ0 .δ , ∆[δ0 7→ Λ]
SS
Pkt[E[stable(λ x.e)(v)]]δ , ∆ =⇒ Λ0
(S TABLE -E XIT )
ES
Pkt[E[stable(v)]]δ.δ , ∆ =⇒ Pkt[E[v]]δ , ∆ − {δ}
(S TABILIZE )
∆(δ) = (P0 , ∆0 )
ST
Pkt[E[stabilize()]δ.δ , ∆ =⇒ P0 , ∆0
Figure 5.8. Stabilizers Semantics – The global evaluation rules for a core
call-by-value language for stabilizers.
41
The term α.α denotes the prefix extension of the sequence α with a single element α, α.α
the suffix extension, αα0 denotes sequence concatenation, φ denotes empty sequences and
sets, and α ≤ α0 holds if α is a prefix of α0 . We write | α | to denote the length of sequence
α.
Our communication model is a message passing system with synchronous send and
receive operations. We do not impose a strict ordering of communication actions on channels; communication actions on the same channel are paired non-deterministically. To
model asynchronous sends, we simply spawn a thread to perform the send. To this core
language we add three new primitives: stable, stabilize, and cut. When a stable function is applied, a global checkpoint is established, and its body, denoted as stable(e), is
evaluated in the context of this checkpoint. The second primitive, stabilize, is used to
initiate a rollback and the third, cut, prevents further rollback in the thread in which it
executes due to a stabilize action.
The syntax and semantics of the language are given in Figure 5.7 and Fig 5.8. Expressions include variables, locations that represent channels, λ-abstractions, function applications, thread creations, channel creations, communication actions that send and receive
messages on channels, or operations which define stable sections, stabilize global state to
a consistent checkpoint, or bound checkpoints. We do not consider references in this core
language as they can be modeled in terms of operations on channels.
A program is defined as a set of threads and we utilize φ to denote the empty program.
Each thread is uniquely identified, and is also associated with a stable section identifier (denoted by δ) that indicates the stable section the thread is currently executing within. Stable
section identifiers are ordered under a relation that allows us to compare them (e.g. they
could be thought of as integers incremented by a global counter). For convention we assume δs range from 0 to δmax , where δmax is the numerically largest identifier (e.g., the last
created identifier). Thus, we write t[e]δ if a thread with identifier t is executing expression
e in the context of the stable section with identifier δ. Since stable sections can be nested,
the notation generalizes to sequences of stable section identifiers with sequence order reflecting nesting relationship. We omit decorating a term with stable section identifiers when
42
not necessary. Our semantics is defined up to congruence of threads (PkP0 ≡ P0 kP). We
write P {t[e]} to denote the set of threads that do not include a thread with identifier t,
and P ⊕ {t[e]} to denote the set of threads that contain a thread executing expression e with
identifier t. We use evaluation contexts to specify order of evaluation within a thread, and
to prevent premature evaluation of the expression encapsulated within a spawn expression.
A program state consists of a collection of evaluating threads (P) and a stable map
(∆) that defines a finite function associating stable section identifiers to states. A program
begins evaluation with an empty stable map (⊥). Program evaluation is specified by a
α
global reduction relation, P, ∆ =⇒ P0 , ∆0 , that maps a program state to a new program state.
We tag each evaluation step with an action, α, that defines the effects induced by evaluating
α ∗
the expression. We write =⇒ to denote the reflexive, transitive closure of this relation.
The actions of interest are those that involve communication events, or manipulate stable
sections. We use labels
LR
to denote local reduction actions,
SP (t, e)
to denote thread
creation, COMM(t, t0 ) to denote communication, SS to indicate the start of a stable section,
ST
to indicate a stabilize operation, ES to denote the exit from a stable section, and CUT to
indicate a cut action.
Local reductions within a thread are specified by an auxiliary relation, e → e0 that
evaluates expression e within some thread to a new expression e0 . The local evaluation
rules are standard: function application substitutes the value of the actual parameter for the
formal in the function body; channel creation results in the creation of a new location that
acts as a container for message transmission and receipt; and, supplying a stable function
as an argument to a stable expression simply yields the stable function.
There are seven global evaluation rules. The first (rule (L OCAL)) simply models global
state change to reflect thread local evaluation. The second (rule (S PAWN)) describes changes
to the global state when a thread is created to evaluate a thunk (λ x.e); the new thread evaluates e in a context without any stable identifier. The third (rule (C OMM)) describes how
a communication event synchronously pairs a sender attempting to transmit a value along
a specific channel in one thread with a receiver waiting on the same channel in another
thread. Evaluating cut (rule (C UT)) discards the current global checkpoint. The existing
43
stable map is replaced by an empty one. This rule ensures that no subsequent stabilization
action will ever cause a thread to revert to a state that existed logically prior to the cut.
While certainly safe, the rule is also very conservative, affecting all threads even those that
have had no interaction (either directly or indirectly) with the thread performing the cut.
We present a more refined treatment in Section 5.4.
The remaining rules are ones involving stable sections. When a stable section is newly
entered (rule (S TABLE)), a new stable section identifier is generated. These identifiers are
related under a total order that allows the semantics to express properties about lifetimes
and scopes of such sections. The newly created identifier is associated with its creating
thread. The checkpoint for this identifier is computed as either the current state if no
checkpoint exists, or the current checkpoint. In this case, our checkpointing scheme is
conservative: if a stable section begins execution, we assume it may have dependencies to
all other currently active stable sections. Therefore, we set the checkpoint for the newly
entered stable section to the checkpoint taken at the start of the oldest active stable section.
When a stable section exits (rule (S TABLE -E XIT)), the thread context is appropriately updated to reflect that the state captured when this section was entered no longer represents an
interesting checkpoint; the stable section identifier is removed from its creating thread. A
stabilize action (rule (S TABILIZE)) simply reverts the state to the current global checkpoint.
Note that the stack of stable section identifiers recorded as part of the thread context
is not strictly necessary since there is a unique global checkpoint that reverts the entire
program state upon stabilization. However, we introduce it here to help motivate our next
semantics that synthesizes global checkpoints from partial ones, and for which maintaining
such a stack is essential.
5.3.1
Example
Consider the example program shown in Fig 5.9. We illustrate how global checkpoints
would be constructed for this program in Figure 5.10. Initially, thread t1 spawns thread
t2 . Afterwards, t1 begins a stable section, creating a global checkpoint prior to the start
44
let fun f() = ...
fun g() = ... recv(c) ...
fun h() = ... send(c,v) ...
in spawn(stable h);
(stable g) (stable f ())
end
Figure 5.9. An example used to illustrate the interaction of inter-thread
communication and stable sections. The call to f establishes an initial
checkpoint. Although g and h do not interact directly with f, the checkpoint established by f may nonetheless be restored on a call to stabilize as
illustrated by Figure 5.10.
t1
spawn
t2
checkpoint
δ1
stable f
δ2
stable h
δ3
stable g
send (c,v)
recv (c)
Figure 5.10. An example of global checkpoint construction where the
inefficiency of global checkpoint causes the restoration of a checkpoint
established prior to the stable section in which a call to stabilize occurs.
45
of the stable section. Additionally, it creates an identifier (δ1 ) for this stable section. We
establish a binding between δ1 and the global checkpoint in the stable map, ∆. Next,
thread t2 begins its stable section. Since ∆ is non-empty, t2 maps its identifier δ2 to the
checkpoint stored by the least δ, namely the checkpoint taken by δ1 , rather than creating a
new checkpoint. Then, thread t1 exits its stable section, removing the binding for δ1 from
∆. It subsequently begins execution within a new stable section with identifier δ3 . Again,
instead of taking a new global checkpoint, δ3 is mapped to the checkpoint taken by the least
δ, in this case δ2 . Notice that δ2 ’s checkpoint is the same as the one taken for δ1 . Lastly,
t1 and t2 communicate. Observe that the same state is restored regardless of whether we
revert to either δ2 or δ3 . In either case, the checkpoint that would be restored would be the
one initially created by δ1 . This checkpoint gets cleared only once no thread is executing
within a stable section.
5.3.2
Soundness
The soundness of the semantics is defined by an erasure property on stabilize actions.
Consider the sequence of actions α that comprise a potential execution of a program; initially, the program has not established any stable section, i.e., δ = φ. Suppose that there is a
stabilize operation that occurs after α. The effect of this operation is to revert the current
global program state to an earlier checkpoint. However, assuming that program execution
successfully continued after the stabilize call, it follows that there exists a sequence of
actions from the checkpoint state that yields the same state as the original, but which does
not involve execution of stabilize. In other words, stabilization actions can never manufacture new states nor restore to inconsistent states, and thus have no effect on the final
state of program evaluation.
α
Theorem[Safety.] Let Eφt,P [e], ∆ =⇒
equivalent evaluation
Eφt,P [e], ∆
∗
ST .β
P0 , ∆0 =⇒
∗
P00 kt[v], ∆ f . Then, there exists an
α0 .β
=⇒ ∗ P00 kt[v], ∆ f such that α0 ≤ α.
Proof sketch. By assumption and rules (S TABLE) and (S TABILIZE), there exists evaluation sequences of the form:
46
Eφt,P [e], ∆
α0
SS
=⇒ ∗ P1 , ∆1 =⇒ P2 , ∆2
and
ST
β
P0 , ∆0 =⇒ P1 , ∆1 =⇒ ∗ P00 kt[v], ∆ f
Moreover, α0 ≤ α since the state recorded by the stable operation must precede the
evaluation of the stabilize action that reverts to that state.2
5.4
Incremental Construction
While easily defined, the semantics is highly conservative because there may be check-
points that involve less unrolling that the semantics does not identify. Consider again the
example given in Figure 5.9. The global checkpoint calculation reverts execution to the
program state prior to execution of f even if f successfully completed. Furthermore,
communication events that establish inter-thread dependencies are not considered in the
checkpoint calculation. Thus, all threads, even those unaffected by effects that occur in the
interval between when the checkpoint is established and when it is restored, are unrolled.
A better alternative would restore thread state based on the actions witnessed by threads
within checkpoint intervals. If a thread T observes action α performed by thread T 0 and
T is restored to a state that precedes the execution of α, T 0 can be restored to its latest
local checkpoint state that precedes its observance of α. If T witnesses no actions of other
threads, it is unaffected by any stabilize calls those threads might make. This strategy
leads to an improved checkpoint algorithm by reducing the cost of restoring a checkpoint,
limiting the impact to only those threads that witness global effects, and establishing their
rollback point to be as temporally close as possible to their current state.
In Figure 5.11 we provide additional syntax and domain equations for a semantics that
utilizes incremental checkpoint constructions. Figure 5.12 and Figure 5.13 present a refinement to the semantics that incrementally constructs a dependency graph as part of program
execution. Checkpointing is now defined with respect to the capture of the communica-
47
tion, spawn, and stable actions performed by threads within a graph data structure. This
structure consists of a set of nodes representing interesting program points, and edges that
connect nodes that have shared dependencies. Nodes are indexed by ordered node identifiers, and hold thread state and record the actions that resulted in their creation. We also
define maps to associate threads with nodes (η), and stable section identifiers with nodes
(σ) in the graph.
Informally, the actions of each thread in the graph are represented by a chain of nodes
that define temporal ordering on thread-local actions. Back-edges are established to nodes
representing stable sections; these nodes define possible per-thread checkpoints. Sources
of backedges are communication actions that occur within a stable section, or the exit of a
nested stable section. Edges also connect nodes belonging to different threads to capture
inter-thread communication events.
The evaluation relation P, G ; P0 , G0 evaluates a process P executing action α with
α
respect to a communication graph G to yield a new process P0 and new graph G0 . As
usual, ;∗ denotes the reflexive, transitive closure of this relation. Programs initially begin
α
evaluation with respect to an empty graph. The auxiliary relation t[e], G ⇓ G0 models intrathread actions within the graph (see rules (B UILD)). It establishes a new node to capture
thread-local state, and sets the current node marker for the thread to this node. In addition,
if the action occurs within a stable section, a back-edge is established from that node to
this section. This back-edge is used to identify a potential rollback point. If a node has a
back-edge the restoration point will be determined by traversing these back-edges. Thus, it
is safe to not store thread contexts with such nodes (⊥ is stored in the node in that case).
New nodes added to the graph are created with a node identifier guaranteed to be greater
than any existing node.
When a new thread is spawned (rule (S PAWN)), a new node and a new stack for the
thread are created. An edge is established from the parent to the child thread in the graph.
When a communication action occurs (rule (C OMM)) a bi-directional edge is added between the current node of the two threads participating in the communication.
48
When a cut action is evaluated (rule (CUT)), a new node is added to the graph that
records the action. A subsequent stablization action that traverses the graph must not visit
this node, which acts as a barrier to prevent restoration of thread state that existed before it.
When a stable section is entered (rule (S TABLE)), a new stable section identifier and a new
node are created. A new graph that contains this node is constructed, and an association
between the thread and this node is recorded. When a stable section exits (rule (S TABLE E XIT)), this association is discarded, although a node is also added to the graph.
Graph reachability is used to ascertain a global checkpoint when a stabilize action
is performed (rule (S TABILIZE)). When thread T performs a stabilize call, all nodes
reachable from T ’s current node in the graph are examined, and the context associated with
the least such reachable node (as defined by the node’s index) for each thread is used as
the thread-local checkpoint for that thread. If a thread is not affected (transitively) by the
actions of the thread performing the rollback, it is not reverted to any earlier state. The
collective set of such checkpoints constitutes a global state. The graph resulting from a
stabilize action does not contain these reachable nodes; the expression G/n defines the
graph in which all nodes reachable from node n are removed from G. Here, n is the node
indexed by the most recent stable section (δ) in the thread performing the stabilization.
An important consistency condition imposed on the resulting graph is that it not contain a
CUT
node. This prevents stablization from incorrectly reverting control to a stable section
established prior to a cut. Besides threads that are affected by a stabilize action because
of dependencies, there maybe other threads that are unaffected. If P0 is the set of processes
affected by a stabilize call, then Ps = P P0 , the set difference between P and P0 ,
represents the set of threads unaffected by a stabilize action; the set P0 ⊕ Ps is therefore
the set that, in addition to unaffected threads, also includes those thread states representing
globally consistent local checkpoints among threads affected by a stabilize call.
49
S YNTAX
P ::= 0/ | PkP | t[e]δ
e → e0
LR
Et,P [e], G ; Et,P [e0 ], G
δ
δ
P ROGRAM S TATES
n ∈ Node
= NodeId × Tid × Op×
(Process+ ⊥)
n 7→ n0 ∈ Edge
= Node × Node
δ ∈ StableID
η ∈ CurNode
fin
= Tid → Node
fin
σ ∈ StableSections= StableID → Node
G ∈ Graph
= NodeId × P (Node)×
P (Edge)×
CurNode × StableSections
Figure 5.11. Stabilizers Semantics – Additional syntax, local evaluation
rules, as well as domain equations for a semantics that utilizes incremental
checkpoint construction.
50
(B UILD)
n = (i + 1, t, α, t[E[e]]φ )
G0 = h i + 1, N ∪ {n}, E ∪ {η(t) 7→ n}, η[t 7→ n], σ i
t[E[e]]φ , α, h i, N, E, η, σ i ⇓ G0
n = σ(δ) n0 = (i + 1, t, α, ⊥)
G0 = h i + 1, N ∪ {n0 }, E ∪ {η(t) 7→ n0 , n0 7→ n}, η[t 7→ n0 ], σ i
t[E[e]]δ.δ , α, h i, N, E, η, σ i ⇓ G0
(S PAWN)
t[E[spawn(λ x.e[unit/x])]]δ , SP(t0 , e[unit/x]), G ⇓ h i, N, E, η, σ i
t0 fresh
n = (i, t, SP(t0 , e[unit/x]), t0 [e[unit/x]]φ )
G0 = h i, N ∪ {n}, E ∪ {η(t) 7→ n}, η[t0 7→ n], σ i
Pkt[E[spawn(λ x.e)]]δ , G
SP (t0 ,e[unit/x])
;
Pkt[E[unit]]δ kt0 [e[unit/x]]φ , G0
(C OMM)
P = P0 kt[E[send(l, v)]]δ kt0 [E0 [recv(l)]]δ0
t[E[send(l, v)]]δ , COMM(t, t0 ), G ⇓ G0
t0 [E[recv(l)]]δ0 , COMM(t, t0 ), G0 ⇓ G00
G00 = h i, N, E, η, σ i
G000 = h i, N, E ∪ {η(t) 7→ η(t0 ), η(t0 ) 7→ η(t)}, η, σ i
P, G
COMM (t,t0 )
;
P0 kt[E[unit]]δ kt0 [E0 [v]]δ0 , G000
(C UT)
t[E[unit]δ , CUT, G ⇓ G0
CUT
Pkt[E[cut()]]δ , G ; Pkt[E[unit]]δ , G0
Figure 5.12. Stabilizers Semantics – Global evaluation rules for a semantics that utilizes incremental checkpoint construction.
51
(S TABLE)
G = h i, N, E, η, σ i δ f resh
n = (i + 1, t, SS, t[E[stable(λ x.e)(v)]]δ )
G0 = h i + 1, N, E ∪ {η(t) 7→ n}, η[t 7→ n], σ[δ 7→ n] i
SS
Pkt[E[stable(λ x.e)(v)]]δ , G ; Pkt[E[stable(e[v/x])]]δ.δ , G0
(S TABLE -E XIT)
t[E[stable(v)]]δ.δ , ES, G ⇓ G0
G0 = h i, N, E, η, σ i G00 = h i, N, E, η, σ − {δ} i
ES
Pkt[E[stable(v)]δ.δ , G ; Pkt[E[v]]δ , G00
(S TABILIZE)
G = h i, N, E, η, σ i σ(δ) = n
G0 = G/n
hi, t, α, t[E[cut()]]i 6∈ G0
P0 = {t[e] | n = hi, t, α, t[e]i, i < j ∀h j, t, α0 , t[e0 ]i ∈ G0 }
P00 = P0 ⊕ (P P0 )
ST
Pkt[E[stabilize()]δ.δ , G ; P00 , G0
Figure 5.13. Stabilizer Semantics – Global evaluation rules for a semantics
that utilizes incremental checkpoint construction. (continued)
5.4.1
Example
To illustrate the semantics, consider the sequence of actions shown in Figure 5.14 that is
based on the example given in Figure 5.9. Initially, thread t1 spawns the thread t2 , creating
a new node n2 for thread t2 and connecting it to node n1 with a directed edge. The node
n3 represents the start of the stable section monitoring function f with identifier δ1 . Next,
a monitored instantiation of h is called, and a new node (n4 ) associated with this context is
allocated in the graph and a new identifier is generated (δ2 ). No changes need to be made to
the graph when f exits its stable section. However, since δ1 cannot be restored by a stabilize
52
t1
spawn
t2
spawn(stable h)
n1
η(t1) = n1
η(t2) = n2
n2
t1
start stable
η(t1) = n3
η(t2) = n2
t2
stable f ()
n1
σ(δ1) = n3
n2
δ1
n3
t1
start stable
t2
stable h ()
n1
n2
σ(δ1) = n3
σ(δ2) = n4
n4
start stable
t2
stable g ()
n1
t1
n2
exit stable
n1
δ2
δ1
n3
t1
η(t1) = n3
η(t2) = n4
σ(δ1) = ∅
σ(δ2) = n4
n4
communication
t2
send (c,v)
recv (c)
η(t1) = n7
η(t2) = n6
n2
σ(δ1) = ∅
σ(δ2) = n4
n1
σ(δ3) = n5
σ(δ3) = n5
δ1
δ2
n4
n3
δ1
δ2
n4
n3
δ3
n5
σ(δ1) = ∅
σ(δ2) = n4
δ2
δ1
t1
η(t1) = n3
η(t2) = n4
n2
n3
η(t1) = n5
η(t2) = n4
t2
δ3
n5
n6
n7
Figure 5.14. An example of incremental checkpoint construction for the
code presented in Figure 5.9.
53
call within this thread – it is mapped to φ in σ. Monitoring of function g results in a new
node (n5 ) added to the graph. A backedge between n5 and n3 is not established because
control has exited from the stable section corresponding to n3 . Similarly as before, we
generate a new identifier δ3 that becomes associated with n5 . Lastly, consider the exchange
of a value on channel c by the two threads. Nodes corresponding to the communication
actions are created, along with back-edges to their respective stable sections. Additionally,
a bi-directional edge is created between the two nodes.
Recall the global checkpointing scheme would restore to a global checkpoint created
at the point the monitored version of f was produced, regardless of where a stabilization
action took place. In contrast, a stabilize call occurring within the execution of either g
or h using this incremental scheme would restore the first thread to the continuation stored
in node n3 (corresponding to the context immediately preceding the call to g), and would
restore the second thread to the continuation stored in node n2 (corresponding to the context
immediately preceding the call to h). We formalize this intuition and prove this algorithm
correct in Section 5.5.
5.5
Efficiency
We have demonstrated the safety of stabilization actions for global checkpoints: the
state restored from a global checkpoint must have been previously encountered during execution of the program. We now introduce the notion of efficiency. Informally, incremental
checkpoints are more efficient than global ones because the amount of computation that
must be performed following restoration of an incremental checkpoint is less than the computation that must be performed following restoration of a global one. To prove this, we
show that from the state restored by a global checkpoint, we can take a sequence of evaluation steps that leads to the same state restored by an incremental checkpoint. Note that
efficiency also implies safety. Since the state restored by a global checkpoint can eventually lead to the state produced by an incremental one, and global checkpoints are safe (by
Theorem [Safety]), it follows that incremental ones must be safe as well.
54
The following lemma states that if a sequence of actions do not modify the dependency
graph, then all those actions must have been LR.
Lemma 1. [Safety of LR]:
If
Eφt,P [e], G ;∗ Eφt,P [e0 ], G
α
then α = LR.
The proof follows from the structure of the rules since only global rules augment G,
and local reductions do not.2
A thread’s execution, as defined by the semantics, corresponds to an ordered series
of nodes within the communication graph. As an example, consider Figure 5.14 which
illustrates how a graph is constructed from a series of evaluation steps. Threads t1 and t2
are represented as paths, [n1 , n3 , n5 , n7 ] for t1 and [n2 , n4 , n6 ] for t2 , in the graph depicted in
Figure 5.14(f).
We define a path in the graph G for a thread as a sequence of nodes, where (a) the first
node in the sequence either has no incoming edges, or a single spawn edge whose source
is a node from a different thread, and (b) the last node either has no outgoing edges, or a
single communication back-edge to another node. Thus, a path is a chain of nodes with the
same thread identifier. Then a graph is a set of paths connected with communication and
spawn edges. A well-formed graph is a set of unique paths, one for each thread. Each edge
in this path corresponds to a global action.
Let PtG be a path extracted from graph G for thread t. By the definition of ⇓ every node
in this path contains: (a) the identity of the thread which performed the action that led to
the insertion of the node in the graph; (b) the action performed by the thread that triggered
the insertion; and (c) the remaining computation for the thread at the point where the action
was performed. An action can be of the form SP(t0 , e) indicating that a new thread t0 was
spawned with label (t0 , e),
COMM (t, t0 )
indicating that a communication action between
the current thread (t) and another thread (t0 ) has occurred, or
SS
reflecting the entry of a
stable section by the executing thread. A schedule StG is a temporally ordered sequence of
tuples extracted from PtG that represents all actions performed by t on G.
55
(SPAWN)
T = (t, SP(t0 , e00 ), e0 ]).S ∪ T 0
SP (t0 ,e0 )
T, t[e]kP 7→G StG0 ∪ S ∪ T 0 , t[e0 ]]kt0 [e00 ]kP
(COMM)
T = (t1 , COMM(t1 , t2 ), e01 ).S1 ∪ (t2 , COMM(t1 , t2 ), e02 ).S2 ∪ T 0
T, t1 [e1 ]kt2 [e2 ]kP
COMM (t1 ,t2 )
7→G
S1 ∪ S2 ∪ T 0 , t1 [e01 ]kt2 [e02 ]kP
(STABLE)
T = (t, SS, e0 ).S ∪ T 0
SS
T, t[e]kP 7→G S ∪ T 0 , t[e0 ]kP
(EXIT STABLE)
T = (t, ES, e0 ).S ∪ T 0
ES
T, t[e]kP 7→G S ∪ T 0 , t[e0 ]kP
(CUT)
T = (t, CUT, e0 ).S ∪ T 0
CUT
T, t[e]kP 7→G S ∪ T 0 , t[e0 ]kP
Figure 5.15. The relation 7→ defines how to evaluate a schedule T derived from a graph G.
We now proceed to define a new semantic relation 7→ (see Figure 5.15) that takes a
graph G, a set of schedules T , and a given program state P and produces a new set of
thread schedules T 0 , and a new program state P0 . Informally, 7→ examines the continuations in schedules obtained from the communication graph to define a specific evaluation
sequence. It operates over schedules based on the following observation: given an element
π = (t, α, e) in a schedule in which an expression e represents the computation still to be
performed by t, the next element in the schedule π0 can be derived by performing the action
α, and some number of thread local reductions.
The rule for (SPAWN) in Figure 5.15 holds when a thread t whose first action in its
recorded schedule within the communication graph is a spawn action. The rule performs
56
this action by yielding a new process state that includes the new thread, and a new schedule
that reflects the execution of the action. The rule for communication (rule (COMM)) takes
the schedules of two threads that were waiting for initiating a communication with one
another, and yields a new process state in which the effects of the communication are
recorded. Entry into a stable section (rule (STABLE)) establishes a thread local checkpoint.
Rules (E XIT S TABLE) and (CUT) install the operation’s continuation and remove the action
from the schedule.
These rules skip local reductions. This is safe because if there existed a reduction that
augmented the graph it would be present in some schedule (such a reduction obviously
does not include stabilize ). The following lemma formalizes this intuition. (If N is the
set of nodes reachable from the roots of G’, then G/G0 denotes the graph that results from
removing N and nodes reachable from N from G.)
α
Lemma 2. [Schedule Soundness] Suppose there exists G and G0 such that P, G ;∗ P0 , G0
and
ST
6 ∈ α. Let T be the schedule derived from G00 where G00 = G0 /G. Then, T, P 7→∗G00
φ, P0 .
The proof is by induction on the size of G0 /G. The base case has G = G0 , and therefore
|G00 | = 0. By Lemma 1, α = LR which implies P = P0 . Since a schedule only includes
actions derived from the graph G0 /G, which is empty, T = φ. Suppose the theorem holds
for |G00 | = n. To prove the inductive step, consider P, G ; P1 , G1 ; P0 , G0 where |G1 /G| =
α
α
α
α
n. By the induction hypothesis, we know T, P 7→G00 T 0 , P1 and T, P 7→G1 /G φ, P1 . Now,
if α =
LR ,
then by Lemma 1, G1 = G0 , thus |G1 /G| = n, and T 0 = φ. Otherwise, α ∈
{SS,ES,COMM(t, t0 ), SP(t, e)}. Since all the rules add a node to the graph, we know by the
α
definition of 7→, there is a transition for each of these actions that guarantees T 0 , P1 7→G00
φ, P0 .
Lemma 3. [Adequacy] Let Gs and G f be two well-formed graphs, G = G f /Gs , and let
α
P and P0 be process states. If T is the schedule derived from G, and if T, P 7→∗G T 0 , P0 then
α0
P, Gs ; P0 , G0f and |α| ≤ |α0 |.
By definition of G f , Gs , and 7→, all tags in α are contained in α0 . By lemma 1, this
implies that α ≤ α0 . 2
57
Furthermore, both global and incremental checkpointing yield the same process states
in the absence of a stabilize action.
Lemma 4. [Correspondence] If P, G ; P0 , G0 and ST ∈
/ α, then P, ∆ =⇒ P0 , ∆.
α
α
The proof follows trivially from the definition of ; and =⇒ .2
Using these lemmas, we can formally characterize our notion of efficiency:
Theorem [Efficiency]:
If
α.ST
Eφt,P [e], ∆ =⇒ ∗ P0 , ∆0
and
α.ST
Eφt,P [e], G0 ; ∗ P00 , G00
β
then there exists β such that P0 , ∆0 =⇒ ∗ P00 , ∆00 .
The proof is by induction on the length of α. The base case considers sequences of
length one since a stabilize action can only occur within the dynamic context of a
stable section (tag SS). Then, P = P0 = P00 , β = φ, and the theorem holds trivially.
Assume the theorem holds for sequences of length n − 1. Let α = β1 .β2 and |β1 | =
n − m, |β2 | = m. By our hypothesis, we know
β1
β2 .ST
Eφt,P [e], ∆ =⇒ ∗ Pn−m , ∆n−m =⇒ ∗ P0 , ∆0
and
β1
β2 .ST
Eφt,P [e], G0 ;∗ Pn−m , Gn−m ; ∗ P00 , G00 .
Without loss of generality, assume Pn−m = P0 . Intuitively, any checkpoint restored by
the global checkpointing semantics corresponds to a state previously seen during evaluation. Since both evaluations begin with the same α sequence, they must share the same
program states, thus we know Pn−m exists in both sequences.
By the definition of ;, we know G00 and Gn−m are well formed. Let G = G00 /Gn−m .
G is well formed since Gn−m and G00 are. Thus, there is a path PtG associated with every
thread t, and a schedule StG that can be constructed from this path. Let T be the set of
schedules derived from G.
58
α0
By Lemma 2, we know there is some sequence of actions α0 such that T, P0 7→∗G φ, P00 .
β
By Lemma 3, we know P0 , Gn−m ;∗ P00 , G00 , and | α0 |≤| β |. By definition of 7→, ; and
by Lemma 2, we know that ST 6∈ β since it differs from α0 only with respect to LR actions,
β
and α0 does not contain any ST tags. By Lemma 4, we know P0 , ∆0 =⇒ ∗ P00 , ∆00 2
5.6
Implementation
The main changes to the underlying infrastructure were the insertion of write barriers to
track shared memory updates, and hooks to the CML library to update the communication
graph. State restoration is thus a combination of restoring continuations as well as reverting
references. The implementation is roughly 2K lines of code to support our data structures,
checkpointing, and restoration code, as well as roughly 200 lines of changes to CML. To
handle references, our implementation assumes race-free programs: every shared reference
is assumed to be consistently protected by the same set of locks. We believe this assumption
is not particularly onerous in a language like CML where references are generally used
sparingly.
5.6.1
Supporting First-Class Events
Because our implementation is an extension of the core CML library, it supports firstclass events [2] as well as channel-based communication. The handling of events is no
different than our treatment of messages. If a thread is blocked on an event with an associated channel, we insert an edge from that thread’s current node to the channel. We
support CML’s selective communication with no change to the basic algorithm – recall that
our operations only update the checkpoint graph on base events; complex events such as
choose , wrap , or guard are thus unaffected. Since CML imposes a strict ordering of
communication events, each channel must be purged of spurious or dead data after a stabilize action. We leverage the same process CML uses for clearing channels of spurious
data after a selective communication, to deal with stabilize actions that roll back channel
59
state. Spurious data accumulates on channels in the presences of choice. When choosing
between multiple communication actions, all actions are placed on channels and are subsequently cleaned up after one has been matched. In CML this accomplished lazily by setting
all the non-matched communication actions to be invalid. We leverage this mechanism in
our stabilizer implementation.
5.6.2
Handling References
We have thus far elided details on how to track shared memory access to properly
support state restoration actions in the presence of references. Naively tracking each read
and write separately would be inefficient. There are two problems that must be addressed:
(1) unnecessary writes should not be logged; and (2) spurious dependencies induced by
reads should be avoided.
Notice that for a given stable section, it is enough to monitor the first write to a given
memory location since each stable section is unrolled as a single unit. To monitor writes,
we create a log in which we store reference/value pairs. For each reference in the list,
its matching value corresponds to the value held in the reference prior to the execution of
the stable section. When the program enters a stable section, we create an empty log for
this section. When a write is encountered within a monitored procedure, a write barrier is
executed that checks if the reference being written is in the log maintained by the section.
If there is no entry for the reference, one is created, and the current value of the reference is
recorded. Otherwise, no action is required. To handle references occurring outside stable
sections, we create a log for the most recently taken checkpoint for the writing thread.
Until a nested stable section exits it is possible for a call to stabilize to unroll to the start
of this section. A nested section is created when a monitored procedure is defined within
the dynamic context of another monitored procedure. Nested sections require maintaining
their own logs. Log information in these sections must be propagated to the outer section
upon exit. However, the propagation of information from nested sections to outer ones is
not trivial. If the outer section has monitored a particular memory location that has also
60
been updated by the inner one, we only need to store the outer section’s log, and the value
preserved by the inner one can be discarded.
Efficiently monitoring read dependencies requires us to adopt a different methodology.
We assume read operations occur much more frequently than writes, and thus it would be
impractical to have barriers on all read operations record dependency information in the
communication graph. Based on our assumption of race-free programs, it is sufficient to
monitor lock acquires/releases to infer shared memory dependencies. By incorporating
happens-before dependency edges on lock operations [23], stabilize actions initiated
by a writer to a shared location can be effectively propagated to readers that mediate access
to that location via a common lock. A lock acquire is dependent on the previous acquisition
of the lock.
5.6.3
Graph Representation
The main challenge in the implementation was developing a compact representation
of the communication graph. We have implemented a number of node/edge compaction
algorithms allowing for fast culling of redundant information. For instance, any two nodes
that share a backedge can be collapsed into a single node. We also ensure that there is at
most one edge between any pair of nodes. Any addition to the graph affects at most two
threads. We use thread-local meta-data to find the most recent node for each thread. The
graph is thus never traversed in its entirety. The size of the communication graph grows
with the number of communication events, thread creation actions, lock acquires, and stable
sections entered. However, we do not need to store the entire graph for the duration of
program execution. The leaves of the graph data structure are distributed among threads.
Specifically, a thread has a reference to its current node (which is always a leaf). When
a new node is added, the reference is updated. Any node created within a stable section
establishes a backedge to the node that represents that section. Thus, any unreachable node
can be safely reclaimed by the garbage collector. As we describe below, memory overheads
are thus proportional to the amount of communication and are only reclaimed after stable
61
let fun g () =
let val f stable fn () =>
(... raise error ...)
in ...; f () handle error => stabilize()
end
in stable g ()
end
Figure 5.16. Sample code utilizing exceptions and stabilizers.
sections complete. Notice, long lived stable sections prevent reclamation of the graph that
is reachable from those stable sections.
A stabilize action has complexity linear in the number of nodes and edges in the
graph. Our implementation utilizes a combination of depth-first search and bucket sorting
to calculate the resulting graph after a stabilize call in linear time. DFS identifies the
part of the graph which will be removed after the stabilize call. Only sections of the graph
reachable from the stabilize call are traversed, resulting in a fast restoration procedure.
5.6.4
Handling Exceptions
Special care must be taken to deal with exceptions since they can propagate out of stable
sections. When an exception is raised within a stable section but its handler’s scope encompasses the stable section itself, we must record this event in our graph. When an exception
propagates out of a stable section, the stable section is no longer active. To illustrate why
such tracking is required consider the following example code given in Fig 5.16.
The example program consists of two functions f and g, both of which execute within
stable sections. Functions f’s execution is also wrapped in an exception handler, which
catches the error exception. Notice that this handler is outside of the scope of the stable
section for f. During execution, two checkpoints will be taken, one at the call site of g and
the other at f’s call site. When the exception error is raised and handled, the program
62
executes a call to stabilize . The correct checkpoint which should be restored is the
captured checkpoint at g’s call site. Without exception tracking, f’s checkpoint would be
incorrectly restored.
To implement exception tracking, we wrap stable sections with generic exception handlers. Such handlers catch all exceptions, modify our run-time graph, and finally re-raise
the exception to allow it to propagate to its appropriate handler. Exceptions that have handlers local to the stable section in which they occur are not affected. Modifications required
to the dependency graph are limited – they are just a stable section exit. Since an exception
propagating out of a stable section is modeled as a stable section exit, nesting does not
introduce additional complexity.
5.6.5
The Rest of CML
Besides channel and event communication, the CML library offers many other primitives for thread communication and synchronization. The most notable of these are synchronous variables (M-vars, I-vars, and sync vars) which support put and get operations.
We instrumented stabilizers to monitor synchronous variables in much the same manner as
shared references. The rest of the CML primitives, such as IVars and MVars, are created
from either the basic channel and event building blocks or synchronous variables and do
not need special support in the context of stabilizers.2 CML also provides various communication protocols, such as multicast, which are constructed from a specified collection of
channels and a series of communication actions. Again, by instrumenting the fundamental
building blocks of CML, no special hooks or monitoring infrastructure must be inserted.
5.7
Performance Results
To measure the cost of stabilizers with respect to various concurrent programming
paradigms, we present synthetic benchmarks to quantify pure memory and time over2 If
an application relies heavily on such constructs it may be more efficient to make stabilizers aware of the
abstraction itself instead of its component building blocks.
63
Table 5.1
Benchmark characteristics and dynamic counts.
LOC
Benchmark
Comm.
Shared
incl. eXene
Threads
Channels
Events
Writes
Reads
Triangles
16501
205
79
187
88
88
N-Body
16326
240
99
224
224
273
Pretty
18400
801
340
950
602
840
Swerve
16000
10532
231
29047
9339
80293
Table 5.2
Benchmark graph sizes and normalized overheads.
Graph
Overheads (%)
Benchmark
Size(MB)
Runtime
Memory
Triangles
.19
0.59
8.62
N-Body
.29
0.81
12.19
Pretty
.74
6.23
20.00
Swerve
5.43
2.85
4.08
heads, and examine several server-based open-source CML benchmarks to illustrate average overheads in real programs. The benchmarks were run on an Intel P4 2.4 GHz machine
with 1 GB of RAM running Gentoo Linux, compiled and executed using MLton release
20041109.
To measure the costs of our abstraction, our benchmarks are executed in three different ways: one in which the benchmark is executed with no actions monitored, and no
checkpoints constructed; one in which the entire program is monitored, effectively wrapped
within a stable section, but in which no checkpoints are actually restored; and one in which
64
relevant sections of code are wrapped within stable sections, exception handlers dealing
with transient faults are augmented to invoke stabilize, and faults are dynamically injected to trigger restoration.
5.7.1
Synthetic Benchmarks
Our first benchmark, Asynchronous Communication, measures the costs of building and
maintaining our graph structure as well as the cost of stabilize actions in the presence
of asynchronous communication. The benchmark spawns two threads, a source and a sink,
that communicate asynchronously. We measure the cost of our abstraction with regard to an
ever increasing load of asynchronous communications. To measure overheads for recording
checkpoints, the source and sink threads are wrapped within a stable section. The source
thread spawns a number of new threads which all send values on a predefined channel. The
sink thread loops until all messages are received and then performs a stabilize action.
Since both threads are wrapped in stable sections, the effect of stabilization is to unroll the
entire program, when stabilize is called from the source thread. Notice that if we called
stabilize from the sink, every worker thread that was spawned would be unrolled, but
the source thread would not since it does not directly communicate with the sink.
The second benchmark, Communication Pipeline, measures similar effects as the first,
but captures the behavior of computations that generate threads which communicate in a
synchronous pipeline fashion. The benchmark spawns a series of threads, each of which
defines a channel used to communicate with its predecessor. Each thread blocks until it
receives a value on its input channel and then sends an acknowledgment to the thread
spawned before it. The first and last threads in the chain are connected to form a circular
pipeline. This structure establishes a chain of communications over a number of channels,
all of which must be unrolled if a stabilize action is initiated. To correctly reach a
consistent checkpoint state, an initial thread, responsible for creating the threads in the
pipeline, is wrapped within a stable section. Since the initial thread begins by spawning
65
Time Overheads
30
%Overhead
25
20
15
10
5
0
250
500
750
1000 1250 1500 1750
Asynchronous Communications
2000
Figure 5.17. Asynchronous Communication runtime overheads.
worker threads, unrolling it will lead to a state that does not reflect any communication
events or thread creation actions.
These benchmarks measure the overhead of logging program state and communication
dependencies with no opportunity for amortizing these costs among other non-stabilizer
related operations. These numbers therefore represent worst case overheads for monitoring
thread interactions. The versions of the benchmarks with injected stabilizers were compared to a vanilla MLton implementation of CML with no additional instrumentation. On
average the programs take about 15-20% longer to run and consume about 50% more memory.
The runtime overheads for the synthetic benchmarks are presented in Figure 5.17 and
Figure 5.19, and the total allocation overheads are presented in Figure 5.18 and Figure 5.20.
As expected, the cost to simply maintain the graph grows linearly with the number of
communications performed and runtime overheads remain constant. There is a significant
initial memory and runtime cost because we pre-allocate hash tables used by the graph.
The protocols inherent in these benchmarks are captured at runtime via the communication graph. We present two sample graphs, one for each of the microbenchmarks, in
Figure 5.21 and Figure 5.22. In the graph for asynchronous communication we notice a
66
Memory Overheads
70
%Overhead
60
50
40
30
20
10
0
250
500
750
1000 1250 1500 1750
Asynchronous Communications
2000
Figure 5.18. Asynchronous Communication memory overheads.
Time Overheads
30
%Overhead
25
20
15
10
5
0
250
500
750
1000 1250 1500
Communicating Threads
1750
2000
Figure 5.19. Communication Pipeline runtime overheads.
complex tree-like communication structure generated from the single thread source communicating asynchronously with the sink. The branching structure occurs from the spawning of new threads, each of which communicates once with the sink. In the communication
pipeline we see a much different communication structure. The threads communicate in a
67
Memory Overheads
70
%Overhead
60
50
40
30
20
10
0
250
500
750
1000 1250 1500
Communicating Threads
1750
2000
Figure 5.20. Communication Pipeline memory overheads.
pre-defined order, creating a simple stream. Both graphs were generated from the stabilizer
communication graph and fed to DOT to generate the visual representation. 3
5.7.2
Open-Source Benchmarks
Our other benchmarks include several eXene [24] benchmarks: Triangles and Nbody,
mostly display programs that create threads to draw objects; and Pretty, a pretty printing
library written on top of eXene. The eXene toolkit is a library for X Windows that implements the functionality of xlib, written in CML and comprises roughly 16K lines of
Standard ML. Events from the X server and control messages between widgets are distributed in streams (coded as CML event values) through the window hierarchy. eXene
manages the X calls through a series of servers, dynamically spawned for each connection and screen. The last benchmark we consider is Swerve, a web-server written in CML
whose major modules communicate with one another using message passing channel communication; it makes no use of eXene. All the benchmarks create various CML threads to
3 We
have discovered that utilizing the visualization of the communication graph is a useful tool for software development and debugging CML programs. We believe stabilizers can be utilized during testing and
development to assist the programmer in constructing complex communication protocols.
68
Figure 5.21. Communication graphs for the Asynchronous Communication synthetic benchmark.
69
Figure 5.22. Communication graphs for the Communication Pipeline synthetic benchmark.
70
Table 5.3
Restoration of the entire web-server.
Graph
Channels
Threads
Runtime
Reqs
Size
Num
Cleared
Affected
(milli-seconds)
20
1130
85
42
470
5
40
2193
147
64
928
19
60
3231
207
84
1376
53
80
4251
256
93
1792
94
100
5027
296
95
2194
132
handle various events; communication occurs mainly through a combination of message
passing on channels, with occasional updates to shared data.
For these benchmarks, stabilizers exhibit a runtime slow down up to 6% over a CML
program in which monitoring is not performed (see Table 5.1 and Table 5.2). For a highlyconcurrent application like Swerve, the overheads are even smaller, on the order of 3%.
The cost of using stabilizers is only dependent on the number of inter-thread actions and
shared data dependencies that are logged. These overheads are well amortized over program execution. Table 5.1 provides dynamic counts from the stabilizer graph.
Memory overheads to maintain the communication graph are larger, although in absolute terms, they are quite small. Because we capture continuations prior to executing
communication events and entering stable sections, part of the memory cost is influenced
by representation choices made by the underlying compiler. Nonetheless, benchmarks such
as Swerve that create over 10K threads, and employ non-trivial communication patterns,
require only 5MB to store the communication graph, a roughly 4% overhead over the memory consumption of the original program.
To measure the cost of calculating and restoring a globally consistent checkpoint, we
consider three experiments. The first is a simple unrolling of Swerve (see Table. 5.3),
in which a call to stabilize is inserted during the processing of a varying number of
71
Table 5.4
Instrumented recovery.
Channels
Benchmark
Threads
Runtime
Num
Cleared
Total
Affected
(milli-seconds)
Swerve
38
4
896
8
3
eXene
158
27
1023
236
19
concurrent web requests. This measurement illustrates the cost of restoring to a consistent
global state that can potentially affect a large number of threads. Although we expect
large checkpoints to be rare, we note that restoration of such checkpoints is nonetheless
quite fast. The graph size is presented as the total number of nodes. Channels can be
affected by an unrolling in two different ways. A channel may contain a value sent on
it by a communicating thread but that has not been consumed by a receiver, or a channel
may connect two threads that have successfully exchanged a value. In the first case we
must clear the channel of the value if the thread which placed the value on the channel is
unrolled; in the latter case no direct processing on the channel is required. The table also
shows the total number of affected channels and those which must be cleared.
5.7.3
Case Studies: Injecting Stabilizers
To quantify the cost of using stabilizers in practice, we extended Swerve and eXene
and replaced some of their error handling mechanisms with stabilizers (see Table 5.4). For
Swerve, the implementation details are given in Section 5.2. Our benchmark manually
injects a timeout every ten requests, stabilizes the program, and re-requests the page.
For eXene, we augment a scrollbar widget used by the pretty printer. In eXene, the state
of a widget is defined by the state of its communicating threads, and no state is stored in
shared data. The scroll bar widget is composed of three threads which communicate over a
set of channels. The widget’s processing is split between two helper threads and one main
72
File Size Overheads
3.00
%Overhead
2.25
1.50
0.75
0
-0.75
-1.50
-2.25
-3.00
10
128
256
File Size (KB)
512
1024
Figure 5.23. Swerve file size overheads for Stabilizers.
controller thread. Any error handled by the controller thread must be communicated to the
helper threads and vice versa. We manually inject the loss of packets into the X server,
stabilize the widget, and wait for new interaction events. The loss of packets is injected
by simply dropping every tenth packet which is received from the X server. Ordinarily, if
eXene ever loses an X server packet, its default behavior is to terminate execution since
there is no easy mechanism available to restore the state of the widget to a globally consistent point. Using stabilizers, however, packet loss exceptions can be safely handled by
the widget. By stabilizing the widget, we return it to a state prior to the failed request.
Subsequent requests will redraw the widget as we would expect. Thus, stabilizers permit
the scroll bar widget to recover from a lost packet without pervasive modification to the
underlying eXene implementation.
Finally, to measure the sensitivity of stabilization to application-specific parameters, we
compare our stabilizer-enabled version of Swerve to the stock configuration by varying two
program attributes: file size and quantum. Since stabilizers eliminate the need for polling
during file processing, runtime costs would improve as file sizes increase. Our tests were
run on both versions of Swerve; for a given file size, 20 requests are processed. The results
(see Figure 5.23) indicate that for large file sizes (upward of 256KB) our implementation
73
Quantum Overheads
%Overhead
6.0
4.5
3.0
1.5
0
5
15
20
25
Quantum (ms)
35
45
Figure 5.24. Swerve quantum overheads for Stabilizers.
is slightly more efficient than the original. Our slow down for small file sizes (on the order
of 10KB) is proportional to our earlier benchmark results. We expect most web-servers to
host mostly small files.
Since our graph algorithm requires monitoring various communication events, lowering
the time quantum allocated to each thread may adversely affect performance, since the
overhead for monitoring the graph consumes a greater fraction of a thread’s computation
per quantum. Our tests compared the two versions of Swerve, keeping file size constant
at 10KB, but varying the allocated quantum (see Figure 5.24). Surprisingly, the results
indicate that stabilizer overheads become significant only when the quantum is less than 5
ms. As a point of comparison, CML’s default quantum is 20ms.
5.8
Related Work
Being able to checkpoint and rollback parts or the entirety of an execution has been
the focus of notable research in the database [25] as well as the parallel and distributed
computing communities [26–28]. Checkpoints have been used to provide fault tolerance for
long-lived applications, for example in scientific computing [29,30] but have been typically
regarded as heavyweight entities to construct and maintain.
74
Existing checkpoint approaches can be classified into four broad categories: (a) schemes
that require applications to provide their own specialized checkpoint and recovery mechanisms [31, 32]; (b) schemes in which the compiler determines where checkpoints can be
safely inserted [33]; (c) techniques that require operating system or hardware monitoring of
thread state [28, 34, 35]; and (d) library implementations that capture and restore state [36].
Checkpointing functionality provided by an application or a library relies on the programmer to define meaningful checkpoints. For many multi-threaded applications, determining
these points is non-trivial because it requires reasoning about global, rather than threadlocal, invariants. Compiler and operating-system injected checkpoints are transparent to
the programmer. However, transparency comes at a notable cost: checkpoints may not be
semantically meaningful or efficient to construct.
The ability to revert to a prior point within a concurrent execution is essential to transaction systems [24, 37, 38]; outside of their role for database concurrency control, such
approaches can improve parallel program performance by profitably exploiting speculative
execution [39, 40]. Harris et al. proposes a transactional memory system [41] for Haskell
that introduces a retry primitive to allow a transactional execution to safely abort and
be re-executed if desired resources are unavailable. However, this work does not propose
to track or revert effectful thread interactions within a transaction. In fact, such interactions are explicitly rejected by the Haskell type-system. There has also been recent interest
in providing transactional infrastructures for ML [42], and in exploring the interaction between transactional semantics and first-class synchronous operations [22,43,44]. Our work
shares obvious similarities with all these efforts insofar as stabilizers also require support
for logging and revert program state.
In addition to stabilizers, functional language implementations have utilized continuations for similar tasks. For example, Tolmach and Appel [45] described a debugging mechanism for SML/NJ that utilized captured continuations to checkpoint the target program
at given time intervals. This work was later extended [46] to support multi-threading, and
was used to log non-deterministic thread events to provide replay abilities. The technique,
however, was never adopted to synchronous message passing or CML-style events.
75
Another possibility for fault recovery is micro-reboot [19], a fine-grained technique for
surgically recovering faulty application components which relies critically on the separation of data recovery and application recovery. Micro-reboot allows for a system to be
restarted without ever being shut down by rebooting separate components. Unlike checkpointing schemes, which attempt to restore a program to a consistent state within the running application, micro-reboot quickly restarts an application component, but the technique
itself is oblivious to program semantics.
Recent work in the programming languages community has explored abstractions and
mechanisms closely related to stabilizers and their implementation for maintaining consistent state in distributed environments [47], detecting deadlocks [48], and gracefully dealing with unexpected termination of communicating tasks in a concurrent environment [49].
For example, kill-safe thread abstractions [49] provide a mechanism to allow cooperating
threads to operate even in the presence of abnormal termination. Stabilizers can be used
for a similar goal, although the means by which this goal is achieved is quite different.
Stabilizers rely on unrolling thread dependencies of affected threads to ensure consistency
instead of employing specific runtime mechanisms to reclaim resources.
There has been a number of recent proposals dealing with safe futures, which bear some
similarity to stabilizers insofar as both provide a revocation mechanism based on tracking
dynamic data and control-flow. Futures are a program abstraction that express a simple
yet expressive form of fork-join parallelism. The expression future (e) declares that e
can be evaluated concurrently with the future’s continuation. The expression touch (p)
where p is a placeholder returned by evaluating a future, blocks until the result of evaluating
e is known. Safe futures provide additional deterministic guarantees on the concurrent
execution of the future with its continuation, ensuring that all data dependencies found in
the original (non-future annotated) version are respected.
Welc et. al [40] provide a dynamic analysis that enforces deterministic execution of
sequential Java programs. However, safe futures do not easily compose with other Java
concurrency primitives, and the criteria for revocation is automatically determined based
on dependency violations, and is thus not under user control. In sequential programs, static
76
analysis coupled with simple program transformations [50] can ensure deterministic parallelism by providing coordination between futures and their continuations in the presence
of mutable state. Unfortunately neither approach provided safety in the presences of exceptions. This was remedied in [51], where the authors presented an implementation for
exception-handling in the presence of safe futures. Flanagan and Felleisen [52] presented a
formal semantics for futures, but did not consider how to enforce safety (i.e. determinism)
in the presence of mutable state. Navabi and Jagannathan [53] presented a formulation
of safe futures for a higher-order language with first-class exceptions and first-class references. Neither formulation consider the interaction of futures with explicit threads of
control. We believe that a combination of some of the approaches described above coupled with stabilizers can be leveraged to provide an implementation of safe futures in the
presence of explicit threads of control. We discuss this possibility in our future work in
Chapter 8.
Transactional Events [22, 43, 44] are an emerging trend in concurrent functional language design. Transactional events combine first-class message passing events with an
all-or-nothing semantics afforded by transactions. Transactional events provide the ability
to express three-way rendezvous and safe, guarded synchronous receives. Unlike stabilizers which provide atomicity properties on rollback, transactional events provide atomicity
on a communication protocol. To ensure that a communication protocol which spans multiple threads of control completes atomically in the presences of selective communication,
all possible permutations of the protocol may need to be explored. This is, in general,
difficult as many different threads can be potential participants in the protocol.
Transactional events rely on a dynamic search thread strategy to explore communication
patterns, thus guaranteeing atomicity of communication protocols. Only successful communication patterns are chosen. Currently, transactional events do a full state space exploration, similar to model checking, until a successful communication protocol is discovered.
Originally transactional events only supported synchronous communication. However, recent extensions have provided semantics for handling shared memory references [44]. To
avoid the complexity of reasoning about all possible interleavings of shared memory oper-
77
ations, transactional events with support for shared memory only consider amalgamations
of operations bound by communication actions and synchronization points.
We believe stabilizers could be utilized by transactional events to implement optimistic
search thread strategies instead of full state space explorations. The monitoring and rollback properties of stabilizers could be leveraged to create an more sophisticated search
mechanism that utilizes monitored information from previous searches to guide future ones
we provide further details as a part of future work in Chapter 8.
5.9
Concluding Remarks
Stabilizers are a novel modular checkpointing abstraction for concurrent functional pro-
grams. Unlike other checkpointing schemes, stabilizers are not only able to identify the
smallest subset of threads which must be unrolled, but also provide useful safety guarantees. As a language abstraction, stabilizers can be used to simplify program structure
especially with respect to error handling, debugging, and consistency management. Our
results indicate that stabilizers can be implemented with small overhead and thus serve as
an effective and promising checkpointing abstraction for high-level concurrent programs.
78
6
PARTIAL MEMOIZATION OF CONCURRENCY AND COMMUNICATION
Eliminating redundant computation is an important optimization supported by many language implementations. One important instance of this optimization class is memoization [54–56], a well-known dynamic technique that can be utilized to avoid performing a
function application by recording the arguments and results of previous calls. If a call is
supplied an argument that has been previously cached, the execution of the function body
can be elided, with the corresponding result immediately returned instead.
When functions perform effectful computations, leveraging memoization becomes significantly more challenging. Two calls to a function f that performs some stateful computation need not generate the same result if the contents of the state f uses to produce its
result are different at the two call-sites.
Concurrency and communication introduce similar complications. If a thread calls a
function f that communicates with functions invoked in other threads, then memo information recorded with f must include the outcome of these actions. If f is subsequently
applied with a previously seen argument, and its communication actions at this call-site
are the same as its effects at the original application, re-evaluation of the pure computation in f ’s body can be avoided. Because of thread interleavings, synchronization, and
non-determinism introduced by scheduling choices, making such decisions is non-trivial.
Nonetheless, we believe memoization can be an important component in a concurrent
programming language runtime. Our belief is enforced by the widespread emergence of
multi-core platforms, and renewed interest in streaming [57], speculative [58] and transactional [20, 59] abstractions to program these architectures. For instance, optimistic concurrency abstractions rely on efficient control and state restoration mechanisms. When a
speculation fails because a previously available computation resource becomes unavailable,
or when a transaction aborts due to a serializability violation [24] and must be retried [41],
their effects must be undone. Failure represents wasted work, both in terms of the opera-
79
tions performed whose effects must now be erased, and in terms of overheads incurred to
implement state restoration; these overheads include logging costs, read and write barriers, contention management, etc. One way to reduce this overhead is to avoid subsequent
re-execution of those function calls previously executed by the failed computation whose
results are unchanged. The key issue is understanding when utilizing memoized information is safe, given the possibility of concurrency, communication, and synchronization
among threads.
In this chapter, we consider the memoization problem for a higher-order concurrent
language in which threads communicate through synchronous message passing primitives
(e.g. Concurrent ML [2]). A synchronization event acknowledges the existence of an external action performed by another thread willing to send or receive data. If such events
occur within a function f whose applications are memoized, then avoiding re-execution at
a call-site c is only possible if these actions are guaranteed to succeed at c. In other words,
using memo information requires discovery of interleavings that satisfy the communication constraints imposed by a previous call. If we can identify a global state in which these
constraints are satisfied, the call to c can be avoided. We say that a constraint is satisfiable if there exists a thread willing to perform a matching action on the channel. Thus, if
our constraint embodied a send on a particular channel, for the constraint to be satisfiable,
there must exist a thread willing to receive from that channel. If there exists no such state,
then the call must be performed. Because finding such a state can be expensive (it may
require an unbounded state space search), we consider a weaker notion of memoization.
By recording the context in which a memoization constraint was generated, implementations can always choose to simply resume execution of the function at the program point
associated with the constraint using the saved context. In other words, rather than requiring
global execution to reach a state in which all constraints in a memoized application are
satisfied, partial memoization gives implementations the freedom to discharge some fraction of these constraints, performing the rest of the application as normal. Although our
description and formalization is developed in the context of message-based communica-
80
tion, the applicability of our solution naturally extends to shared-memory communication
as well given the simple encoding of the latter in terms of the former [2].
Whenever a constraint built during memoization is discharged on a subsequent application, there is a side-effect that occurs; namely the execution of the action the constraint
desribes. For example, consider a communication constraint associated with a memoized
version of a function f that expects a thread T to receive data d on channel c. To use this
information at a subsequent call, we must identify the existence of T , and having done
so, must propagate d along c for T to consume. Thus, whenever a constraint is satisfied,
an effect that reflects the action represented by that constraint is performed. We consider
the set of constraints built during memoization as forming an ordered log, with each entry
in the log representing a condition that must be satisfied to utilize the memoized version,
and an effect that must be performed if the condition holds. The point of memoization for
our purposes is thus to avoid performing the pure computations that execute between these
effectful operations.
6.1
Programming Model
Our programming model is a simple synchronous message passing dialect of ML sim-
ilar to CML. Deciding whether a function application can be avoided based on previously
recorded memo information depends upon the value of its arguments, its communication
actions, channels it creates, threads it spawns, and the return value it produces. Thus, the
memoized result of a call to a function f can be used at a subsequent call if (a) the argument given matches the argument previously supplied; (b) recipients for values sent by
f on channels in an earlier memoized call are still available on those channels; (c) values
consumed by f on channels in an earlier call are again ready to be sent other threads; (d)
channels created in an earlier call have the same actions performed on them, and (e) threads
created by f can be spawned with the same arguments supplied in the memoized version.
Ordering constraints on all sends and receives performed by the procedure must also be
enforced. We call the collection of such constraints for a given function application the
81
constraint log. A successful application of a memoized call yields a new state in which the
effects captured within the constraint log have been executed. Thus, the values sent by f are
received by waiting recipients, senders on channels from which f expects to receive values
propagate these values on those channels, and channels and threads that f is expected to
create are created.
To avoid making a call, a send action performed within the applied function, for example, will need to be paired with a receive operation executed by some other thread.
Unfortunately, there may be no thread currently scheduled that is waiting to receive on
this channel. Consider an application that calls a memoized function f which (a) creates a
thread T that receives a value on channel c, and (b) sends a value on c computed through
values received on other channels that is then consumed by T . To safely use the memoized
return value for f nonetheless still requires that T be instantiated, and that communication
events executed in the first call can still be satisfied (e.g., the values f previously read on
other channels are still available on those channels). Ensuring these actions can succeed
may involve an exhaustive exploration of the execution state space to derive a schedule
that allows us to consider the call in the context of a global state in which these conditions
are satisfied. Because such an exploration may be infeasible in practice, our formulation
also supports partial memoization. Rather than requiring global execution to reach a state
in which all constraints in a memoized application are satisfied, partial memoization gives
implementations the freedom to discharge some fraction of these constraints, performing
the rest of the application as normal.
6.2
Motivation
As a motivating example, we consider how memoization can be profitably utilized in a
concurrent message passing web-server to implement a file cache. In a typical web-server
a file cache is a well known optimization that allows the server to avoid re-reading a file
from disk when it is requested multiple times. This is typically accomplished by storing
the file in memory on the first request and accessing this memory on subsequent requests.
82
We can leverage memoization to accomplish this goal by memoizing the file reading
functionality provided by the File Processor in Swerve (see code given in Figure 5.3).
Recall that the file processor sends the file to the Network Processor in a series of synchronous communications that sends chunks of the file. To successfully memoize this
functionality such that on subsequent request of the file the memoized version of the function correctly sends the file chunks to the Network Processor we must capture and store
this information.
Our formulation of memoization creates constraints for every communication performed
within a given function. Thus, for every file chunk that the File Processor sends to the
Network Processor we create a constraint which captures the data that is sent. In this
manner, the constraints which are generated during memoization of the function, contain
the data that corresponds to the contents of the file. We can think of the memo store associated with this function as the file cache for the server.
However, naively memoizing the function in this way has some unfortunate consequences. Recall that the File Processor must poll for timeouts and notify other modules
if any errors have been detected. In this setting, typical file caches are not suitable. We
would have to construct a file cache that also polled for errors and encapsulated the error
handling protocols the File Processor is responsible for.
Instead, we observe that partial memoiztion provides a unique solution to this problem.
Namely, it provides the ability to stop the discharging of constraints and to resume normal
execution. Since the File Processor polls for timeouts on channels, this action is also
captured in a constraint. When a timeout or other error is detected, the constraint is said to
have failed (i.e. instead of the channel being empty, it contains a value). In this case normal
execution is resumed, thereby triggering the default error handling protocols.
6.3
Semantics
Our semantics is defined in terms of a core call-by-value functional language with
threading and communication primitives. Communication between threads is achieved
83
S YNTAX :
P ::= 0/ | PkP | t[e]
e ∈ Exp ::= v | e(e) | spawn(e) | mkCh() | send(e, e) | recv(e)
v ∈ Val ::= unit | λ x.e | l
E VALUATION C ONTEXTS :
E ::= [ ] | E(e) | v(E) | spawn(E) | send(E, e) | send(l, E) | recv(E)
P ROGRAM S TATES :
P ∈ Process
t ∈ Tid
x, y ∈ Var
l ∈ Channel
α, β ∈ Tag
= {App,Ch, Spn,Com}
Figure 6.1. Syntax, grammar, evaluation contexts, and domain equations
for a concurrent language with synchronous communication.
using synchronous channels. We first present a simple multi-threaded language with synchronous channel based communication. We then extend this core language with memoization primitives, and subsequently consider refinements of this semantics to support partial
memoization. Although the core language has no support for selective communication,
extending it to support choice is straight forward. Memoization would simply record the
result of the choice and replay would only be possible if the recorded choice was satisfiable.
In the following, we write α to denote a sequence of zero or more elements, β.α to
denote sequence concatenation, and 0/ to denote an empty sequence. Metavariables x and y
range over variables, t ranges over thread identifiers, l ranges over channels, v ranges over
values, and α, β denote tags that label individual actions in a program’s execution. We use
84
(F UNCTION A PPLICATION )
App
Pkt[E[λ x.e(v)]] 7−→ Pkt[E[e[v/x]]]
(C HANNEL )
l fresh
Ch
Pkt[E[mkCh()]] 7−→ Pkt[E[l]]
(S PAWN )
t0 fresh
Spn
Pkt[E[spawn(λ x.e)]] 7−→ Pkt[E[unit]]kt0 [e[unit/x]]
(C OMMUNICATION )
P = P0 kt[E[send(l, v)]]kt0 [E0 [recv(l)]]
Com
P 7−→ P0 kt[E[unit]]kt0 [E0 [v]]
Figure 6.2. Operation semantics for a concurrent language with synchronous communication.
P to denote a program state comprised of a collection of threads, E for evaluation contexts,
and e for expressions.
Our communication model is a message passing system with synchronous send and receive operations. We do not impose a strict ordering of communications on channels; communication actions on the same channel by different threads are paired non-deterministically.
To model asynchronous sends, we simply spawn a thread to perform the send. Spawning
an expression (that evaluates to a thunk) creates a new thread in which the application of
the thunk is performed.
6.3.1
Language
The syntax and semantics of the language are given in Figure 6.1. Expressions are
either variables, locations that represent channels, λ-abstractions, function applications,
85
thread creation operations, or communication actions that send and receive messages on
channels. We do not consider references in this core language as they can be modeled in
terms of operations on channels [2].
A thread context t[E[e]] denotes an expression e available for execution by a thread
with identifier t within context E. Evaluation is specified via a relation ( 7−→ ) that maps
a program state (P) to another program state. A program state is a collection of thread
contexts. An evaluation step is marked with a tag that indicates the action performed by
α
that step. As shorthand, we write P 7−→ P0 to represent the sequence of actions α that
transforms P to P0 .
The core rules are presented in Figure 6.2. Application (rule (F UNCTION A PPLICA TION ))
substitutes the argument value for free occurrences of the parameter in the body
of the abstraction, and channel creation (rule (C HANNEL)) results in the creation of a new
location that acts as a container for message transmission and reception . A spawn action
(rule (S PAWN)), given an expression e that evaluates to a thunk changes the global state to
include a new thread in which the thunk is applied. A communication event (rule (C OM MUNICATION )) synchronously pairs a sender attempting to transmit a value along a specific
channel in one thread with a receiver waiting on the same channel in another thread.
6.3.2
Partial Memoization
The core language presented above provides no facilities for memoization of the functions it executes. To support memoization, we must record, in addition to argument and
return values, synchronous communication actions, thread spawns, channel creation etc. as
part of the memoized state. These actions define a log of constraints (C) that must be satisfied at subsequent applications of a memoized function, and whose associated effects must
be discharged if the constraint is satisfied. To record constraints, we augment our semantics
to include a memo store (σ), a map that given a function identifier and an argument value,
returns the set of constraints and result value that was previously recorded for a call to that
function with that argument. If the set of constraints returned by the memo store is satisfied
86
in the current state (and their effects performed), then the return value can be used and the
application elided. The memo store contains only one function/value pair for simplicity of
the presentation. We can envision extending the memo store to contain multiple memoized
versions of a function based on its arguments or constraints. We utilize two thread contexts
t[e] and tC [e], the former to indicate that evaluation of terms should be captured within
the memo store, and the latter to indicate that previously recorded constraints should be
discharged. We elaborate on their use below.
The definition of the language augmented with memoization support is given in Figure 6.3 and operational semantics are provided in Figure 6.4 and Figure 6.5. We now define
evaluation using a new relation ( =⇒ ) defined over two configurations. In one case, it maps
a program state (P) and a memo store (σ) to a new program state and memo store. This
configuration defines evaluation that does not leverage memoized information. The second
configuration maps a thread id and constraint sequence pair ((t,C)), a program state (P),
and a memo store (σ) to a new thread id and constraint sequence pair, program state, and
memo store. Transitions of this form are used when evaluation is discharging constraints
recorded from a previous memoized application.
In addition, a thread state is augmented to hold an additional structure. The memo state
(θ) records the function identifier (δ), the argument (v) supplied to the call, the context (E)
in which the call is performed, and the sequence of constraints (C) that are built during the
evaluation of the application being memoized.
Constraints built during a memoized function application define actions that must be
satisfied at subsequent call-sites in order to avoid complete re-evaluation of the function
body. For a communication action, a constraint records the location being operated upon,
the value sent or received, the action performed (R for receive and S for send), and the
continuation captured immediately prior to the action being performed. The application
resumes evaluation from this point if the corresponding constraint could not be discharged.
For a spawn operation, the constraint records the action (Sp), the expression being spawned,
and the continuation immediately prior to the spawn. For a channel creation operation (Ch),
the constraint records the location of the channel and the continuation immediately prior to
87
the channel creation. For a function application (App), we record the continuation prior to
the application. Returns are also modeled as constraints (Rt, v) where v is the return value
of the application being memoized. Notice, that we record continuations for all constraints.
We do this to simplify our correctness proofs.
Consider an application of function f to value v that has been memoized. Since subsequent calls to f with v may not be able to discharge all constraints, we need to record the
program points for all communication actions within f that represent potential resumption
points from which normal evaluation of the function body proceeds. These continuations
are recorded as part of any constraint that can fail (communication actions, and return constraints as described below). But, since the calling contexts at these other call-sites are different than the original, we must be careful to not include them within saved continuations
recorded within these constraints. Thus, the contexts recorded as part of the saved constraint during memoization only define the continuation of the action up to the return point
of the function; the memo state (θ) stores the evaluation context representing the caller’s
continuation. This context is restored once the application completes (see rule (R ETURN)).
If function f calls function g , then actions performed by g must be satisfiable in any
call that attempts to leverage the memoized version of f . Consider the following program
fragment:
let fun f(...) =
...
let fun g(...) =
...
send(c,v) ...
in ...
end
in ... f(...) ...
end
If we encounter an application of f after it has been memoized, then g ’s communication
action (i.e., the send of v on c ) must be satisfiable at the point of the application to
avoid performing the call. We therefore associate a call stack of constraints (θ) with each
thread that defines the constraints seen thus far, requiring the constraints computed for an
inner application to be satisfiable for any memoization of an outer one. The propagation
88
of constraints to the memo states of all active calls is given by the operation shown in
Figure 6.3.
Channels created within a memoized function must be recorded in the constraint sequence for that function (rule (C HANNEL)). Consider a function that creates a channel and
subsequently initiates communication on that channel. If a call to this function was memoized, later applications that attempt to avail of memo information must still ensure that the
generative effect of creating the channel is not omitted. Function evaluation now associates
a label with function evaluation that is used to index the memo store (rule (F UNCTION)).
If a new thread is spawned within a memoized application, a spawn constraint is added
to the memo state, and a new global state is created that starts memoization of the actions
performed by the newly spawned thread (rule (S PAWN)). A communication action performed by two functions currently being memoized are also appropriately recorded in the
corresponding memo state of the threads that are executing these functions (rule (C OMMU NICATION )).
When a memoized application completes, its constraints are recorded in the
memo store (rule (R ETURN)).
When a function f is applied to argument v, and there exists no previous invocation
of f to v recorded in the memo store, the function’s effects are tracked and recorded (rule
(A PPLICATION)). Until an application of a function being memoized is complete, the
constraints induced by its evaluation are not immediately added to the memo store. Instead,
they are maintained as part of the memo state (θ) associated with the thread in which the
application occurs.
The most interesting rule is the one that deals with determining how much of an application of a memoized function can be elided (rule (M EMO A PPLICATION)). If an application
of function f with argument v has been recorded in the memo store, then the application
can be potentially avoided. If not, its evaluation is memoized by (rule (A PPLICATION)). If
a memoized call is applied, we must examine the set of associated constraints that can be
discharged. To do so, we employ an auxiliary relation ℑ defined in Figure 6.6. Abstractly,
ℑ checks the global state to determine which communication, channel creation, and spawn
creation constraints (the possible effectful actions in our language) can be satisfied, and
89
returns a set of failed constraints, representing those actions that could not be satisfied. The
thread context (tC [e]) is used to signal the utilization of memo information. The failed
constraints are added to the original thread context.
The rule (M EMO A PPLICATION) yields a new global configuration whose thread id and
constraint sequence ((t,C)) corresponds to the constraints satisfiable in the current global
state (defined as C00 ) for thread t as defined by ℑ. These constraints, when discharged,
will leave the thread performing the memoized call in a new state in which the evaluation
of the call is the expression associated with the first failed constraint returned by ℑ. As
we describe below in Sec 6.3.3, there is always at least one such constraint, namely Rt ,
the return constraint, that holds the return value of the memoized call. We also introduce
a rule to allow the completion of memo information use (rule (E ND M EMO)). The rule
installs the continuation of the first currently unsatisfied constraint; no further constraints
are subsequently examined. In this formulation, the other failed constraints are simply
discarded. We present an extension of this semantics in Section. 6.3.6 that make use of
them.
6.3.3
Constraint Matching
The constraints built as a result of evaluating these rules are discharged by the rules
shown in Figure 6.7. Each rule in Figure 6.7 is defined with respect to a thread id and
constraint sequence. Thus, at any given point in its execution, a thread is either building up
memo constraints (i.e., the thread is of the form t[e]) within an application for subsequent
calls to utilize, or attempting to discharge these constraints (i.e., the thread is of the tC [e])
for applications indexed in the memo store.
The function ℑ leverages the evaluation rules defined in Figure 6.7 by examining the
global state and determining which constraints can be discharged, except for the return
constraint. ℑ takes a constraint set (C) and a program state (P) and returns a sequence
of unmatchable constraints (C0 ). Send and receive constraints are matched with threads
blocked in the global program state on the opposite communication action. Once a thread
90
S YNTAX :
P ::= 0/ | PkP | hθ, t[e]i | hθ, tC [e]i
v ∈ Val ::= unit | λδ x.e | l
E ∈ Context
C ONSTRAINT A DDITION :
θ0 = {(δ, v, E,C.C)|(δ, v, E,C) ∈ θ}
θ,C θ0
P ROGRAM S TATES :
δ ∈ MemoId
c ∈ FailableConstraint= ({R, S} × Loc × Val) + Rt
C ∈ Constraint
= (FailableConstraint × Exp)+
((Sp × Exp) × Exp) + ((Ch × Location) × Exp)+
((App) × Exp)
σ ∈ MemoStore
= MemoId × Val → Constraint∗
θ ∈ MemoState
= MemoId × Val × Context × Constraint∗
α, β ∈ Tag
= {Ch, Spn, Com, Fun, App, Ret, MCom,
MCh, MSp, MemS, MemE, MemR, MemP}
Figure 6.3. Memoization Semantics – A concurrent language supporting memoization of synchronous communication and dynamic thread creation.
has been matched with a constraint it is no longer a candidate for future communication
since its communication action is consumed by the constraint. This guarantees that the
candidate function will communicate at most once with each thread in the global state.
Although a thread may in fact be able to communicate more than once with the candidate
function, determining this requires arbitrary look ahead and is infeasible in practice. We
discuss the implications of this restriction in Section 6.3.5.
91
(C HANNEL )
θ, ((Ch, l), E[mkCh()]) θ0
l fresh
Ch
Pkhθ, t[E[mkCh()]]i, σ =⇒ Pkhθ0 , t[E[l]]i, σ
(F UNCTION )
δ fresh
Fun
Pkhθ, t[E[λ x.e]]i, σ =⇒ Pkhθ, t[E[λδ x.e]]i, σ
(S PAWN )
t0 fresh θ, ((Sp, λδ x.e(unit)), E[spawn(λδ x.e)]) θ0
/ t0 [e[unit/x]]i
tk = hθ0 , t[E[unit]]i ts = h0,
Spn
Pkhθ, t[E[spawn(λδ x.e)]]i, σ =⇒ Pktk kts , σ
(C OMMUNICATION )
P = P0 khθ, t[E[send(l, v)]]ikhθ0 , t0 [E0 [recv(l)]]i
θ, ((S, l, v), E[send(l, v)]) θ00
θ0 , ((R, l, v), E0 [recv(l)]) θ000
Com
P, σ =⇒ P0 khθ00 , t[E[unit]]ikhθ000 , t0 [E0 [v]]i, σ
Figure 6.4. Memoization Semantics – A concurrent language supporting memoization of synchronous communication and dynamic thread creation.
A spawn constraint (rule (M EMO S PAWN)) is always satisfied, and leads to the creation
of a new thread of control. Observe that the application evaluated by the new thread is now
a candidate for memoization if the thunk was previously applied and its result is recorded in
the memo store. A channel constraint of the form ((Ch,l), E[e]) (rule (M EMO C HANNEL))
creates a new channel location l0 , and replaces all occurrences of l found in the remaining
constraint sequence for this thread with l0 . The channel location may be embedded within
send and receive constraints, either as the target of the operation, or as the argument value
being sent or received. Thus, discharging a channel constraint ensures that the effect of creating a new channel performed within an earlier memoized call is preserved on subsequent
applications. The renaming operation ensures that later send and receive constraints refer
92
(R ETURN )
θ = (δ, v, E,C)
Ret
Pkhθ.θ, t[v0 ]i, σ =⇒ Pkhθ, t[E[v0 ]]i, σ[(δ, v) 7→ C.(Rt, v0 )]
(A PPLICATION )
θ, ((App), E[λδ x.e(v)]) θ0
/
(δ, v) 6∈ Dom(σ) θ = (δ, v, E, 0)
App
Pkhθ, t[E[λδ x.e(v)]]i, σ =⇒ Pkhθ.θ0 , t[e[v/x]]i, σ
(M EMO S TART )
(δ, v) ∈ Dom(σ) σ(δ, v) = C
ℑ(C, P) = C0 C = C00 .C0
MemS
Pkhθ, t[E[λδ x.e(v)]]i, σ =⇒ (t,C00 ), Pkhθ, tc0 [E[λδ x.e(v)]]i, σ
(M EMO E ND )
C = (c, e0 )
MemE
/ Pkhθ, tC.C [E[λδ x.e(v)]]i, σ =⇒ Pkhθ, t[E[e0 ]]i, σ
(t, 0),
Figure 6.5. Memoization Semantics – A concurrent language supporting
memoization of synchronous communication and dynamic thread creation
(continued).
to the new channel location. Both channel creation and thread creation actions never fail
– they modify the global state with a new thread and channel, respectfully, but impose no
pre-conditions on the state in order for these actions to succeed.
MCom
There are two communication constraint matching rules ( =⇒ ), both of which may indeed fail. If the current constraint expects to receive value v on channel l , and there exists
a thread able to send v on l , evaluation proceeds to a state in which the communication
succeeds (the receiving thread now evaluates in a context in which the receipt of the value
has occurred), and the constraint is removed from the set of constraints that need to be
93
ℑ(((S, l, v), e).C, Pkhθ, t[E[recv(l)]]i)
= ℑ(C, P)
ℑ(((R, l, v), e).C, Pkhθ, t[E[send(l, v)]]i) = ℑ(C, P)
ℑ((Rt, v), P)
= (Rt, v)
ℑ(((Ch, l), e).C, P)
= ℑ(C, P)
ℑ(((Sp, e0 ), e).C, P)
= ℑ(C, P)
ℑ(((App), e).C, P)
= ℑ(C, P)
ℑ(C, P)
= C,
otherwise
Figure 6.6. Memoization Semantics – The function ℑ yields the set of
constraints C which are not satisfiable in program state P.
matched (rule (MR ECV)). Note also that the sender records the fact that a communication
with a matching receive took place in the thread’s memo state, and the receiver does likewise. Any memoization of the sender must consider the receive action that synchronized
with the send, and the application in which the memoized call is being examined must
record the successful discharge of the receive action. In this way, the semantics permits
consideration of multiple nested memoization actions.
If the current constraint expects to send a value v on channel l , and there exists a
thread waiting on l , the constraint is also satisfied (rule (M EMO S END)). A send operation
can match with any waiting receive action on that channel. The semantics of synchronous
communication allows us the freedom to consider pairings of sends with receives other
than the one it communicated with in the original memoized execution. This is because a
receive action places no restriction on either the value it reads, or the specific sender that
provides the value. If these conditions do not hold, the constraint fails.
Observe that there is no evaluation rule for the Rt constraint that can consume it. This
constraint contains the return value of the memoized function (see rule (R ETURN)). If
all other constraints have been satisfied, it is this return value that replaces the call in the
current context (see the consequent of rule (M EMO A PPLICATION)).
94
(M EMO C HANNEL )
C = ((Ch, l), ) l0 fresh
C00 = C[l0 /l] θ,C θ0
MCh
(t,C.C), Pkhθ, tC0 [E[λδ x.e(v)]]i, σ =⇒ (t,C00 ), Pkhθ0 , tC0 [E[λδ x.e(v)]]i, σ
(M EMO A PPLICATION )
C = ((App), ) θ,C θ0
MApp
(t,C.C), Pkhθ, tC0 [E[λδ x.e(v)]]i, σ =⇒ (t,C), Pkhθ0 , tC0 [E[λδ x.e(v)]]i, σ
(M EMO S PAWN )
C = ((Sp, e0 ), ) t0 fresh θ,C θ0
MSp
(t,C.C), Pkhθ, tC0 [E[E[λδ x.e(v)]]]i, σ =⇒
/ t0 [e0 ]i, σ
(t,C), Pkhθ0 , tC0 [E[λδ x.e(v)]]ikh0,
(M EMO R ECEIVE )
C = ((R, l, v), ) ts = hθ, t[E[send(l, v)]]i tr = hθ0 , t0C0 [E0 [λδ x.e(v)]]i
θ0 ,C θ000
θ, ((S, l, v), E[send(l, v)]) θ00
ts0 = hθ00 , t[E[unit]]i tr0 = hθ000 , t0C0 [E0 [λδ x.e(v)]]i
MCom
(t0 ,C.C), Pkts ktr , σ =⇒ (t0 ,C), Pkts0 ktr0 , σ
(M EMO S END )
C = ((S, l, v), ) ts = hθ0 , t0C0 [E[λδ x.e(v)]]i tr = hθ, t[E0 [recv(l)]]i
θ0 ,C θ000
θ, ((R, l, v), E0 [recv(l)]) θ00
ts0 = hθ000 , tC0 0 [E[λδ x.e(v)]]i tr0 = hθ00 , t[E0 [v]]i
MCom
(t0 ,C.C), Pkts ktr , σ =⇒ (t0 ,C), Pkts0 ktr0 , σ
Figure 6.7. Memoization Semantics – Constraint matching is defined by
four rules. Communication constraints are matched with threads performing the opposite communication action of the constraint.
95
let val (c1,c2) = (channel(), channel())
fun f ()
= (send(c1,v1); ...; recv(c2))
fun g ()
= (recv(c1); ...; recv(c2); g())
fun h () =
(...; send(c2,v2); send(c2,v3); h());
fun i () =
(recv(c2); i())
in spawn(g); spawn(h); spawn(i);
f();
...;
send(c2, v4); ...;
f()
end
Figure 6.8. Determining if an application can fully leverage memo information may require examining an arbitrary number of possible thread
interleavings.
c1
v1
g()
h()
i()
c1
c2
c2
f()
v2
c2
c2
c2
v3
c2
v4
Figure 6.9. The communication pattern of the code in Figure 6.8. Circles
represent operations on channels. Gray circles are sends and white circles
are receives. Double arrows represent communications that are captured
as constraints during memoization.
6.3.4
Example
The program fragment shown in Figure 6.8 applies functions f, g, h and i. The calls
to g, h, and i are evaluated within separate threads of control, while the applications of f
96
takes place in the original thread. These different threads communicate with one other over
shared channels c1 and c2. The communication pattern is depicted in Figure 6.9. Separate threads of control are shown as rectangles. Communications actions are represented as
circles; gray for send actions and white for receives. The channel on which the communication action takes place is annotated on the circle and the value which is sent on the arrow.
Double arrows edges represent communication actions for which constraints are generated.
To determine whether the second call to f can be elided we must examine the constraints that would be added to the thread state of the threads in which these functions are
applied. First, spawn constraints would be added to the main thread for the threads executing g, h, and i. Second, a send constraint followed by a receive constraint, modeling the
exchange of values v1 and either v2 or v3 on channels c1 and c2 would be included
as well. For the sake of discussion, assume that the send of v2 by h was consumed by g
and the send of v3 was paired with the receive in f when f() was originally executed.
Consider the memoizability constraints built during the first call to f . The send constraint on f ’s application can be satisfied by matching it with the corresponding receive
constraint associated with the application of g ; observe g() loops forever, consuming values on channels c1 and c2 . The receive constraint associated with f can be discharged
if g receives the first send by h , and f receives the second. A schedule that orders the
execution of f and g in this way, and additionally pairs i with a send operation on c2
in the let -body would allow the second call to f to fully leverage the memo information
recorded in the first. Doing so would enable eliding the pure computation in f (abstracted
by . . .) in its definition, performing only the effects defined by the communication actions
(i.e., the send of v1 on c1 , and the receipt of v3 on c2 ).
6.3.5
Issues
As this example illustrates, utilizing memo information completely may require forcing a schedule that pairs communication actions in a specific way, making a solution that
requires all constraints to be satisfied infeasible in practice. Hence, rule (M EMO A PPLICA -
97
let fun f() = (send(c,1); send(c,2))
fun g() = (recv(c);recv(c))
in spawn(g); f(); ...; spawn(g); f()
end
Figure 6.10. The second application of f can only be partially memoized
up to the second send since only the first receive made by g is blocked in
the global state.
TION )
allows evaluation to continue within an application that has already been memoized
once a constraint cannot be matched. As a result, if during the second call to f , i indeed
received v3 from h , the constraint associated with the recv operation in f would not be
satisfied, and the rules would obligate the call to block on the recv , waiting for h or the
main body to send a value on c2 .
Nonetheless, the semantics as currently defined does have limitations. For example, the
function ℑ does not examine future actions of threads and thus can only match a constraint
with a thread if that thread is able to match the constraint in the current state. Hence,
the rules do not allow leveraging memoization information for function calls involved in
a producer/consumer relation. Consider the simple example given in Figure 6.10. The
second application of f can take advantage of memoized information only for the first
send on channel c. This is because the global state in which constraints are checked only
has the first recv made by g blocked on the channel. The second recv only occurs if the
first is successfully paired with a corresponding send. Although in this simple example the
second recv is guaranteed to occur, consider if g contained a branch which determined if g
consumed a second value from the channel c. In general, constraints can only be matched
against the current communication action of a thread.
Secondly, exploiting memoization may lead to starvation since subsequent applications
of the memoized call will be matched based on the constraints supplied by the initial call.
Consider the example shown in Figure 6.11. If the initial application of f pairs with the
send performed by g, subsequent calls to f that use this memoized version will also pair
98
let fun f() = recv(c)
fun g() = send(c,1);g()
fun h() = send(c,2);h()
in spawn(g); spawn(h); f(); ...; f()
end
Figure 6.11. Memoization of the function f can lead to the starvation
of either g or h depending on which value the original application of f
consumed from channel c.
with g since h produces different values. This leads to starvation of h. Although this
behavior is certainly legal, one might reasonably expect a scheduler to interleave the sends
of g and h.
6.3.6
Schedule Aware Partial Memoization
To address the limitations in the previous section, we define two new symmetric rules
to pause and resume memoization (see Figure 6.12). Pausing memoization (rule (PAUSE
M EMO)) is similar to the rule (M EMO E ND) in Figure 6.3 except the failed constraints are
not discarded and the thread context is not given an expression to evaluate. Instead the
thread retains its log of currently unsatisfied constraints which prevents its further evaluation. This effectively pauses the evaluation of this thread but allows regular threads to
continue normal evaluation. Notice we only pause a thread utilizing memo information
once it has correctly discharged its constraints. We could envision an alternative definition which pauses non-deterministically on any constraint and moves the non-discharged
constraints back to the thread context which holds unsatisfied constraints. For the sake of
simplicity we opted for greedy semantics which favors the utilization of memoization.
We can resume the paused thread, enabling it to discharge other constraints using the
rule (R ESUME M EMO), which begins constraint discharge anew for a paused thread. Thus,
if a thread context has a set of constraints that were not previously satisfied and evaluation is
not utilizing memo information, we can once again apply our ℑ function. Note that the use
99
of memo information can be ended at any time (rule (M EMO E ND) can be applied instead
of (PAUSE M EMO)). We can, therefore, change a thread in a paused state into a bona fide
thread by first applying (R ESUME M EMO). If ℑ does not indicate we can discharge any
additional constraints, we simple apply the rule (E ND M EMO).
We also extend our evaluation rules to allow constraints to be matched against other
constraints (rule (MC OM)). This is accomplished by matching constraints between two
paused threads. Of course, it is possible that two threads, both of which were paused on a
constraint that was not satisfiable, may nonetheless satisfy one another. This happens when
one thread is paused on a send constraint and another on a receive constraint both of which
match on the channel and value. In this case, the constraints on both sender and receiver
can be safely discharged. This allows calls which attempt to use previously memoized
constraints to match against constraints extant in other calls that also attempt to exploit
memoized state.
6.4
Soundness
We can relate the states produced by memoized evaluation to the states constructed by
the non-memoizing evaluator. To do so, we first define a transformation function T that
transforms process states (and terms) defined under memo evaluation to process states (and
terms) defined under non-memoized evaluation (see Figure 6.13). Since memo evaluation
stores evaluation contexts in θ they must be extracted and restored. This is done in the
opposite order that they were pushed onto the stack θ since the top represents the most
recent call. Functions currently in the process of utilizing memo information must be
resumed from the expression captured in the first non-discharged constraint. Similarly
threads which are currently paused must also be resumed.
Our safety theorem ensures memoization does not yield states which could not be realized under non-memoized evaluation:
Theorem 6.4.1 (Safety) If
β
Pkhθ, t[E[λδ x.e(v)]]i, 0/ =⇒ P0 khθ0 , t[[v0 ]]i, σ
100
(PAUSE M EMO )
MemP
/ P, σ =⇒ P, σ
(t, 0),
(R ESUME M EMO )
ℑ(C, P) = C0 C = C00 .C0
MemR
Pkhθ, tC [E[λδ x.e(v)]]i, σ =⇒ (t,C00 ), Pkhθ, tC0 [E[λδ x.e(v)]]i, σ
(M EMO C OMMUNICATION )
C = ((S, l, v), ) C0 = ((R, l, v), )
ts = hθ, tC.C [λδ x.e(v)]i tr = hθ0 , t0C0 .C0 [λδ1 x.e0 (v0 )]i
θ0 ,C0 θ000
θ,C θ00
ts0 = hθ00 , tC [λδ x.e(v)]i tr0 = hθ000 , t0C0 [λδ1 x.e0 (v0 )]i
MCom
Pkts ktr , σ =⇒ Pkts0 ktr0 , σ
Figure 6.12. Memoization Semantics – Schedule Aware Partial Memoization.
then there exists α1 , . . . , αn ∈ {App, Ch, Spn, Com} s.t.
α
n
1
T (Pkhθ, t[E[λδ x.e(v)]]i) 7−→
. . . 7−→
T (P0 khθ0 , t[[v0 ]]i)
α
2
/ is a valid memo
We first introduce a corollary that shows that the empty memo store (0)
store. We then show that every β step taken under memoization corresponds to zero or one
step under non-memoized evaluation: zero steps for returns and memo actions (e.g. M EM S,
M EM E, M EM P, and M EM R), and one step for core evaluation, and effectful actions (e.g.,
MC H, MA PP, MS PAWN, MR ECV, MS END, and MC OM). Although a function which
is utilizing memoized information does not execute pure code (rule (A PP) under 7−→ ), it
does, however, capture constraints for the elided applications. The expressions captured in
101
T ((t,C), Pkhθ, tC0 [E[λδ x.e(v)]]i) = T (Pkhθ, tC.C0 [E[λδ x.e(v)]]i)
T ((P1 kP2 )) = T (P1 )kT (P2 )
T (hθ, t[e]i) = t[T (En [. . . E1 [e]])] θi = (δi , vi , Ei ,Ci ) ∈ θ
T (hθ, t( ,e0 ).C [e]i) = t[T (En [. . . E1 [e0 ]])] θi = (δi , vi , Ei ,Ci ) ∈ θ
T (λδ x.e) = λ x.e
T (e1 (e2 )) = T (e1 )(T (e2 ))
T (spawn(e)) = spawn(T (e))
T (send(e1 , e2 )) = send(T (e1 ), T (e2 ))
T (recv(e)) = recv(T (e))
otherwise
e
Figure 6.13. T defines an erasure property on program states. The first
four rules remove memo information and restore evaluation contexts.
these non-failing constraints allow us to construct a corresponding sequence of evaluation
steps under 7−→ .
/ is a valid memostore
Corollary 1: the empty memostore, 0,
Proof by case analysis of the rules:
Case (Application): We have:
App
Pkhθ, t[E[λδ x.e(v)]]i, 0/ =⇒ Pkhθ.θ0 , t[e[v/x]]i, 0/
where:
/
(δ, v) 6∈ Dom(0)
Case (Start Memo): We have:
MemS
Pkhθ, t[E[λδ x.e(v)]]i, σ =⇒ (t,C00 ), Pkhθ, tc0 [E[λδ x.e(v)]]i, σ
where:
(δ, v) ∈ Dom(σ)
102
Thus we know that the rule (Start Memo) cannot be applied with an empty memo store.
Case (Return): Based on the rule Return we know:
Ret
Pkhθ.θ, t[v0 ]i, σ =⇒ Pkhθ, t[E[v0 ]]i, σ[(δ, v) 7→ C.(Rt, v0 )]
/
We know: 0[(δ,
v) 7→ C.(Rt, v0 )].
Other Cases: All other rules do not modify σ nor do they leverage it.
2
Lemma 1: If
β
Pkhθ, t[E[λδ x.e(v)]]i, 0/ =⇒ P0 khθ0 , t[E[e0 ]]i, σ
then there exists α1 , . . . , αn ∈ {App, Ch, Spn, Com} s.t.
α
n
1
T (Pkhθ, t[E[λδ x.e(v)]]i) 7−→
. . . 7−→
T (P0 khθ0 , t[E[e0 ]]i)
α
2
Proof by induction on the length of β. Base case is sequences of length one. Such
sequences correspond to functions which simply return a value.
By definition of rule (Application) and Corollary 1 we know:
App
Pkhθ, t[E[λδ x.e(v)]]i, 0/ =⇒ Pkhθ.θ0 , t[e[v/x]]i, 0/
/
where: θ, ((App), E[λδ x.e(v)]) θ0 and: (δ, v) 6∈ Dom(σ) θ = (δ, v, E, 0)
By the transform T we know:
T (Pkhθ, t[E[λδ x.e(v)]]i) = T (P)kt[E[λ x.e]]
/ = T (P)kt[E[e[v/x]]]
T (Pkhθ.θ0 , t[e[v/x]]i, 0)
103
We know by the constraint capture rules (Channel), (Spawn), (Application), (Communication), and (Return):
App
T (P)kt[E[λ x.e(v)] 7−→ T (P)kt[E[e[v/x]]]
We also know that e can be a value, namely some v0 and that v0 [v/x] is simply v0 or e can be
x and that e[v/x] is simply v.
Inductive Case: Assume the Lemma holds for β sequences of length n, we show it holds
for sequences of length n + 1.
By the inductive hypothesis we know:
β
Pkhθ, t[E[λδ x.e(v)]]i, 0/ =⇒ P00 , σ
We examine transitions possible for the nth state (P00 , σ) by a case analysis on the n + 1
transition βn+1 .
Case (Channel): If βn+1 = Ch by rule (Channel) we know:
Ch
Pkhθ, t0 [E[mkCh()]]i, σ =⇒ Pkhθ0 , t0 [E[l]]i, σ
where: l fresh
By the transform T we know:
T (Pkhθ, t0 [E[mkCh()]]i) = T (P)kt0 [E[mkCh()]]
T (Pkhθ0 , t0 [E[l]]i) = T (P)kt0 [E[l]]
By the structure of the rules we know:
Ch
Pkt0 [E[mkCh()]] 7−→ Pkt0 [E[l]]
104
Case (Function): If βn+1 = Fun by rule (Function) we know:
Fun
Pkhθ, t0 [E[λ x.e]]i, σ =⇒ Pkhθ, t0 [E[λδ x.e]]i, σ
By the transform T we know:
T (Pkhθ, t0 [E[λ x.e]]i) = T (P)kt0 [E[λ x.e]]
T (Pkhθ, t0 [E[λδ x.e]]i) = T (P)kt0 [E[λ x.e]]
Case (Application): If βn+1 = App by rule (Application) we know:
App
Pkhθ, t0 [E[λδ x.e(v)]]i, σ =⇒ Pkhθ.θ, t0 [e[v/x]]i, σ
By the transform T we know:
T (Pkhθ, t0 [E[λδ x.e(v)]]i) = T (P)kt0 [E[λ x.e(v)]]
T (Pkhθ.θ, t0 [e[v/x]]i) = T (P)kt0 [E[e[v/x]]]
By the structure of the rules we know:
App
Pkt0 [E[λ x.e(v)]] 7−→ Pkt0 [E[e[v/x]]]
Case (Return): If βn+1 = Ret by rule (Return) we know:
Ret
Pkhθ.θ, t0 [v0 ]i, σ =⇒ Pkhθ, t0 [E[v0 ]]i, σ[(δ, v) 7→ C.(Rt, v0 )]
By the transform T we know:
T (Pkhθ.θ, t0 [v0 ]i) = T (P)kt0 [E[v0 ]]
T (Pkhθ, t0 [E[v0 ]]i) = T (P)kt0 [E[v0 ]]
105
Case (Spawn): If βn+1 = Spn by rule (Spawn) we know:
Spn
Pkhθ, t0 [E[spawn(λδ x.e)]]i, σ =⇒ Pktk kts , σ
where:
/ t00 [e[unit/x]]i t00 fresh
tk = hθ0 , t0 [E[unit]]i ts = h0,
By the transform T we know:
T (Pkhθ, t0 [E[spawn(λδ x.e)]]i) = T (P)kt0 [E[spawn(λ x.e)]]
T (Pktk kts ) = T (P)kt0 [E[unit]]kt00 [e[unit/x]]
By the structure of the rules we know:
Spn
Pkt0 [E[spawn(λ x.e)]] 7−→ Pkt0 [E[unit]]kt00 [e[unit/x]]
Case (Communication): If βn+1 = Com by rule (Comm) we know:
Com
P, σ =⇒ P0 khθ00 , t0 [E[unit]]ikhθ000 , t00 [E0 [v]]i, σ
where:
P = P0 khθ, t0 [E[send(l, v)]]ikhθ0 , t00 [E0 [recv(l)]]i
By the transform T we know:
T (P) = T (P0 )kt0 [E[send(l, v)]]kt00 [E0 [recv(l)]]
T (P0 khθ, t0 [E[send(l, v)]]ikhθ0 , t00 [E0 [recv(l)]]i) = T (P0 )kt0 [E[unit]]kt00 [E0 [v]]
By the structure of the rules we know:
Com
P0 kt0 [E[send(l, v)]]kt00 [E0 [recv(l)]] 7−→ P0 kt0 [E[unit]]kt00 [E0 [v]]
106
Case (Memo Channel): If βn+1 = MCh by rule (Memo Channel) we know:
MCh
(t0 ,C.C), Pkhθ, t0C0 [E[λδ x.e(v)]]i, σ =⇒
(t0 ,C00 ), Pkhθ0 , t0C0 [E[λδ x.e(v)]]i, σ
where: C = ((Ch, l), ) and l0 fresh
By the structure of the rules we know C was generated by rule (Channel). We know l0 fresh
and l 0 is substituted for l in the remaining constraints.
C0 = C[l0 /l] θ,C θ0
By the transform T we know:
T ((t0 ,C.C), Pkhθ, t0C0 [E[λδ x.e(v)]]i) = T (P)kt0 [E[E0 [mkCh()]]]
T ((t0 ,C0 ), Pkhθ0 , t0C0 [E[λδ x.e(v)]]i) = T (P)kt0 [E[e0 ]]
where by the structure of the rules:
C = (( ), E0 [mkCh()]) C0 = (( ), e0 ).C000
We know by the constraint capture rules (Channel), (Spawn), (Application), (Communication), and (Return): e0 = E0 [l0 ]
therefore:
Ch
T (P)kt0 [E[E0 [mkCh()]]] 7−→ T (P)kt0 [E[E0 [l]]]
Case (Memo Application): If βn+1 = MApp by rule (Memo Application) we know:
MApp
(t0 ,C.C), Pkhθ, t0C0 [E[λδ x.e(v)]]i, σ =⇒ (t0 ,C), Pkhθ0 , t0C0 [E[λδ x.e(v)]]i, σ
where: C = ((App), E0 [λδ x0 .e0 (v0 )])
By the structure of the rules we know C was generated by (Application).
By the transform T we know:
T ((t,C.C), Pkhθ, tC0 [E[λδ x.e(v)]]i, σ) = T (P)kt0 [E[E0 [λ x0 .e0 (v0 )]]]
107
T ((t,C), Pkhθ0 , tC0 [E[λδ x.e(v)]]i, σ) = T (P)kt0 [E[e00 ]]
where by the structure of the rules: C = (( ), e00 ).C00
We know by the constraint capture rules (Channel), (Spawn), (Application), (Communication), and (Return): e00 = E0 [e0 [v0 /x0 ]]
therefore:
App
T (P)kt0 [E[E0 [λ x0 .e0 (v0 )]]] 7−→ T (P)kt0 [E[E0 [e0 [v0 /x0 ]]]]
Case (Memo Spawn): If βn+1 = Com by rule (Memo Spawn) we know:
MSp
(t0 ,C.C), Pkhθ, tC0 0 [E[λδ x.e(v)]]i, σ =⇒
/ t00 [e0 [unit/x]]i, σ
(t0 ,C), Pkhθ0 , tC0 0 [E[λδ x.e(v)]]ikh0,
where: C = ((Sp, e0 ), E0 [spawn(λδ x.e0 )]) and t00 fresh
By the structure of the rules we know C was generated by (Spawn).
By the transform T we know:
T ((t0 ,C.C), Pkhθ, tC0 0 [E[λδ x.e(v)]]i) = T (P)kt0 [E[E0 [spawn(λδ x.e)]]]
/ t00 [e0 ]i) = T (P)kt0 [E[e00 ]]kt00 [e0 [unit/x]]
T ((t0 ,C), Pkhθ0 , t0C0 [E[λδ x.e(v)]]ikh0,
where by the structure of the rules: C = (( ), e00 ).C00
We know by the constraint capture rules (Channel), (Spawn), (Application), (Communication), and (Return): e00 = E0 [unit] therefore:
Spn
T (P)kt0 [E[E0 [spawn(λδ x.e0 )]]] 7−→ T (P)kt0 [E[E0 [unit]]]kt00 [e0 [unit/x]]
Case (Memo Receive): If βn+1 = Com by rule (Memo Receive) we know:
MCom
(t0 ,C.C), Pkts ktr , σ =⇒ (t0 ,C), Pkts0 ktr0 , σ
where:
ts = hθ, t00 [E[send(l, v)]]i tr = hθ0 , t0C0 [E0 [λδ x.e(v)]]i
108
ts0 = hθ00 , t00 [E[unit]]i tr0 = hθ000 , t0C0 [E0 [λδ x.e(v)]]i
By the structure of the rules we know C was generated by (Communication).
By the transform T we know:
T ((t0 ,C.C), Pkts ktr ) = T (P)kt00 [E[send(l, v)]]kt0 [E0 [E00 [recv(l)]]]
T ((t0 ,C), Pkts0 ktr0 ) = T (P)kt00 [E[unit]]kt0 [E0 [e0 ]
where by the structure of the rules:
C = ((R, l, v), E00 [recv(l)]) C = (( ), e0 ).C00
We know by the constraint capture rules (Channel), (Spawn), (Application), (Communication), and (Return): e0 = E00 [v] therefore:
Com
T (P)kt00 [E[send(l, v)]]kt0 [E0 [E00 [recv(l)]]] 7−→
T (P)kt00 [E[unit]]kt0 [E0 [E00 [v]]]
Case (Memo Send): If βn+1 = Com by rule (Memo Send) we know:
MCom
(t0 ,C.C), Pkts ktr , σ =⇒ (t0 ,C), Pkts0 ktr0 , σ
where:
ts = hθ0 , tC0 0 [E[λδ x.e(v)]]i tr = hθ, t00 [E0 [recv(l)]]i
ts0 = hθ000 , tC0 0 [E[λδ x.e(v)]]i tr0 = hθ00 , t00 [E0 [v]]i
By the structure of the rules we know C was generated by (Communication).
By the transform T we know:
T ((t0 ,C.C), Pkts ktr ) = T (P)kt0 [E[E00 [send(l, v)]]]kt00 [E0 [recv(l)]]
T ((t0 ,C), Pkts0 ktr0 ) = T (P)kt0 [E[e0 ]kt00 [E0 [v]]
where by the structure of the rules: C = (( ), e) C0 = (( ), e0 ).C00
We know by the constraint capture rules (Channel), (Spawn), (Application), (Communication), and (Return): e0 = E00 [unit] therefore:
Com
T (P)kt0 [E[E00 [send(l, v)]]kt00 [E0 [recv(l)]] 7−→ T (P)kt0 [E[E00 [unit]]]kt00 [E0 [v]]
109
Case (Memo Start): If βn+1 = MemS by rule (Memo Start) we know:
MemS
Pkhθ, t[E[λδ x.e(v)]]i, σ =⇒ (t,C00 ), Pkhθ, tc0 [E[λδ x.e(v)]]i, σ
where:
(δ, v) ∈ Dom(σ) σ(δ, v) = C ℑ(C, P) = C0 C = C00 .C0
C00 = C.C000 C = (( ), e0 )
By the transform T we know:
T (Pkhθ, t[E[λδ x.e(v)]]i) = T (P)kt[E[λ x.e(v)]]
T ((t,C00 ), Pkhθ, tc0 [E[λδ x.e(v)]]i) = T (P)kt[E[e0 ]]
We know by the constraint capture rules (Channel), (Spawn), (Application), (Communication), and (Return): e0 = e[v/x]
App
T (P)kt[E[λ x.e(v)]] 7−→ T (P)kt[E[e[v/x]]]
Case (Memo End):
If βn+1 = MemE then by our I.H. we know MemS ∈ β and thus by
rule (Memo End) we know:
MemE
0
/ Pkhθ, tC.C
(t0 , 0),
[E[λδ x.e(v)]]i, σ =⇒ Pkhθ, t0 [E[e0 ]]i, σ
where: C = (c, e0 )
By the transform T we know:
0
/ Pkhθ, tC.C
T ((t0 , 0),
[E[λδ x.e(v)]]i) = T (P)kt0 [E[e0 ]]
T (Pkhθ, t0 [E[e0 ]]i) = T (P)kt0 [E[e0 ]]
Case (Pause Memo): If βn+1 = MemP by our I.H. we know MemS ∈ β and thus by rule
(PauseMemo) we know:
MemP
/ P, σ =⇒ P, σ
(t0 , 0),
where:
P = P0 khθ, tC0 [E[λδ x.e(v)]]i
110
By the structure of σ, the definition of ℑ, and rule (EndMemo) we know:
C = C.C00 C = (( ), e0 )
By the transform T we know:
T (P0 khθ, tc0 [E[λδ x.e(v)]]i) = T (P0 )kt0 [E[e0 ]]
/
Therefore α = 0.
Case (Resume Memo):
If βn+1 = MemR by our I.H. we know MemP ∈ β and thus by
rule (Resume Memo) we know:
MemR
Pkhθ, tC0 [E[λδ x.e(v)]]i, σ =⇒ (t0 ,C00 ), Pkhθ, tC0 0 [E[λδ x.e(v)]]i, σ
where:
ℑ(C, P) = C0 C = C00 .C0
C = C.C000
C = (( ), e0 )
By the transform T we know:
T (Pkhθ, tC0 [E[λδ x.e(v)]]i) = T (P)kt0 [E[e0 ]]
T ((t0 ,C00 ), Pkhθ, tC0 0 [E[λδ x.e(v)]]i) = T (P)kt0 [E[e0 ]]
Case (Memo Communication): If βn+1 = MCom by our I.H. we know MemP ∈ β for
both threads tr and ts and thus by rule (Memo Communication) we know
MCom
Pkts ktr , σ =⇒ Pkts0 ktr0 , σ
where:
ts = hθ, t0C.C [λδ x.e(v)]i tr = hθ0 , t00C0 .C0 [λδ1 x.e0 (v0 )]i
ts0 = hθ00 , t0C [λδ x.e(v)]i tr0 = hθ000 , t00C0 [λδ1 x.e0 (v0 )]i
111
By the transform T we know:
T (Pkts ktr ) = T (P)kt 0 [es ]kt 00 [er ]
T (Pkts0 ktr0 ) = T (P)kt 0 [e0s ]kt 00 [e0r ]
where:
C = (( ), es ) C = C00 .C00 C00 = (( ), e0s )
C0 = (( ), er ) C0 = C000 .C000 C000 = (( ), e0r )
We know by the constraint capture rules (Channel), (Spawn), (Comm), and (Ret): es =
E00 [send(l, v)] and er = E000 [recv(l)]
e0s = E00 [unit] and e0r = E000 [v] therefore:
Com
T (P)kt 0 [es ]kt 00 [er ] 7−→ T (P)kt 0 [e0s ]kt 00 [e0r ]
Theorem 6.4.2 (Safety) If
β
Pkhθ, t[E[λδ x.e(v)]]i, 0/ =⇒ P0 khθ0 , t[E[v0 ]]i, σ
then there exists α1 , . . . , αn ∈ {App, Ch, Spn, Com} s.t.
α
n
1
T (Pkhθ, t[E[λδ x.e(v)]]i) 7−→
. . . 7−→
T (P0 khθ0 , t[E[v0 ]]i)
α
2
The proof follows directly from Lemma 1.
By Lemma 1 we know:
β
Pkhθ, t[E[λδ x.e(v)]]i, 0/ =⇒ P0 khθ0 , t[E[e0 ]]i, σ
then there exists α1 , . . . , αn ∈ {App, Ch, Spn, Com} s.t.
α
n
1
T (Pkhθ, t[E[λδ x.e(v)]]i) 7−→
. . . 7−→
T (P0 khθ0 , t[E[e0 ]]i)
α
By the grammar of the language we know every v is an e.
2
112
6.5
Implementation
The main extensions to Multi-MLton to support partial memoization involve insertion
of read and write barriers to track accesses and updates of references, barriers to monitor
function arguments and return values, and hooks to the Concurrent ML library to monitor
channel based communication.
6.5.1
Parallel CML and Hooks
Our parallel implementation of CML is based on Reppy’s parallel model of CML [9].
We utilize low level locks implemented with compare and swap to provide guarded access
to channels. Whenever a thread wishes to perform an action on a channel, it must first
acquire the lock associated with the channel. Since a given thread may only utilize one
channel at a time, there is no possibility of deadlock.
The underlying CML library was also modified to make memoization efficient. The
bulk of the changes were hooks to monitor channel communication and spawns, additional channel queues to support constraint matching on synchronous operations, and to
log successful communication (including selective communication and complex composed
events).
The constraint matching engine required a modification to the channel structure. Each
channel is augmented with two additional queues to hold send and receive constraints.
When a constraint is being tested for satisfiability, the opposite queue is first checked (e.g. a
send constraint would check the receive constraint queue). If no match is found, the regular
queues are checked for satisfiability. If the constraint cannot be satisfied immediately it is
added to the appropriate queue.
6.5.2
Supporting Memoization
Any SML function can be annotated as a candidate for memoization. For such annotated functions, its arguments and return values at different call-sites, the communication
113
it performs, and information about the threads it spawns are recorded in a memo table.
Memoization information is logged through hooks to the CML runtime and stored by the
underlying client code. In addition, to support partial memoization, the continuations of
logged communication events are also saved.
Our memoization implementation extended CML channels to be aware of memoization
constraints. Each channel structure contained a queue of constraints waiting to be solved on
the channel. Because it will not be readily apparent if a memoized version of a CML function can be utilized at a call site, we delay a function application to see if its constraints can
be matched. these constraints must be satisfied in the order in which they were generated.
Constraint matching can certainly fail on a receive constraint. A receive constraint obligates a receive operation to read a specific value from a channel. Since channel communication is blocking, a receive constraint that is being matched can choose from all values
whose senders are currently blocked on the channel. This does not violate the semantics of
CML since the values blocked on a channel cannot be dependent on one another. In other
words, a schedule must exist where the matched communication occurs prior to the first
value blocked on the channel.
Unlike a receive constraint, a send constraint can only fail if there are (a) no matching
receive constraints on the sending channel that expect the value being sent, or (b) no receive
operations on that same channel. A CML receive operation (not receive constraint) is
ambivalent to the value it removes from a channel. Thus, any receive on a matching channel
will satisfy a send constraint.
If no receives or sends are enqueued on a constraint’s target channel, a memoized execution of the function will block. Therefore, failure to fully discharge constraints by stalling
memoization on a presumed unsatisfiable constraint does not compromise global progress.
This observation is critical to keeping memoization overheads low.
Our memoization technique relies on efficient equality tests. We extend MLton’s polyequal function to support equality on reals and closures. Although equality on values of
type real is not algebraic, built-in compiler equality functions were sufficient for our needs.
To support efficient equality on functions, we approximate function equality as closure
114
equality. Unique identifiers are associated with every closure and recorded within their
environment; runtime equality tests on these identifiers are performed during memoization.
Memoization data is discarded during garbage collection. This prevents unnecessary
build up of memoization meta-data during execution. As a heuristic, we also enforce an
upper bound for the amount of memo data stored for each function, and the space that each
memo entry can take. A function that generates a set of constraints whose size exceeds the
memo entry space bound is not memoized. For each memoized function, we store a list of
memo meta-data. When the length of the list reaches the upper limit but new memo data
is acquired upon an application of the function to previously unseen arguments, one entry
from the list is removed at random.
6.6
Performance Results
We examined three benchmarks to measure the effectiveness of partial memoization
in a parallel setting. The first benchmark is a streaming algorithm for approximating a kclustering of points on a geometric plane. The second is a port of the STMBench7 benchmark [60]. STMBench7 utilizes channel based communication instead of shared memory
and bears resemblance to the red-black tree program presented in Section 6.2. The third is
Swerve, a web-server that was described in Chapter 4. Our benchmarks were executed on a
16-way AMD Opteron 865 server with 8 processors, each containing two symmetric cores,
and 32 GB of total memory, with each CPU having its own local memory of 4 GB. Access to non-local memory is mediated by a hyper-transport layer that coordinates memory
requests between processors.
6.6.1
Synthetic Benchmarks
Similar to most streaming algorithms [61], a k-clustering application defines a number
of worker threads connected in a pipeline fashion. Each worker maintains a cluster of
points that sit on a geometric plane. A stream generator creates a randomized data stream
of points. A point is passed to the first worker in the pipeline. The worker computes a
115
convex hull of its cluster to determine if a smaller cluster could be constructed from the
newly received point. If the new point results in a smaller cluster, the outlier point from the
original cluster is passed to the next worker thread. On the other hand, if the received point
does not alter the configuration of the cluster, it is passed on to the next worker thread. The
result defines an approximation of n clusters (where n is the number of workers) of size k
(points that compose the cluster).
STMBench7 [60], is a comprehensive, tunable multi-threaded benchmark designed
to compare different STM implementations and designs. Based on the well-known 007
database benchmark [62], STMBench7 simulates data storage and access patterns of CAD/CAM applications that operate over complex geometric structures. At its core, STMBench7
builds a tree of assemblies whose leaves contain bags of components. These components
have a highly connected graph of atomic parts and design documents. Indices allow components, parts, and documents to be accessed via their properties and IDs. Traversals of this
graph can begin from the assembly root or any index and sometimes manipulate multiple
pieces of data.
STMBench7 was originally written in Java. Our port, besides translating the assembly tree to use a CML-based server abstraction (as discussed in Section 6.2), also involved
building an STM implementation to support atomic sections, loosely based on the techniques described in [59, 63]. All nodes in the complex assembly structure and atomic parts
graph are represented as servers with one receiving channel and handles to all other adjacent nodes. Handles to other nodes are simply the channels themselves. Each server thread
waits for a message to be received, performs the requested computation. A transaction can
thus be implemented as a series of communications with various server nodes.
To measure the effectiveness of our memoization technique, we executed two configurations (one memoized, and the other non-memoized) of our k-clustering algorithm and
STMBench7, and measured overheads and performance by averaging results over ten executions. The k-clustering algorithm utilizes memoization to avoid redundant computations based on previously witnessed points as well as redundant computations of clusters.
For STMBench7 the non-memoized configuration uses our STM implementation without
116
any memoization where as the memoized configuration implements partial memoization of
aborted transactions.
For k-clustering, we computed 16 clusters of size 60 out of a stream of 10K randomly
generated points. This resulted in the creation of 16 worker threads, one stream generating thread, and a sink thread which aggregates the computation results. STMBench7
was executed on a graph in which there were approximately 280k complex assemblies and
140k assemblies whose bags referenced one of 100 components; by default, each component contained a parts graph of 100 nodes. STMBench7 creates a number of threads
proportional to the number of nodes in the underlying data structure; this is roughly 400K
threads for our configuration. Our experiments on Swerve were executed using the default
server configuration and were exercised using httperf 1 , a well known tool for measuring
web-server performance.
The benchmarks represent two very different programming models – pipeline streambased parallelism and transactions, and leverage two very different executions models –
k-clustering makes use of long-lived worker-threads while STMBench7 utilizes many lightweight server threads. Each run of both benchmarks have execution times that range between 1 and 3 minutes.
For k-clustering we varied the number of repeated points generated by the stream. Configurations in which there is a high degree of repeated points offer the best performance
gain (see Figure 6.14). For example, an input in which 50% of the input points are repeated
yields roughly 50% performance gain. However, we also observe roughly 17% performance improvement even when all points are randomized. This is because the cluster’s
convex hull shrinks as the points which comprise the cluster become geometrically compact. Thus, as the convex hull of the cluster shrinks, the likelihood of a random point being
contained within the convex hull of the cluster is reduced. Memoization can take advantage of this phenomenon by avoiding recomputation of the convex hull as soon as it can
be determined that the input point resides outside the current cluster. Although we do not
1 http://www.hpl.hp.com/research/linux/httperf/
117
% Speedup
k-clustering
90
80
70
60
50
40
30
20
10
0
0
10
20
30
40 50 60
% Repeats
70
80
90
100
Figure 6.14. Normalized runtime percent speedup for the k-clustering
benchmark of memoized evaluation compared to non-memoized execution.
envision workloads that have high degrees of repeatability, memoization nonetheless leads
to a 30% performance gain on a workload in which only 10% of the inputs are repeated.
In STMbench7, the utility of memoization is closely correlated to the number and frequency of aborted transactions. Our tests varied the read-only/read-write ratio (see Figure 6.15) within transactions. Only transactions that modify values can cause aborts. Thus,
an execution where all transactions are read-only cannot be accelerated, but one in which
transactions can frequently abort (because of conflicts due to concurrent writes) offer potential opportunities for memoization. Thus, the cost to support memoization is seen when
there are 100% read-only transactions; in this case, the overhead is roughly 13%. These
overheads arise because of the cost to capture memo information (storing arguments, continuations, etc) and the cost associated with trying to utilize the memo information (discharging constraints).
Notice that as the number of transactions which perform modifications to the underlying data-structure increases so do memoization gains. For example, when the percentage of
read-only transactions is 60%, we see an 18% improvement in runtime performance compared to a non-memoizing implementation for STMBench7. We expected to see roughly
118
STMBench7
40
% Speedup
30
20
10
0
-10
-20
0
10
20
30
40
50
60
70
80
90
100
Read/Write Ratio
Figure 6.15. Normalized runtime percent speedup for STMBench-7 of
memoized evaluation compared to non-memoized execution.
a linear trend correlated to the increase in transactions which perform an update. However, we noticed that performance degrades about 12% from a read/write ratio of 20 to a
read/write ratio of zero. This phenomenon occurs because memoized transactions are more
likely to complete on their first try when there are fewer modifications to the structure.
Since a non-memoized transaction requires longer to complete, it has a greater chance of
aborting when there are frequent updates. This difference becomes muted as the number of
changes to the data structure increase.
For both benchmarks, memory overheads are proportional to cache sizes and averaged
roughly 15% for caches of size eight. The cache size defines the number of different memoized calls for a function maintined. Thus, a cache size of eight means that every memoized
function records effects for eight different arguments.
6.6.2
Constraint Matching and Discharge Overheads
To measure the overheads of constraint discharge we executed a simple ping-pong microbenchmark that sent a twenty eight character string between the pinger and the ponger.
The program offers no ability to leverage captured memo information profitably. The two
119
Memoization Overheads
% Overhead
80
60
40
Memo Both
MemoReceive
Memo Send
20
0
5000
10000
50000
100000
Iterations
Figure 6.16. Normalized percent runtime overhead for discharging send
and receive constraints for ping-pong.
threads which comprise the program repeatedly execute the same function; one thread
sends the string on a shared channel and the other receives from the channel. We executed four configurations of the benchmark. In the first we memoized the sending function,
in the second the receiving function, and in the third both the sending and the receiving
functions. These three configurations were normalized with respect to a forth configuration
that did not utilize memoization. The results are given in Figure 6.16.
The overheads are comprised of searching memo tables, constraint matching and discharge, increased contention, and propagation of return values. The increased contention
on the channel occurs because the channel lock is held longer when matching and discharging a constraint than for a regular send or receive, thereby increasing the probability
of contention. In the configuration where only the sending function is memoized the overhead is roughly 47%. When the receiving function is memoized the overhead grows to
55%, this occurs because when matching the receive constraint we must test for equality.
When both the sending and receiving functions are memoized the overhead jumps to 80%.
120
6.6.3
Case Study: File Caching
In Swerve there is no explicit support for caching files in memory. Therefore, repeated
requests for a given file will require fetching that file from disk. File caching is a well
known performance optimization that is leverage by the majority of modern web-servers.
We observe that we can leverage our partial memoization technique to implement an error
and timeout aware file cache.
Recall that prior to processing a file chunk, the file processor checks for timeouts, other
errors, and if the user has canceled the request of the file. Naively memoizing the file processor would lead to a solution that is not responsive to errors. Namely, the entire file would
be sent over the network prior to the discovery of the timeout or error. This occurs because
the file processor notifies the network processor of any errors through an explicit error
notification protocol. The use of partial memoization provides a mechanism that allows
us to utilize information stored within our memo tables while preserving responsiveness
to errors. Our implementation generates a constraint for the error checking code in the
file processor. If an error occurs this corresponds to a failed memoization constraint and
execution is resumed in the error handler.
Since the file processors sends file chunks to the network processor over a shared channel, our memoization scheme generates a constraint for each file chunk sent. These constraints provide the file cache. When the file is subsequently requested, the constraints
are discharged. Notice that a send constraint does not require an equality test, only that a
matching receive takes place.
In Swerve, we observe an increase in performance correlated to the size of the file
being requested by httperf (see Figure 6.17). In each run of httperf, the files requested were
all of the same size. Performance gains are capped at roughly 80% for file sizes greater
than 8 MB. For each requested file, we build constraints corresponding to the file chunks
read from disk. As long as no errors are encountered, the Swerve file processor sends the
file chunks to be processed into HTTP packets by another module. After each chunk has
been read from disk the file processor polls other modules for timeouts and other error
121
% Speedup
Swerve
90
80
70
60
50
40
30
20
10
0
0.125 0.25
0.5
1
2
4
8
12
16
20
File Size (MB)
Figure 6.17. Normalized runtime percent speedup for Swerve of memoized evaluation compared to non-memoized execution.
conditions. If an error is encountered, subsequent file processing is stopped and the request
is terminated. Partial memoization is particularly well suited for Swerve’s file processing
semantics because control is reverted to the error handling mechanism precisely at the point
an error is detected. This corresponds to a failed constraint.
6.7
Related Work
Memoization, or function caching [54, 64–66], is a well understood method to reduce
the overheads of redundant function execution. Memoization of functions in a concurrent
setting is significantly more difficult and usually highly constrained [58]. We are unaware
of any existing techniques or implementations that apply memoization to the problem of
reducing transaction overheads in languages that support selective communication and dynamic thread creation. Our approach also bears resemblance to the procedure summary
construction for concurrent programs [67]. However, these approaches tend to be based on
a static analysis (e.g., the cited reference leverages procedure summaries to improve the
efficiency of model checking) and thus are obligated to make use of memoization greedily.
Because our motivation is quite different, our approach can consider lazy alternatives, ones
122
that leverage synchronization points to stall memo information use, resulting in potentially
improved runtime benefit.
Over the last decade, transactional memory (TM) has emerged as an attractive alternative to lock-based abstractions by providing strong semantic guarantees [68] (atomicity
and isolation) as well as a simpler programming model. Transactional memory provides
serializability guarantees for any concurrently executing transactional regions, preventing
the programmer from having to reason about complex interleavings of such regions. Transactional memory also relieves the burden of reasoning about deadlocks and complex locking protocols. Additionally, TM has also been utilized to extract fine-grain parallelism
from critical sections. Transactional memory can be implemented in hardware [69], software [20, 63, 70], or both [71, 72].
Software transactional memory (STM) [20, 63, 70] systems provide scalable performance surpassing that of coarse-grain locks and a simpler, but competitive alternative to
hand-tuned fine-grain locks. Unfortunately, the exact semantics that are provided by STM
are highly dependant on the underlying implementation. For instance, STMs that provide
weak atomicity guarantees only consider interactions of transactions and do not provide
any guarantees if threads not encapsulated in a transaction access memory concurrently being accessed by a transaction. Similarly, there are pessimistic and optimistic transactional
systems. Pessimistic transactions afford less parallelism, but in some implementations do
not require rollback or state reversion [70]. On the other hand, optimistic transactions,
under certain workloads, can provide additional parallelism, but force the programmer to
reason about the effects of state reversion. Namely the programmer must avoid performing
I/O operations or any actions that cannot be reverted in a transactional scope.
There has also been work on applying these techniques to a functional programming
setting [41, 42]. These proposals usually rely on an optimistic concurrency control model
that checks for serializability violations prior to committing the transaction, aborting when
a violation is detected. Our benchmark results suggest that partial memoization can help
reduce the overheads of aborting optimistic transactions.
123
Self adjusting mechanisms [73–75] leverage memoization along with change propagation to automatically alter a program’s execution to a change of inputs given an existing execution run. Memoization is used to identify parts of the program which have not changed
from the previous execution while change propagation is harnessed to install changed values where memoization cannot be applied. There has also been recent work on using
change propagation in a parallel setting [76]. The programming model assumes fork/join
parallelism, and is therefore not suitable for the kinds of contexts we consider. We believe
our memoization technique is synergistic with current self-adjusting techniques and can be
leveraged along with self-adjusting computation to create self-adjusting programs which
utilize message passing.
Reppy and Xiao [77] present a program analysis for CML that analyzes communication
patterns to optimize message passing operations. A type-sensitive interprocedural controlflow analysis is used to specialize communication actions to improve performance. While
we also use CML as the underlying subject of interest, our memoization formulation is
orthogonal to their techniques.
Checkpointing and replay mechanism [78] utilize light-weight state restoration to recover from errors or exceptional conditions. These mechanisms unroll a programs execution to a safe point with respect to the error. We believe such checkpointing mechanism
can leverage memoization to avoid redundant computation from the checkpointed state.
Our technique also shares some similarity with transactional events [22,43,44] (for further details please see Section 5.8). Transactional events explore a state space of possible
communications finding matching communications to ensure atomicity of a collection of
actions. To do so, transactional events require arbitrary lookahead in evaluation to determine if a complex composed event can commit. Partial memoization also explores potential matching communication actions to satisfy memoization constraints. However, partial
memoization avoids the need for arbitrary lookahead – failure to discharging memoziation
constraints simply causes execution to proceed as normal. Since transactional events can be
implemented in an optimistic manner instead of relying on an unbounded search strategy,
124
we believe they can leverage partial memoization in ways similar to software transactional
memory.
6.8
Concluding Remarks
We have provided a definition of partial memoization in the context of synchronous
message passing. We have formalized that definition in an operation semantics, which we
have shown to be equivalent to an operational semantics that does not leverage memoization. The usefulness of the abstraction has been shown on two synthetic benchmarks as
well as a web-server.
125
7
ASYNCHRONOUS CML
Software complexity is typically managed using programming abstractions that encapsulate
behaviors within modules. A module’s internal implementation can be changed without requiring changes to clients that use it, provided that these changes do not entail modifying its
signature. For concurrent programs, the specifications that can be described by an interface
are often too weak to enforce this kind of strong encapsulation in the presence of communication that spans abstraction boundaries. Consequently, changing the implementation
of a concurrency abstraction by adding, modifying, or removing behaviors often requires
pervasive change to the users of the abstraction. Modularity is thus compromised.
In particular, asynchronous behavior generated internally within a module defines two
temporally distinct sets of actions, neither of which are exposed in a module’s interface.
The first are post-creation actions – these are actions that must be executed after an asynchronous operation has been initiated, without taking into account whether the effects of
the operation have been witnessed by its recipients. For example, a post-creation action of
an asynchronous send on a channel might initiate another send on that same channel; the
second action should take place with the guarantee that the first has already deposited its
data on the channel. The second are post-consumption actions – these define actions that
must be executed only after the effect of an asynchronous operation has been witnessed.
For example, a post-consumption action might be a callback that is triggered when the
client retrieves data from a channel sent asynchronously.
In this chapter, we describe how to build, maintain, and expand asynchronous concurrency abstractions to give programmers the ability to express composable, yet modular, asynchronous protocols. By weaving protocols through the use of post-creation and
post-consumption computations, we achieve composability without sacrificing modularity,
enabling reasoning about loosely coupled communication partners that span abstraction
126
boundaries. To achieve this, we enable the construction of signatures and interfaces that
specify asynchronous behavior via abstract post-creation and post-consumption actions.
Supporting composable post-creation and post-consumption in an asynchronous setting is challenging because achieving such composability necessarily entails specifying the
behavior across two distinct threads of control – the thread that initiates the action (e.g.,
the thread that performs an asynchronous send on a channel C), and the thread that completes it (e.g, the thread that reads the value from channel C). The heart of the problem is
a dichotomy in language abstractions: asynchrony is fundamentally expressed using distinct threads of control, yet composablity is achieved through abstractions such as events
or callbacks, or operations like function composition, that are inherently thread-unaware.
To address this significant shortcoming, we introduce a family of asynchronous combinators and primitives that are freely composable, and whose signatures precisely capture
the desired post-creation and post-consumption behavior of an asynchronous action. We
present our extensions in the context of CML’s first-class communication events [2]. Just
as CML’s synchronous events provide a solution to composable, synchronous message
passing that could not be easily accommodated using λ-abstraction and application, the
asynchronous events defined here offer a solution for composable asynchronous message
passing not easily expressible using synchronous communication and explicit threading
abstractions.
There are three overarching goals of our design:
1. Asynchronous combinators should permit uniform composition of pre/post creation
and consumption actions. This means that protocols should be allowed to extend the
behavior of an asynchronous action both with respect to the computation performed
before and after it is created and consumed.
2. Asynchronous actions should provide sensible visibility and ordering guarantees. A
post-creation computation should execute with the guarantee that the asynchronous
action it follows has been created (e.g., an action has been deposited on a channel),
127
and the effects of consumed asynchronous actions should be consistent with the order
in which they were created.
3. Communication protocols should be agnostic with respect to the kinds of actions
they handle. Thus, both synchronous and asynchronous actions should be permitted
to operate over the same set of abstractions (e.g., communication channels).
7.1
Design Considerations
There are two fundamental ways to express asynchrony in modern programming lan-
guages: (1) through the use of explicit asynchronous primitives built into the language itself
or (2) through the use of lightweight threads encapsulating synchronous primitives. The use
of explicit asynchronony is present in languages such as Erlang [1], JoCaml [79], and libraries such as MPI. In contrast, synchronous communication is the fundamental building
block in Concurrent ML [2] and languages that have been extended to support CML-like
abstractions (e.g., Haskell [80, 81] or Scheme [49]). For the latter, asynchronous behavior
is typically expressed by wrapping synchronous actions within a thread.
It is difficult to express post-consumption actions [82] using typical asynchronous primitives without requiring some sort of notification from the consumer of the action to the initiator. Asynchronous actions fundamentally decouple the two parties in a communication
protocol (e.g., the sender of a message is unaware of when the receiver receives it), thus preventing the ability to define computations that are triggered when the communication fully
completes 1 . Having explicit threads that encapsulate synchronous actions alleviates this
problem since the desired post-creation and post-consumption behavior can be specified as
part of a user-defined asynchronous abstraction. But, threads do not provide any guarantee
on ordering or visibility of actions; the point at which an asynchronous communication
takes place depends on the behavior of the underlying thread scheduler. Additionally, once
a thread encapsulates a computation it can no longer be extended, limiting composability.
1 Note
that the thread which initiate the action does not need to be aware of the completion of the communication.
128
Log
Log
Server
Client
abstraction
boundary
Client
Server
spawn
spawn
(a)
(b)
Figure 7.1. Two server-client model based concurrency abstractions extended to utilize logging. In (a) asynchronous sends are depicted as solids
lines, where as in (b) synchronous sends are depicted as solid lines and
light weight threads as boxes. Logging actions are presented as dotted
lines.
Such ordering guarantees are less stringent than those afforded by futures, as they provide
no guarantees between asynchronous actions that operate over different channels.
To illustrate these points, consider a channel abstraction that mediates access between
“clients” and a “server” – clients issue requests to servers by writing their request to the
channel, and servers reply to the request by depositing their results on a dedicated channel
provided by the client. Internally, the server implementation may initiate replies asynchronously without waiting for client acknowledgement (e.g., in response to a file request,
the server may send chunks asynchronously that are subsequently aggregated by the client).
Now, consider augmenting this implementation so that actions on the channel are logged.
For example, the desired new behavior might require internally monitoring the rate at which
chunks are consumed by a client so that the server can provide quality of service guarantees.
Given primitive support for asynchronous actions, requests by the client can be serviced
asynchronously by the server. The asynchronous replies generated by the server retain
the desired ordering guarantees since chunks are deposited on the channel in the order in
which the asynchronous sends are executed. However, extending the protocol to support
129
post-consumption actions that supplement the log is difficult to do without exposing the
desired additional functionality to the client. This, unfortunately, embedded the server’s
functionality into the client, which is not desirable. While the client can be extended to
modify shared or global state after each receipt as depicted in Figure 7.1(a), modularity is
compromised. For example, the client would have to distinguish between each potential
server it may communicate with to correctly log only those actions initiated by the particular server that desires it. Furthermore, changes in the structure and behavior of the server’s
log would necessitate changes in logging code of the client.
In contrast, through the use of lightweight threads and synchronous primitives, we
can encode a very simple post-consumption action by sequencing the action after a synchronous send (as shown in Figure 7.1(b)) that is executed by a newly-created thread of
control. Here, the server initiates an asynchronous reply by creating a thread that synchronously communicates with the client. The post-consumption action that adds to the
log is performed by the thread after the synchronous communication completes. The client
is unaware of the additional functionality; the logging action is entirely encapsulated by
the thread. However, this solution suffers from two related problems. First, the use of
explicit threads for initiating asynchronous computation fails to provide any post-creation
guarantees. A post-creation action, in this case the initiation of the second chunk of data,
can make no guarantees at the point it commences evaluation that the value of the first send
has actually been deposited on the channel since the encoding defines the creation point of
the asynchronous action to be the point at which the thread is created, not the point where
the data is placed on the channel. Second, because threads are inherently unordered with
respect to one another and agnostic to their payload, there is no transparent mechanism to
enforce a sensible ordering relationship on the communication actions they encapsulate.
This results in the second thread that is spawned by the server being able to execute prior
to the first. Consequently the data the server produces can be received out of order. Dealing
with this issue again requires augmenting the client to correctly reassemble chunks.
As the example illustrates, there exists a dichotomy in language-based concurrency
abstractions for achieving asynchrony. Specialized asynchronous primitives ensure visi-
130
bility and ordering guarantees but preclude the specification and composability of postconsumption actions, while synchronous primitives and threads provide a mechanism to encapsulate post-consumption actions but fail to preserve ordering (and thus, cannot support
composable post-creation actions), and suffer from the standard composability limitations
of threads. The challenge to building expressive asynchronous communication abstractions
is defining mechanisms that allow programmers to express both composable post-creation
and post-consumption behavior, but also ensure sensible ordering and visibility guarantees.
We implement our design in the context of Concurrent ML [2].
7.1.1
Putting it All Together
Although synchronous message passing alleviates the complexity of reasoning about
arbitrary thread interleavings, and enables composable synchronous communication protocols, using threads to encode asynchrony unfortunately re-introduces these complexities.
Having primitive asynchronous send and receive operations avoids the need for implicitly
created threads to encapsulate an asynchronous operation, but does not support composability. Our design equips asynchronous events with the following properties: (i) they are
extensible both with respect to pre- and post-creation as well as pre- and post-consumption
actions; (ii) they can operate over the same channels that synchronous events operate over;
meaning channels are agnostic to whether they are used synchronously or asynchronously,
allowing both kinds of events to seamlessly co-exist; and, (iii) their visibility, ordering, and
semantics is independent of the underlying runtime and scheduling infrastructure.
7.2
Asynchronous Events
In order to provide primitives that adhere to the properties outlined above, we extend
CML with the following two base events: aSendEvt and aRecvEvt , for creating an asynchronous send event and an asynchronous receive event respectively. Although similar,
asynchronous events are not syntactic sugar as they cannot be expressed using CML prim-
131
itives. The differences in their type signature from their synchronous counterparts reflect
the split in the creation and consumption of the communication action they define:
sendEvt : ’a chan * ’a -> unit Event
aSendEvt : ’a chan * ’a -> (unit, unit) AEvent
recvEvt : ’a chan -> ’a Event
aRecvEvt : ’a chan -> (unit, ’a) AEvent
An AEvent value is parametrized with respect to the type of the event’s post-creation
and post-consumption actions. In the case of aSendEvt , both actions are of type unit :
when synchronized on, the event immediately returns a unit value and places its ’a
argument value on the supplied channel. The post-consumption action also yields unit .
When synchronized on, an aRecvEvt returns unit ; the type of its post-consumption
action is ’a reflecting the type of value read from the channel when it is paired with a
send.
In conjunction to the new base events, we introduce a new synchronization primitive:
aSync , to synchronize asynchronous events. The aSync operation fires the computation
encapsulated by the asynchronous event of type (’a, ’b) AEvent and returns a value
of type ’a , corresponding to the return type of the event’s post-creation action (see Figure 7.2).
sync
aSync
: ’a Event -> ’a
: (’a, ’b) AEvent -> ’a
Unlike their synchronous variants, asynchronous events do not block if no matching
communication is present. For example, executing an asynchronous send event on an empty
channel places the value being sent on the channel and then returns control to the executing
thread (see Figure 7.2(a)). In order to allow this non-blocking behavior, an implicit thread
of control is created for the asynchronous event when the event is paired, or consumed as
shown in Figure 7.2(b). If a receiver is present on the channel, the asynchronous send event
behaves similarly to a synchronous event; it passes the value to the receiver. However, it
132
(a) Thread 1
(b)
Thread 2
aSync(ev)
v
ev
v
post creation
actions
c
recv(c)
c
Implicit Thread
post consumption
actions
Figure 7.2. The figure shows a complex asynchronous event ev , built
from a base event aSendEvt , being executed by Thread 1. (a) When the
event is synchronized via aSync , the value v is placed on channel c and
post-creation actions are executed. Afterwards, control returns to Thread
1. (b) When Thread 2 consumes the value v from channel c , an implicit
thread of control is created to execute any post-consumption actions.
still creates a new implicit thread of control if there are any post-consumption actions to be
executed.
Similarly, the synchronization of an asynchronous receive event does not yield the value
received. Instead, it simply enqueues the receiving action on the channel. Therefore, the
thread which synchronizes on an asynchronous receive always gets the value unit, even if
a matching send exists (see Figure 7.3(a)). The actual value consumed by the asynchronous
receive can be passed back to the thread which synchronized on the event through the use
of combinators that process post-consumption actions (see Figure 7.3(b)). This is particularly well suited to encode reactive programming idioms: the post-consumption actions
encapsulate a reactive computation.
To illustrate the differences between the primitives, consider the two functions f and
af shown below:
133
fun
f () =
(spawn (fn () => sync (sendEvt(c, v)));
sync (sendEvt(c, v’));
sync (recvEvt(c)))
fun af () =
(spawn (fn () => sync (sendEvt(c, v)));
aSync (aSendEvt(c, v’));
sync (recvEvt(c)))
The function f , if executed in a system with no other threads, will always block because
there is no recipient available for the send of v’ on channel c . On the other hand, suppose
there was another thread willing to accept communication on channel c . In this case, the
only possible value that f could receive from c is v . This occurs because the receive will
only happen after the value v’ is consumed from the channel. Notice that if the spawned
thread enqueues v on the channel before v’ , the function f will block even if another
thread is willing to receive a value from the channel, since a function cannot synchronize
with itself.
The function af , on the other hand will never block. The receive may see either the
value v or v’ since the asynchronous send event only asserts that the value v’ has been
placed on the channel and not that it has been consumed.
Consider the following refinement of af :
fun af’ () =
(aSync (aSendEvt(c, v’));
spawn (fn () => sync (sendEvt(c, v)));
sync (recvEvt(c)))
Assuming no other threads exist that read from c , the receive in af’ can only witness
the value v’ . Although the spawn occurs before the synchronous receive, the channel c
is guaranteed to contain the value v’ prior to v . While asynchronous events do not block,
they still enforce ordering constraints that reflect the order in which they were created
within the same thread, based on their channels. This distinguishes their behavior from our
134
(a) Thread 1
(b)
Thread 2
aSync(ev)
v
ev
recv
c
post creation
actions
send(c, v)
c
v
Implicit Thread
v
post consumption
actions
Figure 7.3. The figure shows a complex asynchronous event ev , built
from a base event aRecvEvt , being executed by Thread 1. (a) When the
event is synchronized via aSync , the receive action is placed on channel
c and post-creation actions are executed. Afterwards, control returns to
Thread 1. (b) When Thread 2 sends the value v to channel c , an implicit thread of control is created to execute any post-consumption actions
passing v as the argument.
initial definition of asynchronous events that explicitly encoded asynchronous behavior in
terms of threads.
7.2.1
Combinators
In CML, the wrap combinator allows for the specification of a post-synchronization
action. Once the event is completed the function wrapping the event is evaluated. For
asynchronous events, this means the wrapped function is executed after the action the event
encodes is placed on the channel and not necessarily after that action is consumed.
sWrap : (’a, ’b) AEvent * (’a -> ’c) ->
(’c, ’b) AEvent
aWrap : (’a, ’b) AEvent * (’b -> ’c) ->
(’a, ’c) AEvent
135
To allow for the specification of both post-creation and post-consumption actions for
asynchronous events, we introduce two new combinators: sWrap and aWrap . sWrap is
used to specify post-creation actions. The combinator aWrap , on the other hand, is used to
express post-consumption actions. We can apply sWrap and aWrap to an asynchronous
event in any order.
sWrap(aWrap(e, f) g) ≡ aWrap(sWrap(e, g), f)
Since post-creation actions have been studied in CML extensively, we focus our discussion on aWrap and the specification of post-consumption actions. Consider the following
program fragment:
fun f() =
let val clocal = channel()
in aSync (aWrap(aSendEvt(c, v),fn () => send(clocal , ())));
g();
recv(clocal );
h()
end
The function f first allocates a local channel clocal and then executes an asynchronous
send aWrap -ed with a function that sends on the local channel. The function f then
proceeds to execute functions g and h with a receive on the local channel between the
two function calls. We use the aWrap primitive to encode a simple barrier based on
the consumption of v . We are guaranteed that h executes in a context in which v has
been consumed. The function g , on the other hand, can make no assumptions on the
consumption of v . However, g is guaranteed that v is on the channel. Therefore, if g
consumes values from c , it can witness v and, similarly, if it places values on the channel,
it is guaranteed that v will be consumed prior to the values g produces. Note that v could
have been consumed prior to g ’s evaluation.
If the same code was written with a synchronous wrap, we would have no guarantee
about the consumption of v . In fact, the code would block, as the send encapsulated by the
wrap would be executed by the same thread of control executing f . Thus, the asynchronous
136
event implicitly creates a new evaluation context and a new thread of control. The wrapping
function is evaluated in this context, not the thread which performed the synchronization.
We can now encode a very basic callback mechanism using aWrap . The code shown
below performs an asynchronous receive and passes the result of the receive to its wrapped
function. The value received asynchronously is passed as an argument to h by sending on
the channel clocal .
let val clocal = channel()
in aSync (aWrap(aRecvEvt(c), fn x => send(clocal , x)));
...
h(recv(clocal ))
end
Although this implementation suffices as a basic callback, it is not particularly abstract
and cannot be composed with other asynchronous events. We can create an abstract callback mechanism using both sWrap and aWrap around an input event.
callbackEvt : (’a, ’c) AEvent * (’c -> ’b) ->
(’b Event, ’c) AEvent
fun callbackEvt(ev, f) =
let val clocal = channel()
in sWrap(aWrap(ev,
fn x => (aSync(aSendEvt(clocal , x)); x)),
fn
=> wrap(recvEvt(clocal ), f))
end
If ev contains post-creation actions when the callback event is synchronized on, they
are executed, followed by execution of the sWrap as shown in Figure 7.4(a). It returns a
new event (call it ev’ ), which when synchronized on will first receive on the local channel
and then apply the function f to the value it receives from the local channel. Synchronizing on this event will block until the event ev is consumed. Once ev is consumed,
its post-consumption actions are executed in a new thread of control since ev is asynchronous (see Figure 7.4(b)). The body of the aWrap -ed function simply sends the result
137
(a) aSync(callBackEvt(ev, f))
(b) consumption of ev base event
Implicit Thread
v
Thread 1
post consumption
actions
ev
base event
v'
c
aSend(clocal, v')
post creation
actions
ev'
clocal
(c) sync(ev')
ev'
v'
clocal
f(v')
Figure 7.4. The figure shows a callback event constructed from a complex asynchronous event ev and a callback function f . (a) When the
callback event is synchronized via aSync , the action associated with the
event ev is placed on channel c and post-creation actions are executed.
A new event, ev’ , is created and passed to Thread 1. (b) An implicit
thread of control is created after the base event of ev is consumed. Postconsumption actions are executed passing v , the result of consuming the
base event for ev , as an argument. The result of the post-consumption
actions, v’ is sent on clocal . (c) When ev’ is synchronized upon, f is
called with v’ .
of synchronizing on ev (call it v’ ) on the local channel and then passes the value v’ to
any further post-consumption actions. This is done asynchronously because the complex
event returned by callbackEvt can be further extended with additional post consumption
actions. Those actions should not be blocked if there is no thread willing to synchronize on
ev’ . Synchronizing on a callback event, thus, executes the base event associated with ev
and creates a new event as a post-creation action, which when synchronized on, executes
the callback function synchronously, returning the result of the callback.
We can think of the difference between a callback and an aWrap of an asynchronous
event in terms of the thread of control that executes them. Both specify a post-consumption
138
action for the asynchronous event, but the callback, when synchronized upon, is executed
potentially by an arbitrary thread whereas the aWrap is always executed in the implicit
thread created when the asynchronous event is consumed. Another difference is that the
callback can be postponed and only executes when two conditions are satisfied: (i) the
asynchronous event has completed and (ii) the callback is synchronized on. An aWrap
returns once it has been synchronized on, and does not need to wait for other asynchronous
events or post-consumption actions it encapsulates to complete.
A guard of an asynchronous event behaves much the same as a guard of a synchronous
event does; it specifies pre-synchronization actions (i.e. pre-creation computation):
aGuard : (unit -> (’a, ’b) AEvent) -> (’a, ’b) AEvent
To see how we might use asynchronous guards, notice our definition of callbackEvt
has the unfortunate drawback that it allocates a new local channel regardless of whether or
not the event is ever synchronized upon. The code below uses an aGuard to specify the
allocation of the local channel only when the event is synchronized on:
fun callbackEvt(ev, f) =
aGuard(fn () =>
let val clocal = channel()
in sWrap(aWrap(ev,
fn x => (aSync(aSendEvt(clocal , x));x)),
fn
=> wrap(recvEvt(clocal ), f))
end)
One of the most powerful combinators provided by CML is a non-deterministic choice
over events. The combinator choose picks an active event from a list of events. If no
events are active, it waits until one becomes active. An active event is an event which is
available for synchronization. We define an asynchronous version of the choice combinator,
aChoose , that operates over asynchronous events. Since asynchronous events are nonblocking, all events in the list are considered active. Therefore, the asynchronous choice
always non-deterministically chooses from the list of available asynchronous events. We
also provide a blocking version of the asynchronous choice, sChoose , which blocks until
139
one of the asynchronous base events has been consumed. Post-creation actions are not
executed until the choice has been made. 2
choose : ’a Event list -> ’a Event
aChoose : (’a, ’b) AEvent list -> (’a, ’b) AEvent
sChoose : (’a, ’b) AEvent list -> (’a, ’b) AEvent
To illustrate the difference between aChoose and sChoose , consider a complex event
ev defined as follows:
val ev = aChoose[aSendEvt(c, v), aSendEvt(c’,v’)]
If there exists a thread only willing to receive from channel c as shown in Figure 7.5,
aChoose will nonetheless, with equal probability, execute the asynchronous send on c and
c’ . However, if we redefined ev to utilize sChoose instead, the behavior of the choice
changes:
val ev = sChoose[aSendEvt(c, v), aSendEvt(c’,v’)]
Since sChoose blocks until one of the base asynchronous events is satisfiable, if there
is only a thread willing to accept communication on c (as in Figure 7.5), the choice will
only select the event encoding the asynchronous send on c .
We have thus far provided a mechanism to choose between sets of synchronous events
and sets of asynchronous events. However, we would like to allow programmers to choose
between both synchronous and asynchronous events. Currently, their different type structure would prevent such a formulation. Notice, however, that an asynchronous event with
type (’a, ’b) AEvent and a synchronous event with type ’a Event both yield ’a in
the thread which synchronizes on them. Therefore, it is sensible to allow choice to operate over both asynchronous and synchronous events provided the type of the asynchronous
event’s post-creation action is the same as the type encapsulated by the synchronous event.
To facilitate this interoperability, we provide combinators to transform asynchronous event
types to synchronous event types and vice-versa:
aTrans : (’a, ’b) AEvent -> ’a Event
2 This
behavior is equivalent to a scheduler not executing the thread which created the asynchronous action
until it has been consumed.
140
Thread 1
Thread 2
aSync(ev)
recv(c)
ev
aSendEvt(c,v)
aSe
c
ndE
vt(c
',v')
c'
Figure 7.5. The figure shows Thread 1 synchronizing on a complex asynchronous event ev , built from a choice between two base asynchronous
send events; one sending v on channel c and the other v’ on c’ . Thread
2 is willing to receive from channel c .
sTrans : ’a Event -> (unit, ’a) AEvent
The aTrans combinator takes an asynchronous event and creates a synchronous version by dropping the asynchronous portion of the event from the type (i.e. encapsulating it).
As a result, we can no longer specify post-consumption actions for the event. However, we
can still apply wrap to specify post-creation actions to the resulting synchronous portion
exposed by the ’a Event . Asynchronous events that have been transformed and are part
of a larger choose event are only selected if their base event is satisfiable. Therefore, the
following equivalence holds for two asynchronous events, aEvt1 and aEvt2 :
choose[aTrans(aEvt1), aTrans(aEvt2)] ≡ aChoose[aEvt1, aEvt2]
The sTrans combinator takes a synchronous event and changes it into an asynchronous
event with no post-creation actions. The wrapped computation of the original event occurs
now as a post-consumption action. We can encode an asynchronous version of alwaysEvt
from its synchronous counterpart.
141
aAlwaysEvt : ’a -> (unit, ’a) AEvent
aNever
: (unit, ’a) AEvent
aAlwaysEvt(v) =
sTrans alwaysEvt(v)
aNever = sTrans never
We provide a complete listing of the ACML interface in Figure 7.6.
7.2.2
Mailboxes and Multicast
Using asynchronous events, we encoded other CML structures such as mailboxes (i.e.,
buffered channels) and multicasts channels, reducing code size and complexity. Mailboxes,
or buffered asynchronous channels, are provided by the core CML implementation. Mailboxes are a specialized channel that supports asynchronous sends and synchronous receives. However, mailboxes are not built directly on top of CML channels, requiring a
specialized structure, on the order of a 240 lines of CML code, to support asynchronous
sends. Using asynchronous events, we reduced the original CML mailbox implementation
from 240 LOC to 70 LOC, with a corresponding 52% improvement in performance on
synthetic stress tests exercising various producer/consumer configurations.
Asynchronous events provide the necessary components from which a mailbox structure can be defined, allowing the construction of mailboxes from regular CML channels,
and providing a facility to define asynchronous send events on the mailbox. Having an
asynchronous send event operation defined for mailboxes allows for their use in selective
communication. Additionally, asynchronous events now provide the ability for programmers to specify post-creation and post-consumption actions. The asynchronous send operator and asynchronous send event can be defined as follows:
fun send(mailbox, value) =
CML.aSync(CML.aSendEvt(mailbox, value))
fun sendEvt(mailbox, value) =
CML.aSendEvt(mailbox, value)
142
The synchronous receive and receive event are expressed in terms of regular CML primitives. This highlights the interoperability of asynchronous events with their synchronous
counterparts and provides programmers with a rich interface of combinators to utilizes with
mailboxes.
Multicast channels in CML provide a mechanism to multicast a message to multiple
recipients. Such an operation is naturally expressed using asynchrony. Abstractly, we can
encode a multicast by asynchronously sending the message to all the recipients on the
multicast channel. Similarly to mailboxes, we expressed multicast channels in 60 LOC,
compared to 87 LOC in CML, with a roughly 19% improvement in performance.
7.3
Semantics
Our semantics (see Figure 7.11, Figure 7.12, and Figure 7.13) is defined in terms of
a core call-by-value functional language with threading and communication primitives.
Communication between threads is achieved using channels and events. In our syntax, v
ranges over values, p over primitive event constructors, and e over expressions. Besides
abstractions, a value can be a message identifier, used to track communication actions, a
channel identifier, or an event context. An event context (ε[]) demarcates event expressions
that are built from asynchronous events and their combinators 3 that are eventually supplied
as an argument to a synchronization action. The rules use function composition f ◦ g ≡
λx. f (g(x)) to sequence event actions and computations.
The semantics also includes a new expression form, {e1 , e2 } to denote asynchronous
communication actions; the expression e1 corresponds to the creation (and post-creation) of
the asynchronous event, while e2 corresponds to the consumption (and post-consumption)
of the asynchronous event. For convenience, both synchronous and asynchronous events
are expressed in this form. For a synchronous event, e2 simply corresponds to an uninteresting action. We refer to e1 as the synchronous portion of the event, the expression
3 We describe the necessity of a guarded event context when we introduce the combinators later in this section.
143
which is executed by the current thread, and e2 as the asynchronous portion of the event,
the expression which is executed by a newly created thread (see rule S YNC E VENT).
A program state consists of a set of threads (T ), a communication map (∆), and a channel map (C ). The communication map is used to track the state of an asynchronous action,
while the channel map records the state of channels with respect to waiting (blocked) actions. Evaluation is specified via a relation (→) that maps one program state to another.
Evaluation rules are applied up to commutativity of parallel composition (k). The semantics makes use of two auxiliary relations (⇒) and (;).
7.3.1
Encoding Communication
A communication action is split into two message parts: one corresponding to a sender
and the other to a receiver. A send message part is, in turn, composed of two conceptual
primitive actions: a send act (sendAct(c, v)) and a send wait (sendWait):
sendAct: (ChannelId × Val) → MessageId → MessageId
sendWait: MessageId → Val
The send act primitive, when applied to a message identifier, places the value (v) on
the channel (c), while the send wait, when applied to a message identifier, blocks until the
value has been consumed off of the channel, returning unit when the message has been
consumed.
The message identifier m, generated for each base event (see rule ( SyncEvent )) is
used to correctly pair the ”act” and ”wait”. Similarly, a receive message part is composed
of receive act (recvAct(c)) and a receive wait (recvWait) primitives:
recvAct: ChannelId → MessageId → MessageId
recvWait: MessageId → Val
A receive wait behaves as its send counterpart. A receive act removes a value from
the channel if a matching send action exists; if none exists, it simply records the intention
of performing the receive on the channel queue. We can think of computations occurring
144
after an act as post-creation actions and those occurring after a wait as post-consumption
actions. Splitting a communication message part into an ”act” and a ”wait” primitive functions allows for the expression of many types of message passing. For instance, a traditional synchronous send is simply the sequencing of a send act followed by a send wait:
sendWait ◦ sendAct(c, v). This encoding immediately causes the thread executing the
operation to block after the value has been deposited on a channel, unless there is a matching receive act currently available. A synchronous receive is encoded in much the same
manner.
We use the global communication map (∆) to track act and wait actions for a given
message identifier. A message id is created at a synchronization point, ensuring a unique
message identifier for each synchronized event. At a synchronization point, the message is
mapped to (⊥) in the communication map. Once a send or receive act occurs, ∆ is updated
to reflect the value yielded by the act by rule (M ESSAGE), through an auxiliary relation
(⇒). When a send act occurs the communication map will hold a binding to unit for the
corresponding message, but when a receive act occurs the communication map binds the
corresponding message to the value received. The values stored in the communication map
are passed to the wait actions corresponding to the message by rules (S END WAIT) and
(R ECV WAIT). During the evaluation of a choice the communication map is updated with
bindings of multiple message ids to a choice id (ω). When one of the messages is mapped
to a value, all other messages which map to a choice id are mapped to (>) instead by rules
(M ESSAGE C HOICE).
7.3.2
Base Events
There are four rules for creating base events, (S END E VENT) and (R ECV E VENT) for
synchronous events, and (AS END E VENT) and (AR ECV E VENT) for their asynchronous
counterparts. From base act and wait actions, we define asynchronous events:
ε[{sendAct(c, v), sendWait}]
145
The first component of an asynchronous event is executed in the thread in which the expression evaluates, and is the target of synchronization (sync ), while the second component
defines the actual asynchronous computation. For asynchronous events we split the act
from the wait. Synchronous events can also be encoded using this notation: ε[{sendWait ◦
sendAct(c, v), λx.unit}]. In a synchronous event both the act and its corresponding wait
occur in the synchronous portion of the event. The base asynchronous portion is simply a
lambda that yields a unit value.
7.3.3
Event Evaluation
In rule (S YNC E VENT), events are deconstructed by the sync operator. It strips the event
context (ε[]), generates a new message identifier for the base event, creates a new thread of
control, and triggers the evaluation of the internal expressions. The asynchronous portion
of the event is wrapped in a new thread of control and placed in the regular pool of threads.
If the event abstraction being synchronized was generated by a base synchronous event, the
asynchronous portion is an uninteresting value (e.g. , λ x.unit). The newly created thread,
in the case of an asynchronous event, will not be able to be evaluated further as it blocks
until the corresponding act for the base event comprising the complex asynchronous event
is discharged.
7.3.4
Communication and Ordering
There are four rules for communicating over channels: (S END M ATCH), (R ECV M ATCH),
(S END B LOCK), and (R ECV B LOCK). The channel map (C ) encodes abstract channel states
mapping a channel to a sequence of actions (A ). This sequence encodes a FIFO queue
and provides ordering between actions on the channel. The channel will have either a
sequence of send acts (As ) or receive acts (Ar ), but never both at the same time. This is because if there are, for example, send acts enqueued on it, a receive action will immediately
match the send, instead of needing to be enqueued and vice versa (rules (S END M ATCH)
and (R ECV M ATCH) as well as the rules (E NQUEUE S END) and (E NQUEUE R ECEIVE)). If
146
a channel already has send acts enqueued on it, any thread wishing to send on the channel
will enqueue its act and vice versa (rules (S END B LOCK) and (R ECV B LOCK)). After enqueueing its action, a thread can proceed with its evaluation. The auxiliary relation (;)
enqueues a given action on channel and yields a new channel map with this change.
Ordering for asynchronous acts and their post consumption actions as well as blocking
of synchronous events is achieved by rules (S ENDWAIT) and (R ECV WAIT). Both rules
block the evaluation of a thread until the corresponding act has been evaluated. In the
case of synchronous events, this thread is the one that initiated the act; in the case of an
asynchronous event, the thread that creates the act is different from the one that waits
on it, and the blocking rules only block the implicitly created thread. For example, the
condition ∆(m) = unit in rule (S ENDWAIT) is established either by rule (S END M ATCH),
in the case of a synchronous action (created by (S END E VENT)), or rules (S END B LOCK)
and (R ECV M ATCH) for an asynchronous one (created by (AS END E VENT)).
7.3.5
Combinators
Complex events are built from the combinators described earlier; their definitions are
shown in Figure 7.14. We define two variants of wrap, sWrap for specifying extensions
to the synchronous portion of the event and aWrap for specifying extension to the asynchronous portion of the event. In the case of a synchronous event, we have sWrap extend the event with post-consumption actions as the base event will perform both the act
and wait in the synchronous portion of the event. Similarly, leveraging aWrap on a synchronous event allows for the specification of general asynchronous actions. If the base
event is asynchronous, sWrap expresses post creation actions and aWrap post consumption
actions.
The specification of the guard combinator is a bit more complex. Since a guard builds
an event expression out of a function, that when executed yields an event, the concrete
event is only generated at the synchronization point. This occurs because the guard is
only executed when synchronized upon. The rule (G UARD) simply places the function
147
applied to a unit value (the function is really a thunk) in a specialized guarded event context
(ε[(λx.e)unit]g ). The rule (S YNC G UARDED E VENT) simply strips the guarded event
context and synchronizes on the encapsulated expression. This expression, when evaluated,
will yield an event. Guarded events cannot be immediately extended with an sWrap or
aWrap as the expression contained within a guarded event context is a function. Instead,
wrapping an event in a guarded context simply moves the wrap expression into the event
context.
7.3.6
Choose Events
There are two choose event constructors, given in Figure 7.15, aChoose, and sChoose.
Each event constructor takes a list of input events and performs a choice over the list. Since
aChoose picks an asynchronous event regardless if the asynchronous portion of the event
is satisfiable, we can make the choice prior to synchronization. Notice, the rule (AC HOOSE
E VENT) simply picks one of the asynchronous events. This formulation allows us to simplify the rules. The rule (S C HOOSE E VENT) creates a complex event that when synchronized upon will perform a non-deterministic choice between the satisfiable events. There
is one rule for flattening choices: (S C HOOSE F LATTEN). This rule is based on the following equivalence: choose(choose(e, e’), e’’) ≡ choose(e, e’, e’’). Flattening
choices provides further simplification of the rules, namely the rules for blocking until an
event within a choice becomes satisfiable.
7.3.7
Synchronization of Choice
The rules for evaluating sChoose are shown in Figure 7.16. When an sChoose event
is synchronized upon, the rule (S YNC S C HOOSE) generates one message id for each input
event within the choice. These message ids are mapped to (⊥) in the communication map.
If one of the events is immediately satisfiable, the sChoose complex event evaluates this
base event in rule (S C HOOSE). In this case a new thread of control is created to evaluate the
asynchronous portion of the event. If one of the input events is not satisfiable, the sChoose
148
event blocks. Both rules for blocking enqueue each of the input events’ actions on a channel through the relation (;). Each message id is mapped to the choice id generated for this
choice. The rules (M ESSAGE C HOICE) ensure that only one of the actions associated with
the message ids will be matched. When one of the input events becomes satisfiable, we
create a new thread of control to evaluate the asynchronous portion by rule (S C HOOSE UN B LOCK). Since all the input events’ actions were enqueued on channels, these actions must
be removed after the evaluation of the choice. The rule (C HANNEL C LEAN) removes all actions from a channel if their associated message id maps to (>) in the communication map.
The rules (M ESSAGE C HOICE) map all message ids associated with a particular choice id
to (>). There are two versions of the block rule since we may have encoded a synchronous
or asynchronous variant of choice. The synchronous variant of choice corresponds to the
CML choose event and the asynchronous to our sChoose event.
7.3.8
sWrap, aWrap, and Guard of Choose Events
Notice that choose events introduce a new syntactic event form that contains base
events. The structure of asynchronous choice prevents the application of the sWrap and
aWrap combinators to the base asynchronous events comprising the asynchronous choose
event as both combinators expect a single asynchronous event. This leads us to define
sWrap and aWrap combinators for choose events in Figure 7.17. Abstractly, the sWrap and
aWrap rules for choose events apply the sWrapped or aWrapped function to all of the input
events of the choose event. Since only one of the events will be chosen, this provides the
desired semantics while still allowing us to flatten choose events.
Since guarded events have not yet been evaluated into events we cannot immediately
construct a sChoose event out of them. Thus, the expression is moved into the guarded
event context in much the same way as an aWrap or sWrap in rule (S C HOOSE G UARD).
When the guarded event is synchronized upon, the expression will be evaluated into an
event and then a complex sChoose event will be created.
149
7.4
Implementation
We implemented asynchronous events in Multi-MLton. Our implementation closely
follows the semantics given in Section 7.3. Allowing synchronous and asynchronous events
to seamlessly co-exist involved implementing a unified framework for events, which is agnostic to the underlying channel, scheduler, and runtime implementation. The implementation of asynchronous events is thus composed of four parts: (i) the definition of base event
values, (ii) the internal synchronization protocols for base events, (iii) synchronization protocols for choice, and (iv) the definition of various combinators. The implementation is
roughly 4K lines of ML code. In order to provide a uniform environment for both synchronous and asynchronous events, we retain the channel structure, scheduler and runtime
implementation of Parallel CML and have chosen to define an extended representation for
base event values and a new synchronization protocol.
In the implementation, asynchronous events are represented as a union type parametrized
by two polymorphic types. The first component of the pair represents the post creation action along with the base synchronous action while the second component represents the
post consumption action along with the base asynchronous event.
7.5
Case Study: A Parallel Web-server
We briefly touch upon three aspects of Swerve’s design that were amenable to using
asynchronous events, and show how these changes lead to substantial improvement in
throughput and performance, over 4.7X. To better understand the utility of asynchronous
events, we consider the interactions of four of Swerve’s modules: the Listener, the File
Processor, the Network Processor, and the Timeout Manager.
7.5.1
Lock-step File and Network I/O
Swerve was engineered assuming lock-step file and network I/O. While adequate when
under low request loads, this design has poor scalability characteristics. This is because (a)
150
file descriptors, a bounded resource, can remain open for potentially long periods of time,
as many different requests are multiplexed among a set of compute threads, and (b) for a
given request, a file chunk is read only after the network processor has sent the previous
chunk. Asynchronous events can be used to alleviate both bottlenecks.
To solve the problem of lockstep transfer of file chunks, we might consider using simple asynchronous sends. However, Swerve was engineered such that the file processor was
responsible for detecting timeouts. If a timeout occurs, the file processor sends a notification to the network processor on the same channel used to send file chunks. Therefore,
if asynchrony was used to simply buffer the file chunks, a timeout would not be detected
by the network processor until all the chunks were processed. Changing the communication structure to send timeout notifications on a separate channel would entail substantial
structural modifications to the code base.
The code shown below is a simplified version of the file processing module modified to use asynchronous events. It uses an arbitrator defined within the file processor to
manage the file chunks produced by the fileReader. Now, the fileReader sends file
chunks asynchronously to the arbitrator on the channel arIn (line 12). Note that each such
asynchronous send acts as an arbitrator for the next asynchronous send. The arbitrator
accepts file chunks from the fileReader on this channel and synchronously sends the file
chunks to the consumer as long as a timeout has not been detected. This is accomplished
by choosing between an abortEvt (used by the Timeout manager to signal a timeout)
and receiving a chunk from file processing loop (lines 13-20). When a timeout is detected,
an asynchronous message is sent on channel arOut to notify the file processing loop of
this fact (line 9); subsequent file processing then stops. This loop synchronously chooses
between accepting a timeout notification (line 17), or asynchronously processing the next
chunk (lines 11 - 12). The arbitrator executes as a post-consumption action.
datatype Xfr = TIMEOUT | DONE | X of chunk
1. fun fileReader name abortEvt consumer =
2. let
3.
val (arIn, arOut) = (channel(), channel())
151
4.
fun arbitrator() = sync
5.
6.
7.
8.
9.
10.
(choose [
wrap (recvEvt arIn,
fn chunk => send (consumer, chunk)),
wrap (abortEvt, fn () =>
(aSync(aSendEvt(arOut, ()));
send(consumer, TIMEOUT)))])
11. fun sendChunk(chunk) =
12.
aSync(aWrap(aSendEvt(arIn, X(chunk)),arbitrator))
13. fun loop strm =
14.
case BinIO.read (strm, size)
15.
of SOME chunk => sync
16.
(choose [
17.
recvEvt arOut,
18.
wrap(alwaysEvt,
19.
fn () => (sendChunk(chunk);
20.
loop strm))])
21.
| NONE => aSync(aSendEvt(arIn, DONE))
22. val
= aSync(aWrap(aRecvEvt(arIn),
23.
fn chunk => send(consumer, chunk)))
24. in
25. case BinIO.openIt name of
26.
27.
NONE => ()
| SOME strm => (loop strm; BinIO.closeIt strm)
28. end
Since asynchronous events operate over regular CML channels, we were able to modify
the file processor to utilize asynchrony without having to change any of the other modules or the communication patterns and protocols they expect. Being able to choose between synchronous and asynchronous events in the fileReader function also allowed
us to create a buffer of file chunks, but stop file processing if a timeout was detected by
152
the arbitrator. Recall, each asynchronous send acts as an arbitrator for the next asynchronous send.
7.5.2
Underlying I/O and Logging
To improve scalability and responsiveness, we also implemented a non-blocking I/O
library composed of a language-level interface and associated runtime support. The library
implements all MLton I/O interfaces, but internally utilizes asynchronous events. The library is structured around callback events as defined in Section 7.2.1 operating over I/O
resource servers. Internally, all I/O requests are translated into a potential series of callback
events.
Web-servers utilize logging for administrative purposes. For long running servers, logs
tend to grow quickly. Some web-servers (like Apache) solve this problem by using a rolling
log, which automatically opens a new log file after a set time period (usually a day). In
Swerve, all logging functions were done asynchronously. Using asynchronous events, we
were able to easily change the logging infrastructure to use rolling logs. Because asynchronous events preserve ordering guarantees, log entries reflect actual thread action order.
Post consumption actions were utilized to implement the rolling log functionality, by closing old logs and opening new logs after the appropriate time quantum.
In addition, in Swerve the logging infrastructure is tasked with exiting the system if
a fatal error is detected. The log notates that the occurrence of the error, flushes the log
to disk, and then exits the system. This ensure that the log contains a record of the error
prior to the system’s exit. Unfortunately, for the modules that utilize logging, this poses
additional complexity and breaks modularity. Instead of logging the error at the point
which it occurred, the error must be logged after the module has performed any clean up
actions because of the synchronous communication protocol between the module and the
log. Thus, if the module logs any actions during the clean up phase, they will appear in
the log prior to the error. Instead, we can leverage our callback event to extend the module
without changing the communication protocol to the log.
153
let val logEvt = aSendEvt(log, fatalErr)
val logEvt’ = callbackEvt(logEvt,
fn () => (Log.flush();
System.exit()))
val exitEvt = aSync(logEvt’)
in ( clean up; sync(exitEvt))
end
In the code above logEvt corresponds to an event that encapsulates the communication protocol the log expects: a simple asynchronous send on the log’s input channel log.
The event logEvt’ contains a callback. This event when synchronized will execute an
asynchronous send to the log and then return a new event exitEvt. When exitEvt is
synchronized upon, we are guaranteed that the log has received the notification of the fatal
error. With this simplification we can also simplify the log by removing checks to see if a
logged message corresponds to a fatal error and the exit mechanism; logging and system
exit are now no longer conflated.
7.5.3
Performance Results
To measure the efficiency of our changes in Swerve, we leveraged the server’s internal
timing and profiling output for per-module accounting. The benchmarks were run on an
AMD Opteron 865 server with 8 processors, each containing two symmetric cores, and 32
GB of total memory, with each CPU having its own local memory of 4 GB. The results as
well as the changes to the largest modules are summarized in Table 7.1. Translating the
implementation to use asynchronous events leads to a 4.7X performance improvement as
well as a 15X reduction in client-side observed latency over the original, with only 103
lines of code changed out of 16KLOC. To put these numbers in perspective, our modified
version of Swerve with asynchronous events has a throughput within 10% of Apache 2.2.15
on workloads that establish up to 1000 concurrent connections and process small/medium
files at a total rate of 2000 requests per second. For server performance measurements and
workload generation we used httperf – a tool for measuring web-server peformance.
154
Table 7.1
Per module performance numbers for Swerve.
Module
LOC
LOC modified
improvement
Listener
1188
11
2.15 X
File Processor
2519
35
19.23 X
Network Processor
2456
25
24.8 X
Timeout Manager
360
15
4.25 X
16,000
103
4.7 X
Swerve
7.6
Related Work
Many functional programming languages such as Erlang [1], JoCaml [79], and F# [83,
84] provide intrinsic support for asynchronous programming. In Erlang, message sends
are inherently asynchronous. Unlike CML, in Erlang message are sent between processes
and are the only way two processes can communicate. Processes may be located in the
same VM or distributed among numerous VMs and physically distinct computers. Erlang
at its core does not have mutable state. Instead updates are typically encoded by passing
arguments to recursive functions that run as servers. Such servers, whether in Erlang or
CML, can be considered “reactive”; they only executed whenever another thread, or process, wishes to communicate. We believe that the combinators and programming idioms
presented in this chapter can be applied to language like Erlang.
There are a number of languages that are derived from the Join Calculus [85] that provide some intrinsic support for join patterns (we discuss them later in this section). The Join
Calculus is a process calculus aimed primarily at distributed and mobile programming, but
is equally well suited for concurrent programming. The Join Calculus is structured around
processes that communicate via messages over named channels. Messages are “consumed”
or matched through join patterns. A join pattern guards an expression that is executed once
the pattern is satisfied. The pattern itself specifies what messages and values it requires to
155
be satisfied and on what channels it expects the values. JoCaml is derived directly from
the Join Calculus and is used for both distributed and concurrent programming. In JoCaml,
complex asynchronous protocols are defined using join patterns [85, 86] that define synchronization protocols over asynchronous and synchronous channels. We can view a join
pattern as defining a post-consumption action for a set of communication actions (those
specified in the pattern itself). Notice, in this setting a given communication action may
have multiple different post-consumption actions specified for it based on which pattern it
participates in. In contrast, our combinators specify post-consumption actions regardless
of which thread the communication action is paired with. Our formulation allows us to
compose multiple post-consumption actions with a given event seamlessly.
In F#, asynchronous behavior is defined using asynchronous work flows that permit
asynchronous objects to be created and synchronized. Convenient monadic-style let! syntax permits callbacks (i.e., continuations) to be created within an asynchronous computation. The callback defines the post-computation function for an asynchronous operation.
While these abstractions and paradigms provide expressive ways to define asynchronous
computations, they do not provide a convenient mechanism to specify composable asynchronous abstractions, especially with respect to asynchronous post-consumption actions.
It is the investigation of this important aspect of asynchronous programming, and its incorporation within a CML-style event framework that distinguishes the contributions of this
chapter from these other efforts.
Reactive programming [87] is an important programing style often found in systems
programming that uses event loops to react to outside events (typically related to I/O). In
this context, events do not define abstract communication protocols (as they do in CML),
but typically represent I/O actions delivered asynchronously by the underlying operating
system, sensors, or over the network. While understanding how reactive events and threads
can co-exist is an important one, it is orthogonal to the focus of this work. Indeed we
can encode reactive style programming idioms in ACML through the use of asynchronous
receive events and/or lightweight servers.
156
There have also been efforts to simplify event-driven (reactive) asynchronous programming in imperative languages [82] by providing new primitives that are amenable to compiler analysis. Instead of having programmers weave complex asynchronous protocols
and reason about non-local control-flow, these approaches provide mechanism to specify
reactions whenever certain conditions hold. This is accomplished through specialized nonblocking function calls, high-level coordination primitives that make thread interactions
explicit, and a linearity obligations that couple one thread for each asynchronous function
call. Although accomplished in a different context, we believe ACML is synergistic with
such approaches as ACML provides a robust programming model for explicitly defining
thread interactions.
Other programming languages support different notions of events explicitly through
asynchronous methods or similar constructs, we refer to these collectively as languages for
event-based programming. Examples include EventJava [88], ECO [89], AmbientTalk [90],
and JavaPS [91], or Actor-based languages and language extensions such as Erlang [1] or
Scala Actors [92]. Event correlation is an important programming idiom that allows programmers to specify resulting actions based on a series of events. Most languages supporting event correlation, such as Polyphonic C# [93] (now integrated with Cω), JoinJava [94],
or SCHOOL [95], and libraries such as for Erlang [96] or Scala [92] are based on the Join
Calculus [85]. We can think of event correlation as a pattern that specifies a given action to compute based on a collection of active events. From an event-based programming
perspective we can view CML [2] as a “staged” event matching system where the consumption of a first event is a pre-condition for subsequent matching. Namely, post-consumption
actions, and any events they encapsulate, are executed after an event is satisfied, or paired.
There have been incarnations of CML in languages and systems other than ML (e.g.,
Haskell [80, 81], Scheme [49], and MPI [97]) There has also been much recent interest in
extending CML with transactional support [22, 43, 44] (discussed in Chapter 5) and other
flavors of parallelism [3]. We believe transactional events [22, 43, 44] provide an interesting platform upon which to implement non-blocking versions of sChoose that retain
157
the same semantics. Additionally, we expect that previous work on specialization of CML
primitives [77] can be applied to improve the performance of asynchronous primitives.
7.7
Concluding Remarks
This chapter presents the design, rationale, and implementation for asynchronous events,
a concurrency abstraction that generalizes the behavior of CML-based synchronous events
to enable composable construction of asynchronous computations. Our experiments indicate that asynchronous events can seamlessly co-exist with other CML primitives, and can
be effectively leveraged to improve performance of realistic highly-concurrent applications.
158
spawn
: (unit -> ’a) -> threadID
channel
: unit -> ’a chan
sendEvt
: ’a chan * ’a -> unit Event
recvEvt
: ’a chan -> ’a Event
send
: ’a chan * ’a -> unit
recv
: ’a chan -> ’a
never
: ’a Event
alwaysEvt
: ’a -> ’a Event
sync
: ’a Event -> ’a
wrap
: ’a Event * (’a -> ’b) -> ’b Event
guard
: (unit -> ’a Event) -> ’a Event
choose
: ’a Event list -> ’a Event
aSendEvt
: ’a chan * ’a -> (unit, unit) AEvent
aRecvEvt
: ’a chan -> (unit, ’a) AEvent
aSend
: ’a chan * ’a -> unit
aRecv
: ’a chan -> unit
aAlwaysEvt
: ’a -> (unit, ’a) AEvent
aNever
: (unit, ’a) AEvent
aSync
: (’a, ’b) AEvent -> ’a
sWrap
: (’a, ’b) AEvent * (’a -> ’c) ->
(’c, ’b) AEvent
aWrap
: (’a, ’b) AEvent * (’b -> ’c) ->
(’a, ’c) AEvent
aGuard
: (unit -> (’a, ’b) AEvent) -> (’a, ’b) AEvent
aChoose
: (’a, ’b) AEvent list -> (’a, ’b) AEvent
sChoose
: (’a, ’b) AEvent list -> (’a, ’b) AEvent
aTrans
: (’a, ’b) AEvent -> ’a Event
sTrans
: ’a Event -> (unit, ’a) AEvent
callbackEvt : (’a, ’c) AEvent * (’c -> ’b) -> (’b Event, ’c) AEvent
Figure 7.6. CML Event and AEvent operators.
159
type ’a mbox
val mailbox
: unit -> ’a mbox
val sameMailbox : ’a mbox * ’a mbox -> bool
val send
: ’a mbox * ’a -> unit
val recv
: ’a mbox -> ’a
val recvEvt
: ’a mbox -> ’a event
val recvPoll
: ’a mbox -> ’a option
Figure 7.7. CML mailbox structure.
fun sameMailbox(mailbox1, mailbox2) =
C.sameChannel(mailbox1, mailbox2)
fun send(mailbox, value) =
Channel.aSync(Channel.aSendEvt(mailbox, value))
fun sendEvt(mailbox, value) =
Channel.aSendEvt(mailbox, value))
fun recv(mailbox) = Channel.sync(C.recvEvt(mailbox))
fun recvEvt(mailbox) = Channel.recvEvt(mailbox)
fun recvPoll(mailbox) = Channel.recvPoll(mailbox)
Figure 7.8. An excerpt of a CML mailbox structure implemented utilizing
asynchronous events.
160
e ∈ Exp := v | x | p e | e e
|
{e, e0 } | spawn e | sync e | ch()
|
sendEvt(e, e) | recvEvt(e)
|
aSendEvt(e, e) | aRecvEvt(e)
|
aWrap(e, e) | sWrap(e, e) | guard(e)
|
aChoose(e1 , ..., en ) | sChoose(e1 , ..., en )
v ∈ Val := unit | c | m | λ x. e | ε[e]
p ∈ Prim := sendAct(c, v) | sendWait | recvAct(c) | recvWait
E := • | E e | v E | p E | sync E
|
sendEvt(E, e) | sendEvt(c, E)
|
aSendEvt(E, e) | aSendEvt(c, E)
|
recvEvt(E) | aRecvEvt(E)
|
aWrap(E, e) | sWrap(E, e) | aWrap(v, E) | sWrap(v, E)
|
guard(E)
|
aChoose(E, e, ...) | aChoose(v, E, ...)
|
sChoose(E, e, ...) | sChoose(v, E, ...)
Figure 7.9. Syntax, grammar, and evaluation contexts for a core language
for asynchronous events.
161
m ∈ MessageId
c ∈ ChannelId
ω ∈ ChoiceId
ε[e], ε[e]g ∈ Event
A m ∈ Action
:= Arm | Asm
Arm ∈ ReceiveAct := Rcm
Asm ∈ SendAct
T ∈
m
:= Sc,v
Thread := (t, e)
T ∈ ThreadCollection := 0/ | T | T || T
∆ ∈
C ∈
hTi∆,C ∈
CommMap := MessageId → Val + ChoiceId+ ⊥ +>
ChanMap := ChannelId → Action
State := hT, CommMap, ChanMapi
→ ∈ State → State
⇒ ∈ CommMap × (SendAct + (RecieveAct × Val)) → CommMap
; ∈ Exp × ChanMap → MessageId × ChanMap
Figure 7.10. Domain equations for a core language for asynchronous events.
162
(A PP)
h(t, E[(λx.e) v]) || Ti∆,C → h(t, E[e[v/x]]) || Ti∆,C
(C HANNEL)
c fresh
h(t, E[ch()]) || Ti∆,C → h(t, E[c]) || Ti∆,C [c7→0]
/
(S PAWN)
t0 f resh
h(t, E[spawn e]) || Ti∆,C →
h(t0 , e) || (t, E[unit]) || Ti∆,C
(S END E VENT)
h(t, E[sendEvt(c, v)]) || Ti∆,C →
h(t, E[ε[{sendWait ◦ sendAct(c, v), λx.unit}]]) || Ti∆,C
(AS END E VENT)
h(t, E[aSendEvt(c, v)]) || Ti∆ →
h(t, E[ε[{sendAct(c, v), sendWait}]]) || Ti∆,C
(R ECV E VENT)
h(t, E[recvEvt(c)]) || Ti∆,C →
h(t, E[ε[{recvWait ◦ recvAct(c), λx.unit}]]) || Ti∆,C
(AR ECV E VENT)
h(t, E[aRecvEvt(c)]) || Ti∆,C →
h(t, E[ε[{recvAct(c), recvWait}]]) || Ti∆,C
Figure 7.11. A core language for asynchronous events – base events as
well as rules for spawn, function application, and channel creation.
163
(S YNC E VENT)
m f resh t0 f resh
h(t, E[sync ε[{e, e0 }]]) || Ti∆,C → h(t, E[e m]) || (t0 , e0 m) || Ti∆[m7→⊥],C
(M ESSAGE)
∆(m) =⊥
m ⇒ ∆[m 7→ unit]
∆, Sc,v
∆(m) =⊥
∆, Rcm , v ⇒ ∆[m 7→ v]
(M ESSAGE C HOICE)
∆0 = ∆[m0 7→ >] ∀ m0 : ∆(m0 ) = ω
m ⇒ ∆0 [m 7→ unit]
∆[m 7→ ε], Sc,v
∆0 = ∆[m0 7→ >] ∀ m0 : ∆(m0 ) = ω
∆[m 7→ ε], Rcm , v ⇒ ∆0 [m 7→ v]
(S END M ATCH)
0
C (c) = Rcm : Arm
0
m ⇒ ∆0 ∆0 , R m , v ⇒ ∆00
∆, Sc,v
c
h(t, E[(sendAct(c, v)) m]) || Ti∆,C → h(t, E[m]) || Ti∆00 ,C [c7→A m ]
r
(R ECV M ATCH)
0
m : Am
C (c) = Sc,v
s
0
m ⇒ ∆0 ∆0 , R m , v ⇒ ∆00
∆, Sc,v
c
h(t, E[(recvAct(c)) m]) || Ti∆,C → h(t, E[m]) || Ti∆00 ,C [c7→A m ]
s
Figure 7.12. A core language for asynchronous events – rules for matching
communication and ordering.
164
(S ENDWAIT)
∆(m) = unit
h(t, E[sendWait m]) || Ti∆,C → h(t, E[unit]) || Ti∆,C
(R ECEIVE WAIT)
∆(m) = v
h(t, E[recvWait m]) || Ti∆,C → h(t, E[v]) || Ti∆,C
(S END B LOCK)
(sendAct(c, v)) m, C ; m, C 0
h(t, E[(sendAct(c, v)) m]) || Ti∆,C → h(t, E[m]) || Ti∆,C 0
(R ECV B LOCK)
(recvAct(c)) m, C ; m, C 0
h(t, E[(recvAct(c)) m]) || Ti∆,C → h(t, E[m]) || Ti∆,C 0
(E NQUEUE S END)
m]
C (c) = Asm C 0 = C [c 7→ Asm : Sc,v
(sendAct(c, v)) m, C ; m, C 0
(E NQUEUE R ECEIVE)
C (c) = Arm C 0 = C [c 7→ Arm : Rcm ]
(recvAct(c)) m, C ; m, C 0
Figure 7.13. A core language for asynchronous events – rules for waiting,
blocking, and enqueueing.
165
(S W RAP)
h(t, E[sWrap(ε[{e, e0 }], λx.e00 )]) || Ti∆,C →
h(t, E[ε[{λx.e00 ◦ e, e0 }]]) || Ti∆,C
(AW RAP)
h(t, E[aWrap(ε[{e, e0 }], λx.e00 )]) || Ti∆,C →
h(t, E[ε[{e, λx.e00 ◦ e0 }]]) || Ti∆,C
(G UARD)
h(t, E[guard(λx.e)]) || Ti∆,C → h(t, E[ε[(λx.e) unit]g ]) || Ti∆,C
(S YNC G UARDED E VENT)
h(t, E[sync ε[e]g ]) || Ti∆,C → h(t, E[sync e]) || Ti∆,C
(S W RAP G UARDED E VENT)
h(t, E[sWrap(ε[e]g , λx.e0 )]) || Ti∆,C →
h(t, E[ε[sWrap(e, λx.e0 )]g ]) || Ti∆,C
(AW RAP G UARDED E VENT)
h(t, E[aWrap(ε[e]g , λx.e0 )]) || Ti∆,C →
h(t, E[ε[aWrap(e, λx.e0 )]g ]) || Ti∆,C
Figure 7.14. A core language for asynchronous events – combinator extensions for asynchronous events.
166
(AC HOOSE E VENT)
0≤i≤n
h(t, E[aChoose(ε[e1 ], ..., ε[en ])]) || Ti∆,C →
h(t, E[ε[ei ]]) || Ti∆,C
(S C HOOSE E VENT F LATTEN)
ε[ei ] = ε[sChoose(ei1 , ..., eim )] 0 ≤ i ≤ n
h(t, E[sChoose(ε[e1 ], ..., ε[en ])]) || Ti∆,C →
h(t, E[sChoose(ε[e1 ], ..., ε[ei−1 ], ε[ei1 ], ..., ε[eim ], ε[ei+1 ], ..., ε[en ])]) || Ti∆,C
(S C HOOSE E VENT)
h(t, E[sChoose(ε[e1 ], ..., ε[en ])]) || Ti∆,C →
h(t, E[ε[sChoose(e1 , ..., en )]]) || Ti∆,C
Figure 7.15. A core language for asynchronous events – choose events and
choose event flattening.
167
(S YNC S C HOOSE)
m1 , ..., mn f resh
h(t, E[sync (ε[sChoose({e1 , e01 }, ..., {en , e0n })])]) || Ti∆,C →
h(t, E[sChoose({e1 m1 , e01 m1 }, ..., {en mn , e0n mn })] ) || Ti∆[m1 7→⊥,...,mn 7→⊥],C
(S C HOOSE)
ei = {e m, e0 m} 0 ≤ i ≤ n t0 f resh
h(t, E[e m]) || (t0 , e0 m) || Ti∆,C → h(t, E[e00 ]) || (t0 , e000 ) || Ti∆[m7→v],C
h(t, E[sChoose(e1 , ..., en )]) || Ti∆,C →
h(t, E[e00 ]) || (t0 , e000 ) || Ti∆[m7→v],C
(S C HOOSE B LOCK S YNC)
ω f resh ∆0 = ∆[m1 7→ ω, ..., mn 7→ ω]
e1 , C ; m1 , C1 , ... en , Cn−1 ; mn , Cn
h(t, E[sChoose({E1 [e1 ], e01 }, ..., {En [en ], e0n })]) || Ti∆,C →
h(t, E[sChoose({E1 [m1 ], e01 }, ...., {En [mn ], e0n })]) || Ti∆0 ,Cn
(S C HOOSE B LOCK A S YNC)
ω f resh ∆0 = ∆[m1 7→ ω, ..., mn 7→ ω]
e01 , C ; m1 , C1 , ... e0n , Cn−1 ; mn , Cn
h(t, E[sChoose({e1 , E1 [e01 ]}, ..., {en , En [e0n ]})]) || Ti∆,C →
h(t, E[sChoose({e1 , E1 [m1 ]}, ...., {en , En [mn ]})]) || Ti∆0 ,Cn
(S C HOOSE UN B LOCK)
∆(mi ) = v 0 ≤ i ≤ n t0 f resh
h(t, E[sChoose({E1 [m1 ], e1 }, ..., {En [mn ], en })]) || Ti∆,C →
h(t, E[Ei [mi ]]) || (t0 , ei ) || Ti∆,C
Figure 7.16. A core language for asynchronous events – synchronizing
and evaluating S C HOOSE events.
168
(C HANNEL C LEAN)
C (c) = A m : A m ∆(m) = >
hTi∆,C → hTi∆,C [c7→A m ]
c
(S C HOOSE S W RAP)
h(t, E[sWrap(ε[sChoose({e1 , e01 }, ..., {en , e0n })], λx.e00 )]) || Ti∆,C →
h(t, E[ε[sChoose({(λx.e00 ) e1 , e01 }, ..., {(λx.e00 ) en , e0n })]]) || Ti∆,C
(S C HOOSE AW RAP)
h(t, E[aWrap(ε[sChoose({e1 , e01 }, ..., {en , e0n })], λx.e00 )]) || Ti∆,C →
h(t, E[ε[sChoose({e1 , (λx.e00 ) e01 }, ..., {en , (λx.e00 ) e0n })]]) || Ti∆,C
(S C HOOSE G UARD)
h(t, E[sChoose(..., ε[en ]g , ...)]) || Ti∆,C →
h(t, E[ε[sChoose(..., en , ...)]g ]) || Ti∆,C
Figure 7.17. A core language for asynchronous events – combinators for
S C HOOSE events.
169
8
CONCLUDING REMARKS AND FUTURE DIRECTIONS
Message passing idioms are becoming more predominant, influencing languages, operating
system, and hardware designs. In the context of CML we have presented abstractions for
robust higher-order message-based communication. We have shown how each abstraction
can be leveraged to implement new functionality, increase modularity, and improve performance. Stabilizers are a novel lightweight checkpointing mechanism that provides atomicity properties on the rollback of a stable region of code. Partial memoization is a function
caching technique that allows the elision of subsequent function calls even in the presence
of multiple communicating threads. Asynchronous CML is a language extension to seamlessly marry asynchronous and synchronous communication in the CML event framework,
allowing for the modular creation, maintenance, and extension of asynchronous protocols
through post-creation and post-consumption actions.
8.1
Future Directions
In this section we present planned future directions. We split the discussion based on
each language extension.
8.1.1
Stabilizers
There are a number of different future directions we wish to pursue with stabilizers. The
first is to explore their lightweight state restoration functionality to implement lightweight
speculative constructs and language abstractions that provide deterministic parallelism,
such as safe futures. Futures are a well known programming construct found in languages
from Multilisp [98] to Java [99] (see Section 5.8 for a more in depth description and related work). Many recent language proposals [100–102] incorporate future-like constructs.
170
We would like to incorporate safe futures in the context of ACML and multiple threads of
control. Unlike sequential formulations of safe futures [40], futures created from separate
threads of control can interact. Reversion of a given future may necessitate the reversion
of its communication partners when sequential semantics are violated. Stabilizers are a
natural candidate for providing state reversion in the presence of multiple communicating
threads. Additionally, the monitoring functionality inherent to stabilizers can be leveraged
to perform checks for safety violations.
There are other characterisations of deterministic parallelism and speculation that are
relevant to proposed future work. Although not described in the context of safe futures,
[103] proposed a type and effect system that simplifies parallel programming by enforcing
deterministic semantics. CoreDet is a C/C++ compiler that provides deterministic behavior for programs that leverage pthreads by tracking the ownership of memory locations by
threads, and leveraging a commit protocol to make changes to such locations visible to
other threads of control. Grace [104] is a highly scalable runtime system that eliminates
concurrency errors for programs with fork-join parallelism by enforcing a sequential commit protocol on threads which run as processes. Boudol and Petri [105] provide a definition
for valid speculative computations independent of any implementation technique.
Although our benchmarks indicate that stabilizers are a lightweight checkpointing mechanism, there are a number of optimizations we wish to pursue to limit the overheads of logging and re-execution. Our logging infrastructure would benefit from partial or incremental
continuation grabbing to limit memory overheads. During a stabilize action many threads
and computations may be reverted. However, only a small number of such computations
may actually change during their subsequent re-execution. Identifying such sections of
code could greatly reduce the cost of re-execution after a checkpoint is restored. Function
memoization in a multi-threaded program with channel based communication requires additional monitoring of thread interactions as described in Chapter 6. Such information is
already present within our communication graph and can be leveraged to assist function
memoization.
171
The communication graphs generated by stabilizers have been leverage for debugging
purposes. Although, this was not their intended use, such graphs coupled with additional
debugging information has proved useful to understand, debug, and extend programs that
utilize on the order of hundreds of thousands of threads. More importantly, we would
like to extend stabilizers to provide deterministic replay of monitored programs. Deterministic replay of both concurrent and parallel programs has been the topic of much interest [106–108]. We envision constructing a scheduler that leverages the communication
graph to make scheduling decisions, allowing a programmer to replay a buggy program or
a given component. The communication graph captures all important and side-effecting
operations, but does not provide a concrete schedule. Abstractly, we can think of the graph
as a representation of a set of schedules that produce a given outcome. This simplifies
the amount of information need to successfully replay a program. Similarly, we can leverage the information captured in the graph to reason about the equivalence of two different
component implementations at the communication protocol level.
Transactional events [44] provide additional expressivity to CML, namely the ability
to provide safe guard events and three-way rendezvous. Stabilizers can also be utilized to
encode safe guards, by wrapping the complex event and unrolling whenever an error condition is encountered. However, stabilizers do not provide the ability to express three-way
rendezvous in CML. Currently formulations of transactional events rely on exhaustive state
space exploration to implement their transactional behavior. We are working on a prototype
implementation of transactional events that leverage stabilizers and a rollback/retry mechanism instead of the state space exploration. In the common case, where only one path
is viable in a transactional event, exploring the state space is wasted work. This happens
because often only one event in complex choose event is satisfiable. Thus, stabilizers allow
the characterisation of a depth-first search strategy for finding successful communication
patterns in a transactional event. If, however, a transactional event is not completed, stabilizers offer the ability to rollback and retry the event. Moreover, the stabilizer graph can be
leveraged to guide future executions of the aborted transactional event.
172
While the use of cut can delimit the extent to which control is reverted as a result of
a stabilize call, a more general and robust approach would be to integrate a rational
compensation semantics [109] for stabilizers in the presence of stateful operations. To do
just that, we provide additional extensions to stabilizers to support compensating actions.
A compensating action is specified withing a stable section, when that stable section is
reverted the compensating actions are executed in the order in which they were generated.
Compensating actions can be utilized to revert I/O operations or to perform logging actions
whenever a rollback occurs. In addition, we also provide a mechanism for a given thread
to revert a give stable section. This allows a parent thread to revert poorly behaving child
threads, allowing error handling to be localized in the parent. Such mechanisms are useful
and we like to pursue them and others to develop coding patterns for the use of stabilizers.
8.1.2
Partial Memoization
Our current formulation of partial memoization treats functions independently. As we
discussed above, combining partial memoization with stabilizers provides the ability for
stabilizers to reduce redundant computation after a checkpoint is restored. However, memoization can leverage the information stored within the communication graph to provide
memo information for computational units. For instance, if a producer and a consumer
were successfully memoized, instead of discharging the constraints for the producer and
the consumer separately, we can leverage the communication graph to discover that the
two are tightly coupled. This would help avoid the discharge of constraints, eliminating
equality checks and synchronization.
Selective memoization [56] is an important technique that provides a framework for
specializing memoization requirements. Consider the function f given below:
fun f(x) =
if x < 5
then 8
else 13
173
Notice that the resulting value of the function depends on its argument. However, the
branch inside the function is only dependent on if x is less than 5. Thus, a call to f with a
supplied argument of 1, when a previous memoized call to f was executed with an argument
of 2, should be able to successfully leverage the memo information gathered at the first call
and elide the second call. Selective memoization provides such functionality. We believe
incorporating such work into our partial memoization strategy will improve memoization
benefits. More specifically, we can leverage selective memoization to improve constraint
generation for choose events. As an example consider the code below:
fun g() =
sync(choose(recv(c), recv(c’)))
The function g takes a unit argument and performs a selective communication, receiving
either from channel c or c’. If on the first call to g, the value 13 is received from channel c
a constraint is generated for that value and channel in a memo table. Consider a subsequent
call to g where another thread was willing to also send 13 but on channel c’. Leveraging
selective memoization augmented with CML hooks would allow a successful eliding of the
second call to g.
Self adjusting computation [110,111] is an emerging field of study in programming languages. Self adjusting computation provides the ability for a computation to adjust to small
changes in input quickly and has been utilized to solve and accelerate dynamic problems
in computational geometry and statistical inference. At its core, self adjusting computation relies on two key concepts: memoization, specifically selective memoization [56], and
change propagation [112]. One aspect we would like to explore is leveraging our definition
of partial memoization and its ability to handle threading, synchronization, and communication to provide self adjusting computation in CML and more importantly self adjusting
computation on multi-core system. To do this, we would need to integrate our memoization
technique and provide a parallel and CML aware version of change propagation. Change
propagation relies heavily on dynamic dependence tracking through a dynamic dependence
graph.
174
8.1.3
Asynchronous CML
One natural extension to ACML would be to add session types to help programmers
maintain protocols, check for component equivalence, and even improve performance. Session types have been proposed as a way to precisely capture complex interactions between
communicating parties [113, 114]. They describe interaction protocols by specifying the
type of messages exchanged between the participants. Implicit control flow information
such as branching and loops can also be enumerated using session types.
Session types [113, 114] allow precise specification of typed distributed interaction,
though we believe we can utilize them in much the same fashion in a multi-core setting. Neubauer and Thiemann first described the operational semantics for asynchronous
session types [115]. Early formulations only handled interaction between two parties,
which has been extended to multi-party interaction by Honda et al. [116] and Bonelli et
al. [117]. The work of Bejleri and Yoshida [118] extends that of Honda et al. [116] for
synchronous communication specification among multiple interacting peers. Session types
have have been applied to functional [119, 120], component-based systems [121], objectoriented [122, 123], and operating system services [124]. Asynchronous session types for
Java have been studied [125], extending previous work involving the same authors [126].
Bi-party session types have been implemented in Java [123].
There has also been much recent interest in leveraging information present in session
types to guide optimization of communication protocols [127]. Although this has thus far
been targeted at reducing the overheads of round trip times in a distributed setting, it does
pose an interesting opportunity in a multi-core environment. Such optimizations rely on
batching [128] or chaining [129] of messages between communicating parties. Although
we do not envision be able to leverage chaining of communication, batching can be utilized to reduce contention and synchronization costs on channels. Communication protocols often require multiple message exchanges between communicating threads. Each
communication requires synchronization on the channel that provides the conduit between
175
the communicating threads. By amalgamating messages, we would be able to reduce contention, lock acquisition costs, and limit the size of channel queues.
LIST OF REFERENCES
176
LIST OF REFERENCES
[1] Joe Armstrong, Robert Virding, Claes Wikstrom, and Mike Williams. Concurrent
Programming in Erlang. Prentice-Hall, 2nd edition, 1996.
[2] John H. Reppy. Concurrent Programming in ML. Cambridge University Press, New
York, NY, USA, 2007.
[3] Matthew Fluet, Mike Rainey, John Reppy, and Adam Shaw. Implicitly-threaded
parallelism in manticore. In Proceedings of the 13th ACM SIGPLAN International
Conference on Functional Programming, ICFP ’08, pages 119–130, New York, NY,
USA, 2008. ACM.
[4] Guodong Li, Michael Delisi, Ganesh Gopalakrishnan, and Robert M. Kirby. Formal
specification of the MPI-2.0 standard in TLA+. In Proceedings of the 13th ACM
SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP
’08, pages 283–284, New York, NY, USA, 2008. ACM.
[5] Richard Monson-Haefel and David Chappell. Java Message Service. O’Reilly &
Associates, Inc., Sebastopol, CA, USA, 2000.
[6] Intel. Single-Chip Cloud Computer.
Scale/1826.htm, 2010.
http://techresearch.intel.com/articles/Tera-
[7] Andrew Baumann, Paul Barham, Pierre-Evariste Dagand, Tim Harris, Rebecca
Isaacs, Simon Peter, Timothy Roscoe, Adrian Schüpbach, and Akhilesh Singhania.
The multikernel: A new OS architecture for scalable multicore systems. In Proceedings of the ACM SIGOPS 22nd Symposium on Operating Systems Principles, SOSP
’09, pages 29–44, New York, NY, USA, 2009. ACM.
[8] MLton. http://www.mlton.org.
[9] John Reppy, Claudio V. Russo, and Yingqi Xiao. Parallel concurrent ML. In ICFP
’09: Proceedings of the 14th ACM SIGPLAN International Conference on Functional Programming, pages 257–268, New York, NY, USA, 2009. ACM.
[10] Robin Milner, Mads Tofte, and David Macqueen. The Definition of Standard ML.
MIT Press, Cambridge, MA, USA, 1997.
[11] Lukasz Ziarek, Stephen Weeks, and Suresh Jagannathan. Flattening tuples in an SSA
intermediate representation. Higher-Order and Symbolic Computation, 21:333–358,
2008.
[12] Martin Elsman. Program Modules, Seperate Compilation, and Intermodule Optimization. PhD thesis, University of Copenhagen, 1999.
177
[13] Andrew Tolmach and Dino P. Oliva. From ML to Ada: Strongly-typed language
interoperability via source translation. Journal of Functional Programing, 8(4):367–
412, 1998.
[14] John C. Reynolds. Definitional interpreters for higher-order programming languages. In Proceedings of 25th ACM National Conference, pages 717–740, Boston,
Massachusetts, 1972. Reprinted in Higher-Order and Symbolic Computation
11(4):363–397, 1998.
[15] Ron Cytron, Jeanne Ferrante, Barry K. Rosen, Mark N. Wegman, and F. Kenneth
Zadeck. Efficiently computing static single assignment form and the control dependence graph. ACM Transactions on Programing Languages and Systems, 13:451–
490, October 1991.
[16] Olin Shivers. Control flow analysis in Scheme. In Proceedings of the ACM SIGPLAN 1988 Conference on Programming Language Design and Implementation,
PLDI ’88, pages 164–174, New York, NY, USA, 1988. ACM.
[17] David R. Butenhof. Programming with POSIX threads. Addison-Wesley Longman
Publishing Co., Inc., Boston, MA, USA, 1997.
[18] KC Sivaramakrishnan, Lukasz Ziarek, Raghavendra Prasad, and Suresh Jagannathan. Lightweight asynchrony using parasitic threads. In DAMP ’10: Proceedings
of the 5th ACM SIGPLAN Workshop on Declarative Aspects of Multicore Programming, pages 63–72, New York, NY, USA, 2010. ACM.
[19] George Candea, Shinichi Kawamoto, Yuichi Fujiki, Greg Friedman, and Armando
Fox. Microreboot: A technique for cheap recovery. In Proceedings of the 6th Symposium on Opearting Systems Design & Implementation – Volume 6, pages 3–3,
Berkeley, CA, USA, 2004. USENIX Association.
[20] Tim Harris and Keir Fraser. Language support for lightweight transactions. In Proceedings of the 18th Annual ACM SIGPLAN Conference on Object-Oriented Programing, Systems, Languages, and Applications, OOPSLA ’03, pages 388–402,
New York, NY, USA, 2003. ACM.
[21] Maurice Herlihy, Victor Luchangco, Mark Moir, and William N. Scherer, III. Software transactional memory for dynamic-sized data structures. In Proceedings of the
22nd Annual Symposium on Principles of Distributed Computing, PODC ’03, pages
92–101, New York, NY, USA, 2003. ACM.
[22] Kevin Donnelly and Matthew Fluet. Transactional events. Journal of Functional
Programing, 18:649–706, September 2008.
[23] Jeremy Manson, William Pugh, and Sarita V. Adve. The Java memory model.
In Proceedings of the 32nd ACM SIGPLAN-SIGACT Symposium on Principles of
Programming Languages, POPL ’05, pages 378–391, New York, NY, USA, 2005.
ACM.
[24] Jim Gray and Andreas Reuter. Transaction Processing. Morgan-Kaufmann, 1993.
[25] Panos K. Chrysanthis and Krithi Ramamritham. ACTA: The SAGA continues. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1992.
178
[26] Mootaz Elnozahy, Lorenzo Alvisi, Yi-Min Wang, and David Johnson. A survey of
rollback-recovery protocols in message-passing systems. ACM Computing Surveys,
34(3):375–408, 2002.
[27] Mangesh Kasbekar and Chita Das. Selective checkpointing and rollbacks in multithreaded distributed systems. In Proceedings of the The 21st International Conference on Distributed Computing Systems, page 39, Washington, DC, USA, 2001.
IEEE Computer Society.
[28] Kester Li, Jeffrey Naughton, and James Plank. Real-time concurrent checkpoint
for parallel programs. In Proceedings of the 2nd ACM SIGPLAN Symposium on
Principles & Practice of Parallel Programming, PPOPP ’90, pages 79–88, New
York, NY, USA, 1990. ACM.
[29] Asser N. Tantawi and Manfred Ruschitzka. Performance analysis of checkpointing
strategies. ACM Transactions on Computer Systems, 2:123–144, May 1984.
[30] Saurabh Agarwal, Rahul Garg, Meeta S. Gupta, and Jose E. Moreira. Adaptive
incremental checkpointing for massively parallel systems. In ICS ’04: Proceedings
of the 18th Annual International Conference on Supercomputing, pages 277–286,
New York, NY, USA, 2004. ACM.
[31] Greg Bronevetsky, Daniel Marques, Keshav Pingali, and Paul Stodghill. Automated
application-level checkpointing of MPI programs. In PPoPP ’03: Proceedings of
the 9th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 84–94, New York, NY, USA, 2003. ACM.
[32] Greg Bronevetsky, Daniel Marques, Keshav Pingali, Peter Szwed, and Martin
Schulz. Application-level checkpointing for shared memory programs. SIGARCH
Computer Architecture News, 32(5):235–247, 2004.
[33] Micah Beck, James S. Plank, and Gerry Kingsley. Compiler-assisted checkpointing.
Technical report, University of Tennessee. Knoxville, TN, USA, 1994.
[34] Yuqun Chen, James S. Plank, and Kai Li. Clip: A checkpointing tool for messagepassing parallel programs. In Supercomputing ’97: Proceedings of the 1997
ACM/IEEE Conference on Supercomputing (CDROM), pages 1–11, New York, NY,
USA, 1997. ACM.
[35] Alan Dearie and David Hulse. On page-based optimistic process checkpointing. In
Proceedings of the 4th International Workshop on Object-Orientation in Operating
Systems, IWOOOS ’95, page 24, Washington, DC, USA, 1995. IEEE Computer
Society.
[36] William R. Dieter and James E. Lumpp Jr. A user-level checkpointing library for
POSIX threads programs. In FTCS ’99: Proceedings of the 29th Annual International Symposium on Fault-Tolerant Computing, page 224, Washington, DC, USA,
1999. IEEE Computer Society.
[37] Atul Adya, Robert Gruber, Barbara Liskov, and Umesh Maheshwari. Efficient optimistic concurrency control using loosely synchronized clocks. In SIGMOD ’95:
Proceedings of the 1995 ACM SIGMOD International Conference on Management
of Data, pages 23–34, New York, NY, USA, 1995. ACM.
179
[38] Hsiang-Tsung Kung and John Robinson. On optimistic methods for concurrency
control. ACM Transactions on Database Systems, 6:213–226, June 1981.
[39] Martin C. Rinard. Effective fine-grain synchronization for automatically parallelized
programs using optimistic synchronization primitives. ACM Transactions on Computer Systems, 17:337–371, November 1999.
[40] Adam Welc, Suresh Jagannathan, and Antony Hosking. Safe futures for Java. In
Proceedings of the 20th Annual ACM SIGPLAN Conference on Object-Oriented
Programming, Systems, Languages, and Applications, OOPSLA ’05, pages 439–
453, New York, NY, USA, 2005. ACM.
[41] Tim Harris, Simon Marlow, Simon Peyton-Jones, and Maurice Herlihy. Composable
memory transactions. In Proceedings of the 10th ACM SIGPLAN Symposium on
Principles and Practice of Parallel Programming. PPoPP ’05, pages 48–60, New
York, NY, USA, 2005. ACM.
[42] Michael F. Ringenburg and Dan Grossman. AtomCaml: First-class atomicity via
rollback. In Proceedings of the 10th ACM SIGPLAN International Conference on
Functional Programming, ICFP ’05, pages 92–104, New York, NY, USA, 2005.
ACM.
[43] Kevin Donnelly and Matthew Fluet. Transactional events. In Proceedings of the
11th ACM SIGPLAN International Conference on Functional Programming, ICFP
’06, pages 124–135, New York, NY, USA, 2006. ACM.
[44] Laura Effinger-Dean, Matthew Kehrt, and Dan Grossman. Transactional events for
ML. In Proceeding of the 13th ACM SIGPLAN International Conference on Functional Programming, ICFP ’08, pages 103–114, New York, NY, USA, 2008. ACM.
[45] Andrew P. Tolmach and Andrew W. Appel. Debugging standard ML without reverse
engineering. In Proceedings of the 1990 ACM Conference on LISP and Functional
Programming, LFP ’90, pages 1–12, New York, NY, USA, 1990. ACM.
[46] Andrew P. Tolmach and Andrew W. Appel. Debuggable concurrency extensions
for standard ML. In Proceedings of the 1991 ACM/ONR Workshop on Parallel and
Distributed Debugging, PADD ’91, pages 120–131, New York, NY, USA, 1991.
ACM.
[47] John Field and Carlos A. Varela. Transactors: A programming model for maintaining globally consistent distributed state in unreliable environments. In Proceedings of the 32nd ACM SIGPLAN-SIGACT Symposium on Principles of Programming
Languages, POPL ’05, pages 195–208, New York, NY, USA, 2005. ACM.
[48] Jan Christiansen and Frank Huch. Searching for deadlocks while debugging concurrent Haskell programs. In Proceedings of the 9th ACM SIGPLAN International
Conference on Functional Programming. ICFP ’04, pages 28–39, New York, NY,
USA, 2004. ACM.
[49] Matthew Flatt and Robert Bruce Findler. Kill-Safe synchronization abstractions. In
Proceedings of the ACM SIGPLAN 2004 Conference on Programming Language
Design and Implementation, PLDI ’04, pages 47–58, New York, NY, USA, 2004.
ACM.
180
[50] Armand Navabi, Xiangyu Zhang, and Suresh Jagannathan. Quasi-static scheduling
for safe futures. In Proceedings of the 13th ACM SIGPLAN Symposium on Principles
and Practice of Parallel Programming, PPoPP ’08, pages 23–32, New York, NY,
USA, 2008. ACM.
[51] Lingli Zhang, Chandra Krintz, and Priya Nagpurkar. Supporting exception handling
for futures in Java. In Proceedings of the 5th International Symposium on Principles
and Practice of Programming in Java, PPPJ ’07, pages 175–184, New York, NY,
USA, 2007. ACM.
[52] C. Flanagan and M. Felleisen. The semantics of future and an application. Journal
of Functional Programming, 9(1):1–31, January 1999.
[53] Armand Navabi and Suresh Jagannathan. Exceptionally safe futures. In Proceedings of the 11th International Conference on Coordination Models and Languages,
COORDINATION ’09, pages 47–65, Berlin, Heidelberg, 2009. Springer-Verlag.
[54] Yanhong A. Liu and Tim Teitelbaum. Caching intermediate results for program
improvement. In Proceedings of the 1995 ACM SIGPLAN Symposium on Partial
Evaluation and Semantics-Based Program Manipulation, PEPM ’95, pages 190–
201, New York, NY, USA, 1995. ACM.
[55] William Pugh and Tim Teitelbaum. Incremental computation via function caching.
In Proceedings of the 16th ACM SIGPLAN-SIGACT Symposium on Principles of
Programming Languages, POPL ’89, pages 315–328, New York, NY, USA, 1989.
ACM.
[56] Umut A. Acar, Guy E. Blelloch, and Robert Harper. Selective memoization. In
Proceedings of the 30th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL ’03, pages 14–25, New York, NY, USA, 2003. ACM.
[57] Michael I. Gordon, William Thies, and Saman Amarasinghe. Exploiting coarsegrained task, data, and pipeline parallelism in stream programs. In Proceedings of
the 12th International Conference on Architectural support for Programming Languages and Operating Systems, ASPLOS-XII, pages 151–162, New York, NY, USA,
2006. ACM.
[58] Christopher J. F. Pickett and Clark Verbrugge. Software Thread Level Speculation
for the Java Language and Virtual Machine Environment. In Proceedings of the
International Workshop on Languages and Compilers for Parallel Computing, 2005.
[59] Ali-Reza Adl-Tabatabai, Brian T. Lewis, Vijay Menon, Brian R. Murphy, Bratin
Saha, and Tatiana Shpeisman. Compiler and runtime support for efficient software
transactional memory. In Proceedings of the 2006 ACM SIGPLAN Conference on
Programming Language Design and Implementation, PLDI ’06, pages 26–37, New
York, NY, USA, 2006. ACM.
[60] Rachid Guerraoui, Michal Kapalka, and Jan Vitek. STMBench7: A benchmark for
software transactional memory. In Proceedings of the 2nd ACM SIGOPS/EuroSys
European Conference on Computer Systems 2007, EuroSys ’07, pages 315–324,
New York, NY, USA, 2007. ACM.
[61] Richard Matthew McCutchen and Samir Khuller. Streaming algorithms for k-center
clustering with outliers and with anonymity. In Proceedings of the 11th International
181
Workshop, APPROX 2008, and 12th International Workshop, RANDOM 2008 on
Approximation, Randomization and Combinatorial Optimization: Algorithms and
Techniques, APPROX ’08 / RANDOM ’08, pages 165–178, Berlin, Heidelberg,
2008. Springer-Verlag.
[62] Michael J. Carey, David J. DeWitt, and Jeffrey F. Naughton. The 007 benchmark. In
Proceedings of the 1993 ACM SIGMOD International Conference on Management
of Data, SIGMOD ’93, pages 12–21, New York, NY, USA, 1993. ACM.
[63] Bratin Saha, Ali-Reza Adl-Tabatabai, Richard L. Hudson, Chi Cao Minh, and Benjamin Hertzberg. McRT-STM: A high performance software transactional memory system for a multi-core runtime. In Proceedings of the 11th ACM SIGPLAN
Symposium on Principles and Practice of Parallel Programming, PPoPP ’06, pages
187–197, New York, NY, USA, 2006. ACM.
[64] William Pugh. An improved replacement strategy for function caching. In Proceedings of the 1988 ACM Conference on LISP and Functional Programming, LFP ’88,
pages 269–276, New York, NY, USA, 1988. ACM.
[65] Allan Heydon, Roy Levin, and Yuan Yu. Caching function calls using precise dependencies. In Proceedings of the ACM SIGPLAN 2000 Conference on Programming
Language Design and Implementation, PLDI ’00, pages 311–320, New York, NY,
USA, 2000. ACM.
[66] Kedar Swadi, Walid Taha, Oleg Kiselyov, and Emir Pasalic. A monadic approach
for avoiding code duplication when staging memoized functions. In Proceedings of
the 2006 ACM SIGPLAN Symposium on Partial Evaluation and Semantics-Based
Program Manipulation, PEPM ’06, pages 160–169, New York, NY, USA, 2006.
ACM.
[67] Shaz Qadeer, Sriram K. Rajamani, and Jakob Rehof. Summarizing procedures in
concurrent programs. In Proceedings of the 31st ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL ’04, pages 245–255, New
York, NY, USA, 2004. ACM.
[68] Eric Koskinen, Matthew Parkinson, and Maurice Herlihy. Coarse-grained transactions. In Proceedings of the 37th Annual ACM SIGPLAN-SIGACT Symposium on
Principles of Programming Languages, POPL ’10, pages 19–30, New York, NY,
USA, 2010. ACM.
[69] Dave Dice, Yossi Lev, Mark Moir, and Daniel Nussbaum. Early experience with
a commercial hardware transactional memory implementation. In Proceeding of
the 14th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS ’09, pages 157–168, New York, NY, USA,
2009. ACM.
[70] Ali-Reza Adl-Tabatabai Vijay Menon Tatiana Shpeisman Lukasz Ziarek,
Adam Welc and Suresh Jagannathan. A uniform transactional execution environment for Java. In Proceedings of the 22nd European Conference on Object-Oriented
Programming, ECOOP ’08, pages 129–154, Berlin, Heidelberg, 2008. SpringerVerlag.
[71] Chi Cao Minh, Martin Trautmann, JaeWoong Chung, Austen McDonald, Nathan
Bronson, Jared Casper, Christos Kozyrakis, and Kunle Olukotun. An effective hybrid transactional memory system with strong isolation guarantees. In Proceedings
182
of the 34th Annual International Symposium on Computer Architecture, ISCA ’07,
pages 69–80, New York, NY, USA, 2007. ACM.
[72] Arrvindh Shriraman, Michael F. Spear, Hemayet Hossain, Virendra J. Marathe,
Sandhya Dwarkadas, and Michael L. Scott. An integrated hardware-software approach to flexible transactional memory. In Proceedings of the 34th Annual International Symposium on Computer Architecture, ISCA ’07, pages 104–115, New York,
NY, USA, 2007. ACM.
[73] Umut A. Acar, Amal Ahmed, and Blu Matthias. Imperative self-adjusting computation. In Proceedings of the 35th Annual ACM SIGPLAN-SIGACT Symposium on
Principles of Programming Languages, POPL ’08, pages 309–322, New York, NY,
USA, 2008. ACM.
[74] Ruy Ley-Wild, Matthew Fluet, and Umut A. Acar. Compiling self-adjusting programs with continuations. In Proceeding of the 13th ACM SIGPLAN International
Conference on Functional Programming, ICFP ’08, pages 321–334, New York, NY,
USA, 2008. ACM.
[75] Umut A. Acar, Guy E. Blelloch, Blu Matthias, and Kanat Tangwongsan. An experimental analysis of self-adjusting computation. In Proceedings of the 2006 ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI
’06, pages 96–107, New York, NY, USA, 2006. ACM.
[76] Matthew Hammer, Umut A. Acar, Mohan Rajagopalan, and Anwar Ghuloum. A
proposal for parallel self-adjusting computation. In Proceedings of the 2007 Workshop on Declarative Aspects of Multicore Programming, DAMP ’07, pages 3–9,
New York, NY, USA, 2007. ACM.
[77] John Reppy and Yingqi Xiao. Specialization of cml message-passing primitives.
In Proceedings of the 34th Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL ’s07, pages 315–326, New York, NY, USA,
2007. ACM.
[78] Lukasz Ziarek, Philip Schatz, and Suresh Jagannathan. Stabilizers: A modular
checkpointing abstraction for concurrent functional programs. In ICFP ’06: Proceedings of the 11th ACM SIGPLAN International Conference on Functional Programming, pages 136–147, New York, NY, USA, 2006. ACM.
[79] Cédric Fournet, Fabrice Le Fessant, Luc Maranget, and Alan Schmidt. JoCaml:
A Language for Concurrent Distributed and Mobile Programming. In Advanced
Functional Programming, pages 129–158. Springer-Verlag, 2002.
[80] Avik Chaudhuri. A concurrent ML library in concurrent Haskell. In Proceedings
of the 14th ACM SIGPLAN International Conference on Functional Programming,
ICFP ’09, pages 269–280, New York, NY, USA, 2009. ACM.
[81] George Russell. Events in Haskell, and how to implement them. In Proceedings
of the 6th ACM SIGPLAN International Conference on Functional Programming,
ICFP ’01, pages 157–168, New York, NY, USA, 2001. ACM.
[82] Prakash Chandrasekaran, Christopher L. Conway, Joseph M. Joy, and Sriram K.
Rajamani. Programming asynchronous layers with clarity. In Proceedings of the the
6th Joint Meeting of the European Software Engineering Conference and the ACM
SIGSOFT Symposium on The Foundations of Software Engineering, ESEC-FSE ’07,
pages 65–74, New York, NY, USA, 2007. ACM.
183
[83] Don Syme, Adam Granicz, and Antonio Cisternino. Expert F#. Apress, 2007.
[84] Robert Pickering. Foundations of F#. Apress, 2007.
[85] Cédric Fournet and Georges Gonthier. The reflexive CHAM and the join-calculus.
In Proceedings of the 23rd ACM SIGPLAN-SIGACT Symposium on Principles of
Programming Languages, POPL ’96, pages 372–385, New York, NY, USA, 1996.
ACM.
[86] Jean-Pierre Banâtre and Daniel Le Métayer. Programming by multiset transformation. Communications of the ACM, 36:98–111, January 1993.
[87] Peng Li and Steve Zdancewic. Combining events and threads for scalable network
services implementation and evaluation of monadic, application-level concurrency
primitives. In Proceedings of the 2007 ACM SIGPLAN Conference on Programming
Language Design and Implementation, PLDI ’07, pages 189–199, New York, NY,
USA, 2007. ACM.
[88] Patrick Eugster and K. R. Jayaram. EventJava: An extension of Java for event correlation. In Proceedings of the 23rd European Conference on Object-Oriented Programming, ECOOP ’09, pages 570–594, Berlin, Heidelberg, 2009. Springer-Verlag.
[89] Mads Haahr, René Meier, Paddy Nixon, Vinny Cahill, and Eric Jul. Filtering and
scalability in the ECO distributed event model. In Proceedings of the International
Symposium on Software Engineering for Parallel and Distributed Systems, page 83,
Washington, DC, USA, 2000. IEEE Computer Society.
[90] Jessie Dedecker, Tom Van Cutsem, Stijn Mostinckx, Theo D’Hondt, and Wolfgang
De Meuter. Ambient-oriented programming. In Companion to the 20th Annual ACM
SIGPLAN Conference on Object-oriented Programming, Systems, Languages, and
Applications, OOPSLA ’05, pages 31–40, New York, NY, USA, 2005. ACM.
[91] Patrick Eugster. Type-based publish/subscribe: Concepts and experiences. ACM
Transactions on Programing Languages and Systems, 29, January 2007.
[92] Philipp Haller and Tom Van Cutsem. Implementing joins using extensible pattern
matching. In Coordination Models and Languages, volume 5052 of Lecture Notes
in Computer Science, pages 135–152. Springer Berlin / Heidelberg, 2008.
[93] Nick Benton, Luca Cardelli, and Cédric Fournet. Modern Concurrency Abstractions
for C#. ACM Transactions on Programing Languages and Systems, 26(5):769–804,
2004.
[94] G Stewart Itzstein and David Kearney. Applications of Join Java. In Proceedings
of the 7th Asia-Pacific Conference on Computer Systems Architecture, CRPIT ’02,
pages 37–46, Darlinghurst, Australia, 2002. Australian Computer Society, Inc.
[95] Alex Buckley Sophia Drossopoulou, Alexis Petrounias and Susan Eisenbach.
School: a small chorded object-oriented language. Electronic Notes on Theoretical Computer Science, 135:37–47, March 2006.
[96] Huber Plociniczak and Susan Eisenbach. JErlang: Erlang with Joins. In 12th International Conference on Coordination Models and Languages, COORDINATION
’10, pages 61–75, June 2010.
184
[97] Erik Demaine. First class communication in MPI. In Proceedings of the Second MPI
Developers Conference, page 189, Washington, DC, USA, 1996. IEEE Computer
Society.
[98] Robert Halstead. Multilisp: A language for concurrent symbolic computation. ACM
Transactions on Programing Languages Systems, 7(4):501–538, October 1985.
[99] http://java.sun.com/j2se/1.5.0/docs/guide/concurrency.
[100] Joseph Hallett Victor Luchangco Jan-Willem Maessen Sukyoung Ryu Guy Steele
Eric Allan, David Chase and Sam Tobin-Hochstadt. The Fortress language specification version 1.0. Technical report, Sun Microsystems, Inc., May 2008.
[101] Philippe Charles, Christian Grothoff, Vijay Saraswat, Christopher Donawa, Allan
Kielstra, Kemal Ebcioglu, Christoph von Praun, and Vivek Sarkar. X10: An objectoriented approach to non-uniform cluster computing. In Proceedings of the 20th Annual ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications, OOPSLA ’05, pages 519–538, New York, NY, USA,
2005. ACM.
[102] Barbara Liskov and Liuba Shrira. Promises: Linguistic support for efficient asynchronous procedure calls in distributed systems. In Proceedings of the ACM SIGPLAN 1988 Conference on Programming Language Design and Implementation,
PLDI ’88, pages 260–267, New York, NY, USA, 1988. ACM.
[103] Robert L. Bocchino, Jr., Vikram S. Adve, Danny Dig, Sarita V. Adve, Stephen
Heumann, Rakesh Komuravelli, Jeffrey Overbey, Patrick Simmons, Hyojin Sung,
and Mohsen Vakilian. A type and effect system for deterministic parallel Java. In
Proceeding of the 24th ACM SIGPLAN Conference on Object-Oriented Programming Systems, Languages, and Applications, OOPSLA ’09, pages 97–116, New
York, NY, USA, 2009. ACM.
[104] Emery D. Berger, Ting Yang, Tongping Liu, and Gene Novark. Grace: Safe multithreaded programming for C/C++. In Proceeding of the 24th ACM SIGPLAN Conference on Object Oriented Programming Systems, Languages, and Applications,
OOPSLA ’09, pages 81–96, New York, NY, USA, 2009. ACM.
[105] Gerard Boudol and Gustavo Petri. A theory of speculative computation. In Programming Languages and Systems, Lecture Notes in Computer Science, pages 165–184,
Berlin / Heidelberg, 2010. Springer-Verlag.
[106] Mark Russinovich and Bryce Cogswell. Replay for concurrent non-deterministic
shared-memory applications. In Proceedings of the ACM SIGPLAN 1996 Conference on Programming Language Design and Implementation, PLDI ’96, pages 258–
266, New York, NY, USA, 1996. ACM.
[107] Harish Patil, Cristiano Pereira, Mack Stallcup, Gregory Lueck, and James Cownie.
PinPlay: A framework for deterministic replay and reproducible analysis of parallel
programs. In Proceedings of the 8th Annual IEEE/ACM International Symposium
on Code Generation and Optimization, CGO ’10, pages 2–11, New York, NY, USA,
2010. ACM.
[108] Pablo Montesinos, Matthew Hicks, Samuel T. King, and Josep Torrellas. Capo:
A software-hardware interface for practical deterministic multiprocessor replay. In
185
Proceeding of the 14th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS ’09, pages 73–84, New
York, NY, USA, 2009. ACM.
[109] Roberto Bruni, Hernán Melgratti, and Ugo Montanari. Theoretical foundations for
compensations in flow composition languages. In Proceedings of the 32nd ACM
SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL
’05, pages 209–220, New York, NY, USA, 2005. ACM.
[110] Umut A. Acar. Self-adjusting computation. PhD thesis, Pittsburgh, PA, USA, 2005.
Co-Chair-Guy Blelloch and Co-Chair-Robert Harper.
[111] Umut A. Acar, Guy E. Blelloch, Blu Matthias, Robert Harper, and Kanat Tangwongsan. An experimental analysis of self-adjusting computation. ACM Transactions on Programing Languages and Systems, 32:3:1–3:53, November 2009.
[112] Umut A. Acar, Guy E. Blelloch, and Robert Harper. Adaptive functional programming. In Proceedings of the 29th ACM SIGPLAN-SIGACT Symposium on Principles
of Pogramming Languages, POPL ’02, pages 247–259, New York, NY, USA, 2002.
ACM.
[113] Kaku Takeuchi, Kohei Honda, and Makoto Kubo. An interaction-based language
and its typing system. In PARLE’94, volume 817 of LNCS, pages 398–413. SpringerVerlag, 1994.
[114] Kohei Honda, Vasco Thudichum Vasconcelos, and Makoto Kubo. Language primitives and type discipline for structured communication-based programming. In
Proceedings of the 7th European Symposium on Programming: Programming Languages and Systems, pages 122–138, London, UK, 1998. Springer-Verlag.
[115] Matthias Neubauer and Peter Thiemann. An implementation of session types. In
Practical Aspects of Declarative Languages, volume 3057 of Lecture Notes in Computer Science, pages 56–70. Springer Berlin / Heidelberg, 2004.
[116] Kohei Honda, Nobuko Yoshida, and Marco Carbone. Multiparty asynchronous session types. In Proceedings of the 35th Annual ACM SIGPLAN-SIGACT Symposium
on Principles of Programming Languages, POPL ’08, pages 273–284, New York,
NY, USA, 2008. ACM.
[117] Eduardo Bonelli and Adriana Compagnoni. Multisession session types for a distributed calculus. In Proceedings Trustworthy Global Computing: TGC ’07, LNCS,
pages 38–57. Springer-Verlag, 2007.
[118] Andi Bejleri and Nobuko Yoshida. Synchronous multiparty session types. Electronic
Notes on Theoretical Computer Science, 241:3–33, 2009.
[119] Simon Gay, Vasco Vasconcelos, and Antonio Ravara. Session types for inter-process
communication. Technical report, University of Glasgow, 2003.
[120] Riccardo Pucella and Jesse A. Tov. Haskell session types with (almost) no class. In
Haskell ’08: Proceedings of the 1st ACM SIGPLAN Symposium on Haskell, pages
25–36, New York, NY, USA, 2008. ACM.
[121] Antonio Vallecillo, Vasco T. Vasconcelos, and António Ravara. Typing the behavior
of software components using session types. Fundamenta Informaticae, 73(4):583–
598, 2006.
186
[122] Sara Capecchi, Mario Coppo, Mariangiola Dezani-Ciancaglini, Sophia
Drossopoulou, and Elena Giachino. Amalgamating sessions and methods in
object-oriented languages with generics. Theoretical Computer Science, 410(23):142–167, 2009.
[123] Raymond Hu, Nobuko Yoshida, and Kohei Honda. Session-based distributed programming in Java. In Proceedings of the 22nd European Conference on ObjectOriented Programming, ECOOP ’08, pages 516–541, Berlin, Heidelberg, 2008.
Springer-Verlag.
[124] Manuel Fähndrich, Mark Aiken, Chris Hawblitzel, Orion Hodson, Galen Hunt,
James R. Larus, and Steven Levi. Language support for fast and reliable messagebased communication in singularity OS. In EuroSys ’06: Proceedings of the 1st
ACM SIGOPS/EuroSys European Conference on Computer Systems 2006, pages
177–190, New York, NY, USA, 2006. ACM.
[125] Mario Coppo, Mariangiola Dezani-Ciancaglini, and Nobuko Yoshida. Asynchronous session types and progress for object oriented languages. In Proceedings of
the 9th IFIP WG 6.1 International Conference on Formal Methods for Open ObjectBased Distributed Systems, FMOODS’07, pages 1–31, Berlin / Heidelberg, 2007.
Springer-Verlag.
[126] Mariangiola Dezani-Ciancaglini, Dimitris Mostrous, Nobuko Yoshida, and Sophia
Drossopoulou. Session types for object-oriented languages. In Proceedings of
the 20th European Conference on Object-Oriented Programming, pages 328–352.
Springer-Verlag, 2006.
[127] K.C. Sivaramakrishnan, Karthik Nagaraj, Lukasz Ziarek, and Patrick Eugster. Efficient session type guided distributed interaction. In Coordination Models and
Languages, volume 6116 of Lecture Notes in Computer Science, pages 152–167.
Springer-Verlag, Berlin, Heidelberg, 2010.
[128] Kwok Cheung Yeung and Paul H. J. Kelly. Optimising Java RMI programs by
communication restructuring. In Middleware ’03: Proceedings of the ACM/IFIP/USENIX 2003 International Conference on Middleware, pages 324–343, New
York, NY, USA, 2003. Springer-Verlag New York, Inc.
[129] Yee Jiun Song, Marcos K Aguilera, Ramakrishna Kotla, and Dahlia Malkhi. RPC
chains: Efficient client-server communication in geodistributed systems. In 6th
USENIX Symposium on Networked Systems Design and Implementation (NSDI
2009), pages 17–30, 2009.
VITA
187
VITA
Lukasz S. Ziarek was born on September 17th 1982 in Warszawa, Poland. He moved
to the United States in 1985. He attended St. Joseph High School in South Bend, Indiana.
He received his bachelors degree in computer science from the University of Chicago in
December 2003. He began graduate school at Purdue University in January of 2004. He
married his wife, Kayela, on July 17th of 2010. He received his Ph.D. in computer science
in May 2011.
Download