ABSTRACTIONS FOR ROBUST HIGHER-ORDER MESSAGE-BASED COMMUNICATION A Dissertation Submitted to the Faculty of Purdue University by Lukasz S. Ziarek In Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy May 2011 Purdue University West Lafayette, Indiana ii To my Wife and Parents... iii ACKNOWLEDGMENTS There are many people who have helped me in my journey to complete my dissertation. I want to first thank my wife for helping me through my defense and thesis writing. She always kept me focused and grounded. Without her this work would not have been a possibility. I would also like to thank my advisor Professor Suresh Jagannathan who helped me develop, focus, and expand my research and ideas. Professor Jagannathan helped me at every stage of my graduate career and our frequent meetings in his office were the spring board for this work. Without him this dissertation would not have seen the light of day. I also want to thank my committee members for their insightful comments and critiques. Professors Vitek, Eugster, and Li helped me to refine this dissertation. My fellow lab mates were always available for interesting discussions, critiquing ideas, and even proof reading. Many of them helped shape and refine the ideas that are present in this dissertation. I especially want to thank all of my co-authors on the various research papers that comprise the core of this dissertation. They were instrumental in the ideas, semantics, and implementations of the three language primitives found in this dissertation. Lastly I would like to thank my parents who always believed in me. In their minds there was never any doubt that I would complete this journey even when I doubted myself. iv TABLE OF CONTENTS Page LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii ABBREVIATIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiv 1 INTRODUCTION . . . . . . . . . . . 1.1 Context . . . . . . . . . . . . . . 1.1.1 Concurrent ML . . . . . 1.1.2 Semantics and Case Study 1.2 Contributions and Outline . . . . . . . . . 1 4 4 5 5 2 CONCURRENT ML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Programming with CML . . . . . . . . . . . . . . . . . . . . . . . . . 10 12 3 MLTON . . . . . . . . . . . . . 3.1 Multi-MLton . . . . . . . . 3.1.1 Threading System . 3.1.2 Communication . . 3.1.3 Concluding Remarks . . . . . 16 17 18 21 21 4 SWERVE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.0.4 Processing a Request . . . . . . . . . . . . . . . . . . . . . . . 22 24 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 LIGHTWEIGHT CHECKPOINTING FOR CONCURRENT ML 5.1 Programming Model . . . . . . . . . . . . . . . . . . . . 5.1.1 Interaction of Stable Sections . . . . . . . . . . . 5.2 Motivating Example . . . . . . . . . . . . . . . . . . . . 5.2.1 Cut . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 Semantics . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.1 Example . . . . . . . . . . . . . . . . . . . . . . 5.3.2 Soundness . . . . . . . . . . . . . . . . . . . . . 5.4 Incremental Construction . . . . . . . . . . . . . . . . . 5.4.1 Example . . . . . . . . . . . . . . . . . . . . . . 5.5 Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . 5.6 Implementation . . . . . . . . . . . . . . . . . . . . . . 5.6.1 Supporting First-Class Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 29 31 33 36 38 43 45 46 51 53 58 58 v . . . . . . . . . . Page 59 60 61 62 62 64 67 71 73 77 6 PARTIAL MEMOIZATION OF CONCURRENCY AND COMMUNICATION 6.1 Programming Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3 Semantics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.1 Language . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.2 Partial Memoization . . . . . . . . . . . . . . . . . . . . . . . 6.3.3 Constraint Matching . . . . . . . . . . . . . . . . . . . . . . . 6.3.4 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.5 Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.6 Schedule Aware Partial Memoization . . . . . . . . . . . . . . 6.4 Soundness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.5 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.5.1 Parallel CML and Hooks . . . . . . . . . . . . . . . . . . . . . 6.5.2 Supporting Memoization . . . . . . . . . . . . . . . . . . . . . 6.6 Performance Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.6.1 Synthetic Benchmarks . . . . . . . . . . . . . . . . . . . . . . 6.6.2 Constraint Matching and Discharge Overheads . . . . . . . . . 6.6.3 Case Study: File Caching . . . . . . . . . . . . . . . . . . . . 6.7 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.8 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 80 81 82 84 85 89 95 96 98 99 112 112 112 114 114 118 120 121 124 7 ASYNCHRONOUS CML . . . . . . . . . 7.1 Design Considerations . . . . . . . . 7.1.1 Putting it All Together . . . . 7.2 Asynchronous Events . . . . . . . . 7.2.1 Combinators . . . . . . . . . 7.2.2 Mailboxes and Multicast . . . 7.3 Semantics . . . . . . . . . . . . . . 7.3.1 Encoding Communication . . 7.3.2 Base Events . . . . . . . . . 7.3.3 Event Evaluation . . . . . . . 7.3.4 Communication and Ordering 125 127 130 130 134 141 142 143 144 145 145 5.7 5.8 5.9 5.6.2 Handling References . . . . . . . 5.6.3 Graph Representation . . . . . . 5.6.4 Handling Exceptions . . . . . . . 5.6.5 The Rest of CML . . . . . . . . Performance Results . . . . . . . . . . . 5.7.1 Synthetic Benchmarks . . . . . . 5.7.2 Open-Source Benchmarks . . . . 5.7.3 Case Studies: Injecting Stabilizers Related Work . . . . . . . . . . . . . . . Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Page 146 147 147 148 149 149 149 152 153 154 157 8 CONCLUDING REMARKS AND FUTURE DIRECTIONS 8.1 Future Directions . . . . . . . . . . . . . . . . . . . . 8.1.1 Stabilizers . . . . . . . . . . . . . . . . . . . 8.1.2 Partial Memoization . . . . . . . . . . . . . . 8.1.3 Asynchronous CML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 169 169 172 174 LIST OF REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176 VITA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187 7.4 7.5 7.6 7.7 7.3.5 Combinators . . . . . . . . . . . . . . . . 7.3.6 Choose Events . . . . . . . . . . . . . . . 7.3.7 Synchronization of Choice . . . . . . . . . 7.3.8 sWrap, aWrap, and Guard of Choose Events Implementation . . . . . . . . . . . . . . . . . . Case Study: A Parallel Web-server . . . . . . . . . 7.5.1 Lock-step File and Network I/O . . . . . . 7.5.2 Underlying I/O and Logging . . . . . . . . 7.5.3 Performance Results . . . . . . . . . . . . Related Work . . . . . . . . . . . . . . . . . . . . Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . vii LIST OF TABLES Table Page 5.1 Benchmark characteristics and dynamic counts. . . . . . . . . . . . . . . . 63 5.2 Benchmark graph sizes and normalized overheads. . . . . . . . . . . . . . 63 5.3 Restoration of the entire web-server. . . . . . . . . . . . . . . . . . . . . . 70 5.4 Instrumented recovery. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 7.1 Per module performance numbers for Swerve. . . . . . . . . . . . . . . . . 154 viii LIST OF FIGURES Figure 1.1 Page Single arrows depict communication actions and double arrows represent potential matching communication actions for a thread T1 . . . . . . . . . . . . 3 2.1 CML event operators. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3.1 Abstractions found in Multi-MLton. Communication, either synchronous or asynchronous (expressed via asynchronous events as defined in Chapter 7), is depicted through arrows. Parasites are shown as raw stack frames that comprise a single runtime thread. . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 The runtime architecture of Multi-MLton utilizes pthreads each of which contains a queue of light weight threads, the currently executing lightweight thread, and a pointer to a section of shared memory for allocation. . . . . . . . . . 19 Swerve module interactions for processing a request (solid lines) and error handling control and data flow (dashed lines) for timeouts. The number above the lines indicates the order in which communication actions occur. . . . . . 25 5.1 A simple server-side RPC abstraction using synchronous communication. . . 28 5.2 Interaction between stable sections. Clear circles indicate thread-local checkpoints, dark circles represent stabilization actions. . . . . . . . . . . . . . . 31 An excerpt of the the File Processing module in Swerve. The code fragment displayed on the bottom shows the code modified to use stabilizers. Italics mark areas in the original where the code is changed. . . . . . . . . . . . . 34 An excerpt of the Network Processor module in Swerve. The main processing of the hosting thread, created by the Listener module is wrapped in a stable section and the timeout handling code can be removed. The code fragment on the bottom shows the modifications made to use stabilizers. Italics in the code fragment on the top mark areas in the original where the code is removed in the version modified to use stabilizers. . . . . . . . . . . . . . . 35 An excerpt of the Timeout Manager module in Swerve. The bottom code fragment shows the code modified to use stabilizers. The expired function can be removed and trigger now calls stabilize. Italics mark areas in the original where the code is changed. . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.2 4.1 5.3 5.4 5.5 ix Figure 5.6 Page A multi-server implementation which utilizes a central coordinator and multiple servers. A series of requests is multi-plexed between the servers by the coordinator. Each server handles its own transient faults. The shaded portions represent computation which is unrolled due to the stabilize action performed by the server. Single arrows represent communication to servers and double arrows depict return communication. Circular wedges depict communications which are not considered because a cut operation limits their effects. . . . . 37 Stabilizers Semantics – The syntax, evaluation contexts, and domain equations for a core call-by-value language for stabilizers. . . . . . . . . . . . . . . . 39 Stabilizers Semantics – The global evaluation rules for a core call-by-value language for stabilizers. . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 An example used to illustrate the interaction of inter-thread communication and stable sections. The call to f establishes an initial checkpoint. Although g and h do not interact directly with f, the checkpoint established by f may nonetheless be restored on a call to stabilize as illustrated by Figure 5.10. . . 44 5.10 An example of global checkpoint construction where the inefficiency of global checkpoint causes the restoration of a checkpoint established prior to the stable section in which a call to stabilize occurs. . . . . . . . . . . . . . . . . . . 44 5.11 Stabilizers Semantics – Additional syntax, local evaluation rules, as well as domain equations for a semantics that utilizes incremental checkpoint construction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 5.12 Stabilizers Semantics – Global evaluation rules for a semantics that utilizes incremental checkpoint construction. . . . . . . . . . . . . . . . . . . . . 50 5.13 Stabilizer Semantics – Global evaluation rules for a semantics that utilizes incremental checkpoint construction. (continued) . . . . . . . . . . . . . . . 51 5.14 An example of incremental checkpoint construction for the code presented in Figure 5.9. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 5.15 The relation 7→ defines how to evaluate a schedule T derived from a graph G. 55 5.16 Sample code utilizing exceptions and stabilizers. . . . . . . . . . . . . . . 61 5.17 Asynchronous Communication runtime overheads. . . . . . . . . . . . . . . 65 5.18 Asynchronous Communication memory overheads. . . . . . . . . . . . . . 66 5.19 Communication Pipeline runtime overheads. . . . . . . . . . . . . . . . . . 66 5.20 Communication Pipeline memory overheads. . . . . . . . . . . . . . . . . 67 5.7 5.8 5.9 x Figure Page 5.21 Communication graphs for the Asynchronous Communication synthetic benchmark. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 5.22 Communication graphs for the Communication Pipeline synthetic benchmark. 69 5.23 Swerve file size overheads for Stabilizers. . . . . . . . . . . . . . . . . . . 72 . . . . . . . . . . . . . . . . . 73 5.24 Swerve quantum overheads for Stabilizers. 6.1 Syntax, grammar, evaluation contexts, and domain equations for a concurrent language with synchronous communication. . . . . . . . . . . . . . . . . . 83 Operation semantics for a concurrent language with synchronous communication. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 Memoization Semantics – A concurrent language supporting memoization of synchronous communication and dynamic thread creation. . . . . . . . . . 90 Memoization Semantics – A concurrent language supporting memoization of synchronous communication and dynamic thread creation. . . . . . . . . . 91 Memoization Semantics – A concurrent language supporting memoization of synchronous communication and dynamic thread creation (continued). . . . 92 Memoization Semantics – The function ℑ yields the set of constraints C which are not satisfiable in program state P. . . . . . . . . . . . . . . . . . . . . 93 Memoization Semantics – Constraint matching is defined by four rules. Communication constraints are matched with threads performing the opposite communication action of the constraint. . . . . . . . . . . . . . . . . . . . . . 94 Determining if an application can fully leverage memo information may require examining an arbitrary number of possible thread interleavings. . . . . 95 The communication pattern of the code in Figure 6.8. Circles represent operations on channels. Gray circles are sends and white circles are receives. Double arrows represent communications that are captured as constraints during memoization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 6.10 The second application of f can only be partially memoized up to the second send since only the first receive made by g is blocked in the global state. . . 97 6.11 Memoization of the function f can lead to the starvation of either g or h depending on which value the original application of f consumed from channel c. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 6.12 Memoization Semantics – Schedule Aware Partial Memoization. . . . . . . 100 6.13 T defines an erasure property on program states. The first four rules remove memo information and restore evaluation contexts. . . . . . . . . . . . . . 101 6.2 6.3 6.4 6.5 6.6 6.7 6.8 6.9 xi Figure Page 6.14 Normalized runtime percent speedup for the k-clustering benchmark of memoized evaluation compared to non-memoized execution. . . . . . . . . . . . 117 6.15 Normalized runtime percent speedup for STMBench-7 of memoized evaluation compared to non-memoized execution. . . . . . . . . . . . . . . . . . 118 6.16 Normalized percent runtime overhead for discharging send and receive constraints for ping-pong. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 6.17 Normalized runtime percent speedup for Swerve of memoized evaluation compared to non-memoized execution. . . . . . . . . . . . . . . . . . . . . . . 121 7.1 7.2 7.3 7.4 7.5 Two server-client model based concurrency abstractions extended to utilize logging. In (a) asynchronous sends are depicted as solids lines, where as in (b) synchronous sends are depicted as solid lines and light weight threads as boxes. Logging actions are presented as dotted lines. . . . . . . . . . . . . 128 The figure shows a complex asynchronous event ev , built from a base event aSendEvt , being executed by Thread 1. (a) When the event is synchronized via aSync , the value v is placed on channel c and post-creation actions are executed. Afterwards, control returns to Thread 1. (b) When Thread 2 consumes the value v from channel c , an implicit thread of control is created to execute any post-consumption actions. . . . . . . . . . . . . . . . . . . . . 132 The figure shows a complex asynchronous event ev , built from a base event aRecvEvt , being executed by Thread 1. (a) When the event is synchronized via aSync , the receive action is placed on channel c and post-creation actions are executed. Afterwards, control returns to Thread 1. (b) When Thread 2 sends the value v to channel c , an implicit thread of control is created to execute any post-consumption actions passing v as the argument. . . . . . . 134 The figure shows a callback event constructed from a complex asynchronous event ev and a callback function f . (a) When the callback event is synchronized via aSync , the action associated with the event ev is placed on channel c and post-creation actions are executed. A new event, ev’ , is created and passed to Thread 1. (b) An implicit thread of control is created after the base event of ev is consumed. Post-consumption actions are executed passing v , the result of consuming the base event for ev , as an argument. The result of the post-consumption actions, v’ is sent on clocal . (c) When ev’ is synchronized upon, f is called with v’ . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 The figure shows Thread 1 synchronizing on a complex asynchronous event ev , built from a choice between two base asynchronous send events; one sending v on channel c and the other v’ on c’ . Thread 2 is willing to receive from channel c . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 xii Figure Page 7.6 CML Event and AEvent operators. . . . . . . . . . . . . . . . . . . . . . . 158 7.7 CML mailbox structure. . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 7.8 An excerpt of a CML mailbox structure implemented utilizing asynchronous events. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 Syntax, grammar, and evaluation contexts for a core language for asynchronous events. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 7.9 7.10 Domain equations for a core language for asynchronous events. . . . . . . . 161 7.11 A core language for asynchronous events – base events as well as rules for spawn, function application, and channel creation. . . . . . . . . . . . . . . 162 7.12 A core language for asynchronous events – rules for matching communication and ordering. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 7.13 A core language for asynchronous events – rules for waiting, blocking, and enqueueing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 7.14 A core language for asynchronous events – combinator extensions for asynchronous events. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 7.15 A core language for asynchronous events – choose events and choose event flattening. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166 7.16 A core language for asynchronous events – synchronizing and evaluating S C HOOSE events. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 7.17 A core language for asynchronous events – combinators for S C HOOSE events. 168 xiii ABBREVIATIONS ACML Asynchronous Concurrent ML CML Concurrent ML DFS Depth-First Search Dom Domain GC Garbage Collector min Minimum MPI Message Passing Interface OS Operating System PML Parallel ML RPC Remote Procedure Call SCC Single-Chip Cloud Computer SML Standard ML STM Software Transactional Memory TM Transactional Memory xiv ABSTRACT Ziarek, Lukasz S. Ph.D., Purdue University, May 2011. Abstractions for Robust HigherOrder Message-Based Communication. Major Professor: Suresh Jagannathan. Message passing programming idioms alleviate the burden of reasoning about implicit program interactions that can lead to deadlock or race conditions by explicitly defining the interactions between threads and modules. This simplicity comes at the cost of having to reason about global protocols that span multiple interacting threads and software components. Reasoning about a given thread or component requires reasoning about potential communication partners and protocols in which the thread participates. Therefore, building modular yet composable communication abstractions is challenging. In this dissertation we present three language abstractions for building robust and efficient communication protocols. We show how such language abstractions can be leveraged to simplify protocol design, improve protocol performance, and modularly extend protocols with asynchronous actions. Extending protocols with asynchronous actions, specifically asynchronous events, is currently not possible in the context of languages like Concurrent ML. 1 1 INTRODUCTION The advent of multi-core and multi-processor systems into main-stream computing has posed numerous challenges for software development. Notably, the development and maintenance of software that utilizes these additional computing resources is notoriously difficult. In this dissertation we explore three language based mechanisms aimed at simplifying writing and reasoning about multi-threaded programs. Message passing is a useful abstraction for writing concurrent and parallel programs. In message passing languages threads communicate with each other by sending and receiving messages on channels. Message passing is the prevalent means of synchronization and communication between threads in languages such as Erlang [1], Concurrent ML [2], Manticore [3], and is the basis for MPI [4] and JMS [5]. Message passing is also the key component in current trends in CMP processor and operating systems design. For example, Intel’s recently announced SCC [6] (Single-Chip Cloud Computer) is a manycore CPU that features 24 tiles comprised of dual-core x86 IA processors. Notably, there is no shared L2 cache among these tiles. Instead, communication across cores is via hardware-assisted message passing over a 2D high-bandwidth mesh network. Thus, the SCC does not support uniform access memory; application performance is dictated by the degree of affinity that exists between threads and the data they access. At the software level, Barrelfish [7] is a new operating system kernel design that treats the underlying machine as a network of independent cores, and assumes no inter-core sharing at the lowest level. It recasts all OS services, including memory management and inter-core communication, in terms of message passing, arguing that such a reformulation leads to improved scalability and efficiency due to additional pipelining and batching opportunities. In this dissertation we explore message passing as a programming style that is implemented on top of a shared memory system, instead of as a low-level communication mechanism. In this context, messages are data that threads pass to one another via global 2 conduits, or channels. A message is transferred from one thread to another when a thread sends the message on a channel and another receives the message from the same channel. Communication actions can either be synchronous or asynchronous. The former requires a thread performing a communication action to block until another thread is willing to perform a matching communication action; either receive from or send to the waiting thread. Asynchronous communication, on the other hand, does not wait for a matching participant. For example, in the case of a send, the message is placed on the channel regardless of the existence of a communication partner. Messages, data sent/received on channels, can be propagated either by copying or by reference. The former typically makes a deep copy of the data composing the message, thus sender and recipient each execute with their own separate version of the data. Alternatively, messages can be passed by reference so that sender and receiver both have access to the same data in memory. Languages structured around a message passing programming idiom offer a number of benefits over shared memory. Immutable data that can be witnessed by concurrently executing threads of control is made explicit in message passing languages through sends and receives of such data on channels. This alleviates the programmer from having to reason about shared state since there is no explicit sharing between threads due to the fact that data is immutable. However, in the presence of mutable data, things become more complicated. If mutable data is passed by reference between two communicating threads, future updates to this data may suffer from data-races. If mutable data, on the other hand, is passed by copy, no association is manifested between the various copies. Updates in one thread of control are not propagated to another. Yet, even with strictly immutable state, message passing languages are no panacea. The ease of not having to reason about shared state comes at the complexity of reasoning about global protocols. Message passing programs are typically structured around global protocols that govern the interactions between threads and software components. Reasoning about a given thread, or given region of code, requires reasoning about which protocols that thread may participate in. This is typically difficult since a thread may potentially participate in many different protocols and participating in any given protocol may preclude future commu- 3 T2 T1 T3 C1 C1 C1 C2 C2 T4 ... Tn Figure 1.1. Single arrows depict communication actions and double arrows represent potential matching communication actions for a thread T1 . nication actions, allow different communication actions, or generate new communication actions. This occurs because control flow of a given thread, creation of new threads, or communication actions can be dependent upon the values a thread sends or receives. To reason about a given region of code in a message passing language, one must take into account the possible candidates with whom that region might communicate. As an example, consider the program in Figure 1.1 which consists of some finite number of communicating threads. Communication actions on channels are represented by circles, gray for sends and white for receives. Potential communications are depicted by arrows. To reason about thread T1 we must consider its communication partners. Clearly, T1 can complete by matching its communication actions with T2 . However, T1 may be able to also match with T3 . Notice that T3 is able to send to T1 on channel C1 . Thus, T1 can complete if there exists a thread (T4 ... Tn ) which is also willing to send on channel C2 . Alternatively there could be a thread (T4 ... Tn ) willing to receive on channel C1 from T2 , which would allow the second receive in T2 to match with the second receive in T1 . Reasoning about the different communications for a given thread requires examining the whole program. Notably, a given protocol may be composed of many interacting threads. Reasoning about the interaction between software components is a bit simpler as abstraction boundaries effectively decouple participants in a protocol. Message passing explicitly defines the protocols that govern component interactions, simplifying software design and engineering. However, maintaining and augmenting a specific software compo- 4 nent once again requires the programmer to reason about which cross-component protocols that component may interact with. Changes to cross-component protocols affect both parties. In this dissertation we explore the design and maintenance of message passing programs through the definition of three language abstractions: stabilizers, partial memoization, and asynchronous events. We show how such language abstractions can be leveraged to simplify protocol design, improve protocol performance, and modularly extend protocols with asynchronous actions. 1.1 Context The work presented in this dissertation has been done in MLton [8], a whole program optimizing compiler for Standard ML. ML is a family of functional programming languages whose main dialects include Standard ML, Objective Caml, and F#. More specifically the work focuses on Concurrent ML, a message passing language extension for Standard ML. 1.1.1 Concurrent ML Concurrent ML is a concurrent extension of Standard ML to include message passing primitives, and has been extended to execute on parallel architectures [9]. Concurrent ML at its core is structured around synchronous message passing and typically implemented using lightweight threads (i.e., green threads). In CML, channels are first-class entities and can be created dynamically. CML channels support both send and receive operations by a given thread. In addition to synchronous message passing, part of this dissertation is devoted to exploring an asynchronous extension to CML. We provide additional background on CML and its primitives in Chapter 2. 5 1.1.2 Semantics and Case Study For each language abstraction we present a formal operational semantics and relevant case studies. All of the operational semantics begin with a core functional language with threading and communication primitives. We show how to extend such a language to support the various language abstractions presented in each chapter. This core functional language is augmented with additional primitives in specific chapters where necessary. In each chapter we present a case study detailing the performance characteristics of each language abstraction on Swerve, a real-world third-party benchmark. Swerve is a web-server entirely written in CML and is roughly 16,000 lines of CML code consisting of a number of modules and specialized libraries. The goal of Swerve was to highlight and utilize CML as the predominant concurrency abstraction. The overwhelming majority of the interactions between modules and intra-module concurrency is in fact achieved through CML primitives, combinators, and events. As such we utilize Swerve as our main case study in each chapter. We augment the case study with smaller synthetic benchmarks throughout the chapters. We provide a detailed description of Swerve in Chapter 4. 1.2 Contributions and Outline We begin by providing details on CML primitives in Chapter 2 as well as a few code examples to illustrate their use. We then introduce the MLton compiler and its multi-core extension Multi-MLton in Chapter 3. We present salient details and background information on the implementation environment and extend the description in specific chapters where additional details are necessary. Details on Swerve are given in Chapter 4. In Chapter 5 we explore the effects of atomicity and state reversion in a message passing language through the definition of a lightweight checkpointing abstraction, called stabilizers, and show how it can be used to: a) simplify error handling code and b) recover from transient faults efficiently. Stabilizers ensure an annotated region of code executes atomically, either all of its effects are visible or none are. They do so through the introduction of a new keyword: stable, used to demark atomic regions of code. If a transient error is 6 encountered during the execution of a stable region of code, the region is reverted to a safe checkpoint through the use of the primitive stabilize. Two atomic regions of code which have communicated via message passing are reverted as a single atomic unit. We demonstrate the usefulness of stabilizers by simplifying the handling of timeouts in Swerve. Chapter 5 makes the following contributions: 1. The design and semantics of stabilizers, a new modular language abstraction for transient fault recovery in concurrent programs. To the best of our knowledge, stabilizers are the first language-centric design of a checkpointing facility that provides global consistency and safety guarantees for transient fault recovery in programs with dynamic thread creation, and selective message passing communication. 2. A lightweight dynamic monitoring algorithm faithful to the semantics that constructs efficient global checkpoints based on the context in which a restoration action is performed. Efficiency is defined with respect to the amount of rollback required to ensure that all threads resume execution after a checkpoint is restored to a consistent global state as compared to a global checkpointing scheme. 3. A formal semantics along with soundness theorems that formalize the correctness and efficiency of our design. 4. A detailed explanation of an implementation built as an extension of the Concurrent ML library [2] within the MLton [8] compiler. The library includes support for synchronous, selective communication, threading primitives, exceptions, shared memory, as well as events. 5. An evaluation study that quantifies the cost of using stabilizers on various opensource server-style applications. Our results reveal that the cost of defining and monitoring thread state is small, typically adding roughly no more than 4–6% overhead to overall execution time. Memory overheads are equally modest. In Chapter 6 we explore monitoring of message passing in our formulation of partial memoization and show how partial memoization can be leveraged to improve communi- 7 cation protocol performance. Partial memoization is an optimization technique that allows the omission of redundant computation by monitoring side-effecting actions at the first call of a function. Subsequent applications of the same function can be avoided if all sideeffecting actions of the first call can be replayed. Such an optimization can be utilized to reduce overheads of state reversion in a message passing language. To allow for the successful elimination of a redundant call, our partial memoization technique ensures that all spawns, communication, and channel creation actions are performed in the order witnessed by the prior execution of the candidate function. If this is not possible, partial memoization will resume execution of the computation from the first such action that cannot be performed. We demonstrate the usefulness of partial memoization by accelerating the performance of a wide variety of benchmarks including a streaming benchmark and Swerve. The benchmarks leverage a number of different communication protocols and communication patterns. As a case study we show how to implement an error aware file caching mechanism in Swerve. Chapter 6 makes the following contributions: 1. The definition of partial memoization for a core functional language with synchronous communication primitives as well as threading constructs. 2. A formal definition of partial memoization in terms of an operational semantics along with a safety theorem for partial memoziation. We include a detailed proof of the safety theorem. 3. A description of partial memoization in the context of Multi-MLton, a parallel extension of MLton. 4. We present detailed performance evaluation of partial memoization on three parallel benchmarks. We consider the effect of memoization on improving performance of multi-threaded CML applications executing on a multi-core architecture. Our results indicate that memoization can lead to substantial runtime performance improvement of around 30% over a non-memoized version of the same program, with only modest increases in memory overhead (15% on average). 8 In Chapter 7 we study how to add explicit support for asynchronous events into the CML event framework and show how such events can be utilized to extend existing software in a composable and modular fashion. Asynchronous events interoperate fully with synchronous events, allowing programmers to specify both synchronous and asynchronous actions on a given channel. This uniformity allows the programmer to modify one component to utilize asynchrony without having to change the other participants in crosscomponent protocols. Asynchronous events provide the ability to utilize asynchrony in selective communication, which was not possible in CML without sacrificing ordering and visibility guarantees. Additionally, we provide a rich set of asynchronous event combinators for constructing first-class communication protocols. We demonstrate the usefulness of asynchronous events by augmenting Swerve with additional functionality and improving its performance on a multi-core processor. Chapter 7 makes the following contributions: 1. We present a comprehensive design for asynchronous events, and describe a family of combinators analogous to their synchronous variants available in CML. To the best of our knowledge, this is the first treatment to consider the meaningful integration of composable CML-style event abstractions with asynchronous functionality. 2. We provide implementations of useful asynchronous abstractions such as callbacks and mailboxes, along with a number of case studies extracted from realistic concurrent applications (e.g., a concurrent I/O library, concurrent file processing, etc.). Our abstractions operate over ordinary CML channels, enabling interoperability between synchronous and asynchronous protocols. 3. We discuss an implementation of asynchronous events that has been incorporated into Multi-MLton, a parallel extension of MLton [8], and present a detailed benchmark study that shows how asynchronous events can help improve the expression and performance of highly concurrent server applications. 9 Related work is presented at the end of each major chapter. Additional related work that is relevant to proposed future directions of research is given in Chapter 8 along with concluding remarks. 10 2 CONCURRENT ML CML is a concurrent extension of Standard ML that utilizes synchronous message passing to enable the construction of synchronous communication protocols. Threads perform send and recv operations on typed channels. These operations block until a matching action on the same channel is performed by another thread. The main contributions of CML are its event framework and selective communication. In CML, an event is a first-class, abstract, synchronous operation that decouples the description of a synchronous action from the synchronization, or discharge, of the action. For example, a send event that places a value v on a channel c does not perform the deposit of v on c until a thread explicitly enacts the event. Similarly to how function composition allows the creation of first-class computational units, event combinators provide a mechanism to build first-class communication protocols. The motivation behind first-class events is the desire to support selective communication, allowing a thread to choose between many potential communication partners. For example, a consumer may wish to choose between a set of producers. The consumer will pick any producer from this set of producers to synchronize with that is currently willing to send a value. If no producer from this set of producers is currently available for synchronization, the consumer will block until a producer becomes available. In CML selective communication is provided through a choice operator that selects between a set of events. Events and their combinators are necessary because λ-abstraction and function composition are not enough to support selective communication as functions hide the computation they encapsulate. A choice operation cannot introspect the encapsulated computation and thus cannot choose the communication action that has a waiting partner. CML provides first-class synchronous events that abstract synchronous message-passing operations. An event value of type ’a event when synchronized on yields a value of type ’a . An event value represents a potential computation, with latent effect until a thread syn- 11 spawn : (unit -> ’a) -> threadID sendEvt : ’a chan * ’a -> unit Event recvEvt : ’a chan -> ’a Event alwaysEvt : ’a -> ’a Event neverEvt : ’a Event sync : ’a Event -> ’a wrap : ’a Event * (’a -> ’b) -> ’b Event guard : (unit -> ’a Event) -> ’a Event choose : ’a Event list -> ’a Event Figure 2.1. CML event operators. chronizes upon it by calling sync . The following equivalences therefore hold: send(c, v) ≡ sync(sendEvt(c,v)) and recv(c) ≡ sync(recvEvt(c)) . Besides sendEvt and recvEvt , there are other base events provided by CML: an alwaysEvt which contains a value and is always available for synchronization as well as a neverEvt , which as its name suggests, is never available for synchronization. These events are typically generated based on the (un)satisfiability of conditions or invariants that can be subsequently used to influence the behavior of more complex events built from the event combinators described below. For example, an always event is typically utilized to provide a default behavior in conjunction with selective communication and simply returns its argument when synchronized upon. Notably, thread creation is not encoded as an event – the thread spawn primitive simply takes a thunk to evaluate as a separate thread, and returns a thread identifier that allows access to the newly created thread’s state. Much of CML’s expressive power derives from event combinators that construct complex event values from other events. We list some of these combinators in Figure 2.1. The expression wrap (ev, f) creates an event that when synchronized on applies the result of synchronizing on event ev to function f . Conversely, guard(f) creates an event, which when synchronized on, evaluates f() to yield event ev and then synchronizes on ev . Wrap is thus utilized to provide post-synchronization actions and guard to specify- 12 ing pre-synchronization computation. The choose event combinator takes a list of events and constructs an event value that represents the non-deterministic choice of the events in the list if both are available for synchronization. For example, choose[recvEvt(a), sendEvt(b, v)] when synchronized on will either receive a unit value from channel a , or send value v on channel b if there is a sender available on a and a receiver on b . If only one of the events has a matching participant, that event is selected. If neither have matching participants the expression blocks until one becomes available. The choose event is the fundamental building block for selective communication. Selective communication provided by choice motivates the need for first-class events. Composition of first-class functions prevents the expression of choice because function abstraction does not allow operators like choose from synchronizing on events that may be embedded within the function. 2.1 Programming with CML In this section we provide a few examples to illustrate the expressivity of CML and to provide additional insight on the behavior of relevant primitives. Consider encoding a simple produce consumer pattern leveraging message passing. We can encode both the producer and the consumer in separate threads of control and couple them through a shared channel leveraged for communication of values from the producer to the consumer. val c = channel() val v = () fun producer() = (send(c,v); producer()) fun consumer() = (recv(c); consumer()) val tid1 = spawn(producer) val tid2 = spawn(consumer) In the code above, the function producer sends the unit value across a shared channel c . Analogous to the producer function, the consumer function receives values from the 13 producer on the channel c . Notice that both threads are encoded as infinite loops. Since the communication between both producer and consumer is synchronous the channel will have at most one value stored on it. This allows us to express an infinite computation, or an infinite stream, without having to buffer or potentially utilize infinite space. We call this type of encoding a lightweight server. Lightweight servers produce values in a demand driven fashion as they block until a request (matching receive) is available on the channel over which they are defined. We can further extend our example to leverage CML events with some minor changes: val c = channel() val v = () fun producer() = (sync(sendEvt(c,v)); producer()) fun consumer() = (sync(recvEvt(c)); consumer()) val tid1 = spawn(producer) val tid2 = spawn(consumer) In the code above, we have replaced the CML communication primitives send and recv with their event counterparts. To trigger the computation of the events, the CML primitive sync is used. This code behaves exactly as the previous definition as it does not leverage the inserted events in any interesting way. Let us consider if there were multiple producers and we wanted to augment our consumer to receive a value from any producer so long as the producer had a value available. val c1 = channel() val v1 = 1 val c2 = channel() val v2 = 2 fun producer1() = (sync(sendEvt(c1,v1)); producer1()) fun producer2() = (sync(sendEvt(c2,v2)); producer2()) 14 fun consumer() = (sync(choose([recvEvt(c1), recvEvt(c2)])); consumer()) val tid1 = spawn(producer1) val tid2 = spawn(producer2) val tid3 = spawn(consumer) In the code above, we extended our previous encoding of the producer and consumer communication pattern by adding an additional producer, producer2 . Each producer generates values on a distinct channel, c1 for producer1 and c2 for producer2 . To be able to select which producer the consumer receives a value from, we need to construct a complex event from CML event combinators. The choose combinator picks an event from a list of events based on its satisfiability. In this example we use choose to pick between receiving between the channels c1 and c2 . If there is a value available on both channels, choose non-deterministically picks one. If there is only one available value, choose picks the available value. Similarly, if no values are available, choose blocks until one becomes available and then picks that value. Consider if we wanted to perform an action based on which value we received or whom we might have received that value from. Both types of responses can be encoded by utilizing the wrap combinator in conjunction with the complex event we have already created. fun consumer() = (sync(wrap(choose([recvEvt(c1), recvEvt(c2)]), fn x => if x then ... else ...)); consumer()) The code above wraps the complex event with a function that branches on the result of the choice. In this way we can encode responses that are based on the value received by the consumer. CML also provides us with a mechanism to craft a response based on which channel we have received a value from, or more abstractly which event was chosen in the choice. In this running example, a response based on which event was chosen, corresponds 15 to a response based on with whom the consumer communicated with, as both producers communicate over unique channels. fun consumer() = (sync(choose([wrap(recvEvt(c1), fn x => response1), wrap(recvEvt(c2), fn x => response2)])); consumer()) By wrapping the individual events occurring within the choice, we can designate a response based on which event is picked by the choice when it is synchronized upon. In this code we execute response1 if the receive on c1 is picked and response2 if the receive on c2 is chosen. 16 3 MLTON The work presented in the following chapters has been done in the context of MLton, a whole program optimizing compiler for Standard ML that uses a simply-typed first-order intermediate language. MLton compiles the full SML 97 language [10], including modules and functors. MLton’s approach is different from other functional language compilers. It imposes significant constraints on the compiler, but yields many optimization opportunities not available with other approaches. There are numerous issues that arise when translating SML into a simply-typed IL 1 . First, how does one represent SML modules and functors in a simply-typed IL, since these typically require much more complicated type systems? MLton’s answer: defunctorize the program [12]. This transformation turns an SML program with modules into an equivalent one without modules by duplicating each functor at every application and eliminating structures by renaming variables. Second, how does one represent SML’s polymorphic types and polymorphic functions in a simply-typed IL? MLton’s answer: monomorphise the program [13]. This transformation eliminates polymorphism from an SML program by duplicating each polymorphic datatype and function at every type at which it is instantiated. Third, how does one represent SML’s higher-order functions in a first-order IL? MLton’s answer: defunctionalize the program [14]. This transformation replaces higher-order functions with data structures to represent them and first-order functions to apply them; the resulting IL is a Static Single Assignment (SSA) form [15]. Because each of the above transformations requires matching a functor, function definition, or type definition with all possible uses, MLton is a whole-program compiler. As such MLton requires all source code for a given program to present at compile-time and does not support partial compilation. MLton’s whole-program compilation strategy has a 1 Interested readers are directed to the following article on extending the MLton compiler from which this short description is cited [11]. 17 number of implications. Most importantly, MLton’s use of defunctorization means that the placement of code in modules has little measurable effect on performance. The result of this strategy is that modules are purely for the benefit of the programmer in structuring code. Since MLton duplicates functors at each use, no run-time penalty is incurred for abstracting a module into a functor. The benefits of monomorphisation are similar. In MLton, whole-program control-flow analysis based on 0CFA [16] is employed early in the compilation process, immediately after defunctorization and monomorphisation, and well before any serious code motion or representation decisions are undertaken. Information computed by the analysis is used in the defunctionalization pass to introduce dispatches at call sites to the appropriate closure. In addition to a highly optimized compiler, MLton provides a lightweight runtime layer that supports threads and garbage collection. The MLton runtime permits concurrency but does not allow for parallelism. MLton provides concurrency through user defined, lightweight threads that are multiplexed on a single kernel thread. The primary concurrency abstractions supported by MLton are those found in the CML library. 3.1 Multi-MLton Multi-MLton extends MLton with multi-core support, library primitives for efficient lightweight thread creation and management, as well as PCML [9], an optimized, parallel, synchronous message passing extension to CML. In this section, we present the details of Multi-MLton’s runtime. A functional programming discipline, combined with explicit communication via messages (rather than implicit communication via shared-memory), and associated lightweight concurrency primitives, results in an attractive programming model as it allows the programmer to reason about communication protocols. However, there are numerous challenges to realizing this model in practice on scalable multi- and manycore platforms, with respect to both language abstractions and their implementation. It is an investigation of these challenges that guides the design of Multi-MLton. 18 Parasites Speculation Lightweight Threads ACML Future Isolates Message Passing Stabilizers First Class Asynchronous Events Figure 3.1. Abstractions found in Multi-MLton. Communication, either synchronous or asynchronous (expressed via asynchronous events as defined in Chapter 7), is depicted through arrows. Parasites are shown as raw stack frames that comprise a single runtime thread. Multi-MLton defines a programming model in which threads primarily communicate via message passing. It differs from other message passing systems insofar as the abstractions it aims to provide permit (a) the expression of isolation between explicitly annotated groups of threads; (b) composable speculative actions that are message passing aware; (c) the construction of asynchronous events that seamlessly integrate abstract asynchronous communication protocols with abstract CML-style events allowing the expression heterogeneous protocols, and (d) deterministic concurrency within threads to enable the extraction of additional parallelism when feasible and profitable. A graphical overview of the features and goals of Multi-MLton is given in Figure 3.1. Asynchronous events, depicted in the Figure, are introduce in Chapter 7 and salient details for parasitic threads are presented in Section 3.1.1. 3.1.1 Threading System To support parallel execution, the Multi-MLton runtime utilizes pthreads [17]. Pthreads are not managed by the programmer, they are created when the program starts. The programmer can pass a runtime argument to indicate how many pthreads to utilize, typically this is one pthread per each processor or core. Each pthread manages a lightweight Multi- 19 Local Allocation Main Memory Local Allocation MLton runtime heap pointer running light weight thread PThreads ... queue of light weight threads Figure 3.2. The runtime architecture of Multi-MLton utilizes pthreads each of which contains a queue of light weight threads, the currently executing lightweight thread, and a pointer to a section of shared memory for allocation. MLton thread queue. Lightweight threads are created explicitly through a spawn primitive. Each pthread switches between lightweight MLton threads on its queue when it is preempted as show in in Figure 3.2. Pthreads can by dynamically spawned and managed by Multi-MLton’s runtime and such functionality is leveraged by certain primitives. The MLton garbage collector was also modified to support parallel allocation. Associated with every processor is a memory region used by threads it executes; allocation within this region requires no global synchronization. These regions are dynamic and growable, accomplished by grabbing new chunks of memory from a free list. All pthreads must synchronize when garbage collection is triggered as Multi-MLton does not yet support concurrent collection. 20 Host threads In Multi-MLton we support two types of user defined threads, host threads and parasitic threads [18]. Host threads are analogous to MLton’s lightweight threads. Parasitic threads, are typically raw stack frames that as their name suggests, temporarily execute on top of a host thread. The implementation supports m host threads running on top of n pthreads, where m is never less than n. Each processor runs a single pthread which maintains a queue of host threads. Each host thread has a contiguous stack, allocated on the MLton heap, which can dynamically grow and shrink during execution. The information associated with the currently executing host thread is cached in the runtime state associated with each processor to improve performance. On a context switch, this information is written back to the host thread. The code to accomplish the context switch is generated by the compiler and is highly optimized. A new host thread, created using spawn , is placed in a processor queue in a round-robin fashion. When there are no host threads in a processor queue, the pthread is suspended, to be woken up when a new host thread is added. Parasitic threads Unlike host threads, parasitic threads are implemented as raw stack frames. The expression parasite(e) pushes a new frame to evaluate expression e , similar to a function call. We capture the stack top at the point of invocation. This corresponds to the caller’s continuation and is a de facto boundary between the parasite and its host (or potentially another parasite). If the parasite is not blocked or preempted, the computation runs to completion and control returns to the caller, just as if the caller made a non-tail procedure call. If the parasitic thread is blocked, the frames associated with its execution are reified. When the parasitic thread unblocks it is assigned a new host. Parasitic threads are utilized to implement short-lived asynchronous actions. A parasitic thread can be inflated into a host thread. However, once such inflation occurs, the newly inflated host thread will always remain a host thread. Parasitic threads are inflated into host threads if their executions is preempted. 21 3.1.2 Communication The core communication primitives in Multi-MLton are those offered by the CML library. Threads communicate across synchronous first-class channels and can construct abstract protocols from the CML event framework. We extend the CML primitives and event framework with support of asynchronous actions and asynchronous events as described in Chapter 7. 3.1.3 Concluding Remarks Multi-MLton provides a programming model that leverages pervasive lightweight concurrency and robust communication protocols to specify concurrent and parallel interactions. To realize this programming model we introduce three features of Multi-MLton to build robust, efficient, and expressive communication protocols. Stabilizers provide perthread checkpointing and recovery for building robust, fault tolerant communication abstractions. We introduce partial memoization to accelerate protocols through efficient code reuse. Lastly we extend CML’s event framework with asynchronous events for building first-class asynchronous and heterogeneous protocols, protocols which contain both synchronous and asynchronous communication actions. 22 4 SWERVE Swerve [8] is an open-source, third-party, web-server written in CML and is roughly 16K lines of CML code. Swerve was originally developed for SML/NJ and later ported to MLton. The server is composed of a number of modules and libraries. Communication between modules makes extensive use of CML message passing. Threads communicate over explicitly defined channels on which they can either send or receive values. Complex communication patterns are built from CML events. The web-sever configuration is managed through a configuration file that the server parses during bootstrapping. A folder is specified during configuration that the web-server utilizes as a location for hosting files and CGI scripts. This folder is traversed and a representation of its file structure is generated that the web-server manipulates during hosting. In addition Swerve, like Apache, expects a MIME type configuration file that is also parsed at bootstrapping. After bootstrapping, there are five basic modules that govern the core of the server’s functionality: Listener, File Processor, Network Processor, Timeout Manager, and the Logger. The Listener module receives incoming HTTP requests and delegates file serving requirements to concurrently executing processing threads. For each new connection, a new listener is spawned; thus, each connection has one main governing entity. The Listner manages socket connections and parses incoming requests. The File Processor module handles access to the underlying file system. Each file that will be hosted is read by a file processor thread that chunks the file and sends it via message passing to the Network Processor. The Network Processor, like the File Processor, handles access to the network. The Network Processor receives the chunks of the requested file from the File Processor, packetizes them, and sends then on the network to the client which initiated the request. The File Processor and Network Processor execute in lock-step, requiring the Network Processor to have completed 23 sending a chunk before the next one is read from disk. Notice that concurrent request by multiple clients can be processed in parallel as each individual request will have a thread dedicated to file processing as well as a thread dedicated to network processing. These threads are managed by the Listener module. Timeouts are processed by the Timeout Manager through the use of timed events. The File Processor, which after a connection and a request have been established, starts the inter-module communication protocol, polls for timeouts. If one is detected, the File Processor notifies other modules it currently is communicating with of the detection. Those modules then take appropriate cleanup and recovery actions. Therefore, the error notification protocol mirrors the typical communication protocols between the various modules which comprise Swerve. At each communication, the modules leverage pattern matching to check if the message contains the data expected or a notification of a timeout. Timeouts are the most frequent transient fault present in the server, and are difficult to deal with naively. Indeed, the system’s author notes that handling timeouts in a modular way is “tricky” and devotes an entire section of the user manual explaining the pervasive cross-module error handling in the implementation. Other errors in the system are propagated using the same mechanism the Timeout Manager uses to notify all modules of a timeout. Namely, a specific communication pattern exists, in which interacting modules notify one another through communication primitives that an error has been detected. The Logger module is responsible for producing a time-stamped messages and writing them to a log file. In addition to logging, the Logger module is responsible for terminating the server if a fatal error occurs. After the Logger module writes a time-stamped message to the log file that such an error has occurred it terminates the server. This allows for a clean termination and guarantees a record of the error. Besides the five main modules in the system, Swerve also contains libraries and smaller modules for parsing and processing HTML, MIME type configuration parsing, an internal representation of the file system that his hosted by the server, and file management. 24 4.0.4 Processing a Request Consider the typical execution flow for processing an incoming request given in Figure 4.1. When a new request is received, the listener spawns a new thread for this connection that is responsible for hosting the requested page. This hosting thread first establishes a timeout quantum with the timeout manager (1) and then notifies the file processor (2). The hosting thread creates a file processing thread to process the request and then hosting thread is notified that the file is ready to be chunked (2). The hosting thread passes to the file processing thread the channel on which it will receive its timeout notification (2). The file processing thread is now responsible to check for explicit timeout notification (3). The file processing thread begins reading the file from disk and sending chunks to the hosting thread. The hosting thread leverages the functionality of the Network Processor to receive the chunks, packetizes the chunks, and sends it on the network (2). Since the communication is synchronous, the file processing thread will not read the next chunk from disk until the hosting thread receives the chunk to be sent on the network. The Listner then terminates the connection and performs clean up actions. If detailed logging is enabled, the Logger logs meta-data about the processed request. Since a timeout can occur before a particular request starts processing a file (4) (e.g. within the hosting thread defined by the Listener module) or during the processing of a file (5) (e.g. within the File Processor), the resulting error handling code is cumbersome. Moreover, the detection of the timeout itself is handled by a third module, the Timeout Manager. The result is a complicated message passing procedure that spans multiple modules, each of which must figure out how to deal with timeouts appropriately. The unfortunate side effect of such code organization is that modularity is compromised. The code now contains implicit interactions that cannot be abstracted (6) (e.g. the File Processor must explicitly notify the hosting thread of the timeout). 25 Listener Request Response [1] [2] [4] [6] Timeout Manager File Processor [3] [5] Figure 4.1. Swerve module interactions for processing a request (solid lines) and error handling control and data flow (dashed lines) for timeouts. The number above the lines indicates the order in which communication actions occur. 26 5 LIGHTWEIGHT CHECKPOINTING FOR CONCURRENT ML In this chapter, we present a safe and efficient checkpointing mechanism for CML that can be used to recover from transient faults. We adopt the following definition of transient faults: a transient fault is an exceptional condition that can be often remedied through reexecution of the code in which it has occurred. Typically, these faults are caused by the temporary unavailability of a resource. For example, a program that attempts to communicate through a network may encounter timeout exceptions because of high network load at the time the request was issued. Transient faults may also arise because a resource is inherently unreliable. In large-scale systems comprised of many independently executing components, failure of one component may lead to transient faults in others even after the failure is detected [19]. For example, a an application that enters an unrecoverable error state may need to be rebooted. Here, the server behaves as a temporarily unavailable resource to its clients who must re-issue requests sent during the period the server was being rebooted. Other transient errors, which can potentially be remedied through re-execution, may occur because program invariants are violated. As an example consider software transactional memory (STM), where regions of code execute with atomicity (all or nothing) and isolation (no intermediate effects are witnessed) guarantees. Serializability violations that occur in STMs [20, 21] are typically rectified by aborting the offending transaction and having it re-execute. A simple solution to recovery from transient faults and errors would be to capture the global state of the program before an action executes that could trigger such a fault or error. If the fault or error occurs and raises an exception, the handler only needs to restore the previously saved program state. Unfortunately, transient faults often occur in longrunning server applications that are inherently multi-threaded but which must nonetheless exhibit good fault tolerance characteristics; capturing global program state is costly in these 27 environments. On the other hand, simply re-executing a computation without taking prior thread interactions into account can result in an inconsistent program state and lead to further errors. Suppose two threads communicate synchronously over a shared channel and the sender subsequently re-executes this code to recover from a transient fault. A spurious unhandled execution of the (re)sent message may result because the receiver would have no knowledge that a re-execution of the sender has occurred. Thus, it has no need to expect retransmission of a previously executed message. In general, the problem of computing a sensible checkpoint for a transient fault requires calculating the transitive closure of dependencies that manifest among threads and the section of code which must be re-executed. To support the definition and restoration of safe and efficient checkpoints in concurrent functional programs, we propose a new language abstraction called stabilizers. Stabilizers encapsulate three operations. The first initiates monitoring of code for communication and thread creation events, and establishes thread-local checkpoints when monitored code is evaluated. These thread-local checkpoints can be viewed as a restoration point for any transient fault encountered during the execution of the monitored region. The second operation reverts control and state to a safe global checkpoint when a transient fault is detected. The third operation allows previously established checkpoints to be reclaimed. The checkpoints defined by stabilizers are first-class and composable: a monitored procedure can freely create and return other monitored procedures and behaves like a higher-order function. Stabilizers can be arbitrarily nested, and work in the presence of a dynamically-varying number of threads and non-deterministic selective communication. We demonstrate the use of stabilizers for several large applications written in CML. As a more concrete example of exception recovery, consider a typical remote procedure call in CML. The code shown in Figure 5.1 depicts the server-side implementation of the RPC. Suppose the request to the server is sent asynchronously, allowing the client to compute other actions concurrently with the server; it eventually waits on replyCh for the server’s answer. It may be the case that the server raises an exception while processing of the client’s request. For example, a condition checked in a guarded event that is a part of 28 fun rpc-server (request, replyCh) = let val ans = process request in spawn(send(replyCh,ans)) end handle Exn => ... Figure 5.1. A simple server-side RPC abstraction using synchronous communication. a choice may fail [22]. In addition, the client may attempt to interact with the server while it is awaiting the server’s response to its initial request. w When this happens, how should the client’s state be reverted to ensure it can retry its original request? For example, if the client is waiting on the reply channel, the server must ensure exception handlers communicate information back on the channel, to make sure the client does not deadlock waiting for a response. Moreover, if the client must retry its request, any effects performed by its computation executed concurrently to its request must also be reverted. Constructing fault remediation protocols that involve multiple communicating threads can be complex and unwieldy. Stabilizers, on the other hand, provide the ability to unroll cross-thread computation in the presence of exceptions quickly and efficiently. This is especially useful for errors that can be remedied by re-execution. Stabilizers provide a middle ground between the transparency afforded by operating systems or compiler-injected checkpoints, and the precision afforded by user-injected checkpoints. In our approach, thread-local state immediately preceding a non-local action (e.g., thread communication) is regarded as a possible checkpoint for that thread. In addition, applications may explicitly identify program points where local checkpoints should be taken, and can associate program regions with these specified points. When a rollback operation occurs, control reverts to one of these saved checkpoints for each thread. Rollbacks are initiated to recover from transient faults. Applications must still detect such faults, and when those faults are detected, applications can leverage the functionality provided by stabilizers. The exact set of checkpoints chosen is determined by safety conditions that ensure a globally consistent state is preserved. When a thread is rolled back to a thread-local checkpoint 29 state C, our approach guarantees other threads with which the thread has communicated will be placed in states consistent with C. 5.1 Programming Model Stabilizers are created, reverted, and reclaimed through the use of three primitives with the following signatures: stable : (’a -> ’b) stabilize : unit -> ’a cut : unit -> unit -> ’a -> ’b A stable section is a monitored section of code whose effects are guaranteed to be reverted as a single unit. The primitive stable is used to define stable sections. Given function f the evaluation of stable f yields a new function f’ identical to f except that interesting communication, shared memory access, locks, and spawn events are monitored and grouped based on the stable section in which they occurred. In addition to monitoring communication actions that occur within stable sections, any communication between a stable section and a thread not executing within a stable section is also monitored. Thus, all actions within a stable section are associated with the same checkpoint 1 . The second primitive, stabilize, reverts execution to a dynamically calculated global state; this state will always correspond to a program state that existed immediately prior to the execution of a stable section, communication event, or thread spawn point for each thread. We qualify this claim by observing that external irrevocable operations that occur within a stable section that needs to be reverted (e.g., I/O, foreign function calls, etc.) must be handled explicitly by the application prior to an invocation of a stabilize action in much the same way as when restoring a checkpoint under an application level checkpointing scheme. 1 Mattern-style consistent cuts define a consistent state from a number of local checkpoints based on ordering of communication events. In much the same way, stabilizers also create a consistent global checkpoints from local checkpoints. However, stable sections allow the programmer to specify a series of communication actions that are treated as an atomic unit when calculating a global checkpoint. 30 Informally, a stabilize action reverts all effects performed within a stable section much like an abort action reverts all effects within a transaction. However, whereas a transaction enforces atomicity and isolation until a commit, stabilizers enforce these properties only when a stabilize action occurs. Thus, the actions performed within a stable section are immediately visible to the outside. When a stabilize action occurs these effects, along with their witnesses, are reverted. The third primitive, cut, establishes a point beyond which stabilization cannot occur. Cut points can be used to prevent the unrolling of irrevocable actions within a program (e.g., I/O). A cut prevents reversion to a checkpoint established logically prior to it, in the case of nested stable sections it prevents the reversion to checkpoints established by the outer stable sections. Informally, a cut executed by a thread T requires that any checkpoint restored for T be associated with a program point that logically follows the cut in program order. Thus, if there is an irrevocable action A (e.g., ’launch missile’) that cannot be reverted, the expression: atomic ( A ; cut() ) ensures that any subsequent stabilization action does not cause control to revert to a stable section established prior to A . If such control transfer and state restoration were permitted, it would (a) necessitate revision of A ’s effects, and (b) allow A to be re-executed; neither of which is possible without some external compensation mechanism. The execution of the irrevocable action A and the cut() must be atomic to ensure that another thread does not perform a stabilization action in between the execution of A and cut() . Careful consideration must given where cut() is used as it affects all nested stable sections and global checkpoints constructed from the section in which it occurs, in much the same way as an irrevocable action is a part of such local and global checkpoints. Unlike classical checkpointing schemes or exception handling mechanisms, the result of invoking stabilize does not guarantee that control reverts to the state corresponding to the dynamically-closest stable section. The choice of where control reverts depends upon the actions undertaken by the thread within the stable section in which the stabilize call was triggered (for examples see Section 5.1.1). 31 t1 t1 t2 t2 S1 S3 S1 S3 S2 S2 (a) (b) Figure 5.2. Interaction between stable sections. Clear circles indicate thread-local checkpoints, dark circles represent stabilization actions. Composability is an important design feature of stabilizers. There is no a priori classification of the procedures that need to be monitored, nor is there any restriction against nesting stable sections. Stabilizers separate the construction of monitored code regions from the capture of state. When a monitored procedure is applied, or inter-thread communication action is performed, a potential thread-local restoration point is established. The application of such a procedure may in turn result in the establishment of other independently constructed monitored procedures. In addition, these procedures may themselves be applied and have program state saved appropriately. Thus, state saving and restoration decisions are determined without prejudice to the behavior of other monitored procedures. 5.1.1 Interaction of Stable Sections When a stabilize action occurs, matching inter-thread events are unrolled as pairs. If a send is unrolled, the matching receive must also be reverted. If a thread spawns another 32 thread within a stable section that is being unrolled, this new thread (and all its actions) must also be discarded. All threads which read from a shared variable must be reverted if the thread that wrote the value is unrolled to a state prior to the write. A program state is stable with respect to a statement if there is no thread executing in this state affected by the statement (e.g., all threads are in a point within their execution prior to the execution of the statement and its transitive effects). For example, consider thread t1 that enters a stable section S1 and initiates a communication event with thread t2 (see Figure 5.2(a)). Suppose t1 subsequently enters another stable section S2 , and again establishes a communication with thread t2 . Suppose further that t2 receives these events within its own stable section S3 . The program states immediately prior to S1 and S2 represent feasible checkpoints as determined by the programmer, depicted as white circles in the example. If a rollback is initiated within S2 , then a consistent global state would require that t2 revert back to the state associated with the start of S3 since it has received a communication from t1 initiated within S2 . However, discarding the actions within S3 now obligates t1 to resume execution at the start of S1 since it initiated a communication event within S1 to t2 (executing within S3 ). Such situations can also arise without the presence of nested stable sections. Consider the example in Figure 5.2(b). Once again, the program is obligated to revert t1 , since the stable section S3 spans communication events from both S1 and S2 . Consider the RPC example presented in the introduction rewritten to utilize stabilizers: stable fn () => let fun rpc-server (request, replyCh) = let val ans = process request in spawn(send(replyCh,ans)) end handle Exn => ... stabilize() in rpc-server end If an exception occurs while the request is being processed, the request and the client are unrolled to a state prior to the RPC. The client is free to retry the RPC, or perform some other computation. 33 5.2 Motivating Example The Swerve design, presented in Chapter 4, illustrates the general problem of dealing with transient faults in a complex concurrent system: how can we correctly handle faults that span multiple modules without introducing explicit cross-module dependencies to handle each such fault? To motivate the use of stabilizers, we consider the interactions of three of Swerve’s modules: the Listener, the File Processor, and the Timeout Manager. More specifically, we consider how a timeout is handled by these three modules. Figure 5.3 shows the definition of fileReader, a Swerve function in the file processing module that sends a requested file to the hosting thread by chunking the file contents into a series of smaller packets. The file is opened by BinIOReader, a utility function in the File Processing module. The fileReader function must check in every iteration of the file processing loop whether a timeout has occurred by calling the Timeout.expired function due to the restriction that CML threads cannot be explicitly interrupted. If a timeout has occurred, the procedure is obligated to notify the receiver (the hosting thread) through an explicit send on channel consumer the value XferTimeout; timeout information is propagated from the Timeout module to the fileReader via the abort argument which is polled. Stabilizers allow us to abstract this explicit notification process by wrapping the file processing logic in a stable section. Suppose a call to stabilize replaced the call to CML.send(consumer, Timeout). This action would result in unrolling both the actions of sendFile as well as the receiver, since the receiver is in the midst of receiving file chunks. However, a cleaner solution presents itself. Suppose that we modify the definition of the Timeout module to invoke stabilize, and wrap its operations within a stable section (see Figure 5.5). Now, there is no need for any thread to poll for the timeout event. Since the hosting thread establishes a timeout quantum by communicating with Timeout and passes this information to the file processor thread, any stabilize action performed by the Timeout Manager will unroll all actions related to processing this file. This transformation, therefore, allows us to specify a timeout mechanism without having to embed 34 fun fileReader name abort consumer = let fun loop strm = if Timeout.expired abort then CML.send(consumer, XferTimeout) else let val chunk = BinIO.inputN(strm, fileChunk) in read a chunk of the file and send to receiver loop strm) end in (case BinIOReader.openIt abort name of NONE => () | SOME h => (loop (BinIOReader.get h); BinIOReader.closeIt h) end fun fileReader name abort consumer = let fun loop strm = let val chunk = BinIO.inputN(strm, fileChunk) in read a chunk of the file and send to receiver loop strm) end in stable fn() => (case BinIOReader.openIt abort name of NONE =>() | SOME h =>(loop (BinIOReader.get h); BinIOReader.closeIt h)) () end Figure 5.3. An excerpt of the the File Processing module in Swerve. The code fragment displayed on the bottom shows the code modified to use stabilizers. Italics mark areas in the original where the code is changed. 35 fn () => let fun receiver() = case recv(consumer) of Info info => (sendInfo info; ...) | Chunk bytes | timeout => (sendBytes bytes; ...) => error handling code | Done => ... ... in ... ; loop receiver end stable fn () => let fun receiver() = case recv(consumer) of Info info => (sendInfo info; ...) | Chunk bytes => (sendBytes bytes; ...) | Done => ... ... in ... ; loop receiver end Figure 5.4. An excerpt of the Network Processor module in Swerve. The main processing of the hosting thread, created by the Listener module is wrapped in a stable section and the timeout handling code can be removed. The code fragment on the bottom shows the modifications made to use stabilizers. Italics in the code fragment on the top mark areas in the original where the code is removed in the version modified to use stabilizers. 36 let fun expired (chan) = isSome (CML.poll chan) fun trigger (chan) = CML.send(chan, timeout) ... in ...; trigger(chan) end let fun trigger (chan) = stabilize() ... in stable (fn() => ... ; trigger(chan)) () end Figure 5.5. An excerpt of the Timeout Manager module in Swerve. The bottom code fragment shows the code modified to use stabilizers. The expired function can be removed and trigger now calls stabilize. Italics mark areas in the original where the code is changed. non-local timeout handling logic within each thread that potentially could be affected. The hosting thread itself is also simplified (as seen in Figure 5.4). By wrapping its logic within a stable section, we can remove all of its timeout error handling code as well. A timeout is now handled completely through the use of stabilizers localized within the Timeout module. This improved modularization of concerns does not lead to reduced functionality or robustness. Indeed, a stabilize action causes the timed-out request to be transparently re-processed with the file being resent, or allows the web-server to process a new request, depending on the desired behavior. 5.2.1 Cut The cut primitive can be used to delimit the effects of stabilize calls. Consider the example presented in Figure 5.6, which depicts three separate servers operating in parallel. A central coordinator dispatches requests to individual servers and acts as the front-end for handling user requests. The dispatch code, presented below, contains a stable section and each server has its request processing (as defined in the previous section) wrapped in 37 Figure 5.6. A multi-server implementation which utilizes a central coordinator and multiple servers. A series of requests is multi-plexed between the servers by the coordinator. Each server handles its own transient faults. The shaded portions represent computation which is unrolled due to the stabilize action performed by the server. Single arrows represent communication to servers and double arrows depict return communication. Circular wedges depict communications which are not considered because a cut operation limits their effects. stable sections. After each request is completed, the server establishes a cut point so that the request is not repeated if an error is detected on a different server. Servers utilize stabilizers to handle transient faults. Since the servers are independent of one another, a transient fault local to one server should not affect another. The request allocated only to that server must be re-executed. When the coordinator discovers an error it calls stabilize to unroll request processing. All requests which encountered an error will be unrolled and automatically retried. Those which completed will not be affected. fun multirequest(requestList) = foreach fn (request,replyCh) => let val serverCh = freeServer() in spawn stable( 38 fn () => (send(serverCh, request); let val reply = recv(serverCh) in (send(replyCh,reply); cut()) end)) end requestList The code above depicts a front-end function which handles multiple requests by dispatching them among a number of servers. The function freeServer finds the next available server to process the request. Once the front-end receives a reply from the server, subsequent stabilize actions by other threads will not result in the revocation of previously satisfied requests. This is because the cut() operation prevents rollback of any previously satisfied request. If a stabilization action does occur, the cut() avoids the now satisfied request to this server from being re-executed; only the server that raised the exception is unrolled. 5.3 Semantics Our semantics is defined in terms of a core call-by-value functional language with threading primitives. We present the syntax of the languages in Figure 5.7 and provide an operational semantics in Figure 5.8. We first present an interpretation of stabilizers in which evaluation of stable sections immediately results in the capture of a consistent global checkpoint. Furthermore, we restrict the language to capture checkpoints only upon entry to stable sections, rather than at any communication or thread creation action. This semantics reflects a simpler characterization of checkpointing than the informal description presented in Section 5.1. In Section 5.4, we refine this approach to construct checkpoints incrementally, and to allow checkpoints to be captured at any communication or thread creation action. In the following, we use metavariables v to range over values, and δ to range over stable sections or checkpoint identifiers. We also use P for thread terms, and e for expressions. We use over-bar to represent a finite ordered sequence, for instance, f represents f1 f2 . . . fn . 39 S YNTAX : P ::= 0/ | PkP | t[e]δ e ::= x | l | λ x.e | e(e) | mkCh() | send(e, e) | recv(e) | spawn(e) | stable(e) | stable(e) | stabilize() | cut() E VALUATION C ONTEXTS : E ::= • | E(e) | v(E) | send(E, e) | send(l, E) | recv(E) | spawn(E) | stable(E) | stable(E) L OCAL E VALUATION RULES : λ x.e(v) → e[v/x] mkCh() → l, l fresh stable(stable(λ x.e)) → stable(λ x.e) P ROGRAM S TATES : P ∈ Process t ∈ Tid x ∈ Var l ∈ Channel δ ∈ StableId v ∈ Val α, β ∈ Op = unit | λ x.e | stable(λ x.e) | l = {LR , SP ( t , e ), COMM ( t , t’ ), SS , ST, ES , CUT} Λ ∈ Process × StableMap ∆ ∈ StableMap fin = (StableId 7→ Process × StableMap)+ ⊥ Figure 5.7. Stabilizers Semantics – The syntax, evaluation contexts, and domain equations for a core call-by-value language for stabilizers. 40 (L OCAL ) e → e0 LR Pkt[E[e]]δ , ∆ =⇒ Pkt[E[e0 ]]δ , ∆ (S PAWN ) t0 fresh Pkt[E[spawn(λ x.e)]]δ , ∆ SP(t0 ,e[unit/x]) =⇒ Pkt[E[unit]]δ kt0 [e[unit/x]]φ , ∆ (C OMM ) P = P0 kt[E[send(l, v)]]δ kt0 [E0 [recv(l)]]δ0 P, ∆ COMM ((t,t0 )) =⇒ P0 kt[E[unit]]δ kt0 [E0 [v]]δ0 , ∆ (C UT ) P = P0 kt[E[cut()]]δ P00 = P0 kt[E[unit]]δ CUT P, ∆ =⇒ P00 , ⊥ (S TABLE ) δ0 fresh ∀δ ∈ Dom(∆), δ0 ≥ δ ∆0 = ∆[δ0 7→ Pkt[E[stable(λ x.e)(v)]]δ , ∆)] Λ = ∆0 (δmin ) ∀δ ∈ Dom(∆0 ) δmin ≤ δ Λ0 = Pkt[E[stable(e[v/x])]]δ0 .δ , ∆[δ0 7→ Λ] SS Pkt[E[stable(λ x.e)(v)]]δ , ∆ =⇒ Λ0 (S TABLE -E XIT ) ES Pkt[E[stable(v)]]δ.δ , ∆ =⇒ Pkt[E[v]]δ , ∆ − {δ} (S TABILIZE ) ∆(δ) = (P0 , ∆0 ) ST Pkt[E[stabilize()]δ.δ , ∆ =⇒ P0 , ∆0 Figure 5.8. Stabilizers Semantics – The global evaluation rules for a core call-by-value language for stabilizers. 41 The term α.α denotes the prefix extension of the sequence α with a single element α, α.α the suffix extension, αα0 denotes sequence concatenation, φ denotes empty sequences and sets, and α ≤ α0 holds if α is a prefix of α0 . We write | α | to denote the length of sequence α. Our communication model is a message passing system with synchronous send and receive operations. We do not impose a strict ordering of communication actions on channels; communication actions on the same channel are paired non-deterministically. To model asynchronous sends, we simply spawn a thread to perform the send. To this core language we add three new primitives: stable, stabilize, and cut. When a stable function is applied, a global checkpoint is established, and its body, denoted as stable(e), is evaluated in the context of this checkpoint. The second primitive, stabilize, is used to initiate a rollback and the third, cut, prevents further rollback in the thread in which it executes due to a stabilize action. The syntax and semantics of the language are given in Figure 5.7 and Fig 5.8. Expressions include variables, locations that represent channels, λ-abstractions, function applications, thread creations, channel creations, communication actions that send and receive messages on channels, or operations which define stable sections, stabilize global state to a consistent checkpoint, or bound checkpoints. We do not consider references in this core language as they can be modeled in terms of operations on channels. A program is defined as a set of threads and we utilize φ to denote the empty program. Each thread is uniquely identified, and is also associated with a stable section identifier (denoted by δ) that indicates the stable section the thread is currently executing within. Stable section identifiers are ordered under a relation that allows us to compare them (e.g. they could be thought of as integers incremented by a global counter). For convention we assume δs range from 0 to δmax , where δmax is the numerically largest identifier (e.g., the last created identifier). Thus, we write t[e]δ if a thread with identifier t is executing expression e in the context of the stable section with identifier δ. Since stable sections can be nested, the notation generalizes to sequences of stable section identifiers with sequence order reflecting nesting relationship. We omit decorating a term with stable section identifiers when 42 not necessary. Our semantics is defined up to congruence of threads (PkP0 ≡ P0 kP). We write P {t[e]} to denote the set of threads that do not include a thread with identifier t, and P ⊕ {t[e]} to denote the set of threads that contain a thread executing expression e with identifier t. We use evaluation contexts to specify order of evaluation within a thread, and to prevent premature evaluation of the expression encapsulated within a spawn expression. A program state consists of a collection of evaluating threads (P) and a stable map (∆) that defines a finite function associating stable section identifiers to states. A program begins evaluation with an empty stable map (⊥). Program evaluation is specified by a α global reduction relation, P, ∆ =⇒ P0 , ∆0 , that maps a program state to a new program state. We tag each evaluation step with an action, α, that defines the effects induced by evaluating α ∗ the expression. We write =⇒ to denote the reflexive, transitive closure of this relation. The actions of interest are those that involve communication events, or manipulate stable sections. We use labels LR to denote local reduction actions, SP (t, e) to denote thread creation, COMM(t, t0 ) to denote communication, SS to indicate the start of a stable section, ST to indicate a stabilize operation, ES to denote the exit from a stable section, and CUT to indicate a cut action. Local reductions within a thread are specified by an auxiliary relation, e → e0 that evaluates expression e within some thread to a new expression e0 . The local evaluation rules are standard: function application substitutes the value of the actual parameter for the formal in the function body; channel creation results in the creation of a new location that acts as a container for message transmission and receipt; and, supplying a stable function as an argument to a stable expression simply yields the stable function. There are seven global evaluation rules. The first (rule (L OCAL)) simply models global state change to reflect thread local evaluation. The second (rule (S PAWN)) describes changes to the global state when a thread is created to evaluate a thunk (λ x.e); the new thread evaluates e in a context without any stable identifier. The third (rule (C OMM)) describes how a communication event synchronously pairs a sender attempting to transmit a value along a specific channel in one thread with a receiver waiting on the same channel in another thread. Evaluating cut (rule (C UT)) discards the current global checkpoint. The existing 43 stable map is replaced by an empty one. This rule ensures that no subsequent stabilization action will ever cause a thread to revert to a state that existed logically prior to the cut. While certainly safe, the rule is also very conservative, affecting all threads even those that have had no interaction (either directly or indirectly) with the thread performing the cut. We present a more refined treatment in Section 5.4. The remaining rules are ones involving stable sections. When a stable section is newly entered (rule (S TABLE)), a new stable section identifier is generated. These identifiers are related under a total order that allows the semantics to express properties about lifetimes and scopes of such sections. The newly created identifier is associated with its creating thread. The checkpoint for this identifier is computed as either the current state if no checkpoint exists, or the current checkpoint. In this case, our checkpointing scheme is conservative: if a stable section begins execution, we assume it may have dependencies to all other currently active stable sections. Therefore, we set the checkpoint for the newly entered stable section to the checkpoint taken at the start of the oldest active stable section. When a stable section exits (rule (S TABLE -E XIT)), the thread context is appropriately updated to reflect that the state captured when this section was entered no longer represents an interesting checkpoint; the stable section identifier is removed from its creating thread. A stabilize action (rule (S TABILIZE)) simply reverts the state to the current global checkpoint. Note that the stack of stable section identifiers recorded as part of the thread context is not strictly necessary since there is a unique global checkpoint that reverts the entire program state upon stabilization. However, we introduce it here to help motivate our next semantics that synthesizes global checkpoints from partial ones, and for which maintaining such a stack is essential. 5.3.1 Example Consider the example program shown in Fig 5.9. We illustrate how global checkpoints would be constructed for this program in Figure 5.10. Initially, thread t1 spawns thread t2 . Afterwards, t1 begins a stable section, creating a global checkpoint prior to the start 44 let fun f() = ... fun g() = ... recv(c) ... fun h() = ... send(c,v) ... in spawn(stable h); (stable g) (stable f ()) end Figure 5.9. An example used to illustrate the interaction of inter-thread communication and stable sections. The call to f establishes an initial checkpoint. Although g and h do not interact directly with f, the checkpoint established by f may nonetheless be restored on a call to stabilize as illustrated by Figure 5.10. t1 spawn t2 checkpoint δ1 stable f δ2 stable h δ3 stable g send (c,v) recv (c) Figure 5.10. An example of global checkpoint construction where the inefficiency of global checkpoint causes the restoration of a checkpoint established prior to the stable section in which a call to stabilize occurs. 45 of the stable section. Additionally, it creates an identifier (δ1 ) for this stable section. We establish a binding between δ1 and the global checkpoint in the stable map, ∆. Next, thread t2 begins its stable section. Since ∆ is non-empty, t2 maps its identifier δ2 to the checkpoint stored by the least δ, namely the checkpoint taken by δ1 , rather than creating a new checkpoint. Then, thread t1 exits its stable section, removing the binding for δ1 from ∆. It subsequently begins execution within a new stable section with identifier δ3 . Again, instead of taking a new global checkpoint, δ3 is mapped to the checkpoint taken by the least δ, in this case δ2 . Notice that δ2 ’s checkpoint is the same as the one taken for δ1 . Lastly, t1 and t2 communicate. Observe that the same state is restored regardless of whether we revert to either δ2 or δ3 . In either case, the checkpoint that would be restored would be the one initially created by δ1 . This checkpoint gets cleared only once no thread is executing within a stable section. 5.3.2 Soundness The soundness of the semantics is defined by an erasure property on stabilize actions. Consider the sequence of actions α that comprise a potential execution of a program; initially, the program has not established any stable section, i.e., δ = φ. Suppose that there is a stabilize operation that occurs after α. The effect of this operation is to revert the current global program state to an earlier checkpoint. However, assuming that program execution successfully continued after the stabilize call, it follows that there exists a sequence of actions from the checkpoint state that yields the same state as the original, but which does not involve execution of stabilize. In other words, stabilization actions can never manufacture new states nor restore to inconsistent states, and thus have no effect on the final state of program evaluation. α Theorem[Safety.] Let Eφt,P [e], ∆ =⇒ equivalent evaluation Eφt,P [e], ∆ ∗ ST .β P0 , ∆0 =⇒ ∗ P00 kt[v], ∆ f . Then, there exists an α0 .β =⇒ ∗ P00 kt[v], ∆ f such that α0 ≤ α. Proof sketch. By assumption and rules (S TABLE) and (S TABILIZE), there exists evaluation sequences of the form: 46 Eφt,P [e], ∆ α0 SS =⇒ ∗ P1 , ∆1 =⇒ P2 , ∆2 and ST β P0 , ∆0 =⇒ P1 , ∆1 =⇒ ∗ P00 kt[v], ∆ f Moreover, α0 ≤ α since the state recorded by the stable operation must precede the evaluation of the stabilize action that reverts to that state.2 5.4 Incremental Construction While easily defined, the semantics is highly conservative because there may be check- points that involve less unrolling that the semantics does not identify. Consider again the example given in Figure 5.9. The global checkpoint calculation reverts execution to the program state prior to execution of f even if f successfully completed. Furthermore, communication events that establish inter-thread dependencies are not considered in the checkpoint calculation. Thus, all threads, even those unaffected by effects that occur in the interval between when the checkpoint is established and when it is restored, are unrolled. A better alternative would restore thread state based on the actions witnessed by threads within checkpoint intervals. If a thread T observes action α performed by thread T 0 and T is restored to a state that precedes the execution of α, T 0 can be restored to its latest local checkpoint state that precedes its observance of α. If T witnesses no actions of other threads, it is unaffected by any stabilize calls those threads might make. This strategy leads to an improved checkpoint algorithm by reducing the cost of restoring a checkpoint, limiting the impact to only those threads that witness global effects, and establishing their rollback point to be as temporally close as possible to their current state. In Figure 5.11 we provide additional syntax and domain equations for a semantics that utilizes incremental checkpoint constructions. Figure 5.12 and Figure 5.13 present a refinement to the semantics that incrementally constructs a dependency graph as part of program execution. Checkpointing is now defined with respect to the capture of the communica- 47 tion, spawn, and stable actions performed by threads within a graph data structure. This structure consists of a set of nodes representing interesting program points, and edges that connect nodes that have shared dependencies. Nodes are indexed by ordered node identifiers, and hold thread state and record the actions that resulted in their creation. We also define maps to associate threads with nodes (η), and stable section identifiers with nodes (σ) in the graph. Informally, the actions of each thread in the graph are represented by a chain of nodes that define temporal ordering on thread-local actions. Back-edges are established to nodes representing stable sections; these nodes define possible per-thread checkpoints. Sources of backedges are communication actions that occur within a stable section, or the exit of a nested stable section. Edges also connect nodes belonging to different threads to capture inter-thread communication events. The evaluation relation P, G ; P0 , G0 evaluates a process P executing action α with α respect to a communication graph G to yield a new process P0 and new graph G0 . As usual, ;∗ denotes the reflexive, transitive closure of this relation. Programs initially begin α evaluation with respect to an empty graph. The auxiliary relation t[e], G ⇓ G0 models intrathread actions within the graph (see rules (B UILD)). It establishes a new node to capture thread-local state, and sets the current node marker for the thread to this node. In addition, if the action occurs within a stable section, a back-edge is established from that node to this section. This back-edge is used to identify a potential rollback point. If a node has a back-edge the restoration point will be determined by traversing these back-edges. Thus, it is safe to not store thread contexts with such nodes (⊥ is stored in the node in that case). New nodes added to the graph are created with a node identifier guaranteed to be greater than any existing node. When a new thread is spawned (rule (S PAWN)), a new node and a new stack for the thread are created. An edge is established from the parent to the child thread in the graph. When a communication action occurs (rule (C OMM)) a bi-directional edge is added between the current node of the two threads participating in the communication. 48 When a cut action is evaluated (rule (CUT)), a new node is added to the graph that records the action. A subsequent stablization action that traverses the graph must not visit this node, which acts as a barrier to prevent restoration of thread state that existed before it. When a stable section is entered (rule (S TABLE)), a new stable section identifier and a new node are created. A new graph that contains this node is constructed, and an association between the thread and this node is recorded. When a stable section exits (rule (S TABLE E XIT)), this association is discarded, although a node is also added to the graph. Graph reachability is used to ascertain a global checkpoint when a stabilize action is performed (rule (S TABILIZE)). When thread T performs a stabilize call, all nodes reachable from T ’s current node in the graph are examined, and the context associated with the least such reachable node (as defined by the node’s index) for each thread is used as the thread-local checkpoint for that thread. If a thread is not affected (transitively) by the actions of the thread performing the rollback, it is not reverted to any earlier state. The collective set of such checkpoints constitutes a global state. The graph resulting from a stabilize action does not contain these reachable nodes; the expression G/n defines the graph in which all nodes reachable from node n are removed from G. Here, n is the node indexed by the most recent stable section (δ) in the thread performing the stabilization. An important consistency condition imposed on the resulting graph is that it not contain a CUT node. This prevents stablization from incorrectly reverting control to a stable section established prior to a cut. Besides threads that are affected by a stabilize action because of dependencies, there maybe other threads that are unaffected. If P0 is the set of processes affected by a stabilize call, then Ps = P P0 , the set difference between P and P0 , represents the set of threads unaffected by a stabilize action; the set P0 ⊕ Ps is therefore the set that, in addition to unaffected threads, also includes those thread states representing globally consistent local checkpoints among threads affected by a stabilize call. 49 S YNTAX P ::= 0/ | PkP | t[e]δ e → e0 LR Et,P [e], G ; Et,P [e0 ], G δ δ P ROGRAM S TATES n ∈ Node = NodeId × Tid × Op× (Process+ ⊥) n 7→ n0 ∈ Edge = Node × Node δ ∈ StableID η ∈ CurNode fin = Tid → Node fin σ ∈ StableSections= StableID → Node G ∈ Graph = NodeId × P (Node)× P (Edge)× CurNode × StableSections Figure 5.11. Stabilizers Semantics – Additional syntax, local evaluation rules, as well as domain equations for a semantics that utilizes incremental checkpoint construction. 50 (B UILD) n = (i + 1, t, α, t[E[e]]φ ) G0 = h i + 1, N ∪ {n}, E ∪ {η(t) 7→ n}, η[t 7→ n], σ i t[E[e]]φ , α, h i, N, E, η, σ i ⇓ G0 n = σ(δ) n0 = (i + 1, t, α, ⊥) G0 = h i + 1, N ∪ {n0 }, E ∪ {η(t) 7→ n0 , n0 7→ n}, η[t 7→ n0 ], σ i t[E[e]]δ.δ , α, h i, N, E, η, σ i ⇓ G0 (S PAWN) t[E[spawn(λ x.e[unit/x])]]δ , SP(t0 , e[unit/x]), G ⇓ h i, N, E, η, σ i t0 fresh n = (i, t, SP(t0 , e[unit/x]), t0 [e[unit/x]]φ ) G0 = h i, N ∪ {n}, E ∪ {η(t) 7→ n}, η[t0 7→ n], σ i Pkt[E[spawn(λ x.e)]]δ , G SP (t0 ,e[unit/x]) ; Pkt[E[unit]]δ kt0 [e[unit/x]]φ , G0 (C OMM) P = P0 kt[E[send(l, v)]]δ kt0 [E0 [recv(l)]]δ0 t[E[send(l, v)]]δ , COMM(t, t0 ), G ⇓ G0 t0 [E[recv(l)]]δ0 , COMM(t, t0 ), G0 ⇓ G00 G00 = h i, N, E, η, σ i G000 = h i, N, E ∪ {η(t) 7→ η(t0 ), η(t0 ) 7→ η(t)}, η, σ i P, G COMM (t,t0 ) ; P0 kt[E[unit]]δ kt0 [E0 [v]]δ0 , G000 (C UT) t[E[unit]δ , CUT, G ⇓ G0 CUT Pkt[E[cut()]]δ , G ; Pkt[E[unit]]δ , G0 Figure 5.12. Stabilizers Semantics – Global evaluation rules for a semantics that utilizes incremental checkpoint construction. 51 (S TABLE) G = h i, N, E, η, σ i δ f resh n = (i + 1, t, SS, t[E[stable(λ x.e)(v)]]δ ) G0 = h i + 1, N, E ∪ {η(t) 7→ n}, η[t 7→ n], σ[δ 7→ n] i SS Pkt[E[stable(λ x.e)(v)]]δ , G ; Pkt[E[stable(e[v/x])]]δ.δ , G0 (S TABLE -E XIT) t[E[stable(v)]]δ.δ , ES, G ⇓ G0 G0 = h i, N, E, η, σ i G00 = h i, N, E, η, σ − {δ} i ES Pkt[E[stable(v)]δ.δ , G ; Pkt[E[v]]δ , G00 (S TABILIZE) G = h i, N, E, η, σ i σ(δ) = n G0 = G/n hi, t, α, t[E[cut()]]i 6∈ G0 P0 = {t[e] | n = hi, t, α, t[e]i, i < j ∀h j, t, α0 , t[e0 ]i ∈ G0 } P00 = P0 ⊕ (P P0 ) ST Pkt[E[stabilize()]δ.δ , G ; P00 , G0 Figure 5.13. Stabilizer Semantics – Global evaluation rules for a semantics that utilizes incremental checkpoint construction. (continued) 5.4.1 Example To illustrate the semantics, consider the sequence of actions shown in Figure 5.14 that is based on the example given in Figure 5.9. Initially, thread t1 spawns the thread t2 , creating a new node n2 for thread t2 and connecting it to node n1 with a directed edge. The node n3 represents the start of the stable section monitoring function f with identifier δ1 . Next, a monitored instantiation of h is called, and a new node (n4 ) associated with this context is allocated in the graph and a new identifier is generated (δ2 ). No changes need to be made to the graph when f exits its stable section. However, since δ1 cannot be restored by a stabilize 52 t1 spawn t2 spawn(stable h) n1 η(t1) = n1 η(t2) = n2 n2 t1 start stable η(t1) = n3 η(t2) = n2 t2 stable f () n1 σ(δ1) = n3 n2 δ1 n3 t1 start stable t2 stable h () n1 n2 σ(δ1) = n3 σ(δ2) = n4 n4 start stable t2 stable g () n1 t1 n2 exit stable n1 δ2 δ1 n3 t1 η(t1) = n3 η(t2) = n4 σ(δ1) = ∅ σ(δ2) = n4 n4 communication t2 send (c,v) recv (c) η(t1) = n7 η(t2) = n6 n2 σ(δ1) = ∅ σ(δ2) = n4 n1 σ(δ3) = n5 σ(δ3) = n5 δ1 δ2 n4 n3 δ1 δ2 n4 n3 δ3 n5 σ(δ1) = ∅ σ(δ2) = n4 δ2 δ1 t1 η(t1) = n3 η(t2) = n4 n2 n3 η(t1) = n5 η(t2) = n4 t2 δ3 n5 n6 n7 Figure 5.14. An example of incremental checkpoint construction for the code presented in Figure 5.9. 53 call within this thread – it is mapped to φ in σ. Monitoring of function g results in a new node (n5 ) added to the graph. A backedge between n5 and n3 is not established because control has exited from the stable section corresponding to n3 . Similarly as before, we generate a new identifier δ3 that becomes associated with n5 . Lastly, consider the exchange of a value on channel c by the two threads. Nodes corresponding to the communication actions are created, along with back-edges to their respective stable sections. Additionally, a bi-directional edge is created between the two nodes. Recall the global checkpointing scheme would restore to a global checkpoint created at the point the monitored version of f was produced, regardless of where a stabilization action took place. In contrast, a stabilize call occurring within the execution of either g or h using this incremental scheme would restore the first thread to the continuation stored in node n3 (corresponding to the context immediately preceding the call to g), and would restore the second thread to the continuation stored in node n2 (corresponding to the context immediately preceding the call to h). We formalize this intuition and prove this algorithm correct in Section 5.5. 5.5 Efficiency We have demonstrated the safety of stabilization actions for global checkpoints: the state restored from a global checkpoint must have been previously encountered during execution of the program. We now introduce the notion of efficiency. Informally, incremental checkpoints are more efficient than global ones because the amount of computation that must be performed following restoration of an incremental checkpoint is less than the computation that must be performed following restoration of a global one. To prove this, we show that from the state restored by a global checkpoint, we can take a sequence of evaluation steps that leads to the same state restored by an incremental checkpoint. Note that efficiency also implies safety. Since the state restored by a global checkpoint can eventually lead to the state produced by an incremental one, and global checkpoints are safe (by Theorem [Safety]), it follows that incremental ones must be safe as well. 54 The following lemma states that if a sequence of actions do not modify the dependency graph, then all those actions must have been LR. Lemma 1. [Safety of LR]: If Eφt,P [e], G ;∗ Eφt,P [e0 ], G α then α = LR. The proof follows from the structure of the rules since only global rules augment G, and local reductions do not.2 A thread’s execution, as defined by the semantics, corresponds to an ordered series of nodes within the communication graph. As an example, consider Figure 5.14 which illustrates how a graph is constructed from a series of evaluation steps. Threads t1 and t2 are represented as paths, [n1 , n3 , n5 , n7 ] for t1 and [n2 , n4 , n6 ] for t2 , in the graph depicted in Figure 5.14(f). We define a path in the graph G for a thread as a sequence of nodes, where (a) the first node in the sequence either has no incoming edges, or a single spawn edge whose source is a node from a different thread, and (b) the last node either has no outgoing edges, or a single communication back-edge to another node. Thus, a path is a chain of nodes with the same thread identifier. Then a graph is a set of paths connected with communication and spawn edges. A well-formed graph is a set of unique paths, one for each thread. Each edge in this path corresponds to a global action. Let PtG be a path extracted from graph G for thread t. By the definition of ⇓ every node in this path contains: (a) the identity of the thread which performed the action that led to the insertion of the node in the graph; (b) the action performed by the thread that triggered the insertion; and (c) the remaining computation for the thread at the point where the action was performed. An action can be of the form SP(t0 , e) indicating that a new thread t0 was spawned with label (t0 , e), COMM (t, t0 ) indicating that a communication action between the current thread (t) and another thread (t0 ) has occurred, or SS reflecting the entry of a stable section by the executing thread. A schedule StG is a temporally ordered sequence of tuples extracted from PtG that represents all actions performed by t on G. 55 (SPAWN) T = (t, SP(t0 , e00 ), e0 ]).S ∪ T 0 SP (t0 ,e0 ) T, t[e]kP 7→G StG0 ∪ S ∪ T 0 , t[e0 ]]kt0 [e00 ]kP (COMM) T = (t1 , COMM(t1 , t2 ), e01 ).S1 ∪ (t2 , COMM(t1 , t2 ), e02 ).S2 ∪ T 0 T, t1 [e1 ]kt2 [e2 ]kP COMM (t1 ,t2 ) 7→G S1 ∪ S2 ∪ T 0 , t1 [e01 ]kt2 [e02 ]kP (STABLE) T = (t, SS, e0 ).S ∪ T 0 SS T, t[e]kP 7→G S ∪ T 0 , t[e0 ]kP (EXIT STABLE) T = (t, ES, e0 ).S ∪ T 0 ES T, t[e]kP 7→G S ∪ T 0 , t[e0 ]kP (CUT) T = (t, CUT, e0 ).S ∪ T 0 CUT T, t[e]kP 7→G S ∪ T 0 , t[e0 ]kP Figure 5.15. The relation 7→ defines how to evaluate a schedule T derived from a graph G. We now proceed to define a new semantic relation 7→ (see Figure 5.15) that takes a graph G, a set of schedules T , and a given program state P and produces a new set of thread schedules T 0 , and a new program state P0 . Informally, 7→ examines the continuations in schedules obtained from the communication graph to define a specific evaluation sequence. It operates over schedules based on the following observation: given an element π = (t, α, e) in a schedule in which an expression e represents the computation still to be performed by t, the next element in the schedule π0 can be derived by performing the action α, and some number of thread local reductions. The rule for (SPAWN) in Figure 5.15 holds when a thread t whose first action in its recorded schedule within the communication graph is a spawn action. The rule performs 56 this action by yielding a new process state that includes the new thread, and a new schedule that reflects the execution of the action. The rule for communication (rule (COMM)) takes the schedules of two threads that were waiting for initiating a communication with one another, and yields a new process state in which the effects of the communication are recorded. Entry into a stable section (rule (STABLE)) establishes a thread local checkpoint. Rules (E XIT S TABLE) and (CUT) install the operation’s continuation and remove the action from the schedule. These rules skip local reductions. This is safe because if there existed a reduction that augmented the graph it would be present in some schedule (such a reduction obviously does not include stabilize ). The following lemma formalizes this intuition. (If N is the set of nodes reachable from the roots of G’, then G/G0 denotes the graph that results from removing N and nodes reachable from N from G.) α Lemma 2. [Schedule Soundness] Suppose there exists G and G0 such that P, G ;∗ P0 , G0 and ST 6 ∈ α. Let T be the schedule derived from G00 where G00 = G0 /G. Then, T, P 7→∗G00 φ, P0 . The proof is by induction on the size of G0 /G. The base case has G = G0 , and therefore |G00 | = 0. By Lemma 1, α = LR which implies P = P0 . Since a schedule only includes actions derived from the graph G0 /G, which is empty, T = φ. Suppose the theorem holds for |G00 | = n. To prove the inductive step, consider P, G ; P1 , G1 ; P0 , G0 where |G1 /G| = α α α α n. By the induction hypothesis, we know T, P 7→G00 T 0 , P1 and T, P 7→G1 /G φ, P1 . Now, if α = LR , then by Lemma 1, G1 = G0 , thus |G1 /G| = n, and T 0 = φ. Otherwise, α ∈ {SS,ES,COMM(t, t0 ), SP(t, e)}. Since all the rules add a node to the graph, we know by the α definition of 7→, there is a transition for each of these actions that guarantees T 0 , P1 7→G00 φ, P0 . Lemma 3. [Adequacy] Let Gs and G f be two well-formed graphs, G = G f /Gs , and let α P and P0 be process states. If T is the schedule derived from G, and if T, P 7→∗G T 0 , P0 then α0 P, Gs ; P0 , G0f and |α| ≤ |α0 |. By definition of G f , Gs , and 7→, all tags in α are contained in α0 . By lemma 1, this implies that α ≤ α0 . 2 57 Furthermore, both global and incremental checkpointing yield the same process states in the absence of a stabilize action. Lemma 4. [Correspondence] If P, G ; P0 , G0 and ST ∈ / α, then P, ∆ =⇒ P0 , ∆. α α The proof follows trivially from the definition of ; and =⇒ .2 Using these lemmas, we can formally characterize our notion of efficiency: Theorem [Efficiency]: If α.ST Eφt,P [e], ∆ =⇒ ∗ P0 , ∆0 and α.ST Eφt,P [e], G0 ; ∗ P00 , G00 β then there exists β such that P0 , ∆0 =⇒ ∗ P00 , ∆00 . The proof is by induction on the length of α. The base case considers sequences of length one since a stabilize action can only occur within the dynamic context of a stable section (tag SS). Then, P = P0 = P00 , β = φ, and the theorem holds trivially. Assume the theorem holds for sequences of length n − 1. Let α = β1 .β2 and |β1 | = n − m, |β2 | = m. By our hypothesis, we know β1 β2 .ST Eφt,P [e], ∆ =⇒ ∗ Pn−m , ∆n−m =⇒ ∗ P0 , ∆0 and β1 β2 .ST Eφt,P [e], G0 ;∗ Pn−m , Gn−m ; ∗ P00 , G00 . Without loss of generality, assume Pn−m = P0 . Intuitively, any checkpoint restored by the global checkpointing semantics corresponds to a state previously seen during evaluation. Since both evaluations begin with the same α sequence, they must share the same program states, thus we know Pn−m exists in both sequences. By the definition of ;, we know G00 and Gn−m are well formed. Let G = G00 /Gn−m . G is well formed since Gn−m and G00 are. Thus, there is a path PtG associated with every thread t, and a schedule StG that can be constructed from this path. Let T be the set of schedules derived from G. 58 α0 By Lemma 2, we know there is some sequence of actions α0 such that T, P0 7→∗G φ, P00 . β By Lemma 3, we know P0 , Gn−m ;∗ P00 , G00 , and | α0 |≤| β |. By definition of 7→, ; and by Lemma 2, we know that ST 6∈ β since it differs from α0 only with respect to LR actions, β and α0 does not contain any ST tags. By Lemma 4, we know P0 , ∆0 =⇒ ∗ P00 , ∆00 2 5.6 Implementation The main changes to the underlying infrastructure were the insertion of write barriers to track shared memory updates, and hooks to the CML library to update the communication graph. State restoration is thus a combination of restoring continuations as well as reverting references. The implementation is roughly 2K lines of code to support our data structures, checkpointing, and restoration code, as well as roughly 200 lines of changes to CML. To handle references, our implementation assumes race-free programs: every shared reference is assumed to be consistently protected by the same set of locks. We believe this assumption is not particularly onerous in a language like CML where references are generally used sparingly. 5.6.1 Supporting First-Class Events Because our implementation is an extension of the core CML library, it supports firstclass events [2] as well as channel-based communication. The handling of events is no different than our treatment of messages. If a thread is blocked on an event with an associated channel, we insert an edge from that thread’s current node to the channel. We support CML’s selective communication with no change to the basic algorithm – recall that our operations only update the checkpoint graph on base events; complex events such as choose , wrap , or guard are thus unaffected. Since CML imposes a strict ordering of communication events, each channel must be purged of spurious or dead data after a stabilize action. We leverage the same process CML uses for clearing channels of spurious data after a selective communication, to deal with stabilize actions that roll back channel 59 state. Spurious data accumulates on channels in the presences of choice. When choosing between multiple communication actions, all actions are placed on channels and are subsequently cleaned up after one has been matched. In CML this accomplished lazily by setting all the non-matched communication actions to be invalid. We leverage this mechanism in our stabilizer implementation. 5.6.2 Handling References We have thus far elided details on how to track shared memory access to properly support state restoration actions in the presence of references. Naively tracking each read and write separately would be inefficient. There are two problems that must be addressed: (1) unnecessary writes should not be logged; and (2) spurious dependencies induced by reads should be avoided. Notice that for a given stable section, it is enough to monitor the first write to a given memory location since each stable section is unrolled as a single unit. To monitor writes, we create a log in which we store reference/value pairs. For each reference in the list, its matching value corresponds to the value held in the reference prior to the execution of the stable section. When the program enters a stable section, we create an empty log for this section. When a write is encountered within a monitored procedure, a write barrier is executed that checks if the reference being written is in the log maintained by the section. If there is no entry for the reference, one is created, and the current value of the reference is recorded. Otherwise, no action is required. To handle references occurring outside stable sections, we create a log for the most recently taken checkpoint for the writing thread. Until a nested stable section exits it is possible for a call to stabilize to unroll to the start of this section. A nested section is created when a monitored procedure is defined within the dynamic context of another monitored procedure. Nested sections require maintaining their own logs. Log information in these sections must be propagated to the outer section upon exit. However, the propagation of information from nested sections to outer ones is not trivial. If the outer section has monitored a particular memory location that has also 60 been updated by the inner one, we only need to store the outer section’s log, and the value preserved by the inner one can be discarded. Efficiently monitoring read dependencies requires us to adopt a different methodology. We assume read operations occur much more frequently than writes, and thus it would be impractical to have barriers on all read operations record dependency information in the communication graph. Based on our assumption of race-free programs, it is sufficient to monitor lock acquires/releases to infer shared memory dependencies. By incorporating happens-before dependency edges on lock operations [23], stabilize actions initiated by a writer to a shared location can be effectively propagated to readers that mediate access to that location via a common lock. A lock acquire is dependent on the previous acquisition of the lock. 5.6.3 Graph Representation The main challenge in the implementation was developing a compact representation of the communication graph. We have implemented a number of node/edge compaction algorithms allowing for fast culling of redundant information. For instance, any two nodes that share a backedge can be collapsed into a single node. We also ensure that there is at most one edge between any pair of nodes. Any addition to the graph affects at most two threads. We use thread-local meta-data to find the most recent node for each thread. The graph is thus never traversed in its entirety. The size of the communication graph grows with the number of communication events, thread creation actions, lock acquires, and stable sections entered. However, we do not need to store the entire graph for the duration of program execution. The leaves of the graph data structure are distributed among threads. Specifically, a thread has a reference to its current node (which is always a leaf). When a new node is added, the reference is updated. Any node created within a stable section establishes a backedge to the node that represents that section. Thus, any unreachable node can be safely reclaimed by the garbage collector. As we describe below, memory overheads are thus proportional to the amount of communication and are only reclaimed after stable 61 let fun g () = let val f stable fn () => (... raise error ...) in ...; f () handle error => stabilize() end in stable g () end Figure 5.16. Sample code utilizing exceptions and stabilizers. sections complete. Notice, long lived stable sections prevent reclamation of the graph that is reachable from those stable sections. A stabilize action has complexity linear in the number of nodes and edges in the graph. Our implementation utilizes a combination of depth-first search and bucket sorting to calculate the resulting graph after a stabilize call in linear time. DFS identifies the part of the graph which will be removed after the stabilize call. Only sections of the graph reachable from the stabilize call are traversed, resulting in a fast restoration procedure. 5.6.4 Handling Exceptions Special care must be taken to deal with exceptions since they can propagate out of stable sections. When an exception is raised within a stable section but its handler’s scope encompasses the stable section itself, we must record this event in our graph. When an exception propagates out of a stable section, the stable section is no longer active. To illustrate why such tracking is required consider the following example code given in Fig 5.16. The example program consists of two functions f and g, both of which execute within stable sections. Functions f’s execution is also wrapped in an exception handler, which catches the error exception. Notice that this handler is outside of the scope of the stable section for f. During execution, two checkpoints will be taken, one at the call site of g and the other at f’s call site. When the exception error is raised and handled, the program 62 executes a call to stabilize . The correct checkpoint which should be restored is the captured checkpoint at g’s call site. Without exception tracking, f’s checkpoint would be incorrectly restored. To implement exception tracking, we wrap stable sections with generic exception handlers. Such handlers catch all exceptions, modify our run-time graph, and finally re-raise the exception to allow it to propagate to its appropriate handler. Exceptions that have handlers local to the stable section in which they occur are not affected. Modifications required to the dependency graph are limited – they are just a stable section exit. Since an exception propagating out of a stable section is modeled as a stable section exit, nesting does not introduce additional complexity. 5.6.5 The Rest of CML Besides channel and event communication, the CML library offers many other primitives for thread communication and synchronization. The most notable of these are synchronous variables (M-vars, I-vars, and sync vars) which support put and get operations. We instrumented stabilizers to monitor synchronous variables in much the same manner as shared references. The rest of the CML primitives, such as IVars and MVars, are created from either the basic channel and event building blocks or synchronous variables and do not need special support in the context of stabilizers.2 CML also provides various communication protocols, such as multicast, which are constructed from a specified collection of channels and a series of communication actions. Again, by instrumenting the fundamental building blocks of CML, no special hooks or monitoring infrastructure must be inserted. 5.7 Performance Results To measure the cost of stabilizers with respect to various concurrent programming paradigms, we present synthetic benchmarks to quantify pure memory and time over2 If an application relies heavily on such constructs it may be more efficient to make stabilizers aware of the abstraction itself instead of its component building blocks. 63 Table 5.1 Benchmark characteristics and dynamic counts. LOC Benchmark Comm. Shared incl. eXene Threads Channels Events Writes Reads Triangles 16501 205 79 187 88 88 N-Body 16326 240 99 224 224 273 Pretty 18400 801 340 950 602 840 Swerve 16000 10532 231 29047 9339 80293 Table 5.2 Benchmark graph sizes and normalized overheads. Graph Overheads (%) Benchmark Size(MB) Runtime Memory Triangles .19 0.59 8.62 N-Body .29 0.81 12.19 Pretty .74 6.23 20.00 Swerve 5.43 2.85 4.08 heads, and examine several server-based open-source CML benchmarks to illustrate average overheads in real programs. The benchmarks were run on an Intel P4 2.4 GHz machine with 1 GB of RAM running Gentoo Linux, compiled and executed using MLton release 20041109. To measure the costs of our abstraction, our benchmarks are executed in three different ways: one in which the benchmark is executed with no actions monitored, and no checkpoints constructed; one in which the entire program is monitored, effectively wrapped within a stable section, but in which no checkpoints are actually restored; and one in which 64 relevant sections of code are wrapped within stable sections, exception handlers dealing with transient faults are augmented to invoke stabilize, and faults are dynamically injected to trigger restoration. 5.7.1 Synthetic Benchmarks Our first benchmark, Asynchronous Communication, measures the costs of building and maintaining our graph structure as well as the cost of stabilize actions in the presence of asynchronous communication. The benchmark spawns two threads, a source and a sink, that communicate asynchronously. We measure the cost of our abstraction with regard to an ever increasing load of asynchronous communications. To measure overheads for recording checkpoints, the source and sink threads are wrapped within a stable section. The source thread spawns a number of new threads which all send values on a predefined channel. The sink thread loops until all messages are received and then performs a stabilize action. Since both threads are wrapped in stable sections, the effect of stabilization is to unroll the entire program, when stabilize is called from the source thread. Notice that if we called stabilize from the sink, every worker thread that was spawned would be unrolled, but the source thread would not since it does not directly communicate with the sink. The second benchmark, Communication Pipeline, measures similar effects as the first, but captures the behavior of computations that generate threads which communicate in a synchronous pipeline fashion. The benchmark spawns a series of threads, each of which defines a channel used to communicate with its predecessor. Each thread blocks until it receives a value on its input channel and then sends an acknowledgment to the thread spawned before it. The first and last threads in the chain are connected to form a circular pipeline. This structure establishes a chain of communications over a number of channels, all of which must be unrolled if a stabilize action is initiated. To correctly reach a consistent checkpoint state, an initial thread, responsible for creating the threads in the pipeline, is wrapped within a stable section. Since the initial thread begins by spawning 65 Time Overheads 30 %Overhead 25 20 15 10 5 0 250 500 750 1000 1250 1500 1750 Asynchronous Communications 2000 Figure 5.17. Asynchronous Communication runtime overheads. worker threads, unrolling it will lead to a state that does not reflect any communication events or thread creation actions. These benchmarks measure the overhead of logging program state and communication dependencies with no opportunity for amortizing these costs among other non-stabilizer related operations. These numbers therefore represent worst case overheads for monitoring thread interactions. The versions of the benchmarks with injected stabilizers were compared to a vanilla MLton implementation of CML with no additional instrumentation. On average the programs take about 15-20% longer to run and consume about 50% more memory. The runtime overheads for the synthetic benchmarks are presented in Figure 5.17 and Figure 5.19, and the total allocation overheads are presented in Figure 5.18 and Figure 5.20. As expected, the cost to simply maintain the graph grows linearly with the number of communications performed and runtime overheads remain constant. There is a significant initial memory and runtime cost because we pre-allocate hash tables used by the graph. The protocols inherent in these benchmarks are captured at runtime via the communication graph. We present two sample graphs, one for each of the microbenchmarks, in Figure 5.21 and Figure 5.22. In the graph for asynchronous communication we notice a 66 Memory Overheads 70 %Overhead 60 50 40 30 20 10 0 250 500 750 1000 1250 1500 1750 Asynchronous Communications 2000 Figure 5.18. Asynchronous Communication memory overheads. Time Overheads 30 %Overhead 25 20 15 10 5 0 250 500 750 1000 1250 1500 Communicating Threads 1750 2000 Figure 5.19. Communication Pipeline runtime overheads. complex tree-like communication structure generated from the single thread source communicating asynchronously with the sink. The branching structure occurs from the spawning of new threads, each of which communicates once with the sink. In the communication pipeline we see a much different communication structure. The threads communicate in a 67 Memory Overheads 70 %Overhead 60 50 40 30 20 10 0 250 500 750 1000 1250 1500 Communicating Threads 1750 2000 Figure 5.20. Communication Pipeline memory overheads. pre-defined order, creating a simple stream. Both graphs were generated from the stabilizer communication graph and fed to DOT to generate the visual representation. 3 5.7.2 Open-Source Benchmarks Our other benchmarks include several eXene [24] benchmarks: Triangles and Nbody, mostly display programs that create threads to draw objects; and Pretty, a pretty printing library written on top of eXene. The eXene toolkit is a library for X Windows that implements the functionality of xlib, written in CML and comprises roughly 16K lines of Standard ML. Events from the X server and control messages between widgets are distributed in streams (coded as CML event values) through the window hierarchy. eXene manages the X calls through a series of servers, dynamically spawned for each connection and screen. The last benchmark we consider is Swerve, a web-server written in CML whose major modules communicate with one another using message passing channel communication; it makes no use of eXene. All the benchmarks create various CML threads to 3 We have discovered that utilizing the visualization of the communication graph is a useful tool for software development and debugging CML programs. We believe stabilizers can be utilized during testing and development to assist the programmer in constructing complex communication protocols. 68 Figure 5.21. Communication graphs for the Asynchronous Communication synthetic benchmark. 69 Figure 5.22. Communication graphs for the Communication Pipeline synthetic benchmark. 70 Table 5.3 Restoration of the entire web-server. Graph Channels Threads Runtime Reqs Size Num Cleared Affected (milli-seconds) 20 1130 85 42 470 5 40 2193 147 64 928 19 60 3231 207 84 1376 53 80 4251 256 93 1792 94 100 5027 296 95 2194 132 handle various events; communication occurs mainly through a combination of message passing on channels, with occasional updates to shared data. For these benchmarks, stabilizers exhibit a runtime slow down up to 6% over a CML program in which monitoring is not performed (see Table 5.1 and Table 5.2). For a highlyconcurrent application like Swerve, the overheads are even smaller, on the order of 3%. The cost of using stabilizers is only dependent on the number of inter-thread actions and shared data dependencies that are logged. These overheads are well amortized over program execution. Table 5.1 provides dynamic counts from the stabilizer graph. Memory overheads to maintain the communication graph are larger, although in absolute terms, they are quite small. Because we capture continuations prior to executing communication events and entering stable sections, part of the memory cost is influenced by representation choices made by the underlying compiler. Nonetheless, benchmarks such as Swerve that create over 10K threads, and employ non-trivial communication patterns, require only 5MB to store the communication graph, a roughly 4% overhead over the memory consumption of the original program. To measure the cost of calculating and restoring a globally consistent checkpoint, we consider three experiments. The first is a simple unrolling of Swerve (see Table. 5.3), in which a call to stabilize is inserted during the processing of a varying number of 71 Table 5.4 Instrumented recovery. Channels Benchmark Threads Runtime Num Cleared Total Affected (milli-seconds) Swerve 38 4 896 8 3 eXene 158 27 1023 236 19 concurrent web requests. This measurement illustrates the cost of restoring to a consistent global state that can potentially affect a large number of threads. Although we expect large checkpoints to be rare, we note that restoration of such checkpoints is nonetheless quite fast. The graph size is presented as the total number of nodes. Channels can be affected by an unrolling in two different ways. A channel may contain a value sent on it by a communicating thread but that has not been consumed by a receiver, or a channel may connect two threads that have successfully exchanged a value. In the first case we must clear the channel of the value if the thread which placed the value on the channel is unrolled; in the latter case no direct processing on the channel is required. The table also shows the total number of affected channels and those which must be cleared. 5.7.3 Case Studies: Injecting Stabilizers To quantify the cost of using stabilizers in practice, we extended Swerve and eXene and replaced some of their error handling mechanisms with stabilizers (see Table 5.4). For Swerve, the implementation details are given in Section 5.2. Our benchmark manually injects a timeout every ten requests, stabilizes the program, and re-requests the page. For eXene, we augment a scrollbar widget used by the pretty printer. In eXene, the state of a widget is defined by the state of its communicating threads, and no state is stored in shared data. The scroll bar widget is composed of three threads which communicate over a set of channels. The widget’s processing is split between two helper threads and one main 72 File Size Overheads 3.00 %Overhead 2.25 1.50 0.75 0 -0.75 -1.50 -2.25 -3.00 10 128 256 File Size (KB) 512 1024 Figure 5.23. Swerve file size overheads for Stabilizers. controller thread. Any error handled by the controller thread must be communicated to the helper threads and vice versa. We manually inject the loss of packets into the X server, stabilize the widget, and wait for new interaction events. The loss of packets is injected by simply dropping every tenth packet which is received from the X server. Ordinarily, if eXene ever loses an X server packet, its default behavior is to terminate execution since there is no easy mechanism available to restore the state of the widget to a globally consistent point. Using stabilizers, however, packet loss exceptions can be safely handled by the widget. By stabilizing the widget, we return it to a state prior to the failed request. Subsequent requests will redraw the widget as we would expect. Thus, stabilizers permit the scroll bar widget to recover from a lost packet without pervasive modification to the underlying eXene implementation. Finally, to measure the sensitivity of stabilization to application-specific parameters, we compare our stabilizer-enabled version of Swerve to the stock configuration by varying two program attributes: file size and quantum. Since stabilizers eliminate the need for polling during file processing, runtime costs would improve as file sizes increase. Our tests were run on both versions of Swerve; for a given file size, 20 requests are processed. The results (see Figure 5.23) indicate that for large file sizes (upward of 256KB) our implementation 73 Quantum Overheads %Overhead 6.0 4.5 3.0 1.5 0 5 15 20 25 Quantum (ms) 35 45 Figure 5.24. Swerve quantum overheads for Stabilizers. is slightly more efficient than the original. Our slow down for small file sizes (on the order of 10KB) is proportional to our earlier benchmark results. We expect most web-servers to host mostly small files. Since our graph algorithm requires monitoring various communication events, lowering the time quantum allocated to each thread may adversely affect performance, since the overhead for monitoring the graph consumes a greater fraction of a thread’s computation per quantum. Our tests compared the two versions of Swerve, keeping file size constant at 10KB, but varying the allocated quantum (see Figure 5.24). Surprisingly, the results indicate that stabilizer overheads become significant only when the quantum is less than 5 ms. As a point of comparison, CML’s default quantum is 20ms. 5.8 Related Work Being able to checkpoint and rollback parts or the entirety of an execution has been the focus of notable research in the database [25] as well as the parallel and distributed computing communities [26–28]. Checkpoints have been used to provide fault tolerance for long-lived applications, for example in scientific computing [29,30] but have been typically regarded as heavyweight entities to construct and maintain. 74 Existing checkpoint approaches can be classified into four broad categories: (a) schemes that require applications to provide their own specialized checkpoint and recovery mechanisms [31, 32]; (b) schemes in which the compiler determines where checkpoints can be safely inserted [33]; (c) techniques that require operating system or hardware monitoring of thread state [28, 34, 35]; and (d) library implementations that capture and restore state [36]. Checkpointing functionality provided by an application or a library relies on the programmer to define meaningful checkpoints. For many multi-threaded applications, determining these points is non-trivial because it requires reasoning about global, rather than threadlocal, invariants. Compiler and operating-system injected checkpoints are transparent to the programmer. However, transparency comes at a notable cost: checkpoints may not be semantically meaningful or efficient to construct. The ability to revert to a prior point within a concurrent execution is essential to transaction systems [24, 37, 38]; outside of their role for database concurrency control, such approaches can improve parallel program performance by profitably exploiting speculative execution [39, 40]. Harris et al. proposes a transactional memory system [41] for Haskell that introduces a retry primitive to allow a transactional execution to safely abort and be re-executed if desired resources are unavailable. However, this work does not propose to track or revert effectful thread interactions within a transaction. In fact, such interactions are explicitly rejected by the Haskell type-system. There has also been recent interest in providing transactional infrastructures for ML [42], and in exploring the interaction between transactional semantics and first-class synchronous operations [22,43,44]. Our work shares obvious similarities with all these efforts insofar as stabilizers also require support for logging and revert program state. In addition to stabilizers, functional language implementations have utilized continuations for similar tasks. For example, Tolmach and Appel [45] described a debugging mechanism for SML/NJ that utilized captured continuations to checkpoint the target program at given time intervals. This work was later extended [46] to support multi-threading, and was used to log non-deterministic thread events to provide replay abilities. The technique, however, was never adopted to synchronous message passing or CML-style events. 75 Another possibility for fault recovery is micro-reboot [19], a fine-grained technique for surgically recovering faulty application components which relies critically on the separation of data recovery and application recovery. Micro-reboot allows for a system to be restarted without ever being shut down by rebooting separate components. Unlike checkpointing schemes, which attempt to restore a program to a consistent state within the running application, micro-reboot quickly restarts an application component, but the technique itself is oblivious to program semantics. Recent work in the programming languages community has explored abstractions and mechanisms closely related to stabilizers and their implementation for maintaining consistent state in distributed environments [47], detecting deadlocks [48], and gracefully dealing with unexpected termination of communicating tasks in a concurrent environment [49]. For example, kill-safe thread abstractions [49] provide a mechanism to allow cooperating threads to operate even in the presence of abnormal termination. Stabilizers can be used for a similar goal, although the means by which this goal is achieved is quite different. Stabilizers rely on unrolling thread dependencies of affected threads to ensure consistency instead of employing specific runtime mechanisms to reclaim resources. There has been a number of recent proposals dealing with safe futures, which bear some similarity to stabilizers insofar as both provide a revocation mechanism based on tracking dynamic data and control-flow. Futures are a program abstraction that express a simple yet expressive form of fork-join parallelism. The expression future (e) declares that e can be evaluated concurrently with the future’s continuation. The expression touch (p) where p is a placeholder returned by evaluating a future, blocks until the result of evaluating e is known. Safe futures provide additional deterministic guarantees on the concurrent execution of the future with its continuation, ensuring that all data dependencies found in the original (non-future annotated) version are respected. Welc et. al [40] provide a dynamic analysis that enforces deterministic execution of sequential Java programs. However, safe futures do not easily compose with other Java concurrency primitives, and the criteria for revocation is automatically determined based on dependency violations, and is thus not under user control. In sequential programs, static 76 analysis coupled with simple program transformations [50] can ensure deterministic parallelism by providing coordination between futures and their continuations in the presence of mutable state. Unfortunately neither approach provided safety in the presences of exceptions. This was remedied in [51], where the authors presented an implementation for exception-handling in the presence of safe futures. Flanagan and Felleisen [52] presented a formal semantics for futures, but did not consider how to enforce safety (i.e. determinism) in the presence of mutable state. Navabi and Jagannathan [53] presented a formulation of safe futures for a higher-order language with first-class exceptions and first-class references. Neither formulation consider the interaction of futures with explicit threads of control. We believe that a combination of some of the approaches described above coupled with stabilizers can be leveraged to provide an implementation of safe futures in the presence of explicit threads of control. We discuss this possibility in our future work in Chapter 8. Transactional Events [22, 43, 44] are an emerging trend in concurrent functional language design. Transactional events combine first-class message passing events with an all-or-nothing semantics afforded by transactions. Transactional events provide the ability to express three-way rendezvous and safe, guarded synchronous receives. Unlike stabilizers which provide atomicity properties on rollback, transactional events provide atomicity on a communication protocol. To ensure that a communication protocol which spans multiple threads of control completes atomically in the presences of selective communication, all possible permutations of the protocol may need to be explored. This is, in general, difficult as many different threads can be potential participants in the protocol. Transactional events rely on a dynamic search thread strategy to explore communication patterns, thus guaranteeing atomicity of communication protocols. Only successful communication patterns are chosen. Currently, transactional events do a full state space exploration, similar to model checking, until a successful communication protocol is discovered. Originally transactional events only supported synchronous communication. However, recent extensions have provided semantics for handling shared memory references [44]. To avoid the complexity of reasoning about all possible interleavings of shared memory oper- 77 ations, transactional events with support for shared memory only consider amalgamations of operations bound by communication actions and synchronization points. We believe stabilizers could be utilized by transactional events to implement optimistic search thread strategies instead of full state space explorations. The monitoring and rollback properties of stabilizers could be leveraged to create an more sophisticated search mechanism that utilizes monitored information from previous searches to guide future ones we provide further details as a part of future work in Chapter 8. 5.9 Concluding Remarks Stabilizers are a novel modular checkpointing abstraction for concurrent functional pro- grams. Unlike other checkpointing schemes, stabilizers are not only able to identify the smallest subset of threads which must be unrolled, but also provide useful safety guarantees. As a language abstraction, stabilizers can be used to simplify program structure especially with respect to error handling, debugging, and consistency management. Our results indicate that stabilizers can be implemented with small overhead and thus serve as an effective and promising checkpointing abstraction for high-level concurrent programs. 78 6 PARTIAL MEMOIZATION OF CONCURRENCY AND COMMUNICATION Eliminating redundant computation is an important optimization supported by many language implementations. One important instance of this optimization class is memoization [54–56], a well-known dynamic technique that can be utilized to avoid performing a function application by recording the arguments and results of previous calls. If a call is supplied an argument that has been previously cached, the execution of the function body can be elided, with the corresponding result immediately returned instead. When functions perform effectful computations, leveraging memoization becomes significantly more challenging. Two calls to a function f that performs some stateful computation need not generate the same result if the contents of the state f uses to produce its result are different at the two call-sites. Concurrency and communication introduce similar complications. If a thread calls a function f that communicates with functions invoked in other threads, then memo information recorded with f must include the outcome of these actions. If f is subsequently applied with a previously seen argument, and its communication actions at this call-site are the same as its effects at the original application, re-evaluation of the pure computation in f ’s body can be avoided. Because of thread interleavings, synchronization, and non-determinism introduced by scheduling choices, making such decisions is non-trivial. Nonetheless, we believe memoization can be an important component in a concurrent programming language runtime. Our belief is enforced by the widespread emergence of multi-core platforms, and renewed interest in streaming [57], speculative [58] and transactional [20, 59] abstractions to program these architectures. For instance, optimistic concurrency abstractions rely on efficient control and state restoration mechanisms. When a speculation fails because a previously available computation resource becomes unavailable, or when a transaction aborts due to a serializability violation [24] and must be retried [41], their effects must be undone. Failure represents wasted work, both in terms of the opera- 79 tions performed whose effects must now be erased, and in terms of overheads incurred to implement state restoration; these overheads include logging costs, read and write barriers, contention management, etc. One way to reduce this overhead is to avoid subsequent re-execution of those function calls previously executed by the failed computation whose results are unchanged. The key issue is understanding when utilizing memoized information is safe, given the possibility of concurrency, communication, and synchronization among threads. In this chapter, we consider the memoization problem for a higher-order concurrent language in which threads communicate through synchronous message passing primitives (e.g. Concurrent ML [2]). A synchronization event acknowledges the existence of an external action performed by another thread willing to send or receive data. If such events occur within a function f whose applications are memoized, then avoiding re-execution at a call-site c is only possible if these actions are guaranteed to succeed at c. In other words, using memo information requires discovery of interleavings that satisfy the communication constraints imposed by a previous call. If we can identify a global state in which these constraints are satisfied, the call to c can be avoided. We say that a constraint is satisfiable if there exists a thread willing to perform a matching action on the channel. Thus, if our constraint embodied a send on a particular channel, for the constraint to be satisfiable, there must exist a thread willing to receive from that channel. If there exists no such state, then the call must be performed. Because finding such a state can be expensive (it may require an unbounded state space search), we consider a weaker notion of memoization. By recording the context in which a memoization constraint was generated, implementations can always choose to simply resume execution of the function at the program point associated with the constraint using the saved context. In other words, rather than requiring global execution to reach a state in which all constraints in a memoized application are satisfied, partial memoization gives implementations the freedom to discharge some fraction of these constraints, performing the rest of the application as normal. Although our description and formalization is developed in the context of message-based communica- 80 tion, the applicability of our solution naturally extends to shared-memory communication as well given the simple encoding of the latter in terms of the former [2]. Whenever a constraint built during memoization is discharged on a subsequent application, there is a side-effect that occurs; namely the execution of the action the constraint desribes. For example, consider a communication constraint associated with a memoized version of a function f that expects a thread T to receive data d on channel c. To use this information at a subsequent call, we must identify the existence of T , and having done so, must propagate d along c for T to consume. Thus, whenever a constraint is satisfied, an effect that reflects the action represented by that constraint is performed. We consider the set of constraints built during memoization as forming an ordered log, with each entry in the log representing a condition that must be satisfied to utilize the memoized version, and an effect that must be performed if the condition holds. The point of memoization for our purposes is thus to avoid performing the pure computations that execute between these effectful operations. 6.1 Programming Model Our programming model is a simple synchronous message passing dialect of ML sim- ilar to CML. Deciding whether a function application can be avoided based on previously recorded memo information depends upon the value of its arguments, its communication actions, channels it creates, threads it spawns, and the return value it produces. Thus, the memoized result of a call to a function f can be used at a subsequent call if (a) the argument given matches the argument previously supplied; (b) recipients for values sent by f on channels in an earlier memoized call are still available on those channels; (c) values consumed by f on channels in an earlier call are again ready to be sent other threads; (d) channels created in an earlier call have the same actions performed on them, and (e) threads created by f can be spawned with the same arguments supplied in the memoized version. Ordering constraints on all sends and receives performed by the procedure must also be enforced. We call the collection of such constraints for a given function application the 81 constraint log. A successful application of a memoized call yields a new state in which the effects captured within the constraint log have been executed. Thus, the values sent by f are received by waiting recipients, senders on channels from which f expects to receive values propagate these values on those channels, and channels and threads that f is expected to create are created. To avoid making a call, a send action performed within the applied function, for example, will need to be paired with a receive operation executed by some other thread. Unfortunately, there may be no thread currently scheduled that is waiting to receive on this channel. Consider an application that calls a memoized function f which (a) creates a thread T that receives a value on channel c, and (b) sends a value on c computed through values received on other channels that is then consumed by T . To safely use the memoized return value for f nonetheless still requires that T be instantiated, and that communication events executed in the first call can still be satisfied (e.g., the values f previously read on other channels are still available on those channels). Ensuring these actions can succeed may involve an exhaustive exploration of the execution state space to derive a schedule that allows us to consider the call in the context of a global state in which these conditions are satisfied. Because such an exploration may be infeasible in practice, our formulation also supports partial memoization. Rather than requiring global execution to reach a state in which all constraints in a memoized application are satisfied, partial memoization gives implementations the freedom to discharge some fraction of these constraints, performing the rest of the application as normal. 6.2 Motivation As a motivating example, we consider how memoization can be profitably utilized in a concurrent message passing web-server to implement a file cache. In a typical web-server a file cache is a well known optimization that allows the server to avoid re-reading a file from disk when it is requested multiple times. This is typically accomplished by storing the file in memory on the first request and accessing this memory on subsequent requests. 82 We can leverage memoization to accomplish this goal by memoizing the file reading functionality provided by the File Processor in Swerve (see code given in Figure 5.3). Recall that the file processor sends the file to the Network Processor in a series of synchronous communications that sends chunks of the file. To successfully memoize this functionality such that on subsequent request of the file the memoized version of the function correctly sends the file chunks to the Network Processor we must capture and store this information. Our formulation of memoization creates constraints for every communication performed within a given function. Thus, for every file chunk that the File Processor sends to the Network Processor we create a constraint which captures the data that is sent. In this manner, the constraints which are generated during memoization of the function, contain the data that corresponds to the contents of the file. We can think of the memo store associated with this function as the file cache for the server. However, naively memoizing the function in this way has some unfortunate consequences. Recall that the File Processor must poll for timeouts and notify other modules if any errors have been detected. In this setting, typical file caches are not suitable. We would have to construct a file cache that also polled for errors and encapsulated the error handling protocols the File Processor is responsible for. Instead, we observe that partial memoiztion provides a unique solution to this problem. Namely, it provides the ability to stop the discharging of constraints and to resume normal execution. Since the File Processor polls for timeouts on channels, this action is also captured in a constraint. When a timeout or other error is detected, the constraint is said to have failed (i.e. instead of the channel being empty, it contains a value). In this case normal execution is resumed, thereby triggering the default error handling protocols. 6.3 Semantics Our semantics is defined in terms of a core call-by-value functional language with threading and communication primitives. Communication between threads is achieved 83 S YNTAX : P ::= 0/ | PkP | t[e] e ∈ Exp ::= v | e(e) | spawn(e) | mkCh() | send(e, e) | recv(e) v ∈ Val ::= unit | λ x.e | l E VALUATION C ONTEXTS : E ::= [ ] | E(e) | v(E) | spawn(E) | send(E, e) | send(l, E) | recv(E) P ROGRAM S TATES : P ∈ Process t ∈ Tid x, y ∈ Var l ∈ Channel α, β ∈ Tag = {App,Ch, Spn,Com} Figure 6.1. Syntax, grammar, evaluation contexts, and domain equations for a concurrent language with synchronous communication. using synchronous channels. We first present a simple multi-threaded language with synchronous channel based communication. We then extend this core language with memoization primitives, and subsequently consider refinements of this semantics to support partial memoization. Although the core language has no support for selective communication, extending it to support choice is straight forward. Memoization would simply record the result of the choice and replay would only be possible if the recorded choice was satisfiable. In the following, we write α to denote a sequence of zero or more elements, β.α to denote sequence concatenation, and 0/ to denote an empty sequence. Metavariables x and y range over variables, t ranges over thread identifiers, l ranges over channels, v ranges over values, and α, β denote tags that label individual actions in a program’s execution. We use 84 (F UNCTION A PPLICATION ) App Pkt[E[λ x.e(v)]] 7−→ Pkt[E[e[v/x]]] (C HANNEL ) l fresh Ch Pkt[E[mkCh()]] 7−→ Pkt[E[l]] (S PAWN ) t0 fresh Spn Pkt[E[spawn(λ x.e)]] 7−→ Pkt[E[unit]]kt0 [e[unit/x]] (C OMMUNICATION ) P = P0 kt[E[send(l, v)]]kt0 [E0 [recv(l)]] Com P 7−→ P0 kt[E[unit]]kt0 [E0 [v]] Figure 6.2. Operation semantics for a concurrent language with synchronous communication. P to denote a program state comprised of a collection of threads, E for evaluation contexts, and e for expressions. Our communication model is a message passing system with synchronous send and receive operations. We do not impose a strict ordering of communications on channels; communication actions on the same channel by different threads are paired non-deterministically. To model asynchronous sends, we simply spawn a thread to perform the send. Spawning an expression (that evaluates to a thunk) creates a new thread in which the application of the thunk is performed. 6.3.1 Language The syntax and semantics of the language are given in Figure 6.1. Expressions are either variables, locations that represent channels, λ-abstractions, function applications, 85 thread creation operations, or communication actions that send and receive messages on channels. We do not consider references in this core language as they can be modeled in terms of operations on channels [2]. A thread context t[E[e]] denotes an expression e available for execution by a thread with identifier t within context E. Evaluation is specified via a relation ( 7−→ ) that maps a program state (P) to another program state. A program state is a collection of thread contexts. An evaluation step is marked with a tag that indicates the action performed by α that step. As shorthand, we write P 7−→ P0 to represent the sequence of actions α that transforms P to P0 . The core rules are presented in Figure 6.2. Application (rule (F UNCTION A PPLICA TION )) substitutes the argument value for free occurrences of the parameter in the body of the abstraction, and channel creation (rule (C HANNEL)) results in the creation of a new location that acts as a container for message transmission and reception . A spawn action (rule (S PAWN)), given an expression e that evaluates to a thunk changes the global state to include a new thread in which the thunk is applied. A communication event (rule (C OM MUNICATION )) synchronously pairs a sender attempting to transmit a value along a specific channel in one thread with a receiver waiting on the same channel in another thread. 6.3.2 Partial Memoization The core language presented above provides no facilities for memoization of the functions it executes. To support memoization, we must record, in addition to argument and return values, synchronous communication actions, thread spawns, channel creation etc. as part of the memoized state. These actions define a log of constraints (C) that must be satisfied at subsequent applications of a memoized function, and whose associated effects must be discharged if the constraint is satisfied. To record constraints, we augment our semantics to include a memo store (σ), a map that given a function identifier and an argument value, returns the set of constraints and result value that was previously recorded for a call to that function with that argument. If the set of constraints returned by the memo store is satisfied 86 in the current state (and their effects performed), then the return value can be used and the application elided. The memo store contains only one function/value pair for simplicity of the presentation. We can envision extending the memo store to contain multiple memoized versions of a function based on its arguments or constraints. We utilize two thread contexts t[e] and tC [e], the former to indicate that evaluation of terms should be captured within the memo store, and the latter to indicate that previously recorded constraints should be discharged. We elaborate on their use below. The definition of the language augmented with memoization support is given in Figure 6.3 and operational semantics are provided in Figure 6.4 and Figure 6.5. We now define evaluation using a new relation ( =⇒ ) defined over two configurations. In one case, it maps a program state (P) and a memo store (σ) to a new program state and memo store. This configuration defines evaluation that does not leverage memoized information. The second configuration maps a thread id and constraint sequence pair ((t,C)), a program state (P), and a memo store (σ) to a new thread id and constraint sequence pair, program state, and memo store. Transitions of this form are used when evaluation is discharging constraints recorded from a previous memoized application. In addition, a thread state is augmented to hold an additional structure. The memo state (θ) records the function identifier (δ), the argument (v) supplied to the call, the context (E) in which the call is performed, and the sequence of constraints (C) that are built during the evaluation of the application being memoized. Constraints built during a memoized function application define actions that must be satisfied at subsequent call-sites in order to avoid complete re-evaluation of the function body. For a communication action, a constraint records the location being operated upon, the value sent or received, the action performed (R for receive and S for send), and the continuation captured immediately prior to the action being performed. The application resumes evaluation from this point if the corresponding constraint could not be discharged. For a spawn operation, the constraint records the action (Sp), the expression being spawned, and the continuation immediately prior to the spawn. For a channel creation operation (Ch), the constraint records the location of the channel and the continuation immediately prior to 87 the channel creation. For a function application (App), we record the continuation prior to the application. Returns are also modeled as constraints (Rt, v) where v is the return value of the application being memoized. Notice, that we record continuations for all constraints. We do this to simplify our correctness proofs. Consider an application of function f to value v that has been memoized. Since subsequent calls to f with v may not be able to discharge all constraints, we need to record the program points for all communication actions within f that represent potential resumption points from which normal evaluation of the function body proceeds. These continuations are recorded as part of any constraint that can fail (communication actions, and return constraints as described below). But, since the calling contexts at these other call-sites are different than the original, we must be careful to not include them within saved continuations recorded within these constraints. Thus, the contexts recorded as part of the saved constraint during memoization only define the continuation of the action up to the return point of the function; the memo state (θ) stores the evaluation context representing the caller’s continuation. This context is restored once the application completes (see rule (R ETURN)). If function f calls function g , then actions performed by g must be satisfiable in any call that attempts to leverage the memoized version of f . Consider the following program fragment: let fun f(...) = ... let fun g(...) = ... send(c,v) ... in ... end in ... f(...) ... end If we encounter an application of f after it has been memoized, then g ’s communication action (i.e., the send of v on c ) must be satisfiable at the point of the application to avoid performing the call. We therefore associate a call stack of constraints (θ) with each thread that defines the constraints seen thus far, requiring the constraints computed for an inner application to be satisfiable for any memoization of an outer one. The propagation 88 of constraints to the memo states of all active calls is given by the operation shown in Figure 6.3. Channels created within a memoized function must be recorded in the constraint sequence for that function (rule (C HANNEL)). Consider a function that creates a channel and subsequently initiates communication on that channel. If a call to this function was memoized, later applications that attempt to avail of memo information must still ensure that the generative effect of creating the channel is not omitted. Function evaluation now associates a label with function evaluation that is used to index the memo store (rule (F UNCTION)). If a new thread is spawned within a memoized application, a spawn constraint is added to the memo state, and a new global state is created that starts memoization of the actions performed by the newly spawned thread (rule (S PAWN)). A communication action performed by two functions currently being memoized are also appropriately recorded in the corresponding memo state of the threads that are executing these functions (rule (C OMMU NICATION )). When a memoized application completes, its constraints are recorded in the memo store (rule (R ETURN)). When a function f is applied to argument v, and there exists no previous invocation of f to v recorded in the memo store, the function’s effects are tracked and recorded (rule (A PPLICATION)). Until an application of a function being memoized is complete, the constraints induced by its evaluation are not immediately added to the memo store. Instead, they are maintained as part of the memo state (θ) associated with the thread in which the application occurs. The most interesting rule is the one that deals with determining how much of an application of a memoized function can be elided (rule (M EMO A PPLICATION)). If an application of function f with argument v has been recorded in the memo store, then the application can be potentially avoided. If not, its evaluation is memoized by (rule (A PPLICATION)). If a memoized call is applied, we must examine the set of associated constraints that can be discharged. To do so, we employ an auxiliary relation ℑ defined in Figure 6.6. Abstractly, ℑ checks the global state to determine which communication, channel creation, and spawn creation constraints (the possible effectful actions in our language) can be satisfied, and 89 returns a set of failed constraints, representing those actions that could not be satisfied. The thread context (tC [e]) is used to signal the utilization of memo information. The failed constraints are added to the original thread context. The rule (M EMO A PPLICATION) yields a new global configuration whose thread id and constraint sequence ((t,C)) corresponds to the constraints satisfiable in the current global state (defined as C00 ) for thread t as defined by ℑ. These constraints, when discharged, will leave the thread performing the memoized call in a new state in which the evaluation of the call is the expression associated with the first failed constraint returned by ℑ. As we describe below in Sec 6.3.3, there is always at least one such constraint, namely Rt , the return constraint, that holds the return value of the memoized call. We also introduce a rule to allow the completion of memo information use (rule (E ND M EMO)). The rule installs the continuation of the first currently unsatisfied constraint; no further constraints are subsequently examined. In this formulation, the other failed constraints are simply discarded. We present an extension of this semantics in Section. 6.3.6 that make use of them. 6.3.3 Constraint Matching The constraints built as a result of evaluating these rules are discharged by the rules shown in Figure 6.7. Each rule in Figure 6.7 is defined with respect to a thread id and constraint sequence. Thus, at any given point in its execution, a thread is either building up memo constraints (i.e., the thread is of the form t[e]) within an application for subsequent calls to utilize, or attempting to discharge these constraints (i.e., the thread is of the tC [e]) for applications indexed in the memo store. The function ℑ leverages the evaluation rules defined in Figure 6.7 by examining the global state and determining which constraints can be discharged, except for the return constraint. ℑ takes a constraint set (C) and a program state (P) and returns a sequence of unmatchable constraints (C0 ). Send and receive constraints are matched with threads blocked in the global program state on the opposite communication action. Once a thread 90 S YNTAX : P ::= 0/ | PkP | hθ, t[e]i | hθ, tC [e]i v ∈ Val ::= unit | λδ x.e | l E ∈ Context C ONSTRAINT A DDITION : θ0 = {(δ, v, E,C.C)|(δ, v, E,C) ∈ θ} θ,C θ0 P ROGRAM S TATES : δ ∈ MemoId c ∈ FailableConstraint= ({R, S} × Loc × Val) + Rt C ∈ Constraint = (FailableConstraint × Exp)+ ((Sp × Exp) × Exp) + ((Ch × Location) × Exp)+ ((App) × Exp) σ ∈ MemoStore = MemoId × Val → Constraint∗ θ ∈ MemoState = MemoId × Val × Context × Constraint∗ α, β ∈ Tag = {Ch, Spn, Com, Fun, App, Ret, MCom, MCh, MSp, MemS, MemE, MemR, MemP} Figure 6.3. Memoization Semantics – A concurrent language supporting memoization of synchronous communication and dynamic thread creation. has been matched with a constraint it is no longer a candidate for future communication since its communication action is consumed by the constraint. This guarantees that the candidate function will communicate at most once with each thread in the global state. Although a thread may in fact be able to communicate more than once with the candidate function, determining this requires arbitrary look ahead and is infeasible in practice. We discuss the implications of this restriction in Section 6.3.5. 91 (C HANNEL ) θ, ((Ch, l), E[mkCh()]) θ0 l fresh Ch Pkhθ, t[E[mkCh()]]i, σ =⇒ Pkhθ0 , t[E[l]]i, σ (F UNCTION ) δ fresh Fun Pkhθ, t[E[λ x.e]]i, σ =⇒ Pkhθ, t[E[λδ x.e]]i, σ (S PAWN ) t0 fresh θ, ((Sp, λδ x.e(unit)), E[spawn(λδ x.e)]) θ0 / t0 [e[unit/x]]i tk = hθ0 , t[E[unit]]i ts = h0, Spn Pkhθ, t[E[spawn(λδ x.e)]]i, σ =⇒ Pktk kts , σ (C OMMUNICATION ) P = P0 khθ, t[E[send(l, v)]]ikhθ0 , t0 [E0 [recv(l)]]i θ, ((S, l, v), E[send(l, v)]) θ00 θ0 , ((R, l, v), E0 [recv(l)]) θ000 Com P, σ =⇒ P0 khθ00 , t[E[unit]]ikhθ000 , t0 [E0 [v]]i, σ Figure 6.4. Memoization Semantics – A concurrent language supporting memoization of synchronous communication and dynamic thread creation. A spawn constraint (rule (M EMO S PAWN)) is always satisfied, and leads to the creation of a new thread of control. Observe that the application evaluated by the new thread is now a candidate for memoization if the thunk was previously applied and its result is recorded in the memo store. A channel constraint of the form ((Ch,l), E[e]) (rule (M EMO C HANNEL)) creates a new channel location l0 , and replaces all occurrences of l found in the remaining constraint sequence for this thread with l0 . The channel location may be embedded within send and receive constraints, either as the target of the operation, or as the argument value being sent or received. Thus, discharging a channel constraint ensures that the effect of creating a new channel performed within an earlier memoized call is preserved on subsequent applications. The renaming operation ensures that later send and receive constraints refer 92 (R ETURN ) θ = (δ, v, E,C) Ret Pkhθ.θ, t[v0 ]i, σ =⇒ Pkhθ, t[E[v0 ]]i, σ[(δ, v) 7→ C.(Rt, v0 )] (A PPLICATION ) θ, ((App), E[λδ x.e(v)]) θ0 / (δ, v) 6∈ Dom(σ) θ = (δ, v, E, 0) App Pkhθ, t[E[λδ x.e(v)]]i, σ =⇒ Pkhθ.θ0 , t[e[v/x]]i, σ (M EMO S TART ) (δ, v) ∈ Dom(σ) σ(δ, v) = C ℑ(C, P) = C0 C = C00 .C0 MemS Pkhθ, t[E[λδ x.e(v)]]i, σ =⇒ (t,C00 ), Pkhθ, tc0 [E[λδ x.e(v)]]i, σ (M EMO E ND ) C = (c, e0 ) MemE / Pkhθ, tC.C [E[λδ x.e(v)]]i, σ =⇒ Pkhθ, t[E[e0 ]]i, σ (t, 0), Figure 6.5. Memoization Semantics – A concurrent language supporting memoization of synchronous communication and dynamic thread creation (continued). to the new channel location. Both channel creation and thread creation actions never fail – they modify the global state with a new thread and channel, respectfully, but impose no pre-conditions on the state in order for these actions to succeed. MCom There are two communication constraint matching rules ( =⇒ ), both of which may indeed fail. If the current constraint expects to receive value v on channel l , and there exists a thread able to send v on l , evaluation proceeds to a state in which the communication succeeds (the receiving thread now evaluates in a context in which the receipt of the value has occurred), and the constraint is removed from the set of constraints that need to be 93 ℑ(((S, l, v), e).C, Pkhθ, t[E[recv(l)]]i) = ℑ(C, P) ℑ(((R, l, v), e).C, Pkhθ, t[E[send(l, v)]]i) = ℑ(C, P) ℑ((Rt, v), P) = (Rt, v) ℑ(((Ch, l), e).C, P) = ℑ(C, P) ℑ(((Sp, e0 ), e).C, P) = ℑ(C, P) ℑ(((App), e).C, P) = ℑ(C, P) ℑ(C, P) = C, otherwise Figure 6.6. Memoization Semantics – The function ℑ yields the set of constraints C which are not satisfiable in program state P. matched (rule (MR ECV)). Note also that the sender records the fact that a communication with a matching receive took place in the thread’s memo state, and the receiver does likewise. Any memoization of the sender must consider the receive action that synchronized with the send, and the application in which the memoized call is being examined must record the successful discharge of the receive action. In this way, the semantics permits consideration of multiple nested memoization actions. If the current constraint expects to send a value v on channel l , and there exists a thread waiting on l , the constraint is also satisfied (rule (M EMO S END)). A send operation can match with any waiting receive action on that channel. The semantics of synchronous communication allows us the freedom to consider pairings of sends with receives other than the one it communicated with in the original memoized execution. This is because a receive action places no restriction on either the value it reads, or the specific sender that provides the value. If these conditions do not hold, the constraint fails. Observe that there is no evaluation rule for the Rt constraint that can consume it. This constraint contains the return value of the memoized function (see rule (R ETURN)). If all other constraints have been satisfied, it is this return value that replaces the call in the current context (see the consequent of rule (M EMO A PPLICATION)). 94 (M EMO C HANNEL ) C = ((Ch, l), ) l0 fresh C00 = C[l0 /l] θ,C θ0 MCh (t,C.C), Pkhθ, tC0 [E[λδ x.e(v)]]i, σ =⇒ (t,C00 ), Pkhθ0 , tC0 [E[λδ x.e(v)]]i, σ (M EMO A PPLICATION ) C = ((App), ) θ,C θ0 MApp (t,C.C), Pkhθ, tC0 [E[λδ x.e(v)]]i, σ =⇒ (t,C), Pkhθ0 , tC0 [E[λδ x.e(v)]]i, σ (M EMO S PAWN ) C = ((Sp, e0 ), ) t0 fresh θ,C θ0 MSp (t,C.C), Pkhθ, tC0 [E[E[λδ x.e(v)]]]i, σ =⇒ / t0 [e0 ]i, σ (t,C), Pkhθ0 , tC0 [E[λδ x.e(v)]]ikh0, (M EMO R ECEIVE ) C = ((R, l, v), ) ts = hθ, t[E[send(l, v)]]i tr = hθ0 , t0C0 [E0 [λδ x.e(v)]]i θ0 ,C θ000 θ, ((S, l, v), E[send(l, v)]) θ00 ts0 = hθ00 , t[E[unit]]i tr0 = hθ000 , t0C0 [E0 [λδ x.e(v)]]i MCom (t0 ,C.C), Pkts ktr , σ =⇒ (t0 ,C), Pkts0 ktr0 , σ (M EMO S END ) C = ((S, l, v), ) ts = hθ0 , t0C0 [E[λδ x.e(v)]]i tr = hθ, t[E0 [recv(l)]]i θ0 ,C θ000 θ, ((R, l, v), E0 [recv(l)]) θ00 ts0 = hθ000 , tC0 0 [E[λδ x.e(v)]]i tr0 = hθ00 , t[E0 [v]]i MCom (t0 ,C.C), Pkts ktr , σ =⇒ (t0 ,C), Pkts0 ktr0 , σ Figure 6.7. Memoization Semantics – Constraint matching is defined by four rules. Communication constraints are matched with threads performing the opposite communication action of the constraint. 95 let val (c1,c2) = (channel(), channel()) fun f () = (send(c1,v1); ...; recv(c2)) fun g () = (recv(c1); ...; recv(c2); g()) fun h () = (...; send(c2,v2); send(c2,v3); h()); fun i () = (recv(c2); i()) in spawn(g); spawn(h); spawn(i); f(); ...; send(c2, v4); ...; f() end Figure 6.8. Determining if an application can fully leverage memo information may require examining an arbitrary number of possible thread interleavings. c1 v1 g() h() i() c1 c2 c2 f() v2 c2 c2 c2 v3 c2 v4 Figure 6.9. The communication pattern of the code in Figure 6.8. Circles represent operations on channels. Gray circles are sends and white circles are receives. Double arrows represent communications that are captured as constraints during memoization. 6.3.4 Example The program fragment shown in Figure 6.8 applies functions f, g, h and i. The calls to g, h, and i are evaluated within separate threads of control, while the applications of f 96 takes place in the original thread. These different threads communicate with one other over shared channels c1 and c2. The communication pattern is depicted in Figure 6.9. Separate threads of control are shown as rectangles. Communications actions are represented as circles; gray for send actions and white for receives. The channel on which the communication action takes place is annotated on the circle and the value which is sent on the arrow. Double arrows edges represent communication actions for which constraints are generated. To determine whether the second call to f can be elided we must examine the constraints that would be added to the thread state of the threads in which these functions are applied. First, spawn constraints would be added to the main thread for the threads executing g, h, and i. Second, a send constraint followed by a receive constraint, modeling the exchange of values v1 and either v2 or v3 on channels c1 and c2 would be included as well. For the sake of discussion, assume that the send of v2 by h was consumed by g and the send of v3 was paired with the receive in f when f() was originally executed. Consider the memoizability constraints built during the first call to f . The send constraint on f ’s application can be satisfied by matching it with the corresponding receive constraint associated with the application of g ; observe g() loops forever, consuming values on channels c1 and c2 . The receive constraint associated with f can be discharged if g receives the first send by h , and f receives the second. A schedule that orders the execution of f and g in this way, and additionally pairs i with a send operation on c2 in the let -body would allow the second call to f to fully leverage the memo information recorded in the first. Doing so would enable eliding the pure computation in f (abstracted by . . .) in its definition, performing only the effects defined by the communication actions (i.e., the send of v1 on c1 , and the receipt of v3 on c2 ). 6.3.5 Issues As this example illustrates, utilizing memo information completely may require forcing a schedule that pairs communication actions in a specific way, making a solution that requires all constraints to be satisfied infeasible in practice. Hence, rule (M EMO A PPLICA - 97 let fun f() = (send(c,1); send(c,2)) fun g() = (recv(c);recv(c)) in spawn(g); f(); ...; spawn(g); f() end Figure 6.10. The second application of f can only be partially memoized up to the second send since only the first receive made by g is blocked in the global state. TION ) allows evaluation to continue within an application that has already been memoized once a constraint cannot be matched. As a result, if during the second call to f , i indeed received v3 from h , the constraint associated with the recv operation in f would not be satisfied, and the rules would obligate the call to block on the recv , waiting for h or the main body to send a value on c2 . Nonetheless, the semantics as currently defined does have limitations. For example, the function ℑ does not examine future actions of threads and thus can only match a constraint with a thread if that thread is able to match the constraint in the current state. Hence, the rules do not allow leveraging memoization information for function calls involved in a producer/consumer relation. Consider the simple example given in Figure 6.10. The second application of f can take advantage of memoized information only for the first send on channel c. This is because the global state in which constraints are checked only has the first recv made by g blocked on the channel. The second recv only occurs if the first is successfully paired with a corresponding send. Although in this simple example the second recv is guaranteed to occur, consider if g contained a branch which determined if g consumed a second value from the channel c. In general, constraints can only be matched against the current communication action of a thread. Secondly, exploiting memoization may lead to starvation since subsequent applications of the memoized call will be matched based on the constraints supplied by the initial call. Consider the example shown in Figure 6.11. If the initial application of f pairs with the send performed by g, subsequent calls to f that use this memoized version will also pair 98 let fun f() = recv(c) fun g() = send(c,1);g() fun h() = send(c,2);h() in spawn(g); spawn(h); f(); ...; f() end Figure 6.11. Memoization of the function f can lead to the starvation of either g or h depending on which value the original application of f consumed from channel c. with g since h produces different values. This leads to starvation of h. Although this behavior is certainly legal, one might reasonably expect a scheduler to interleave the sends of g and h. 6.3.6 Schedule Aware Partial Memoization To address the limitations in the previous section, we define two new symmetric rules to pause and resume memoization (see Figure 6.12). Pausing memoization (rule (PAUSE M EMO)) is similar to the rule (M EMO E ND) in Figure 6.3 except the failed constraints are not discarded and the thread context is not given an expression to evaluate. Instead the thread retains its log of currently unsatisfied constraints which prevents its further evaluation. This effectively pauses the evaluation of this thread but allows regular threads to continue normal evaluation. Notice we only pause a thread utilizing memo information once it has correctly discharged its constraints. We could envision an alternative definition which pauses non-deterministically on any constraint and moves the non-discharged constraints back to the thread context which holds unsatisfied constraints. For the sake of simplicity we opted for greedy semantics which favors the utilization of memoization. We can resume the paused thread, enabling it to discharge other constraints using the rule (R ESUME M EMO), which begins constraint discharge anew for a paused thread. Thus, if a thread context has a set of constraints that were not previously satisfied and evaluation is not utilizing memo information, we can once again apply our ℑ function. Note that the use 99 of memo information can be ended at any time (rule (M EMO E ND) can be applied instead of (PAUSE M EMO)). We can, therefore, change a thread in a paused state into a bona fide thread by first applying (R ESUME M EMO). If ℑ does not indicate we can discharge any additional constraints, we simple apply the rule (E ND M EMO). We also extend our evaluation rules to allow constraints to be matched against other constraints (rule (MC OM)). This is accomplished by matching constraints between two paused threads. Of course, it is possible that two threads, both of which were paused on a constraint that was not satisfiable, may nonetheless satisfy one another. This happens when one thread is paused on a send constraint and another on a receive constraint both of which match on the channel and value. In this case, the constraints on both sender and receiver can be safely discharged. This allows calls which attempt to use previously memoized constraints to match against constraints extant in other calls that also attempt to exploit memoized state. 6.4 Soundness We can relate the states produced by memoized evaluation to the states constructed by the non-memoizing evaluator. To do so, we first define a transformation function T that transforms process states (and terms) defined under memo evaluation to process states (and terms) defined under non-memoized evaluation (see Figure 6.13). Since memo evaluation stores evaluation contexts in θ they must be extracted and restored. This is done in the opposite order that they were pushed onto the stack θ since the top represents the most recent call. Functions currently in the process of utilizing memo information must be resumed from the expression captured in the first non-discharged constraint. Similarly threads which are currently paused must also be resumed. Our safety theorem ensures memoization does not yield states which could not be realized under non-memoized evaluation: Theorem 6.4.1 (Safety) If β Pkhθ, t[E[λδ x.e(v)]]i, 0/ =⇒ P0 khθ0 , t[[v0 ]]i, σ 100 (PAUSE M EMO ) MemP / P, σ =⇒ P, σ (t, 0), (R ESUME M EMO ) ℑ(C, P) = C0 C = C00 .C0 MemR Pkhθ, tC [E[λδ x.e(v)]]i, σ =⇒ (t,C00 ), Pkhθ, tC0 [E[λδ x.e(v)]]i, σ (M EMO C OMMUNICATION ) C = ((S, l, v), ) C0 = ((R, l, v), ) ts = hθ, tC.C [λδ x.e(v)]i tr = hθ0 , t0C0 .C0 [λδ1 x.e0 (v0 )]i θ0 ,C0 θ000 θ,C θ00 ts0 = hθ00 , tC [λδ x.e(v)]i tr0 = hθ000 , t0C0 [λδ1 x.e0 (v0 )]i MCom Pkts ktr , σ =⇒ Pkts0 ktr0 , σ Figure 6.12. Memoization Semantics – Schedule Aware Partial Memoization. then there exists α1 , . . . , αn ∈ {App, Ch, Spn, Com} s.t. α n 1 T (Pkhθ, t[E[λδ x.e(v)]]i) 7−→ . . . 7−→ T (P0 khθ0 , t[[v0 ]]i) α 2 / is a valid memo We first introduce a corollary that shows that the empty memo store (0) store. We then show that every β step taken under memoization corresponds to zero or one step under non-memoized evaluation: zero steps for returns and memo actions (e.g. M EM S, M EM E, M EM P, and M EM R), and one step for core evaluation, and effectful actions (e.g., MC H, MA PP, MS PAWN, MR ECV, MS END, and MC OM). Although a function which is utilizing memoized information does not execute pure code (rule (A PP) under 7−→ ), it does, however, capture constraints for the elided applications. The expressions captured in 101 T ((t,C), Pkhθ, tC0 [E[λδ x.e(v)]]i) = T (Pkhθ, tC.C0 [E[λδ x.e(v)]]i) T ((P1 kP2 )) = T (P1 )kT (P2 ) T (hθ, t[e]i) = t[T (En [. . . E1 [e]])] θi = (δi , vi , Ei ,Ci ) ∈ θ T (hθ, t( ,e0 ).C [e]i) = t[T (En [. . . E1 [e0 ]])] θi = (δi , vi , Ei ,Ci ) ∈ θ T (λδ x.e) = λ x.e T (e1 (e2 )) = T (e1 )(T (e2 )) T (spawn(e)) = spawn(T (e)) T (send(e1 , e2 )) = send(T (e1 ), T (e2 )) T (recv(e)) = recv(T (e)) otherwise e Figure 6.13. T defines an erasure property on program states. The first four rules remove memo information and restore evaluation contexts. these non-failing constraints allow us to construct a corresponding sequence of evaluation steps under 7−→ . / is a valid memostore Corollary 1: the empty memostore, 0, Proof by case analysis of the rules: Case (Application): We have: App Pkhθ, t[E[λδ x.e(v)]]i, 0/ =⇒ Pkhθ.θ0 , t[e[v/x]]i, 0/ where: / (δ, v) 6∈ Dom(0) Case (Start Memo): We have: MemS Pkhθ, t[E[λδ x.e(v)]]i, σ =⇒ (t,C00 ), Pkhθ, tc0 [E[λδ x.e(v)]]i, σ where: (δ, v) ∈ Dom(σ) 102 Thus we know that the rule (Start Memo) cannot be applied with an empty memo store. Case (Return): Based on the rule Return we know: Ret Pkhθ.θ, t[v0 ]i, σ =⇒ Pkhθ, t[E[v0 ]]i, σ[(δ, v) 7→ C.(Rt, v0 )] / We know: 0[(δ, v) 7→ C.(Rt, v0 )]. Other Cases: All other rules do not modify σ nor do they leverage it. 2 Lemma 1: If β Pkhθ, t[E[λδ x.e(v)]]i, 0/ =⇒ P0 khθ0 , t[E[e0 ]]i, σ then there exists α1 , . . . , αn ∈ {App, Ch, Spn, Com} s.t. α n 1 T (Pkhθ, t[E[λδ x.e(v)]]i) 7−→ . . . 7−→ T (P0 khθ0 , t[E[e0 ]]i) α 2 Proof by induction on the length of β. Base case is sequences of length one. Such sequences correspond to functions which simply return a value. By definition of rule (Application) and Corollary 1 we know: App Pkhθ, t[E[λδ x.e(v)]]i, 0/ =⇒ Pkhθ.θ0 , t[e[v/x]]i, 0/ / where: θ, ((App), E[λδ x.e(v)]) θ0 and: (δ, v) 6∈ Dom(σ) θ = (δ, v, E, 0) By the transform T we know: T (Pkhθ, t[E[λδ x.e(v)]]i) = T (P)kt[E[λ x.e]] / = T (P)kt[E[e[v/x]]] T (Pkhθ.θ0 , t[e[v/x]]i, 0) 103 We know by the constraint capture rules (Channel), (Spawn), (Application), (Communication), and (Return): App T (P)kt[E[λ x.e(v)] 7−→ T (P)kt[E[e[v/x]]] We also know that e can be a value, namely some v0 and that v0 [v/x] is simply v0 or e can be x and that e[v/x] is simply v. Inductive Case: Assume the Lemma holds for β sequences of length n, we show it holds for sequences of length n + 1. By the inductive hypothesis we know: β Pkhθ, t[E[λδ x.e(v)]]i, 0/ =⇒ P00 , σ We examine transitions possible for the nth state (P00 , σ) by a case analysis on the n + 1 transition βn+1 . Case (Channel): If βn+1 = Ch by rule (Channel) we know: Ch Pkhθ, t0 [E[mkCh()]]i, σ =⇒ Pkhθ0 , t0 [E[l]]i, σ where: l fresh By the transform T we know: T (Pkhθ, t0 [E[mkCh()]]i) = T (P)kt0 [E[mkCh()]] T (Pkhθ0 , t0 [E[l]]i) = T (P)kt0 [E[l]] By the structure of the rules we know: Ch Pkt0 [E[mkCh()]] 7−→ Pkt0 [E[l]] 104 Case (Function): If βn+1 = Fun by rule (Function) we know: Fun Pkhθ, t0 [E[λ x.e]]i, σ =⇒ Pkhθ, t0 [E[λδ x.e]]i, σ By the transform T we know: T (Pkhθ, t0 [E[λ x.e]]i) = T (P)kt0 [E[λ x.e]] T (Pkhθ, t0 [E[λδ x.e]]i) = T (P)kt0 [E[λ x.e]] Case (Application): If βn+1 = App by rule (Application) we know: App Pkhθ, t0 [E[λδ x.e(v)]]i, σ =⇒ Pkhθ.θ, t0 [e[v/x]]i, σ By the transform T we know: T (Pkhθ, t0 [E[λδ x.e(v)]]i) = T (P)kt0 [E[λ x.e(v)]] T (Pkhθ.θ, t0 [e[v/x]]i) = T (P)kt0 [E[e[v/x]]] By the structure of the rules we know: App Pkt0 [E[λ x.e(v)]] 7−→ Pkt0 [E[e[v/x]]] Case (Return): If βn+1 = Ret by rule (Return) we know: Ret Pkhθ.θ, t0 [v0 ]i, σ =⇒ Pkhθ, t0 [E[v0 ]]i, σ[(δ, v) 7→ C.(Rt, v0 )] By the transform T we know: T (Pkhθ.θ, t0 [v0 ]i) = T (P)kt0 [E[v0 ]] T (Pkhθ, t0 [E[v0 ]]i) = T (P)kt0 [E[v0 ]] 105 Case (Spawn): If βn+1 = Spn by rule (Spawn) we know: Spn Pkhθ, t0 [E[spawn(λδ x.e)]]i, σ =⇒ Pktk kts , σ where: / t00 [e[unit/x]]i t00 fresh tk = hθ0 , t0 [E[unit]]i ts = h0, By the transform T we know: T (Pkhθ, t0 [E[spawn(λδ x.e)]]i) = T (P)kt0 [E[spawn(λ x.e)]] T (Pktk kts ) = T (P)kt0 [E[unit]]kt00 [e[unit/x]] By the structure of the rules we know: Spn Pkt0 [E[spawn(λ x.e)]] 7−→ Pkt0 [E[unit]]kt00 [e[unit/x]] Case (Communication): If βn+1 = Com by rule (Comm) we know: Com P, σ =⇒ P0 khθ00 , t0 [E[unit]]ikhθ000 , t00 [E0 [v]]i, σ where: P = P0 khθ, t0 [E[send(l, v)]]ikhθ0 , t00 [E0 [recv(l)]]i By the transform T we know: T (P) = T (P0 )kt0 [E[send(l, v)]]kt00 [E0 [recv(l)]] T (P0 khθ, t0 [E[send(l, v)]]ikhθ0 , t00 [E0 [recv(l)]]i) = T (P0 )kt0 [E[unit]]kt00 [E0 [v]] By the structure of the rules we know: Com P0 kt0 [E[send(l, v)]]kt00 [E0 [recv(l)]] 7−→ P0 kt0 [E[unit]]kt00 [E0 [v]] 106 Case (Memo Channel): If βn+1 = MCh by rule (Memo Channel) we know: MCh (t0 ,C.C), Pkhθ, t0C0 [E[λδ x.e(v)]]i, σ =⇒ (t0 ,C00 ), Pkhθ0 , t0C0 [E[λδ x.e(v)]]i, σ where: C = ((Ch, l), ) and l0 fresh By the structure of the rules we know C was generated by rule (Channel). We know l0 fresh and l 0 is substituted for l in the remaining constraints. C0 = C[l0 /l] θ,C θ0 By the transform T we know: T ((t0 ,C.C), Pkhθ, t0C0 [E[λδ x.e(v)]]i) = T (P)kt0 [E[E0 [mkCh()]]] T ((t0 ,C0 ), Pkhθ0 , t0C0 [E[λδ x.e(v)]]i) = T (P)kt0 [E[e0 ]] where by the structure of the rules: C = (( ), E0 [mkCh()]) C0 = (( ), e0 ).C000 We know by the constraint capture rules (Channel), (Spawn), (Application), (Communication), and (Return): e0 = E0 [l0 ] therefore: Ch T (P)kt0 [E[E0 [mkCh()]]] 7−→ T (P)kt0 [E[E0 [l]]] Case (Memo Application): If βn+1 = MApp by rule (Memo Application) we know: MApp (t0 ,C.C), Pkhθ, t0C0 [E[λδ x.e(v)]]i, σ =⇒ (t0 ,C), Pkhθ0 , t0C0 [E[λδ x.e(v)]]i, σ where: C = ((App), E0 [λδ x0 .e0 (v0 )]) By the structure of the rules we know C was generated by (Application). By the transform T we know: T ((t,C.C), Pkhθ, tC0 [E[λδ x.e(v)]]i, σ) = T (P)kt0 [E[E0 [λ x0 .e0 (v0 )]]] 107 T ((t,C), Pkhθ0 , tC0 [E[λδ x.e(v)]]i, σ) = T (P)kt0 [E[e00 ]] where by the structure of the rules: C = (( ), e00 ).C00 We know by the constraint capture rules (Channel), (Spawn), (Application), (Communication), and (Return): e00 = E0 [e0 [v0 /x0 ]] therefore: App T (P)kt0 [E[E0 [λ x0 .e0 (v0 )]]] 7−→ T (P)kt0 [E[E0 [e0 [v0 /x0 ]]]] Case (Memo Spawn): If βn+1 = Com by rule (Memo Spawn) we know: MSp (t0 ,C.C), Pkhθ, tC0 0 [E[λδ x.e(v)]]i, σ =⇒ / t00 [e0 [unit/x]]i, σ (t0 ,C), Pkhθ0 , tC0 0 [E[λδ x.e(v)]]ikh0, where: C = ((Sp, e0 ), E0 [spawn(λδ x.e0 )]) and t00 fresh By the structure of the rules we know C was generated by (Spawn). By the transform T we know: T ((t0 ,C.C), Pkhθ, tC0 0 [E[λδ x.e(v)]]i) = T (P)kt0 [E[E0 [spawn(λδ x.e)]]] / t00 [e0 ]i) = T (P)kt0 [E[e00 ]]kt00 [e0 [unit/x]] T ((t0 ,C), Pkhθ0 , t0C0 [E[λδ x.e(v)]]ikh0, where by the structure of the rules: C = (( ), e00 ).C00 We know by the constraint capture rules (Channel), (Spawn), (Application), (Communication), and (Return): e00 = E0 [unit] therefore: Spn T (P)kt0 [E[E0 [spawn(λδ x.e0 )]]] 7−→ T (P)kt0 [E[E0 [unit]]]kt00 [e0 [unit/x]] Case (Memo Receive): If βn+1 = Com by rule (Memo Receive) we know: MCom (t0 ,C.C), Pkts ktr , σ =⇒ (t0 ,C), Pkts0 ktr0 , σ where: ts = hθ, t00 [E[send(l, v)]]i tr = hθ0 , t0C0 [E0 [λδ x.e(v)]]i 108 ts0 = hθ00 , t00 [E[unit]]i tr0 = hθ000 , t0C0 [E0 [λδ x.e(v)]]i By the structure of the rules we know C was generated by (Communication). By the transform T we know: T ((t0 ,C.C), Pkts ktr ) = T (P)kt00 [E[send(l, v)]]kt0 [E0 [E00 [recv(l)]]] T ((t0 ,C), Pkts0 ktr0 ) = T (P)kt00 [E[unit]]kt0 [E0 [e0 ] where by the structure of the rules: C = ((R, l, v), E00 [recv(l)]) C = (( ), e0 ).C00 We know by the constraint capture rules (Channel), (Spawn), (Application), (Communication), and (Return): e0 = E00 [v] therefore: Com T (P)kt00 [E[send(l, v)]]kt0 [E0 [E00 [recv(l)]]] 7−→ T (P)kt00 [E[unit]]kt0 [E0 [E00 [v]]] Case (Memo Send): If βn+1 = Com by rule (Memo Send) we know: MCom (t0 ,C.C), Pkts ktr , σ =⇒ (t0 ,C), Pkts0 ktr0 , σ where: ts = hθ0 , tC0 0 [E[λδ x.e(v)]]i tr = hθ, t00 [E0 [recv(l)]]i ts0 = hθ000 , tC0 0 [E[λδ x.e(v)]]i tr0 = hθ00 , t00 [E0 [v]]i By the structure of the rules we know C was generated by (Communication). By the transform T we know: T ((t0 ,C.C), Pkts ktr ) = T (P)kt0 [E[E00 [send(l, v)]]]kt00 [E0 [recv(l)]] T ((t0 ,C), Pkts0 ktr0 ) = T (P)kt0 [E[e0 ]kt00 [E0 [v]] where by the structure of the rules: C = (( ), e) C0 = (( ), e0 ).C00 We know by the constraint capture rules (Channel), (Spawn), (Application), (Communication), and (Return): e0 = E00 [unit] therefore: Com T (P)kt0 [E[E00 [send(l, v)]]kt00 [E0 [recv(l)]] 7−→ T (P)kt0 [E[E00 [unit]]]kt00 [E0 [v]] 109 Case (Memo Start): If βn+1 = MemS by rule (Memo Start) we know: MemS Pkhθ, t[E[λδ x.e(v)]]i, σ =⇒ (t,C00 ), Pkhθ, tc0 [E[λδ x.e(v)]]i, σ where: (δ, v) ∈ Dom(σ) σ(δ, v) = C ℑ(C, P) = C0 C = C00 .C0 C00 = C.C000 C = (( ), e0 ) By the transform T we know: T (Pkhθ, t[E[λδ x.e(v)]]i) = T (P)kt[E[λ x.e(v)]] T ((t,C00 ), Pkhθ, tc0 [E[λδ x.e(v)]]i) = T (P)kt[E[e0 ]] We know by the constraint capture rules (Channel), (Spawn), (Application), (Communication), and (Return): e0 = e[v/x] App T (P)kt[E[λ x.e(v)]] 7−→ T (P)kt[E[e[v/x]]] Case (Memo End): If βn+1 = MemE then by our I.H. we know MemS ∈ β and thus by rule (Memo End) we know: MemE 0 / Pkhθ, tC.C (t0 , 0), [E[λδ x.e(v)]]i, σ =⇒ Pkhθ, t0 [E[e0 ]]i, σ where: C = (c, e0 ) By the transform T we know: 0 / Pkhθ, tC.C T ((t0 , 0), [E[λδ x.e(v)]]i) = T (P)kt0 [E[e0 ]] T (Pkhθ, t0 [E[e0 ]]i) = T (P)kt0 [E[e0 ]] Case (Pause Memo): If βn+1 = MemP by our I.H. we know MemS ∈ β and thus by rule (PauseMemo) we know: MemP / P, σ =⇒ P, σ (t0 , 0), where: P = P0 khθ, tC0 [E[λδ x.e(v)]]i 110 By the structure of σ, the definition of ℑ, and rule (EndMemo) we know: C = C.C00 C = (( ), e0 ) By the transform T we know: T (P0 khθ, tc0 [E[λδ x.e(v)]]i) = T (P0 )kt0 [E[e0 ]] / Therefore α = 0. Case (Resume Memo): If βn+1 = MemR by our I.H. we know MemP ∈ β and thus by rule (Resume Memo) we know: MemR Pkhθ, tC0 [E[λδ x.e(v)]]i, σ =⇒ (t0 ,C00 ), Pkhθ, tC0 0 [E[λδ x.e(v)]]i, σ where: ℑ(C, P) = C0 C = C00 .C0 C = C.C000 C = (( ), e0 ) By the transform T we know: T (Pkhθ, tC0 [E[λδ x.e(v)]]i) = T (P)kt0 [E[e0 ]] T ((t0 ,C00 ), Pkhθ, tC0 0 [E[λδ x.e(v)]]i) = T (P)kt0 [E[e0 ]] Case (Memo Communication): If βn+1 = MCom by our I.H. we know MemP ∈ β for both threads tr and ts and thus by rule (Memo Communication) we know MCom Pkts ktr , σ =⇒ Pkts0 ktr0 , σ where: ts = hθ, t0C.C [λδ x.e(v)]i tr = hθ0 , t00C0 .C0 [λδ1 x.e0 (v0 )]i ts0 = hθ00 , t0C [λδ x.e(v)]i tr0 = hθ000 , t00C0 [λδ1 x.e0 (v0 )]i 111 By the transform T we know: T (Pkts ktr ) = T (P)kt 0 [es ]kt 00 [er ] T (Pkts0 ktr0 ) = T (P)kt 0 [e0s ]kt 00 [e0r ] where: C = (( ), es ) C = C00 .C00 C00 = (( ), e0s ) C0 = (( ), er ) C0 = C000 .C000 C000 = (( ), e0r ) We know by the constraint capture rules (Channel), (Spawn), (Comm), and (Ret): es = E00 [send(l, v)] and er = E000 [recv(l)] e0s = E00 [unit] and e0r = E000 [v] therefore: Com T (P)kt 0 [es ]kt 00 [er ] 7−→ T (P)kt 0 [e0s ]kt 00 [e0r ] Theorem 6.4.2 (Safety) If β Pkhθ, t[E[λδ x.e(v)]]i, 0/ =⇒ P0 khθ0 , t[E[v0 ]]i, σ then there exists α1 , . . . , αn ∈ {App, Ch, Spn, Com} s.t. α n 1 T (Pkhθ, t[E[λδ x.e(v)]]i) 7−→ . . . 7−→ T (P0 khθ0 , t[E[v0 ]]i) α 2 The proof follows directly from Lemma 1. By Lemma 1 we know: β Pkhθ, t[E[λδ x.e(v)]]i, 0/ =⇒ P0 khθ0 , t[E[e0 ]]i, σ then there exists α1 , . . . , αn ∈ {App, Ch, Spn, Com} s.t. α n 1 T (Pkhθ, t[E[λδ x.e(v)]]i) 7−→ . . . 7−→ T (P0 khθ0 , t[E[e0 ]]i) α By the grammar of the language we know every v is an e. 2 112 6.5 Implementation The main extensions to Multi-MLton to support partial memoization involve insertion of read and write barriers to track accesses and updates of references, barriers to monitor function arguments and return values, and hooks to the Concurrent ML library to monitor channel based communication. 6.5.1 Parallel CML and Hooks Our parallel implementation of CML is based on Reppy’s parallel model of CML [9]. We utilize low level locks implemented with compare and swap to provide guarded access to channels. Whenever a thread wishes to perform an action on a channel, it must first acquire the lock associated with the channel. Since a given thread may only utilize one channel at a time, there is no possibility of deadlock. The underlying CML library was also modified to make memoization efficient. The bulk of the changes were hooks to monitor channel communication and spawns, additional channel queues to support constraint matching on synchronous operations, and to log successful communication (including selective communication and complex composed events). The constraint matching engine required a modification to the channel structure. Each channel is augmented with two additional queues to hold send and receive constraints. When a constraint is being tested for satisfiability, the opposite queue is first checked (e.g. a send constraint would check the receive constraint queue). If no match is found, the regular queues are checked for satisfiability. If the constraint cannot be satisfied immediately it is added to the appropriate queue. 6.5.2 Supporting Memoization Any SML function can be annotated as a candidate for memoization. For such annotated functions, its arguments and return values at different call-sites, the communication 113 it performs, and information about the threads it spawns are recorded in a memo table. Memoization information is logged through hooks to the CML runtime and stored by the underlying client code. In addition, to support partial memoization, the continuations of logged communication events are also saved. Our memoization implementation extended CML channels to be aware of memoization constraints. Each channel structure contained a queue of constraints waiting to be solved on the channel. Because it will not be readily apparent if a memoized version of a CML function can be utilized at a call site, we delay a function application to see if its constraints can be matched. these constraints must be satisfied in the order in which they were generated. Constraint matching can certainly fail on a receive constraint. A receive constraint obligates a receive operation to read a specific value from a channel. Since channel communication is blocking, a receive constraint that is being matched can choose from all values whose senders are currently blocked on the channel. This does not violate the semantics of CML since the values blocked on a channel cannot be dependent on one another. In other words, a schedule must exist where the matched communication occurs prior to the first value blocked on the channel. Unlike a receive constraint, a send constraint can only fail if there are (a) no matching receive constraints on the sending channel that expect the value being sent, or (b) no receive operations on that same channel. A CML receive operation (not receive constraint) is ambivalent to the value it removes from a channel. Thus, any receive on a matching channel will satisfy a send constraint. If no receives or sends are enqueued on a constraint’s target channel, a memoized execution of the function will block. Therefore, failure to fully discharge constraints by stalling memoization on a presumed unsatisfiable constraint does not compromise global progress. This observation is critical to keeping memoization overheads low. Our memoization technique relies on efficient equality tests. We extend MLton’s polyequal function to support equality on reals and closures. Although equality on values of type real is not algebraic, built-in compiler equality functions were sufficient for our needs. To support efficient equality on functions, we approximate function equality as closure 114 equality. Unique identifiers are associated with every closure and recorded within their environment; runtime equality tests on these identifiers are performed during memoization. Memoization data is discarded during garbage collection. This prevents unnecessary build up of memoization meta-data during execution. As a heuristic, we also enforce an upper bound for the amount of memo data stored for each function, and the space that each memo entry can take. A function that generates a set of constraints whose size exceeds the memo entry space bound is not memoized. For each memoized function, we store a list of memo meta-data. When the length of the list reaches the upper limit but new memo data is acquired upon an application of the function to previously unseen arguments, one entry from the list is removed at random. 6.6 Performance Results We examined three benchmarks to measure the effectiveness of partial memoization in a parallel setting. The first benchmark is a streaming algorithm for approximating a kclustering of points on a geometric plane. The second is a port of the STMBench7 benchmark [60]. STMBench7 utilizes channel based communication instead of shared memory and bears resemblance to the red-black tree program presented in Section 6.2. The third is Swerve, a web-server that was described in Chapter 4. Our benchmarks were executed on a 16-way AMD Opteron 865 server with 8 processors, each containing two symmetric cores, and 32 GB of total memory, with each CPU having its own local memory of 4 GB. Access to non-local memory is mediated by a hyper-transport layer that coordinates memory requests between processors. 6.6.1 Synthetic Benchmarks Similar to most streaming algorithms [61], a k-clustering application defines a number of worker threads connected in a pipeline fashion. Each worker maintains a cluster of points that sit on a geometric plane. A stream generator creates a randomized data stream of points. A point is passed to the first worker in the pipeline. The worker computes a 115 convex hull of its cluster to determine if a smaller cluster could be constructed from the newly received point. If the new point results in a smaller cluster, the outlier point from the original cluster is passed to the next worker thread. On the other hand, if the received point does not alter the configuration of the cluster, it is passed on to the next worker thread. The result defines an approximation of n clusters (where n is the number of workers) of size k (points that compose the cluster). STMBench7 [60], is a comprehensive, tunable multi-threaded benchmark designed to compare different STM implementations and designs. Based on the well-known 007 database benchmark [62], STMBench7 simulates data storage and access patterns of CAD/CAM applications that operate over complex geometric structures. At its core, STMBench7 builds a tree of assemblies whose leaves contain bags of components. These components have a highly connected graph of atomic parts and design documents. Indices allow components, parts, and documents to be accessed via their properties and IDs. Traversals of this graph can begin from the assembly root or any index and sometimes manipulate multiple pieces of data. STMBench7 was originally written in Java. Our port, besides translating the assembly tree to use a CML-based server abstraction (as discussed in Section 6.2), also involved building an STM implementation to support atomic sections, loosely based on the techniques described in [59, 63]. All nodes in the complex assembly structure and atomic parts graph are represented as servers with one receiving channel and handles to all other adjacent nodes. Handles to other nodes are simply the channels themselves. Each server thread waits for a message to be received, performs the requested computation. A transaction can thus be implemented as a series of communications with various server nodes. To measure the effectiveness of our memoization technique, we executed two configurations (one memoized, and the other non-memoized) of our k-clustering algorithm and STMBench7, and measured overheads and performance by averaging results over ten executions. The k-clustering algorithm utilizes memoization to avoid redundant computations based on previously witnessed points as well as redundant computations of clusters. For STMBench7 the non-memoized configuration uses our STM implementation without 116 any memoization where as the memoized configuration implements partial memoization of aborted transactions. For k-clustering, we computed 16 clusters of size 60 out of a stream of 10K randomly generated points. This resulted in the creation of 16 worker threads, one stream generating thread, and a sink thread which aggregates the computation results. STMBench7 was executed on a graph in which there were approximately 280k complex assemblies and 140k assemblies whose bags referenced one of 100 components; by default, each component contained a parts graph of 100 nodes. STMBench7 creates a number of threads proportional to the number of nodes in the underlying data structure; this is roughly 400K threads for our configuration. Our experiments on Swerve were executed using the default server configuration and were exercised using httperf 1 , a well known tool for measuring web-server performance. The benchmarks represent two very different programming models – pipeline streambased parallelism and transactions, and leverage two very different executions models – k-clustering makes use of long-lived worker-threads while STMBench7 utilizes many lightweight server threads. Each run of both benchmarks have execution times that range between 1 and 3 minutes. For k-clustering we varied the number of repeated points generated by the stream. Configurations in which there is a high degree of repeated points offer the best performance gain (see Figure 6.14). For example, an input in which 50% of the input points are repeated yields roughly 50% performance gain. However, we also observe roughly 17% performance improvement even when all points are randomized. This is because the cluster’s convex hull shrinks as the points which comprise the cluster become geometrically compact. Thus, as the convex hull of the cluster shrinks, the likelihood of a random point being contained within the convex hull of the cluster is reduced. Memoization can take advantage of this phenomenon by avoiding recomputation of the convex hull as soon as it can be determined that the input point resides outside the current cluster. Although we do not 1 http://www.hpl.hp.com/research/linux/httperf/ 117 % Speedup k-clustering 90 80 70 60 50 40 30 20 10 0 0 10 20 30 40 50 60 % Repeats 70 80 90 100 Figure 6.14. Normalized runtime percent speedup for the k-clustering benchmark of memoized evaluation compared to non-memoized execution. envision workloads that have high degrees of repeatability, memoization nonetheless leads to a 30% performance gain on a workload in which only 10% of the inputs are repeated. In STMbench7, the utility of memoization is closely correlated to the number and frequency of aborted transactions. Our tests varied the read-only/read-write ratio (see Figure 6.15) within transactions. Only transactions that modify values can cause aborts. Thus, an execution where all transactions are read-only cannot be accelerated, but one in which transactions can frequently abort (because of conflicts due to concurrent writes) offer potential opportunities for memoization. Thus, the cost to support memoization is seen when there are 100% read-only transactions; in this case, the overhead is roughly 13%. These overheads arise because of the cost to capture memo information (storing arguments, continuations, etc) and the cost associated with trying to utilize the memo information (discharging constraints). Notice that as the number of transactions which perform modifications to the underlying data-structure increases so do memoization gains. For example, when the percentage of read-only transactions is 60%, we see an 18% improvement in runtime performance compared to a non-memoizing implementation for STMBench7. We expected to see roughly 118 STMBench7 40 % Speedup 30 20 10 0 -10 -20 0 10 20 30 40 50 60 70 80 90 100 Read/Write Ratio Figure 6.15. Normalized runtime percent speedup for STMBench-7 of memoized evaluation compared to non-memoized execution. a linear trend correlated to the increase in transactions which perform an update. However, we noticed that performance degrades about 12% from a read/write ratio of 20 to a read/write ratio of zero. This phenomenon occurs because memoized transactions are more likely to complete on their first try when there are fewer modifications to the structure. Since a non-memoized transaction requires longer to complete, it has a greater chance of aborting when there are frequent updates. This difference becomes muted as the number of changes to the data structure increase. For both benchmarks, memory overheads are proportional to cache sizes and averaged roughly 15% for caches of size eight. The cache size defines the number of different memoized calls for a function maintined. Thus, a cache size of eight means that every memoized function records effects for eight different arguments. 6.6.2 Constraint Matching and Discharge Overheads To measure the overheads of constraint discharge we executed a simple ping-pong microbenchmark that sent a twenty eight character string between the pinger and the ponger. The program offers no ability to leverage captured memo information profitably. The two 119 Memoization Overheads % Overhead 80 60 40 Memo Both MemoReceive Memo Send 20 0 5000 10000 50000 100000 Iterations Figure 6.16. Normalized percent runtime overhead for discharging send and receive constraints for ping-pong. threads which comprise the program repeatedly execute the same function; one thread sends the string on a shared channel and the other receives from the channel. We executed four configurations of the benchmark. In the first we memoized the sending function, in the second the receiving function, and in the third both the sending and the receiving functions. These three configurations were normalized with respect to a forth configuration that did not utilize memoization. The results are given in Figure 6.16. The overheads are comprised of searching memo tables, constraint matching and discharge, increased contention, and propagation of return values. The increased contention on the channel occurs because the channel lock is held longer when matching and discharging a constraint than for a regular send or receive, thereby increasing the probability of contention. In the configuration where only the sending function is memoized the overhead is roughly 47%. When the receiving function is memoized the overhead grows to 55%, this occurs because when matching the receive constraint we must test for equality. When both the sending and receiving functions are memoized the overhead jumps to 80%. 120 6.6.3 Case Study: File Caching In Swerve there is no explicit support for caching files in memory. Therefore, repeated requests for a given file will require fetching that file from disk. File caching is a well known performance optimization that is leverage by the majority of modern web-servers. We observe that we can leverage our partial memoization technique to implement an error and timeout aware file cache. Recall that prior to processing a file chunk, the file processor checks for timeouts, other errors, and if the user has canceled the request of the file. Naively memoizing the file processor would lead to a solution that is not responsive to errors. Namely, the entire file would be sent over the network prior to the discovery of the timeout or error. This occurs because the file processor notifies the network processor of any errors through an explicit error notification protocol. The use of partial memoization provides a mechanism that allows us to utilize information stored within our memo tables while preserving responsiveness to errors. Our implementation generates a constraint for the error checking code in the file processor. If an error occurs this corresponds to a failed memoization constraint and execution is resumed in the error handler. Since the file processors sends file chunks to the network processor over a shared channel, our memoization scheme generates a constraint for each file chunk sent. These constraints provide the file cache. When the file is subsequently requested, the constraints are discharged. Notice that a send constraint does not require an equality test, only that a matching receive takes place. In Swerve, we observe an increase in performance correlated to the size of the file being requested by httperf (see Figure 6.17). In each run of httperf, the files requested were all of the same size. Performance gains are capped at roughly 80% for file sizes greater than 8 MB. For each requested file, we build constraints corresponding to the file chunks read from disk. As long as no errors are encountered, the Swerve file processor sends the file chunks to be processed into HTTP packets by another module. After each chunk has been read from disk the file processor polls other modules for timeouts and other error 121 % Speedup Swerve 90 80 70 60 50 40 30 20 10 0 0.125 0.25 0.5 1 2 4 8 12 16 20 File Size (MB) Figure 6.17. Normalized runtime percent speedup for Swerve of memoized evaluation compared to non-memoized execution. conditions. If an error is encountered, subsequent file processing is stopped and the request is terminated. Partial memoization is particularly well suited for Swerve’s file processing semantics because control is reverted to the error handling mechanism precisely at the point an error is detected. This corresponds to a failed constraint. 6.7 Related Work Memoization, or function caching [54, 64–66], is a well understood method to reduce the overheads of redundant function execution. Memoization of functions in a concurrent setting is significantly more difficult and usually highly constrained [58]. We are unaware of any existing techniques or implementations that apply memoization to the problem of reducing transaction overheads in languages that support selective communication and dynamic thread creation. Our approach also bears resemblance to the procedure summary construction for concurrent programs [67]. However, these approaches tend to be based on a static analysis (e.g., the cited reference leverages procedure summaries to improve the efficiency of model checking) and thus are obligated to make use of memoization greedily. Because our motivation is quite different, our approach can consider lazy alternatives, ones 122 that leverage synchronization points to stall memo information use, resulting in potentially improved runtime benefit. Over the last decade, transactional memory (TM) has emerged as an attractive alternative to lock-based abstractions by providing strong semantic guarantees [68] (atomicity and isolation) as well as a simpler programming model. Transactional memory provides serializability guarantees for any concurrently executing transactional regions, preventing the programmer from having to reason about complex interleavings of such regions. Transactional memory also relieves the burden of reasoning about deadlocks and complex locking protocols. Additionally, TM has also been utilized to extract fine-grain parallelism from critical sections. Transactional memory can be implemented in hardware [69], software [20, 63, 70], or both [71, 72]. Software transactional memory (STM) [20, 63, 70] systems provide scalable performance surpassing that of coarse-grain locks and a simpler, but competitive alternative to hand-tuned fine-grain locks. Unfortunately, the exact semantics that are provided by STM are highly dependant on the underlying implementation. For instance, STMs that provide weak atomicity guarantees only consider interactions of transactions and do not provide any guarantees if threads not encapsulated in a transaction access memory concurrently being accessed by a transaction. Similarly, there are pessimistic and optimistic transactional systems. Pessimistic transactions afford less parallelism, but in some implementations do not require rollback or state reversion [70]. On the other hand, optimistic transactions, under certain workloads, can provide additional parallelism, but force the programmer to reason about the effects of state reversion. Namely the programmer must avoid performing I/O operations or any actions that cannot be reverted in a transactional scope. There has also been work on applying these techniques to a functional programming setting [41, 42]. These proposals usually rely on an optimistic concurrency control model that checks for serializability violations prior to committing the transaction, aborting when a violation is detected. Our benchmark results suggest that partial memoization can help reduce the overheads of aborting optimistic transactions. 123 Self adjusting mechanisms [73–75] leverage memoization along with change propagation to automatically alter a program’s execution to a change of inputs given an existing execution run. Memoization is used to identify parts of the program which have not changed from the previous execution while change propagation is harnessed to install changed values where memoization cannot be applied. There has also been recent work on using change propagation in a parallel setting [76]. The programming model assumes fork/join parallelism, and is therefore not suitable for the kinds of contexts we consider. We believe our memoization technique is synergistic with current self-adjusting techniques and can be leveraged along with self-adjusting computation to create self-adjusting programs which utilize message passing. Reppy and Xiao [77] present a program analysis for CML that analyzes communication patterns to optimize message passing operations. A type-sensitive interprocedural controlflow analysis is used to specialize communication actions to improve performance. While we also use CML as the underlying subject of interest, our memoization formulation is orthogonal to their techniques. Checkpointing and replay mechanism [78] utilize light-weight state restoration to recover from errors or exceptional conditions. These mechanisms unroll a programs execution to a safe point with respect to the error. We believe such checkpointing mechanism can leverage memoization to avoid redundant computation from the checkpointed state. Our technique also shares some similarity with transactional events [22,43,44] (for further details please see Section 5.8). Transactional events explore a state space of possible communications finding matching communications to ensure atomicity of a collection of actions. To do so, transactional events require arbitrary lookahead in evaluation to determine if a complex composed event can commit. Partial memoization also explores potential matching communication actions to satisfy memoization constraints. However, partial memoization avoids the need for arbitrary lookahead – failure to discharging memoziation constraints simply causes execution to proceed as normal. Since transactional events can be implemented in an optimistic manner instead of relying on an unbounded search strategy, 124 we believe they can leverage partial memoization in ways similar to software transactional memory. 6.8 Concluding Remarks We have provided a definition of partial memoization in the context of synchronous message passing. We have formalized that definition in an operation semantics, which we have shown to be equivalent to an operational semantics that does not leverage memoization. The usefulness of the abstraction has been shown on two synthetic benchmarks as well as a web-server. 125 7 ASYNCHRONOUS CML Software complexity is typically managed using programming abstractions that encapsulate behaviors within modules. A module’s internal implementation can be changed without requiring changes to clients that use it, provided that these changes do not entail modifying its signature. For concurrent programs, the specifications that can be described by an interface are often too weak to enforce this kind of strong encapsulation in the presence of communication that spans abstraction boundaries. Consequently, changing the implementation of a concurrency abstraction by adding, modifying, or removing behaviors often requires pervasive change to the users of the abstraction. Modularity is thus compromised. In particular, asynchronous behavior generated internally within a module defines two temporally distinct sets of actions, neither of which are exposed in a module’s interface. The first are post-creation actions – these are actions that must be executed after an asynchronous operation has been initiated, without taking into account whether the effects of the operation have been witnessed by its recipients. For example, a post-creation action of an asynchronous send on a channel might initiate another send on that same channel; the second action should take place with the guarantee that the first has already deposited its data on the channel. The second are post-consumption actions – these define actions that must be executed only after the effect of an asynchronous operation has been witnessed. For example, a post-consumption action might be a callback that is triggered when the client retrieves data from a channel sent asynchronously. In this chapter, we describe how to build, maintain, and expand asynchronous concurrency abstractions to give programmers the ability to express composable, yet modular, asynchronous protocols. By weaving protocols through the use of post-creation and post-consumption computations, we achieve composability without sacrificing modularity, enabling reasoning about loosely coupled communication partners that span abstraction 126 boundaries. To achieve this, we enable the construction of signatures and interfaces that specify asynchronous behavior via abstract post-creation and post-consumption actions. Supporting composable post-creation and post-consumption in an asynchronous setting is challenging because achieving such composability necessarily entails specifying the behavior across two distinct threads of control – the thread that initiates the action (e.g., the thread that performs an asynchronous send on a channel C), and the thread that completes it (e.g, the thread that reads the value from channel C). The heart of the problem is a dichotomy in language abstractions: asynchrony is fundamentally expressed using distinct threads of control, yet composablity is achieved through abstractions such as events or callbacks, or operations like function composition, that are inherently thread-unaware. To address this significant shortcoming, we introduce a family of asynchronous combinators and primitives that are freely composable, and whose signatures precisely capture the desired post-creation and post-consumption behavior of an asynchronous action. We present our extensions in the context of CML’s first-class communication events [2]. Just as CML’s synchronous events provide a solution to composable, synchronous message passing that could not be easily accommodated using λ-abstraction and application, the asynchronous events defined here offer a solution for composable asynchronous message passing not easily expressible using synchronous communication and explicit threading abstractions. There are three overarching goals of our design: 1. Asynchronous combinators should permit uniform composition of pre/post creation and consumption actions. This means that protocols should be allowed to extend the behavior of an asynchronous action both with respect to the computation performed before and after it is created and consumed. 2. Asynchronous actions should provide sensible visibility and ordering guarantees. A post-creation computation should execute with the guarantee that the asynchronous action it follows has been created (e.g., an action has been deposited on a channel), 127 and the effects of consumed asynchronous actions should be consistent with the order in which they were created. 3. Communication protocols should be agnostic with respect to the kinds of actions they handle. Thus, both synchronous and asynchronous actions should be permitted to operate over the same set of abstractions (e.g., communication channels). 7.1 Design Considerations There are two fundamental ways to express asynchrony in modern programming lan- guages: (1) through the use of explicit asynchronous primitives built into the language itself or (2) through the use of lightweight threads encapsulating synchronous primitives. The use of explicit asynchronony is present in languages such as Erlang [1], JoCaml [79], and libraries such as MPI. In contrast, synchronous communication is the fundamental building block in Concurrent ML [2] and languages that have been extended to support CML-like abstractions (e.g., Haskell [80, 81] or Scheme [49]). For the latter, asynchronous behavior is typically expressed by wrapping synchronous actions within a thread. It is difficult to express post-consumption actions [82] using typical asynchronous primitives without requiring some sort of notification from the consumer of the action to the initiator. Asynchronous actions fundamentally decouple the two parties in a communication protocol (e.g., the sender of a message is unaware of when the receiver receives it), thus preventing the ability to define computations that are triggered when the communication fully completes 1 . Having explicit threads that encapsulate synchronous actions alleviates this problem since the desired post-creation and post-consumption behavior can be specified as part of a user-defined asynchronous abstraction. But, threads do not provide any guarantee on ordering or visibility of actions; the point at which an asynchronous communication takes place depends on the behavior of the underlying thread scheduler. Additionally, once a thread encapsulates a computation it can no longer be extended, limiting composability. 1 Note that the thread which initiate the action does not need to be aware of the completion of the communication. 128 Log Log Server Client abstraction boundary Client Server spawn spawn (a) (b) Figure 7.1. Two server-client model based concurrency abstractions extended to utilize logging. In (a) asynchronous sends are depicted as solids lines, where as in (b) synchronous sends are depicted as solid lines and light weight threads as boxes. Logging actions are presented as dotted lines. Such ordering guarantees are less stringent than those afforded by futures, as they provide no guarantees between asynchronous actions that operate over different channels. To illustrate these points, consider a channel abstraction that mediates access between “clients” and a “server” – clients issue requests to servers by writing their request to the channel, and servers reply to the request by depositing their results on a dedicated channel provided by the client. Internally, the server implementation may initiate replies asynchronously without waiting for client acknowledgement (e.g., in response to a file request, the server may send chunks asynchronously that are subsequently aggregated by the client). Now, consider augmenting this implementation so that actions on the channel are logged. For example, the desired new behavior might require internally monitoring the rate at which chunks are consumed by a client so that the server can provide quality of service guarantees. Given primitive support for asynchronous actions, requests by the client can be serviced asynchronously by the server. The asynchronous replies generated by the server retain the desired ordering guarantees since chunks are deposited on the channel in the order in which the asynchronous sends are executed. However, extending the protocol to support 129 post-consumption actions that supplement the log is difficult to do without exposing the desired additional functionality to the client. This, unfortunately, embedded the server’s functionality into the client, which is not desirable. While the client can be extended to modify shared or global state after each receipt as depicted in Figure 7.1(a), modularity is compromised. For example, the client would have to distinguish between each potential server it may communicate with to correctly log only those actions initiated by the particular server that desires it. Furthermore, changes in the structure and behavior of the server’s log would necessitate changes in logging code of the client. In contrast, through the use of lightweight threads and synchronous primitives, we can encode a very simple post-consumption action by sequencing the action after a synchronous send (as shown in Figure 7.1(b)) that is executed by a newly-created thread of control. Here, the server initiates an asynchronous reply by creating a thread that synchronously communicates with the client. The post-consumption action that adds to the log is performed by the thread after the synchronous communication completes. The client is unaware of the additional functionality; the logging action is entirely encapsulated by the thread. However, this solution suffers from two related problems. First, the use of explicit threads for initiating asynchronous computation fails to provide any post-creation guarantees. A post-creation action, in this case the initiation of the second chunk of data, can make no guarantees at the point it commences evaluation that the value of the first send has actually been deposited on the channel since the encoding defines the creation point of the asynchronous action to be the point at which the thread is created, not the point where the data is placed on the channel. Second, because threads are inherently unordered with respect to one another and agnostic to their payload, there is no transparent mechanism to enforce a sensible ordering relationship on the communication actions they encapsulate. This results in the second thread that is spawned by the server being able to execute prior to the first. Consequently the data the server produces can be received out of order. Dealing with this issue again requires augmenting the client to correctly reassemble chunks. As the example illustrates, there exists a dichotomy in language-based concurrency abstractions for achieving asynchrony. Specialized asynchronous primitives ensure visi- 130 bility and ordering guarantees but preclude the specification and composability of postconsumption actions, while synchronous primitives and threads provide a mechanism to encapsulate post-consumption actions but fail to preserve ordering (and thus, cannot support composable post-creation actions), and suffer from the standard composability limitations of threads. The challenge to building expressive asynchronous communication abstractions is defining mechanisms that allow programmers to express both composable post-creation and post-consumption behavior, but also ensure sensible ordering and visibility guarantees. We implement our design in the context of Concurrent ML [2]. 7.1.1 Putting it All Together Although synchronous message passing alleviates the complexity of reasoning about arbitrary thread interleavings, and enables composable synchronous communication protocols, using threads to encode asynchrony unfortunately re-introduces these complexities. Having primitive asynchronous send and receive operations avoids the need for implicitly created threads to encapsulate an asynchronous operation, but does not support composability. Our design equips asynchronous events with the following properties: (i) they are extensible both with respect to pre- and post-creation as well as pre- and post-consumption actions; (ii) they can operate over the same channels that synchronous events operate over; meaning channels are agnostic to whether they are used synchronously or asynchronously, allowing both kinds of events to seamlessly co-exist; and, (iii) their visibility, ordering, and semantics is independent of the underlying runtime and scheduling infrastructure. 7.2 Asynchronous Events In order to provide primitives that adhere to the properties outlined above, we extend CML with the following two base events: aSendEvt and aRecvEvt , for creating an asynchronous send event and an asynchronous receive event respectively. Although similar, asynchronous events are not syntactic sugar as they cannot be expressed using CML prim- 131 itives. The differences in their type signature from their synchronous counterparts reflect the split in the creation and consumption of the communication action they define: sendEvt : ’a chan * ’a -> unit Event aSendEvt : ’a chan * ’a -> (unit, unit) AEvent recvEvt : ’a chan -> ’a Event aRecvEvt : ’a chan -> (unit, ’a) AEvent An AEvent value is parametrized with respect to the type of the event’s post-creation and post-consumption actions. In the case of aSendEvt , both actions are of type unit : when synchronized on, the event immediately returns a unit value and places its ’a argument value on the supplied channel. The post-consumption action also yields unit . When synchronized on, an aRecvEvt returns unit ; the type of its post-consumption action is ’a reflecting the type of value read from the channel when it is paired with a send. In conjunction to the new base events, we introduce a new synchronization primitive: aSync , to synchronize asynchronous events. The aSync operation fires the computation encapsulated by the asynchronous event of type (’a, ’b) AEvent and returns a value of type ’a , corresponding to the return type of the event’s post-creation action (see Figure 7.2). sync aSync : ’a Event -> ’a : (’a, ’b) AEvent -> ’a Unlike their synchronous variants, asynchronous events do not block if no matching communication is present. For example, executing an asynchronous send event on an empty channel places the value being sent on the channel and then returns control to the executing thread (see Figure 7.2(a)). In order to allow this non-blocking behavior, an implicit thread of control is created for the asynchronous event when the event is paired, or consumed as shown in Figure 7.2(b). If a receiver is present on the channel, the asynchronous send event behaves similarly to a synchronous event; it passes the value to the receiver. However, it 132 (a) Thread 1 (b) Thread 2 aSync(ev) v ev v post creation actions c recv(c) c Implicit Thread post consumption actions Figure 7.2. The figure shows a complex asynchronous event ev , built from a base event aSendEvt , being executed by Thread 1. (a) When the event is synchronized via aSync , the value v is placed on channel c and post-creation actions are executed. Afterwards, control returns to Thread 1. (b) When Thread 2 consumes the value v from channel c , an implicit thread of control is created to execute any post-consumption actions. still creates a new implicit thread of control if there are any post-consumption actions to be executed. Similarly, the synchronization of an asynchronous receive event does not yield the value received. Instead, it simply enqueues the receiving action on the channel. Therefore, the thread which synchronizes on an asynchronous receive always gets the value unit, even if a matching send exists (see Figure 7.3(a)). The actual value consumed by the asynchronous receive can be passed back to the thread which synchronized on the event through the use of combinators that process post-consumption actions (see Figure 7.3(b)). This is particularly well suited to encode reactive programming idioms: the post-consumption actions encapsulate a reactive computation. To illustrate the differences between the primitives, consider the two functions f and af shown below: 133 fun f () = (spawn (fn () => sync (sendEvt(c, v))); sync (sendEvt(c, v’)); sync (recvEvt(c))) fun af () = (spawn (fn () => sync (sendEvt(c, v))); aSync (aSendEvt(c, v’)); sync (recvEvt(c))) The function f , if executed in a system with no other threads, will always block because there is no recipient available for the send of v’ on channel c . On the other hand, suppose there was another thread willing to accept communication on channel c . In this case, the only possible value that f could receive from c is v . This occurs because the receive will only happen after the value v’ is consumed from the channel. Notice that if the spawned thread enqueues v on the channel before v’ , the function f will block even if another thread is willing to receive a value from the channel, since a function cannot synchronize with itself. The function af , on the other hand will never block. The receive may see either the value v or v’ since the asynchronous send event only asserts that the value v’ has been placed on the channel and not that it has been consumed. Consider the following refinement of af : fun af’ () = (aSync (aSendEvt(c, v’)); spawn (fn () => sync (sendEvt(c, v))); sync (recvEvt(c))) Assuming no other threads exist that read from c , the receive in af’ can only witness the value v’ . Although the spawn occurs before the synchronous receive, the channel c is guaranteed to contain the value v’ prior to v . While asynchronous events do not block, they still enforce ordering constraints that reflect the order in which they were created within the same thread, based on their channels. This distinguishes their behavior from our 134 (a) Thread 1 (b) Thread 2 aSync(ev) v ev recv c post creation actions send(c, v) c v Implicit Thread v post consumption actions Figure 7.3. The figure shows a complex asynchronous event ev , built from a base event aRecvEvt , being executed by Thread 1. (a) When the event is synchronized via aSync , the receive action is placed on channel c and post-creation actions are executed. Afterwards, control returns to Thread 1. (b) When Thread 2 sends the value v to channel c , an implicit thread of control is created to execute any post-consumption actions passing v as the argument. initial definition of asynchronous events that explicitly encoded asynchronous behavior in terms of threads. 7.2.1 Combinators In CML, the wrap combinator allows for the specification of a post-synchronization action. Once the event is completed the function wrapping the event is evaluated. For asynchronous events, this means the wrapped function is executed after the action the event encodes is placed on the channel and not necessarily after that action is consumed. sWrap : (’a, ’b) AEvent * (’a -> ’c) -> (’c, ’b) AEvent aWrap : (’a, ’b) AEvent * (’b -> ’c) -> (’a, ’c) AEvent 135 To allow for the specification of both post-creation and post-consumption actions for asynchronous events, we introduce two new combinators: sWrap and aWrap . sWrap is used to specify post-creation actions. The combinator aWrap , on the other hand, is used to express post-consumption actions. We can apply sWrap and aWrap to an asynchronous event in any order. sWrap(aWrap(e, f) g) ≡ aWrap(sWrap(e, g), f) Since post-creation actions have been studied in CML extensively, we focus our discussion on aWrap and the specification of post-consumption actions. Consider the following program fragment: fun f() = let val clocal = channel() in aSync (aWrap(aSendEvt(c, v),fn () => send(clocal , ()))); g(); recv(clocal ); h() end The function f first allocates a local channel clocal and then executes an asynchronous send aWrap -ed with a function that sends on the local channel. The function f then proceeds to execute functions g and h with a receive on the local channel between the two function calls. We use the aWrap primitive to encode a simple barrier based on the consumption of v . We are guaranteed that h executes in a context in which v has been consumed. The function g , on the other hand, can make no assumptions on the consumption of v . However, g is guaranteed that v is on the channel. Therefore, if g consumes values from c , it can witness v and, similarly, if it places values on the channel, it is guaranteed that v will be consumed prior to the values g produces. Note that v could have been consumed prior to g ’s evaluation. If the same code was written with a synchronous wrap, we would have no guarantee about the consumption of v . In fact, the code would block, as the send encapsulated by the wrap would be executed by the same thread of control executing f . Thus, the asynchronous 136 event implicitly creates a new evaluation context and a new thread of control. The wrapping function is evaluated in this context, not the thread which performed the synchronization. We can now encode a very basic callback mechanism using aWrap . The code shown below performs an asynchronous receive and passes the result of the receive to its wrapped function. The value received asynchronously is passed as an argument to h by sending on the channel clocal . let val clocal = channel() in aSync (aWrap(aRecvEvt(c), fn x => send(clocal , x))); ... h(recv(clocal )) end Although this implementation suffices as a basic callback, it is not particularly abstract and cannot be composed with other asynchronous events. We can create an abstract callback mechanism using both sWrap and aWrap around an input event. callbackEvt : (’a, ’c) AEvent * (’c -> ’b) -> (’b Event, ’c) AEvent fun callbackEvt(ev, f) = let val clocal = channel() in sWrap(aWrap(ev, fn x => (aSync(aSendEvt(clocal , x)); x)), fn => wrap(recvEvt(clocal ), f)) end If ev contains post-creation actions when the callback event is synchronized on, they are executed, followed by execution of the sWrap as shown in Figure 7.4(a). It returns a new event (call it ev’ ), which when synchronized on will first receive on the local channel and then apply the function f to the value it receives from the local channel. Synchronizing on this event will block until the event ev is consumed. Once ev is consumed, its post-consumption actions are executed in a new thread of control since ev is asynchronous (see Figure 7.4(b)). The body of the aWrap -ed function simply sends the result 137 (a) aSync(callBackEvt(ev, f)) (b) consumption of ev base event Implicit Thread v Thread 1 post consumption actions ev base event v' c aSend(clocal, v') post creation actions ev' clocal (c) sync(ev') ev' v' clocal f(v') Figure 7.4. The figure shows a callback event constructed from a complex asynchronous event ev and a callback function f . (a) When the callback event is synchronized via aSync , the action associated with the event ev is placed on channel c and post-creation actions are executed. A new event, ev’ , is created and passed to Thread 1. (b) An implicit thread of control is created after the base event of ev is consumed. Postconsumption actions are executed passing v , the result of consuming the base event for ev , as an argument. The result of the post-consumption actions, v’ is sent on clocal . (c) When ev’ is synchronized upon, f is called with v’ . of synchronizing on ev (call it v’ ) on the local channel and then passes the value v’ to any further post-consumption actions. This is done asynchronously because the complex event returned by callbackEvt can be further extended with additional post consumption actions. Those actions should not be blocked if there is no thread willing to synchronize on ev’ . Synchronizing on a callback event, thus, executes the base event associated with ev and creates a new event as a post-creation action, which when synchronized on, executes the callback function synchronously, returning the result of the callback. We can think of the difference between a callback and an aWrap of an asynchronous event in terms of the thread of control that executes them. Both specify a post-consumption 138 action for the asynchronous event, but the callback, when synchronized upon, is executed potentially by an arbitrary thread whereas the aWrap is always executed in the implicit thread created when the asynchronous event is consumed. Another difference is that the callback can be postponed and only executes when two conditions are satisfied: (i) the asynchronous event has completed and (ii) the callback is synchronized on. An aWrap returns once it has been synchronized on, and does not need to wait for other asynchronous events or post-consumption actions it encapsulates to complete. A guard of an asynchronous event behaves much the same as a guard of a synchronous event does; it specifies pre-synchronization actions (i.e. pre-creation computation): aGuard : (unit -> (’a, ’b) AEvent) -> (’a, ’b) AEvent To see how we might use asynchronous guards, notice our definition of callbackEvt has the unfortunate drawback that it allocates a new local channel regardless of whether or not the event is ever synchronized upon. The code below uses an aGuard to specify the allocation of the local channel only when the event is synchronized on: fun callbackEvt(ev, f) = aGuard(fn () => let val clocal = channel() in sWrap(aWrap(ev, fn x => (aSync(aSendEvt(clocal , x));x)), fn => wrap(recvEvt(clocal ), f)) end) One of the most powerful combinators provided by CML is a non-deterministic choice over events. The combinator choose picks an active event from a list of events. If no events are active, it waits until one becomes active. An active event is an event which is available for synchronization. We define an asynchronous version of the choice combinator, aChoose , that operates over asynchronous events. Since asynchronous events are nonblocking, all events in the list are considered active. Therefore, the asynchronous choice always non-deterministically chooses from the list of available asynchronous events. We also provide a blocking version of the asynchronous choice, sChoose , which blocks until 139 one of the asynchronous base events has been consumed. Post-creation actions are not executed until the choice has been made. 2 choose : ’a Event list -> ’a Event aChoose : (’a, ’b) AEvent list -> (’a, ’b) AEvent sChoose : (’a, ’b) AEvent list -> (’a, ’b) AEvent To illustrate the difference between aChoose and sChoose , consider a complex event ev defined as follows: val ev = aChoose[aSendEvt(c, v), aSendEvt(c’,v’)] If there exists a thread only willing to receive from channel c as shown in Figure 7.5, aChoose will nonetheless, with equal probability, execute the asynchronous send on c and c’ . However, if we redefined ev to utilize sChoose instead, the behavior of the choice changes: val ev = sChoose[aSendEvt(c, v), aSendEvt(c’,v’)] Since sChoose blocks until one of the base asynchronous events is satisfiable, if there is only a thread willing to accept communication on c (as in Figure 7.5), the choice will only select the event encoding the asynchronous send on c . We have thus far provided a mechanism to choose between sets of synchronous events and sets of asynchronous events. However, we would like to allow programmers to choose between both synchronous and asynchronous events. Currently, their different type structure would prevent such a formulation. Notice, however, that an asynchronous event with type (’a, ’b) AEvent and a synchronous event with type ’a Event both yield ’a in the thread which synchronizes on them. Therefore, it is sensible to allow choice to operate over both asynchronous and synchronous events provided the type of the asynchronous event’s post-creation action is the same as the type encapsulated by the synchronous event. To facilitate this interoperability, we provide combinators to transform asynchronous event types to synchronous event types and vice-versa: aTrans : (’a, ’b) AEvent -> ’a Event 2 This behavior is equivalent to a scheduler not executing the thread which created the asynchronous action until it has been consumed. 140 Thread 1 Thread 2 aSync(ev) recv(c) ev aSendEvt(c,v) aSe c ndE vt(c ',v') c' Figure 7.5. The figure shows Thread 1 synchronizing on a complex asynchronous event ev , built from a choice between two base asynchronous send events; one sending v on channel c and the other v’ on c’ . Thread 2 is willing to receive from channel c . sTrans : ’a Event -> (unit, ’a) AEvent The aTrans combinator takes an asynchronous event and creates a synchronous version by dropping the asynchronous portion of the event from the type (i.e. encapsulating it). As a result, we can no longer specify post-consumption actions for the event. However, we can still apply wrap to specify post-creation actions to the resulting synchronous portion exposed by the ’a Event . Asynchronous events that have been transformed and are part of a larger choose event are only selected if their base event is satisfiable. Therefore, the following equivalence holds for two asynchronous events, aEvt1 and aEvt2 : choose[aTrans(aEvt1), aTrans(aEvt2)] ≡ aChoose[aEvt1, aEvt2] The sTrans combinator takes a synchronous event and changes it into an asynchronous event with no post-creation actions. The wrapped computation of the original event occurs now as a post-consumption action. We can encode an asynchronous version of alwaysEvt from its synchronous counterpart. 141 aAlwaysEvt : ’a -> (unit, ’a) AEvent aNever : (unit, ’a) AEvent aAlwaysEvt(v) = sTrans alwaysEvt(v) aNever = sTrans never We provide a complete listing of the ACML interface in Figure 7.6. 7.2.2 Mailboxes and Multicast Using asynchronous events, we encoded other CML structures such as mailboxes (i.e., buffered channels) and multicasts channels, reducing code size and complexity. Mailboxes, or buffered asynchronous channels, are provided by the core CML implementation. Mailboxes are a specialized channel that supports asynchronous sends and synchronous receives. However, mailboxes are not built directly on top of CML channels, requiring a specialized structure, on the order of a 240 lines of CML code, to support asynchronous sends. Using asynchronous events, we reduced the original CML mailbox implementation from 240 LOC to 70 LOC, with a corresponding 52% improvement in performance on synthetic stress tests exercising various producer/consumer configurations. Asynchronous events provide the necessary components from which a mailbox structure can be defined, allowing the construction of mailboxes from regular CML channels, and providing a facility to define asynchronous send events on the mailbox. Having an asynchronous send event operation defined for mailboxes allows for their use in selective communication. Additionally, asynchronous events now provide the ability for programmers to specify post-creation and post-consumption actions. The asynchronous send operator and asynchronous send event can be defined as follows: fun send(mailbox, value) = CML.aSync(CML.aSendEvt(mailbox, value)) fun sendEvt(mailbox, value) = CML.aSendEvt(mailbox, value) 142 The synchronous receive and receive event are expressed in terms of regular CML primitives. This highlights the interoperability of asynchronous events with their synchronous counterparts and provides programmers with a rich interface of combinators to utilizes with mailboxes. Multicast channels in CML provide a mechanism to multicast a message to multiple recipients. Such an operation is naturally expressed using asynchrony. Abstractly, we can encode a multicast by asynchronously sending the message to all the recipients on the multicast channel. Similarly to mailboxes, we expressed multicast channels in 60 LOC, compared to 87 LOC in CML, with a roughly 19% improvement in performance. 7.3 Semantics Our semantics (see Figure 7.11, Figure 7.12, and Figure 7.13) is defined in terms of a core call-by-value functional language with threading and communication primitives. Communication between threads is achieved using channels and events. In our syntax, v ranges over values, p over primitive event constructors, and e over expressions. Besides abstractions, a value can be a message identifier, used to track communication actions, a channel identifier, or an event context. An event context (ε[]) demarcates event expressions that are built from asynchronous events and their combinators 3 that are eventually supplied as an argument to a synchronization action. The rules use function composition f ◦ g ≡ λx. f (g(x)) to sequence event actions and computations. The semantics also includes a new expression form, {e1 , e2 } to denote asynchronous communication actions; the expression e1 corresponds to the creation (and post-creation) of the asynchronous event, while e2 corresponds to the consumption (and post-consumption) of the asynchronous event. For convenience, both synchronous and asynchronous events are expressed in this form. For a synchronous event, e2 simply corresponds to an uninteresting action. We refer to e1 as the synchronous portion of the event, the expression 3 We describe the necessity of a guarded event context when we introduce the combinators later in this section. 143 which is executed by the current thread, and e2 as the asynchronous portion of the event, the expression which is executed by a newly created thread (see rule S YNC E VENT). A program state consists of a set of threads (T ), a communication map (∆), and a channel map (C ). The communication map is used to track the state of an asynchronous action, while the channel map records the state of channels with respect to waiting (blocked) actions. Evaluation is specified via a relation (→) that maps one program state to another. Evaluation rules are applied up to commutativity of parallel composition (k). The semantics makes use of two auxiliary relations (⇒) and (;). 7.3.1 Encoding Communication A communication action is split into two message parts: one corresponding to a sender and the other to a receiver. A send message part is, in turn, composed of two conceptual primitive actions: a send act (sendAct(c, v)) and a send wait (sendWait): sendAct: (ChannelId × Val) → MessageId → MessageId sendWait: MessageId → Val The send act primitive, when applied to a message identifier, places the value (v) on the channel (c), while the send wait, when applied to a message identifier, blocks until the value has been consumed off of the channel, returning unit when the message has been consumed. The message identifier m, generated for each base event (see rule ( SyncEvent )) is used to correctly pair the ”act” and ”wait”. Similarly, a receive message part is composed of receive act (recvAct(c)) and a receive wait (recvWait) primitives: recvAct: ChannelId → MessageId → MessageId recvWait: MessageId → Val A receive wait behaves as its send counterpart. A receive act removes a value from the channel if a matching send action exists; if none exists, it simply records the intention of performing the receive on the channel queue. We can think of computations occurring 144 after an act as post-creation actions and those occurring after a wait as post-consumption actions. Splitting a communication message part into an ”act” and a ”wait” primitive functions allows for the expression of many types of message passing. For instance, a traditional synchronous send is simply the sequencing of a send act followed by a send wait: sendWait ◦ sendAct(c, v). This encoding immediately causes the thread executing the operation to block after the value has been deposited on a channel, unless there is a matching receive act currently available. A synchronous receive is encoded in much the same manner. We use the global communication map (∆) to track act and wait actions for a given message identifier. A message id is created at a synchronization point, ensuring a unique message identifier for each synchronized event. At a synchronization point, the message is mapped to (⊥) in the communication map. Once a send or receive act occurs, ∆ is updated to reflect the value yielded by the act by rule (M ESSAGE), through an auxiliary relation (⇒). When a send act occurs the communication map will hold a binding to unit for the corresponding message, but when a receive act occurs the communication map binds the corresponding message to the value received. The values stored in the communication map are passed to the wait actions corresponding to the message by rules (S END WAIT) and (R ECV WAIT). During the evaluation of a choice the communication map is updated with bindings of multiple message ids to a choice id (ω). When one of the messages is mapped to a value, all other messages which map to a choice id are mapped to (>) instead by rules (M ESSAGE C HOICE). 7.3.2 Base Events There are four rules for creating base events, (S END E VENT) and (R ECV E VENT) for synchronous events, and (AS END E VENT) and (AR ECV E VENT) for their asynchronous counterparts. From base act and wait actions, we define asynchronous events: ε[{sendAct(c, v), sendWait}] 145 The first component of an asynchronous event is executed in the thread in which the expression evaluates, and is the target of synchronization (sync ), while the second component defines the actual asynchronous computation. For asynchronous events we split the act from the wait. Synchronous events can also be encoded using this notation: ε[{sendWait ◦ sendAct(c, v), λx.unit}]. In a synchronous event both the act and its corresponding wait occur in the synchronous portion of the event. The base asynchronous portion is simply a lambda that yields a unit value. 7.3.3 Event Evaluation In rule (S YNC E VENT), events are deconstructed by the sync operator. It strips the event context (ε[]), generates a new message identifier for the base event, creates a new thread of control, and triggers the evaluation of the internal expressions. The asynchronous portion of the event is wrapped in a new thread of control and placed in the regular pool of threads. If the event abstraction being synchronized was generated by a base synchronous event, the asynchronous portion is an uninteresting value (e.g. , λ x.unit). The newly created thread, in the case of an asynchronous event, will not be able to be evaluated further as it blocks until the corresponding act for the base event comprising the complex asynchronous event is discharged. 7.3.4 Communication and Ordering There are four rules for communicating over channels: (S END M ATCH), (R ECV M ATCH), (S END B LOCK), and (R ECV B LOCK). The channel map (C ) encodes abstract channel states mapping a channel to a sequence of actions (A ). This sequence encodes a FIFO queue and provides ordering between actions on the channel. The channel will have either a sequence of send acts (As ) or receive acts (Ar ), but never both at the same time. This is because if there are, for example, send acts enqueued on it, a receive action will immediately match the send, instead of needing to be enqueued and vice versa (rules (S END M ATCH) and (R ECV M ATCH) as well as the rules (E NQUEUE S END) and (E NQUEUE R ECEIVE)). If 146 a channel already has send acts enqueued on it, any thread wishing to send on the channel will enqueue its act and vice versa (rules (S END B LOCK) and (R ECV B LOCK)). After enqueueing its action, a thread can proceed with its evaluation. The auxiliary relation (;) enqueues a given action on channel and yields a new channel map with this change. Ordering for asynchronous acts and their post consumption actions as well as blocking of synchronous events is achieved by rules (S ENDWAIT) and (R ECV WAIT). Both rules block the evaluation of a thread until the corresponding act has been evaluated. In the case of synchronous events, this thread is the one that initiated the act; in the case of an asynchronous event, the thread that creates the act is different from the one that waits on it, and the blocking rules only block the implicitly created thread. For example, the condition ∆(m) = unit in rule (S ENDWAIT) is established either by rule (S END M ATCH), in the case of a synchronous action (created by (S END E VENT)), or rules (S END B LOCK) and (R ECV M ATCH) for an asynchronous one (created by (AS END E VENT)). 7.3.5 Combinators Complex events are built from the combinators described earlier; their definitions are shown in Figure 7.14. We define two variants of wrap, sWrap for specifying extensions to the synchronous portion of the event and aWrap for specifying extension to the asynchronous portion of the event. In the case of a synchronous event, we have sWrap extend the event with post-consumption actions as the base event will perform both the act and wait in the synchronous portion of the event. Similarly, leveraging aWrap on a synchronous event allows for the specification of general asynchronous actions. If the base event is asynchronous, sWrap expresses post creation actions and aWrap post consumption actions. The specification of the guard combinator is a bit more complex. Since a guard builds an event expression out of a function, that when executed yields an event, the concrete event is only generated at the synchronization point. This occurs because the guard is only executed when synchronized upon. The rule (G UARD) simply places the function 147 applied to a unit value (the function is really a thunk) in a specialized guarded event context (ε[(λx.e)unit]g ). The rule (S YNC G UARDED E VENT) simply strips the guarded event context and synchronizes on the encapsulated expression. This expression, when evaluated, will yield an event. Guarded events cannot be immediately extended with an sWrap or aWrap as the expression contained within a guarded event context is a function. Instead, wrapping an event in a guarded context simply moves the wrap expression into the event context. 7.3.6 Choose Events There are two choose event constructors, given in Figure 7.15, aChoose, and sChoose. Each event constructor takes a list of input events and performs a choice over the list. Since aChoose picks an asynchronous event regardless if the asynchronous portion of the event is satisfiable, we can make the choice prior to synchronization. Notice, the rule (AC HOOSE E VENT) simply picks one of the asynchronous events. This formulation allows us to simplify the rules. The rule (S C HOOSE E VENT) creates a complex event that when synchronized upon will perform a non-deterministic choice between the satisfiable events. There is one rule for flattening choices: (S C HOOSE F LATTEN). This rule is based on the following equivalence: choose(choose(e, e’), e’’) ≡ choose(e, e’, e’’). Flattening choices provides further simplification of the rules, namely the rules for blocking until an event within a choice becomes satisfiable. 7.3.7 Synchronization of Choice The rules for evaluating sChoose are shown in Figure 7.16. When an sChoose event is synchronized upon, the rule (S YNC S C HOOSE) generates one message id for each input event within the choice. These message ids are mapped to (⊥) in the communication map. If one of the events is immediately satisfiable, the sChoose complex event evaluates this base event in rule (S C HOOSE). In this case a new thread of control is created to evaluate the asynchronous portion of the event. If one of the input events is not satisfiable, the sChoose 148 event blocks. Both rules for blocking enqueue each of the input events’ actions on a channel through the relation (;). Each message id is mapped to the choice id generated for this choice. The rules (M ESSAGE C HOICE) ensure that only one of the actions associated with the message ids will be matched. When one of the input events becomes satisfiable, we create a new thread of control to evaluate the asynchronous portion by rule (S C HOOSE UN B LOCK). Since all the input events’ actions were enqueued on channels, these actions must be removed after the evaluation of the choice. The rule (C HANNEL C LEAN) removes all actions from a channel if their associated message id maps to (>) in the communication map. The rules (M ESSAGE C HOICE) map all message ids associated with a particular choice id to (>). There are two versions of the block rule since we may have encoded a synchronous or asynchronous variant of choice. The synchronous variant of choice corresponds to the CML choose event and the asynchronous to our sChoose event. 7.3.8 sWrap, aWrap, and Guard of Choose Events Notice that choose events introduce a new syntactic event form that contains base events. The structure of asynchronous choice prevents the application of the sWrap and aWrap combinators to the base asynchronous events comprising the asynchronous choose event as both combinators expect a single asynchronous event. This leads us to define sWrap and aWrap combinators for choose events in Figure 7.17. Abstractly, the sWrap and aWrap rules for choose events apply the sWrapped or aWrapped function to all of the input events of the choose event. Since only one of the events will be chosen, this provides the desired semantics while still allowing us to flatten choose events. Since guarded events have not yet been evaluated into events we cannot immediately construct a sChoose event out of them. Thus, the expression is moved into the guarded event context in much the same way as an aWrap or sWrap in rule (S C HOOSE G UARD). When the guarded event is synchronized upon, the expression will be evaluated into an event and then a complex sChoose event will be created. 149 7.4 Implementation We implemented asynchronous events in Multi-MLton. Our implementation closely follows the semantics given in Section 7.3. Allowing synchronous and asynchronous events to seamlessly co-exist involved implementing a unified framework for events, which is agnostic to the underlying channel, scheduler, and runtime implementation. The implementation of asynchronous events is thus composed of four parts: (i) the definition of base event values, (ii) the internal synchronization protocols for base events, (iii) synchronization protocols for choice, and (iv) the definition of various combinators. The implementation is roughly 4K lines of ML code. In order to provide a uniform environment for both synchronous and asynchronous events, we retain the channel structure, scheduler and runtime implementation of Parallel CML and have chosen to define an extended representation for base event values and a new synchronization protocol. In the implementation, asynchronous events are represented as a union type parametrized by two polymorphic types. The first component of the pair represents the post creation action along with the base synchronous action while the second component represents the post consumption action along with the base asynchronous event. 7.5 Case Study: A Parallel Web-server We briefly touch upon three aspects of Swerve’s design that were amenable to using asynchronous events, and show how these changes lead to substantial improvement in throughput and performance, over 4.7X. To better understand the utility of asynchronous events, we consider the interactions of four of Swerve’s modules: the Listener, the File Processor, the Network Processor, and the Timeout Manager. 7.5.1 Lock-step File and Network I/O Swerve was engineered assuming lock-step file and network I/O. While adequate when under low request loads, this design has poor scalability characteristics. This is because (a) 150 file descriptors, a bounded resource, can remain open for potentially long periods of time, as many different requests are multiplexed among a set of compute threads, and (b) for a given request, a file chunk is read only after the network processor has sent the previous chunk. Asynchronous events can be used to alleviate both bottlenecks. To solve the problem of lockstep transfer of file chunks, we might consider using simple asynchronous sends. However, Swerve was engineered such that the file processor was responsible for detecting timeouts. If a timeout occurs, the file processor sends a notification to the network processor on the same channel used to send file chunks. Therefore, if asynchrony was used to simply buffer the file chunks, a timeout would not be detected by the network processor until all the chunks were processed. Changing the communication structure to send timeout notifications on a separate channel would entail substantial structural modifications to the code base. The code shown below is a simplified version of the file processing module modified to use asynchronous events. It uses an arbitrator defined within the file processor to manage the file chunks produced by the fileReader. Now, the fileReader sends file chunks asynchronously to the arbitrator on the channel arIn (line 12). Note that each such asynchronous send acts as an arbitrator for the next asynchronous send. The arbitrator accepts file chunks from the fileReader on this channel and synchronously sends the file chunks to the consumer as long as a timeout has not been detected. This is accomplished by choosing between an abortEvt (used by the Timeout manager to signal a timeout) and receiving a chunk from file processing loop (lines 13-20). When a timeout is detected, an asynchronous message is sent on channel arOut to notify the file processing loop of this fact (line 9); subsequent file processing then stops. This loop synchronously chooses between accepting a timeout notification (line 17), or asynchronously processing the next chunk (lines 11 - 12). The arbitrator executes as a post-consumption action. datatype Xfr = TIMEOUT | DONE | X of chunk 1. fun fileReader name abortEvt consumer = 2. let 3. val (arIn, arOut) = (channel(), channel()) 151 4. fun arbitrator() = sync 5. 6. 7. 8. 9. 10. (choose [ wrap (recvEvt arIn, fn chunk => send (consumer, chunk)), wrap (abortEvt, fn () => (aSync(aSendEvt(arOut, ())); send(consumer, TIMEOUT)))]) 11. fun sendChunk(chunk) = 12. aSync(aWrap(aSendEvt(arIn, X(chunk)),arbitrator)) 13. fun loop strm = 14. case BinIO.read (strm, size) 15. of SOME chunk => sync 16. (choose [ 17. recvEvt arOut, 18. wrap(alwaysEvt, 19. fn () => (sendChunk(chunk); 20. loop strm))]) 21. | NONE => aSync(aSendEvt(arIn, DONE)) 22. val = aSync(aWrap(aRecvEvt(arIn), 23. fn chunk => send(consumer, chunk))) 24. in 25. case BinIO.openIt name of 26. 27. NONE => () | SOME strm => (loop strm; BinIO.closeIt strm) 28. end Since asynchronous events operate over regular CML channels, we were able to modify the file processor to utilize asynchrony without having to change any of the other modules or the communication patterns and protocols they expect. Being able to choose between synchronous and asynchronous events in the fileReader function also allowed us to create a buffer of file chunks, but stop file processing if a timeout was detected by 152 the arbitrator. Recall, each asynchronous send acts as an arbitrator for the next asynchronous send. 7.5.2 Underlying I/O and Logging To improve scalability and responsiveness, we also implemented a non-blocking I/O library composed of a language-level interface and associated runtime support. The library implements all MLton I/O interfaces, but internally utilizes asynchronous events. The library is structured around callback events as defined in Section 7.2.1 operating over I/O resource servers. Internally, all I/O requests are translated into a potential series of callback events. Web-servers utilize logging for administrative purposes. For long running servers, logs tend to grow quickly. Some web-servers (like Apache) solve this problem by using a rolling log, which automatically opens a new log file after a set time period (usually a day). In Swerve, all logging functions were done asynchronously. Using asynchronous events, we were able to easily change the logging infrastructure to use rolling logs. Because asynchronous events preserve ordering guarantees, log entries reflect actual thread action order. Post consumption actions were utilized to implement the rolling log functionality, by closing old logs and opening new logs after the appropriate time quantum. In addition, in Swerve the logging infrastructure is tasked with exiting the system if a fatal error is detected. The log notates that the occurrence of the error, flushes the log to disk, and then exits the system. This ensure that the log contains a record of the error prior to the system’s exit. Unfortunately, for the modules that utilize logging, this poses additional complexity and breaks modularity. Instead of logging the error at the point which it occurred, the error must be logged after the module has performed any clean up actions because of the synchronous communication protocol between the module and the log. Thus, if the module logs any actions during the clean up phase, they will appear in the log prior to the error. Instead, we can leverage our callback event to extend the module without changing the communication protocol to the log. 153 let val logEvt = aSendEvt(log, fatalErr) val logEvt’ = callbackEvt(logEvt, fn () => (Log.flush(); System.exit())) val exitEvt = aSync(logEvt’) in ( clean up; sync(exitEvt)) end In the code above logEvt corresponds to an event that encapsulates the communication protocol the log expects: a simple asynchronous send on the log’s input channel log. The event logEvt’ contains a callback. This event when synchronized will execute an asynchronous send to the log and then return a new event exitEvt. When exitEvt is synchronized upon, we are guaranteed that the log has received the notification of the fatal error. With this simplification we can also simplify the log by removing checks to see if a logged message corresponds to a fatal error and the exit mechanism; logging and system exit are now no longer conflated. 7.5.3 Performance Results To measure the efficiency of our changes in Swerve, we leveraged the server’s internal timing and profiling output for per-module accounting. The benchmarks were run on an AMD Opteron 865 server with 8 processors, each containing two symmetric cores, and 32 GB of total memory, with each CPU having its own local memory of 4 GB. The results as well as the changes to the largest modules are summarized in Table 7.1. Translating the implementation to use asynchronous events leads to a 4.7X performance improvement as well as a 15X reduction in client-side observed latency over the original, with only 103 lines of code changed out of 16KLOC. To put these numbers in perspective, our modified version of Swerve with asynchronous events has a throughput within 10% of Apache 2.2.15 on workloads that establish up to 1000 concurrent connections and process small/medium files at a total rate of 2000 requests per second. For server performance measurements and workload generation we used httperf – a tool for measuring web-server peformance. 154 Table 7.1 Per module performance numbers for Swerve. Module LOC LOC modified improvement Listener 1188 11 2.15 X File Processor 2519 35 19.23 X Network Processor 2456 25 24.8 X Timeout Manager 360 15 4.25 X 16,000 103 4.7 X Swerve 7.6 Related Work Many functional programming languages such as Erlang [1], JoCaml [79], and F# [83, 84] provide intrinsic support for asynchronous programming. In Erlang, message sends are inherently asynchronous. Unlike CML, in Erlang message are sent between processes and are the only way two processes can communicate. Processes may be located in the same VM or distributed among numerous VMs and physically distinct computers. Erlang at its core does not have mutable state. Instead updates are typically encoded by passing arguments to recursive functions that run as servers. Such servers, whether in Erlang or CML, can be considered “reactive”; they only executed whenever another thread, or process, wishes to communicate. We believe that the combinators and programming idioms presented in this chapter can be applied to language like Erlang. There are a number of languages that are derived from the Join Calculus [85] that provide some intrinsic support for join patterns (we discuss them later in this section). The Join Calculus is a process calculus aimed primarily at distributed and mobile programming, but is equally well suited for concurrent programming. The Join Calculus is structured around processes that communicate via messages over named channels. Messages are “consumed” or matched through join patterns. A join pattern guards an expression that is executed once the pattern is satisfied. The pattern itself specifies what messages and values it requires to 155 be satisfied and on what channels it expects the values. JoCaml is derived directly from the Join Calculus and is used for both distributed and concurrent programming. In JoCaml, complex asynchronous protocols are defined using join patterns [85, 86] that define synchronization protocols over asynchronous and synchronous channels. We can view a join pattern as defining a post-consumption action for a set of communication actions (those specified in the pattern itself). Notice, in this setting a given communication action may have multiple different post-consumption actions specified for it based on which pattern it participates in. In contrast, our combinators specify post-consumption actions regardless of which thread the communication action is paired with. Our formulation allows us to compose multiple post-consumption actions with a given event seamlessly. In F#, asynchronous behavior is defined using asynchronous work flows that permit asynchronous objects to be created and synchronized. Convenient monadic-style let! syntax permits callbacks (i.e., continuations) to be created within an asynchronous computation. The callback defines the post-computation function for an asynchronous operation. While these abstractions and paradigms provide expressive ways to define asynchronous computations, they do not provide a convenient mechanism to specify composable asynchronous abstractions, especially with respect to asynchronous post-consumption actions. It is the investigation of this important aspect of asynchronous programming, and its incorporation within a CML-style event framework that distinguishes the contributions of this chapter from these other efforts. Reactive programming [87] is an important programing style often found in systems programming that uses event loops to react to outside events (typically related to I/O). In this context, events do not define abstract communication protocols (as they do in CML), but typically represent I/O actions delivered asynchronously by the underlying operating system, sensors, or over the network. While understanding how reactive events and threads can co-exist is an important one, it is orthogonal to the focus of this work. Indeed we can encode reactive style programming idioms in ACML through the use of asynchronous receive events and/or lightweight servers. 156 There have also been efforts to simplify event-driven (reactive) asynchronous programming in imperative languages [82] by providing new primitives that are amenable to compiler analysis. Instead of having programmers weave complex asynchronous protocols and reason about non-local control-flow, these approaches provide mechanism to specify reactions whenever certain conditions hold. This is accomplished through specialized nonblocking function calls, high-level coordination primitives that make thread interactions explicit, and a linearity obligations that couple one thread for each asynchronous function call. Although accomplished in a different context, we believe ACML is synergistic with such approaches as ACML provides a robust programming model for explicitly defining thread interactions. Other programming languages support different notions of events explicitly through asynchronous methods or similar constructs, we refer to these collectively as languages for event-based programming. Examples include EventJava [88], ECO [89], AmbientTalk [90], and JavaPS [91], or Actor-based languages and language extensions such as Erlang [1] or Scala Actors [92]. Event correlation is an important programming idiom that allows programmers to specify resulting actions based on a series of events. Most languages supporting event correlation, such as Polyphonic C# [93] (now integrated with Cω), JoinJava [94], or SCHOOL [95], and libraries such as for Erlang [96] or Scala [92] are based on the Join Calculus [85]. We can think of event correlation as a pattern that specifies a given action to compute based on a collection of active events. From an event-based programming perspective we can view CML [2] as a “staged” event matching system where the consumption of a first event is a pre-condition for subsequent matching. Namely, post-consumption actions, and any events they encapsulate, are executed after an event is satisfied, or paired. There have been incarnations of CML in languages and systems other than ML (e.g., Haskell [80, 81], Scheme [49], and MPI [97]) There has also been much recent interest in extending CML with transactional support [22, 43, 44] (discussed in Chapter 5) and other flavors of parallelism [3]. We believe transactional events [22, 43, 44] provide an interesting platform upon which to implement non-blocking versions of sChoose that retain 157 the same semantics. Additionally, we expect that previous work on specialization of CML primitives [77] can be applied to improve the performance of asynchronous primitives. 7.7 Concluding Remarks This chapter presents the design, rationale, and implementation for asynchronous events, a concurrency abstraction that generalizes the behavior of CML-based synchronous events to enable composable construction of asynchronous computations. Our experiments indicate that asynchronous events can seamlessly co-exist with other CML primitives, and can be effectively leveraged to improve performance of realistic highly-concurrent applications. 158 spawn : (unit -> ’a) -> threadID channel : unit -> ’a chan sendEvt : ’a chan * ’a -> unit Event recvEvt : ’a chan -> ’a Event send : ’a chan * ’a -> unit recv : ’a chan -> ’a never : ’a Event alwaysEvt : ’a -> ’a Event sync : ’a Event -> ’a wrap : ’a Event * (’a -> ’b) -> ’b Event guard : (unit -> ’a Event) -> ’a Event choose : ’a Event list -> ’a Event aSendEvt : ’a chan * ’a -> (unit, unit) AEvent aRecvEvt : ’a chan -> (unit, ’a) AEvent aSend : ’a chan * ’a -> unit aRecv : ’a chan -> unit aAlwaysEvt : ’a -> (unit, ’a) AEvent aNever : (unit, ’a) AEvent aSync : (’a, ’b) AEvent -> ’a sWrap : (’a, ’b) AEvent * (’a -> ’c) -> (’c, ’b) AEvent aWrap : (’a, ’b) AEvent * (’b -> ’c) -> (’a, ’c) AEvent aGuard : (unit -> (’a, ’b) AEvent) -> (’a, ’b) AEvent aChoose : (’a, ’b) AEvent list -> (’a, ’b) AEvent sChoose : (’a, ’b) AEvent list -> (’a, ’b) AEvent aTrans : (’a, ’b) AEvent -> ’a Event sTrans : ’a Event -> (unit, ’a) AEvent callbackEvt : (’a, ’c) AEvent * (’c -> ’b) -> (’b Event, ’c) AEvent Figure 7.6. CML Event and AEvent operators. 159 type ’a mbox val mailbox : unit -> ’a mbox val sameMailbox : ’a mbox * ’a mbox -> bool val send : ’a mbox * ’a -> unit val recv : ’a mbox -> ’a val recvEvt : ’a mbox -> ’a event val recvPoll : ’a mbox -> ’a option Figure 7.7. CML mailbox structure. fun sameMailbox(mailbox1, mailbox2) = C.sameChannel(mailbox1, mailbox2) fun send(mailbox, value) = Channel.aSync(Channel.aSendEvt(mailbox, value)) fun sendEvt(mailbox, value) = Channel.aSendEvt(mailbox, value)) fun recv(mailbox) = Channel.sync(C.recvEvt(mailbox)) fun recvEvt(mailbox) = Channel.recvEvt(mailbox) fun recvPoll(mailbox) = Channel.recvPoll(mailbox) Figure 7.8. An excerpt of a CML mailbox structure implemented utilizing asynchronous events. 160 e ∈ Exp := v | x | p e | e e | {e, e0 } | spawn e | sync e | ch() | sendEvt(e, e) | recvEvt(e) | aSendEvt(e, e) | aRecvEvt(e) | aWrap(e, e) | sWrap(e, e) | guard(e) | aChoose(e1 , ..., en ) | sChoose(e1 , ..., en ) v ∈ Val := unit | c | m | λ x. e | ε[e] p ∈ Prim := sendAct(c, v) | sendWait | recvAct(c) | recvWait E := • | E e | v E | p E | sync E | sendEvt(E, e) | sendEvt(c, E) | aSendEvt(E, e) | aSendEvt(c, E) | recvEvt(E) | aRecvEvt(E) | aWrap(E, e) | sWrap(E, e) | aWrap(v, E) | sWrap(v, E) | guard(E) | aChoose(E, e, ...) | aChoose(v, E, ...) | sChoose(E, e, ...) | sChoose(v, E, ...) Figure 7.9. Syntax, grammar, and evaluation contexts for a core language for asynchronous events. 161 m ∈ MessageId c ∈ ChannelId ω ∈ ChoiceId ε[e], ε[e]g ∈ Event A m ∈ Action := Arm | Asm Arm ∈ ReceiveAct := Rcm Asm ∈ SendAct T ∈ m := Sc,v Thread := (t, e) T ∈ ThreadCollection := 0/ | T | T || T ∆ ∈ C ∈ hTi∆,C ∈ CommMap := MessageId → Val + ChoiceId+ ⊥ +> ChanMap := ChannelId → Action State := hT, CommMap, ChanMapi → ∈ State → State ⇒ ∈ CommMap × (SendAct + (RecieveAct × Val)) → CommMap ; ∈ Exp × ChanMap → MessageId × ChanMap Figure 7.10. Domain equations for a core language for asynchronous events. 162 (A PP) h(t, E[(λx.e) v]) || Ti∆,C → h(t, E[e[v/x]]) || Ti∆,C (C HANNEL) c fresh h(t, E[ch()]) || Ti∆,C → h(t, E[c]) || Ti∆,C [c7→0] / (S PAWN) t0 f resh h(t, E[spawn e]) || Ti∆,C → h(t0 , e) || (t, E[unit]) || Ti∆,C (S END E VENT) h(t, E[sendEvt(c, v)]) || Ti∆,C → h(t, E[ε[{sendWait ◦ sendAct(c, v), λx.unit}]]) || Ti∆,C (AS END E VENT) h(t, E[aSendEvt(c, v)]) || Ti∆ → h(t, E[ε[{sendAct(c, v), sendWait}]]) || Ti∆,C (R ECV E VENT) h(t, E[recvEvt(c)]) || Ti∆,C → h(t, E[ε[{recvWait ◦ recvAct(c), λx.unit}]]) || Ti∆,C (AR ECV E VENT) h(t, E[aRecvEvt(c)]) || Ti∆,C → h(t, E[ε[{recvAct(c), recvWait}]]) || Ti∆,C Figure 7.11. A core language for asynchronous events – base events as well as rules for spawn, function application, and channel creation. 163 (S YNC E VENT) m f resh t0 f resh h(t, E[sync ε[{e, e0 }]]) || Ti∆,C → h(t, E[e m]) || (t0 , e0 m) || Ti∆[m7→⊥],C (M ESSAGE) ∆(m) =⊥ m ⇒ ∆[m 7→ unit] ∆, Sc,v ∆(m) =⊥ ∆, Rcm , v ⇒ ∆[m 7→ v] (M ESSAGE C HOICE) ∆0 = ∆[m0 7→ >] ∀ m0 : ∆(m0 ) = ω m ⇒ ∆0 [m 7→ unit] ∆[m 7→ ε], Sc,v ∆0 = ∆[m0 7→ >] ∀ m0 : ∆(m0 ) = ω ∆[m 7→ ε], Rcm , v ⇒ ∆0 [m 7→ v] (S END M ATCH) 0 C (c) = Rcm : Arm 0 m ⇒ ∆0 ∆0 , R m , v ⇒ ∆00 ∆, Sc,v c h(t, E[(sendAct(c, v)) m]) || Ti∆,C → h(t, E[m]) || Ti∆00 ,C [c7→A m ] r (R ECV M ATCH) 0 m : Am C (c) = Sc,v s 0 m ⇒ ∆0 ∆0 , R m , v ⇒ ∆00 ∆, Sc,v c h(t, E[(recvAct(c)) m]) || Ti∆,C → h(t, E[m]) || Ti∆00 ,C [c7→A m ] s Figure 7.12. A core language for asynchronous events – rules for matching communication and ordering. 164 (S ENDWAIT) ∆(m) = unit h(t, E[sendWait m]) || Ti∆,C → h(t, E[unit]) || Ti∆,C (R ECEIVE WAIT) ∆(m) = v h(t, E[recvWait m]) || Ti∆,C → h(t, E[v]) || Ti∆,C (S END B LOCK) (sendAct(c, v)) m, C ; m, C 0 h(t, E[(sendAct(c, v)) m]) || Ti∆,C → h(t, E[m]) || Ti∆,C 0 (R ECV B LOCK) (recvAct(c)) m, C ; m, C 0 h(t, E[(recvAct(c)) m]) || Ti∆,C → h(t, E[m]) || Ti∆,C 0 (E NQUEUE S END) m] C (c) = Asm C 0 = C [c 7→ Asm : Sc,v (sendAct(c, v)) m, C ; m, C 0 (E NQUEUE R ECEIVE) C (c) = Arm C 0 = C [c 7→ Arm : Rcm ] (recvAct(c)) m, C ; m, C 0 Figure 7.13. A core language for asynchronous events – rules for waiting, blocking, and enqueueing. 165 (S W RAP) h(t, E[sWrap(ε[{e, e0 }], λx.e00 )]) || Ti∆,C → h(t, E[ε[{λx.e00 ◦ e, e0 }]]) || Ti∆,C (AW RAP) h(t, E[aWrap(ε[{e, e0 }], λx.e00 )]) || Ti∆,C → h(t, E[ε[{e, λx.e00 ◦ e0 }]]) || Ti∆,C (G UARD) h(t, E[guard(λx.e)]) || Ti∆,C → h(t, E[ε[(λx.e) unit]g ]) || Ti∆,C (S YNC G UARDED E VENT) h(t, E[sync ε[e]g ]) || Ti∆,C → h(t, E[sync e]) || Ti∆,C (S W RAP G UARDED E VENT) h(t, E[sWrap(ε[e]g , λx.e0 )]) || Ti∆,C → h(t, E[ε[sWrap(e, λx.e0 )]g ]) || Ti∆,C (AW RAP G UARDED E VENT) h(t, E[aWrap(ε[e]g , λx.e0 )]) || Ti∆,C → h(t, E[ε[aWrap(e, λx.e0 )]g ]) || Ti∆,C Figure 7.14. A core language for asynchronous events – combinator extensions for asynchronous events. 166 (AC HOOSE E VENT) 0≤i≤n h(t, E[aChoose(ε[e1 ], ..., ε[en ])]) || Ti∆,C → h(t, E[ε[ei ]]) || Ti∆,C (S C HOOSE E VENT F LATTEN) ε[ei ] = ε[sChoose(ei1 , ..., eim )] 0 ≤ i ≤ n h(t, E[sChoose(ε[e1 ], ..., ε[en ])]) || Ti∆,C → h(t, E[sChoose(ε[e1 ], ..., ε[ei−1 ], ε[ei1 ], ..., ε[eim ], ε[ei+1 ], ..., ε[en ])]) || Ti∆,C (S C HOOSE E VENT) h(t, E[sChoose(ε[e1 ], ..., ε[en ])]) || Ti∆,C → h(t, E[ε[sChoose(e1 , ..., en )]]) || Ti∆,C Figure 7.15. A core language for asynchronous events – choose events and choose event flattening. 167 (S YNC S C HOOSE) m1 , ..., mn f resh h(t, E[sync (ε[sChoose({e1 , e01 }, ..., {en , e0n })])]) || Ti∆,C → h(t, E[sChoose({e1 m1 , e01 m1 }, ..., {en mn , e0n mn })] ) || Ti∆[m1 7→⊥,...,mn 7→⊥],C (S C HOOSE) ei = {e m, e0 m} 0 ≤ i ≤ n t0 f resh h(t, E[e m]) || (t0 , e0 m) || Ti∆,C → h(t, E[e00 ]) || (t0 , e000 ) || Ti∆[m7→v],C h(t, E[sChoose(e1 , ..., en )]) || Ti∆,C → h(t, E[e00 ]) || (t0 , e000 ) || Ti∆[m7→v],C (S C HOOSE B LOCK S YNC) ω f resh ∆0 = ∆[m1 7→ ω, ..., mn 7→ ω] e1 , C ; m1 , C1 , ... en , Cn−1 ; mn , Cn h(t, E[sChoose({E1 [e1 ], e01 }, ..., {En [en ], e0n })]) || Ti∆,C → h(t, E[sChoose({E1 [m1 ], e01 }, ...., {En [mn ], e0n })]) || Ti∆0 ,Cn (S C HOOSE B LOCK A S YNC) ω f resh ∆0 = ∆[m1 7→ ω, ..., mn 7→ ω] e01 , C ; m1 , C1 , ... e0n , Cn−1 ; mn , Cn h(t, E[sChoose({e1 , E1 [e01 ]}, ..., {en , En [e0n ]})]) || Ti∆,C → h(t, E[sChoose({e1 , E1 [m1 ]}, ...., {en , En [mn ]})]) || Ti∆0 ,Cn (S C HOOSE UN B LOCK) ∆(mi ) = v 0 ≤ i ≤ n t0 f resh h(t, E[sChoose({E1 [m1 ], e1 }, ..., {En [mn ], en })]) || Ti∆,C → h(t, E[Ei [mi ]]) || (t0 , ei ) || Ti∆,C Figure 7.16. A core language for asynchronous events – synchronizing and evaluating S C HOOSE events. 168 (C HANNEL C LEAN) C (c) = A m : A m ∆(m) = > hTi∆,C → hTi∆,C [c7→A m ] c (S C HOOSE S W RAP) h(t, E[sWrap(ε[sChoose({e1 , e01 }, ..., {en , e0n })], λx.e00 )]) || Ti∆,C → h(t, E[ε[sChoose({(λx.e00 ) e1 , e01 }, ..., {(λx.e00 ) en , e0n })]]) || Ti∆,C (S C HOOSE AW RAP) h(t, E[aWrap(ε[sChoose({e1 , e01 }, ..., {en , e0n })], λx.e00 )]) || Ti∆,C → h(t, E[ε[sChoose({e1 , (λx.e00 ) e01 }, ..., {en , (λx.e00 ) e0n })]]) || Ti∆,C (S C HOOSE G UARD) h(t, E[sChoose(..., ε[en ]g , ...)]) || Ti∆,C → h(t, E[ε[sChoose(..., en , ...)]g ]) || Ti∆,C Figure 7.17. A core language for asynchronous events – combinators for S C HOOSE events. 169 8 CONCLUDING REMARKS AND FUTURE DIRECTIONS Message passing idioms are becoming more predominant, influencing languages, operating system, and hardware designs. In the context of CML we have presented abstractions for robust higher-order message-based communication. We have shown how each abstraction can be leveraged to implement new functionality, increase modularity, and improve performance. Stabilizers are a novel lightweight checkpointing mechanism that provides atomicity properties on the rollback of a stable region of code. Partial memoization is a function caching technique that allows the elision of subsequent function calls even in the presence of multiple communicating threads. Asynchronous CML is a language extension to seamlessly marry asynchronous and synchronous communication in the CML event framework, allowing for the modular creation, maintenance, and extension of asynchronous protocols through post-creation and post-consumption actions. 8.1 Future Directions In this section we present planned future directions. We split the discussion based on each language extension. 8.1.1 Stabilizers There are a number of different future directions we wish to pursue with stabilizers. The first is to explore their lightweight state restoration functionality to implement lightweight speculative constructs and language abstractions that provide deterministic parallelism, such as safe futures. Futures are a well known programming construct found in languages from Multilisp [98] to Java [99] (see Section 5.8 for a more in depth description and related work). Many recent language proposals [100–102] incorporate future-like constructs. 170 We would like to incorporate safe futures in the context of ACML and multiple threads of control. Unlike sequential formulations of safe futures [40], futures created from separate threads of control can interact. Reversion of a given future may necessitate the reversion of its communication partners when sequential semantics are violated. Stabilizers are a natural candidate for providing state reversion in the presence of multiple communicating threads. Additionally, the monitoring functionality inherent to stabilizers can be leveraged to perform checks for safety violations. There are other characterisations of deterministic parallelism and speculation that are relevant to proposed future work. Although not described in the context of safe futures, [103] proposed a type and effect system that simplifies parallel programming by enforcing deterministic semantics. CoreDet is a C/C++ compiler that provides deterministic behavior for programs that leverage pthreads by tracking the ownership of memory locations by threads, and leveraging a commit protocol to make changes to such locations visible to other threads of control. Grace [104] is a highly scalable runtime system that eliminates concurrency errors for programs with fork-join parallelism by enforcing a sequential commit protocol on threads which run as processes. Boudol and Petri [105] provide a definition for valid speculative computations independent of any implementation technique. Although our benchmarks indicate that stabilizers are a lightweight checkpointing mechanism, there are a number of optimizations we wish to pursue to limit the overheads of logging and re-execution. Our logging infrastructure would benefit from partial or incremental continuation grabbing to limit memory overheads. During a stabilize action many threads and computations may be reverted. However, only a small number of such computations may actually change during their subsequent re-execution. Identifying such sections of code could greatly reduce the cost of re-execution after a checkpoint is restored. Function memoization in a multi-threaded program with channel based communication requires additional monitoring of thread interactions as described in Chapter 6. Such information is already present within our communication graph and can be leveraged to assist function memoization. 171 The communication graphs generated by stabilizers have been leverage for debugging purposes. Although, this was not their intended use, such graphs coupled with additional debugging information has proved useful to understand, debug, and extend programs that utilize on the order of hundreds of thousands of threads. More importantly, we would like to extend stabilizers to provide deterministic replay of monitored programs. Deterministic replay of both concurrent and parallel programs has been the topic of much interest [106–108]. We envision constructing a scheduler that leverages the communication graph to make scheduling decisions, allowing a programmer to replay a buggy program or a given component. The communication graph captures all important and side-effecting operations, but does not provide a concrete schedule. Abstractly, we can think of the graph as a representation of a set of schedules that produce a given outcome. This simplifies the amount of information need to successfully replay a program. Similarly, we can leverage the information captured in the graph to reason about the equivalence of two different component implementations at the communication protocol level. Transactional events [44] provide additional expressivity to CML, namely the ability to provide safe guard events and three-way rendezvous. Stabilizers can also be utilized to encode safe guards, by wrapping the complex event and unrolling whenever an error condition is encountered. However, stabilizers do not provide the ability to express three-way rendezvous in CML. Currently formulations of transactional events rely on exhaustive state space exploration to implement their transactional behavior. We are working on a prototype implementation of transactional events that leverage stabilizers and a rollback/retry mechanism instead of the state space exploration. In the common case, where only one path is viable in a transactional event, exploring the state space is wasted work. This happens because often only one event in complex choose event is satisfiable. Thus, stabilizers allow the characterisation of a depth-first search strategy for finding successful communication patterns in a transactional event. If, however, a transactional event is not completed, stabilizers offer the ability to rollback and retry the event. Moreover, the stabilizer graph can be leveraged to guide future executions of the aborted transactional event. 172 While the use of cut can delimit the extent to which control is reverted as a result of a stabilize call, a more general and robust approach would be to integrate a rational compensation semantics [109] for stabilizers in the presence of stateful operations. To do just that, we provide additional extensions to stabilizers to support compensating actions. A compensating action is specified withing a stable section, when that stable section is reverted the compensating actions are executed in the order in which they were generated. Compensating actions can be utilized to revert I/O operations or to perform logging actions whenever a rollback occurs. In addition, we also provide a mechanism for a given thread to revert a give stable section. This allows a parent thread to revert poorly behaving child threads, allowing error handling to be localized in the parent. Such mechanisms are useful and we like to pursue them and others to develop coding patterns for the use of stabilizers. 8.1.2 Partial Memoization Our current formulation of partial memoization treats functions independently. As we discussed above, combining partial memoization with stabilizers provides the ability for stabilizers to reduce redundant computation after a checkpoint is restored. However, memoization can leverage the information stored within the communication graph to provide memo information for computational units. For instance, if a producer and a consumer were successfully memoized, instead of discharging the constraints for the producer and the consumer separately, we can leverage the communication graph to discover that the two are tightly coupled. This would help avoid the discharge of constraints, eliminating equality checks and synchronization. Selective memoization [56] is an important technique that provides a framework for specializing memoization requirements. Consider the function f given below: fun f(x) = if x < 5 then 8 else 13 173 Notice that the resulting value of the function depends on its argument. However, the branch inside the function is only dependent on if x is less than 5. Thus, a call to f with a supplied argument of 1, when a previous memoized call to f was executed with an argument of 2, should be able to successfully leverage the memo information gathered at the first call and elide the second call. Selective memoization provides such functionality. We believe incorporating such work into our partial memoization strategy will improve memoization benefits. More specifically, we can leverage selective memoization to improve constraint generation for choose events. As an example consider the code below: fun g() = sync(choose(recv(c), recv(c’))) The function g takes a unit argument and performs a selective communication, receiving either from channel c or c’. If on the first call to g, the value 13 is received from channel c a constraint is generated for that value and channel in a memo table. Consider a subsequent call to g where another thread was willing to also send 13 but on channel c’. Leveraging selective memoization augmented with CML hooks would allow a successful eliding of the second call to g. Self adjusting computation [110,111] is an emerging field of study in programming languages. Self adjusting computation provides the ability for a computation to adjust to small changes in input quickly and has been utilized to solve and accelerate dynamic problems in computational geometry and statistical inference. At its core, self adjusting computation relies on two key concepts: memoization, specifically selective memoization [56], and change propagation [112]. One aspect we would like to explore is leveraging our definition of partial memoization and its ability to handle threading, synchronization, and communication to provide self adjusting computation in CML and more importantly self adjusting computation on multi-core system. To do this, we would need to integrate our memoization technique and provide a parallel and CML aware version of change propagation. Change propagation relies heavily on dynamic dependence tracking through a dynamic dependence graph. 174 8.1.3 Asynchronous CML One natural extension to ACML would be to add session types to help programmers maintain protocols, check for component equivalence, and even improve performance. Session types have been proposed as a way to precisely capture complex interactions between communicating parties [113, 114]. They describe interaction protocols by specifying the type of messages exchanged between the participants. Implicit control flow information such as branching and loops can also be enumerated using session types. Session types [113, 114] allow precise specification of typed distributed interaction, though we believe we can utilize them in much the same fashion in a multi-core setting. Neubauer and Thiemann first described the operational semantics for asynchronous session types [115]. Early formulations only handled interaction between two parties, which has been extended to multi-party interaction by Honda et al. [116] and Bonelli et al. [117]. The work of Bejleri and Yoshida [118] extends that of Honda et al. [116] for synchronous communication specification among multiple interacting peers. Session types have have been applied to functional [119, 120], component-based systems [121], objectoriented [122, 123], and operating system services [124]. Asynchronous session types for Java have been studied [125], extending previous work involving the same authors [126]. Bi-party session types have been implemented in Java [123]. There has also been much recent interest in leveraging information present in session types to guide optimization of communication protocols [127]. Although this has thus far been targeted at reducing the overheads of round trip times in a distributed setting, it does pose an interesting opportunity in a multi-core environment. Such optimizations rely on batching [128] or chaining [129] of messages between communicating parties. Although we do not envision be able to leverage chaining of communication, batching can be utilized to reduce contention and synchronization costs on channels. Communication protocols often require multiple message exchanges between communicating threads. Each communication requires synchronization on the channel that provides the conduit between 175 the communicating threads. By amalgamating messages, we would be able to reduce contention, lock acquisition costs, and limit the size of channel queues. LIST OF REFERENCES 176 LIST OF REFERENCES [1] Joe Armstrong, Robert Virding, Claes Wikstrom, and Mike Williams. Concurrent Programming in Erlang. Prentice-Hall, 2nd edition, 1996. [2] John H. Reppy. Concurrent Programming in ML. Cambridge University Press, New York, NY, USA, 2007. [3] Matthew Fluet, Mike Rainey, John Reppy, and Adam Shaw. Implicitly-threaded parallelism in manticore. In Proceedings of the 13th ACM SIGPLAN International Conference on Functional Programming, ICFP ’08, pages 119–130, New York, NY, USA, 2008. ACM. [4] Guodong Li, Michael Delisi, Ganesh Gopalakrishnan, and Robert M. Kirby. Formal specification of the MPI-2.0 standard in TLA+. In Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP ’08, pages 283–284, New York, NY, USA, 2008. ACM. [5] Richard Monson-Haefel and David Chappell. Java Message Service. O’Reilly & Associates, Inc., Sebastopol, CA, USA, 2000. [6] Intel. Single-Chip Cloud Computer. Scale/1826.htm, 2010. http://techresearch.intel.com/articles/Tera- [7] Andrew Baumann, Paul Barham, Pierre-Evariste Dagand, Tim Harris, Rebecca Isaacs, Simon Peter, Timothy Roscoe, Adrian Schüpbach, and Akhilesh Singhania. The multikernel: A new OS architecture for scalable multicore systems. In Proceedings of the ACM SIGOPS 22nd Symposium on Operating Systems Principles, SOSP ’09, pages 29–44, New York, NY, USA, 2009. ACM. [8] MLton. http://www.mlton.org. [9] John Reppy, Claudio V. Russo, and Yingqi Xiao. Parallel concurrent ML. In ICFP ’09: Proceedings of the 14th ACM SIGPLAN International Conference on Functional Programming, pages 257–268, New York, NY, USA, 2009. ACM. [10] Robin Milner, Mads Tofte, and David Macqueen. The Definition of Standard ML. MIT Press, Cambridge, MA, USA, 1997. [11] Lukasz Ziarek, Stephen Weeks, and Suresh Jagannathan. Flattening tuples in an SSA intermediate representation. Higher-Order and Symbolic Computation, 21:333–358, 2008. [12] Martin Elsman. Program Modules, Seperate Compilation, and Intermodule Optimization. PhD thesis, University of Copenhagen, 1999. 177 [13] Andrew Tolmach and Dino P. Oliva. From ML to Ada: Strongly-typed language interoperability via source translation. Journal of Functional Programing, 8(4):367– 412, 1998. [14] John C. Reynolds. Definitional interpreters for higher-order programming languages. In Proceedings of 25th ACM National Conference, pages 717–740, Boston, Massachusetts, 1972. Reprinted in Higher-Order and Symbolic Computation 11(4):363–397, 1998. [15] Ron Cytron, Jeanne Ferrante, Barry K. Rosen, Mark N. Wegman, and F. Kenneth Zadeck. Efficiently computing static single assignment form and the control dependence graph. ACM Transactions on Programing Languages and Systems, 13:451– 490, October 1991. [16] Olin Shivers. Control flow analysis in Scheme. In Proceedings of the ACM SIGPLAN 1988 Conference on Programming Language Design and Implementation, PLDI ’88, pages 164–174, New York, NY, USA, 1988. ACM. [17] David R. Butenhof. Programming with POSIX threads. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 1997. [18] KC Sivaramakrishnan, Lukasz Ziarek, Raghavendra Prasad, and Suresh Jagannathan. Lightweight asynchrony using parasitic threads. In DAMP ’10: Proceedings of the 5th ACM SIGPLAN Workshop on Declarative Aspects of Multicore Programming, pages 63–72, New York, NY, USA, 2010. ACM. [19] George Candea, Shinichi Kawamoto, Yuichi Fujiki, Greg Friedman, and Armando Fox. Microreboot: A technique for cheap recovery. In Proceedings of the 6th Symposium on Opearting Systems Design & Implementation – Volume 6, pages 3–3, Berkeley, CA, USA, 2004. USENIX Association. [20] Tim Harris and Keir Fraser. Language support for lightweight transactions. In Proceedings of the 18th Annual ACM SIGPLAN Conference on Object-Oriented Programing, Systems, Languages, and Applications, OOPSLA ’03, pages 388–402, New York, NY, USA, 2003. ACM. [21] Maurice Herlihy, Victor Luchangco, Mark Moir, and William N. Scherer, III. Software transactional memory for dynamic-sized data structures. In Proceedings of the 22nd Annual Symposium on Principles of Distributed Computing, PODC ’03, pages 92–101, New York, NY, USA, 2003. ACM. [22] Kevin Donnelly and Matthew Fluet. Transactional events. Journal of Functional Programing, 18:649–706, September 2008. [23] Jeremy Manson, William Pugh, and Sarita V. Adve. The Java memory model. In Proceedings of the 32nd ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL ’05, pages 378–391, New York, NY, USA, 2005. ACM. [24] Jim Gray and Andreas Reuter. Transaction Processing. Morgan-Kaufmann, 1993. [25] Panos K. Chrysanthis and Krithi Ramamritham. ACTA: The SAGA continues. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1992. 178 [26] Mootaz Elnozahy, Lorenzo Alvisi, Yi-Min Wang, and David Johnson. A survey of rollback-recovery protocols in message-passing systems. ACM Computing Surveys, 34(3):375–408, 2002. [27] Mangesh Kasbekar and Chita Das. Selective checkpointing and rollbacks in multithreaded distributed systems. In Proceedings of the The 21st International Conference on Distributed Computing Systems, page 39, Washington, DC, USA, 2001. IEEE Computer Society. [28] Kester Li, Jeffrey Naughton, and James Plank. Real-time concurrent checkpoint for parallel programs. In Proceedings of the 2nd ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming, PPOPP ’90, pages 79–88, New York, NY, USA, 1990. ACM. [29] Asser N. Tantawi and Manfred Ruschitzka. Performance analysis of checkpointing strategies. ACM Transactions on Computer Systems, 2:123–144, May 1984. [30] Saurabh Agarwal, Rahul Garg, Meeta S. Gupta, and Jose E. Moreira. Adaptive incremental checkpointing for massively parallel systems. In ICS ’04: Proceedings of the 18th Annual International Conference on Supercomputing, pages 277–286, New York, NY, USA, 2004. ACM. [31] Greg Bronevetsky, Daniel Marques, Keshav Pingali, and Paul Stodghill. Automated application-level checkpointing of MPI programs. In PPoPP ’03: Proceedings of the 9th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 84–94, New York, NY, USA, 2003. ACM. [32] Greg Bronevetsky, Daniel Marques, Keshav Pingali, Peter Szwed, and Martin Schulz. Application-level checkpointing for shared memory programs. SIGARCH Computer Architecture News, 32(5):235–247, 2004. [33] Micah Beck, James S. Plank, and Gerry Kingsley. Compiler-assisted checkpointing. Technical report, University of Tennessee. Knoxville, TN, USA, 1994. [34] Yuqun Chen, James S. Plank, and Kai Li. Clip: A checkpointing tool for messagepassing parallel programs. In Supercomputing ’97: Proceedings of the 1997 ACM/IEEE Conference on Supercomputing (CDROM), pages 1–11, New York, NY, USA, 1997. ACM. [35] Alan Dearie and David Hulse. On page-based optimistic process checkpointing. In Proceedings of the 4th International Workshop on Object-Orientation in Operating Systems, IWOOOS ’95, page 24, Washington, DC, USA, 1995. IEEE Computer Society. [36] William R. Dieter and James E. Lumpp Jr. A user-level checkpointing library for POSIX threads programs. In FTCS ’99: Proceedings of the 29th Annual International Symposium on Fault-Tolerant Computing, page 224, Washington, DC, USA, 1999. IEEE Computer Society. [37] Atul Adya, Robert Gruber, Barbara Liskov, and Umesh Maheshwari. Efficient optimistic concurrency control using loosely synchronized clocks. In SIGMOD ’95: Proceedings of the 1995 ACM SIGMOD International Conference on Management of Data, pages 23–34, New York, NY, USA, 1995. ACM. 179 [38] Hsiang-Tsung Kung and John Robinson. On optimistic methods for concurrency control. ACM Transactions on Database Systems, 6:213–226, June 1981. [39] Martin C. Rinard. Effective fine-grain synchronization for automatically parallelized programs using optimistic synchronization primitives. ACM Transactions on Computer Systems, 17:337–371, November 1999. [40] Adam Welc, Suresh Jagannathan, and Antony Hosking. Safe futures for Java. In Proceedings of the 20th Annual ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications, OOPSLA ’05, pages 439– 453, New York, NY, USA, 2005. ACM. [41] Tim Harris, Simon Marlow, Simon Peyton-Jones, and Maurice Herlihy. Composable memory transactions. In Proceedings of the 10th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. PPoPP ’05, pages 48–60, New York, NY, USA, 2005. ACM. [42] Michael F. Ringenburg and Dan Grossman. AtomCaml: First-class atomicity via rollback. In Proceedings of the 10th ACM SIGPLAN International Conference on Functional Programming, ICFP ’05, pages 92–104, New York, NY, USA, 2005. ACM. [43] Kevin Donnelly and Matthew Fluet. Transactional events. In Proceedings of the 11th ACM SIGPLAN International Conference on Functional Programming, ICFP ’06, pages 124–135, New York, NY, USA, 2006. ACM. [44] Laura Effinger-Dean, Matthew Kehrt, and Dan Grossman. Transactional events for ML. In Proceeding of the 13th ACM SIGPLAN International Conference on Functional Programming, ICFP ’08, pages 103–114, New York, NY, USA, 2008. ACM. [45] Andrew P. Tolmach and Andrew W. Appel. Debugging standard ML without reverse engineering. In Proceedings of the 1990 ACM Conference on LISP and Functional Programming, LFP ’90, pages 1–12, New York, NY, USA, 1990. ACM. [46] Andrew P. Tolmach and Andrew W. Appel. Debuggable concurrency extensions for standard ML. In Proceedings of the 1991 ACM/ONR Workshop on Parallel and Distributed Debugging, PADD ’91, pages 120–131, New York, NY, USA, 1991. ACM. [47] John Field and Carlos A. Varela. Transactors: A programming model for maintaining globally consistent distributed state in unreliable environments. In Proceedings of the 32nd ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL ’05, pages 195–208, New York, NY, USA, 2005. ACM. [48] Jan Christiansen and Frank Huch. Searching for deadlocks while debugging concurrent Haskell programs. In Proceedings of the 9th ACM SIGPLAN International Conference on Functional Programming. ICFP ’04, pages 28–39, New York, NY, USA, 2004. ACM. [49] Matthew Flatt and Robert Bruce Findler. Kill-Safe synchronization abstractions. In Proceedings of the ACM SIGPLAN 2004 Conference on Programming Language Design and Implementation, PLDI ’04, pages 47–58, New York, NY, USA, 2004. ACM. 180 [50] Armand Navabi, Xiangyu Zhang, and Suresh Jagannathan. Quasi-static scheduling for safe futures. In Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP ’08, pages 23–32, New York, NY, USA, 2008. ACM. [51] Lingli Zhang, Chandra Krintz, and Priya Nagpurkar. Supporting exception handling for futures in Java. In Proceedings of the 5th International Symposium on Principles and Practice of Programming in Java, PPPJ ’07, pages 175–184, New York, NY, USA, 2007. ACM. [52] C. Flanagan and M. Felleisen. The semantics of future and an application. Journal of Functional Programming, 9(1):1–31, January 1999. [53] Armand Navabi and Suresh Jagannathan. Exceptionally safe futures. In Proceedings of the 11th International Conference on Coordination Models and Languages, COORDINATION ’09, pages 47–65, Berlin, Heidelberg, 2009. Springer-Verlag. [54] Yanhong A. Liu and Tim Teitelbaum. Caching intermediate results for program improvement. In Proceedings of the 1995 ACM SIGPLAN Symposium on Partial Evaluation and Semantics-Based Program Manipulation, PEPM ’95, pages 190– 201, New York, NY, USA, 1995. ACM. [55] William Pugh and Tim Teitelbaum. Incremental computation via function caching. In Proceedings of the 16th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL ’89, pages 315–328, New York, NY, USA, 1989. ACM. [56] Umut A. Acar, Guy E. Blelloch, and Robert Harper. Selective memoization. In Proceedings of the 30th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL ’03, pages 14–25, New York, NY, USA, 2003. ACM. [57] Michael I. Gordon, William Thies, and Saman Amarasinghe. Exploiting coarsegrained task, data, and pipeline parallelism in stream programs. In Proceedings of the 12th International Conference on Architectural support for Programming Languages and Operating Systems, ASPLOS-XII, pages 151–162, New York, NY, USA, 2006. ACM. [58] Christopher J. F. Pickett and Clark Verbrugge. Software Thread Level Speculation for the Java Language and Virtual Machine Environment. In Proceedings of the International Workshop on Languages and Compilers for Parallel Computing, 2005. [59] Ali-Reza Adl-Tabatabai, Brian T. Lewis, Vijay Menon, Brian R. Murphy, Bratin Saha, and Tatiana Shpeisman. Compiler and runtime support for efficient software transactional memory. In Proceedings of the 2006 ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’06, pages 26–37, New York, NY, USA, 2006. ACM. [60] Rachid Guerraoui, Michal Kapalka, and Jan Vitek. STMBench7: A benchmark for software transactional memory. In Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007, EuroSys ’07, pages 315–324, New York, NY, USA, 2007. ACM. [61] Richard Matthew McCutchen and Samir Khuller. Streaming algorithms for k-center clustering with outliers and with anonymity. In Proceedings of the 11th International 181 Workshop, APPROX 2008, and 12th International Workshop, RANDOM 2008 on Approximation, Randomization and Combinatorial Optimization: Algorithms and Techniques, APPROX ’08 / RANDOM ’08, pages 165–178, Berlin, Heidelberg, 2008. Springer-Verlag. [62] Michael J. Carey, David J. DeWitt, and Jeffrey F. Naughton. The 007 benchmark. In Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data, SIGMOD ’93, pages 12–21, New York, NY, USA, 1993. ACM. [63] Bratin Saha, Ali-Reza Adl-Tabatabai, Richard L. Hudson, Chi Cao Minh, and Benjamin Hertzberg. McRT-STM: A high performance software transactional memory system for a multi-core runtime. In Proceedings of the 11th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP ’06, pages 187–197, New York, NY, USA, 2006. ACM. [64] William Pugh. An improved replacement strategy for function caching. In Proceedings of the 1988 ACM Conference on LISP and Functional Programming, LFP ’88, pages 269–276, New York, NY, USA, 1988. ACM. [65] Allan Heydon, Roy Levin, and Yuan Yu. Caching function calls using precise dependencies. In Proceedings of the ACM SIGPLAN 2000 Conference on Programming Language Design and Implementation, PLDI ’00, pages 311–320, New York, NY, USA, 2000. ACM. [66] Kedar Swadi, Walid Taha, Oleg Kiselyov, and Emir Pasalic. A monadic approach for avoiding code duplication when staging memoized functions. In Proceedings of the 2006 ACM SIGPLAN Symposium on Partial Evaluation and Semantics-Based Program Manipulation, PEPM ’06, pages 160–169, New York, NY, USA, 2006. ACM. [67] Shaz Qadeer, Sriram K. Rajamani, and Jakob Rehof. Summarizing procedures in concurrent programs. In Proceedings of the 31st ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL ’04, pages 245–255, New York, NY, USA, 2004. ACM. [68] Eric Koskinen, Matthew Parkinson, and Maurice Herlihy. Coarse-grained transactions. In Proceedings of the 37th Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL ’10, pages 19–30, New York, NY, USA, 2010. ACM. [69] Dave Dice, Yossi Lev, Mark Moir, and Daniel Nussbaum. Early experience with a commercial hardware transactional memory implementation. In Proceeding of the 14th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS ’09, pages 157–168, New York, NY, USA, 2009. ACM. [70] Ali-Reza Adl-Tabatabai Vijay Menon Tatiana Shpeisman Lukasz Ziarek, Adam Welc and Suresh Jagannathan. A uniform transactional execution environment for Java. In Proceedings of the 22nd European Conference on Object-Oriented Programming, ECOOP ’08, pages 129–154, Berlin, Heidelberg, 2008. SpringerVerlag. [71] Chi Cao Minh, Martin Trautmann, JaeWoong Chung, Austen McDonald, Nathan Bronson, Jared Casper, Christos Kozyrakis, and Kunle Olukotun. An effective hybrid transactional memory system with strong isolation guarantees. In Proceedings 182 of the 34th Annual International Symposium on Computer Architecture, ISCA ’07, pages 69–80, New York, NY, USA, 2007. ACM. [72] Arrvindh Shriraman, Michael F. Spear, Hemayet Hossain, Virendra J. Marathe, Sandhya Dwarkadas, and Michael L. Scott. An integrated hardware-software approach to flexible transactional memory. In Proceedings of the 34th Annual International Symposium on Computer Architecture, ISCA ’07, pages 104–115, New York, NY, USA, 2007. ACM. [73] Umut A. Acar, Amal Ahmed, and Blu Matthias. Imperative self-adjusting computation. In Proceedings of the 35th Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL ’08, pages 309–322, New York, NY, USA, 2008. ACM. [74] Ruy Ley-Wild, Matthew Fluet, and Umut A. Acar. Compiling self-adjusting programs with continuations. In Proceeding of the 13th ACM SIGPLAN International Conference on Functional Programming, ICFP ’08, pages 321–334, New York, NY, USA, 2008. ACM. [75] Umut A. Acar, Guy E. Blelloch, Blu Matthias, and Kanat Tangwongsan. An experimental analysis of self-adjusting computation. In Proceedings of the 2006 ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’06, pages 96–107, New York, NY, USA, 2006. ACM. [76] Matthew Hammer, Umut A. Acar, Mohan Rajagopalan, and Anwar Ghuloum. A proposal for parallel self-adjusting computation. In Proceedings of the 2007 Workshop on Declarative Aspects of Multicore Programming, DAMP ’07, pages 3–9, New York, NY, USA, 2007. ACM. [77] John Reppy and Yingqi Xiao. Specialization of cml message-passing primitives. In Proceedings of the 34th Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL ’s07, pages 315–326, New York, NY, USA, 2007. ACM. [78] Lukasz Ziarek, Philip Schatz, and Suresh Jagannathan. Stabilizers: A modular checkpointing abstraction for concurrent functional programs. In ICFP ’06: Proceedings of the 11th ACM SIGPLAN International Conference on Functional Programming, pages 136–147, New York, NY, USA, 2006. ACM. [79] Cédric Fournet, Fabrice Le Fessant, Luc Maranget, and Alan Schmidt. JoCaml: A Language for Concurrent Distributed and Mobile Programming. In Advanced Functional Programming, pages 129–158. Springer-Verlag, 2002. [80] Avik Chaudhuri. A concurrent ML library in concurrent Haskell. In Proceedings of the 14th ACM SIGPLAN International Conference on Functional Programming, ICFP ’09, pages 269–280, New York, NY, USA, 2009. ACM. [81] George Russell. Events in Haskell, and how to implement them. In Proceedings of the 6th ACM SIGPLAN International Conference on Functional Programming, ICFP ’01, pages 157–168, New York, NY, USA, 2001. ACM. [82] Prakash Chandrasekaran, Christopher L. Conway, Joseph M. Joy, and Sriram K. Rajamani. Programming asynchronous layers with clarity. In Proceedings of the the 6th Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on The Foundations of Software Engineering, ESEC-FSE ’07, pages 65–74, New York, NY, USA, 2007. ACM. 183 [83] Don Syme, Adam Granicz, and Antonio Cisternino. Expert F#. Apress, 2007. [84] Robert Pickering. Foundations of F#. Apress, 2007. [85] Cédric Fournet and Georges Gonthier. The reflexive CHAM and the join-calculus. In Proceedings of the 23rd ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL ’96, pages 372–385, New York, NY, USA, 1996. ACM. [86] Jean-Pierre Banâtre and Daniel Le Métayer. Programming by multiset transformation. Communications of the ACM, 36:98–111, January 1993. [87] Peng Li and Steve Zdancewic. Combining events and threads for scalable network services implementation and evaluation of monadic, application-level concurrency primitives. In Proceedings of the 2007 ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’07, pages 189–199, New York, NY, USA, 2007. ACM. [88] Patrick Eugster and K. R. Jayaram. EventJava: An extension of Java for event correlation. In Proceedings of the 23rd European Conference on Object-Oriented Programming, ECOOP ’09, pages 570–594, Berlin, Heidelberg, 2009. Springer-Verlag. [89] Mads Haahr, René Meier, Paddy Nixon, Vinny Cahill, and Eric Jul. Filtering and scalability in the ECO distributed event model. In Proceedings of the International Symposium on Software Engineering for Parallel and Distributed Systems, page 83, Washington, DC, USA, 2000. IEEE Computer Society. [90] Jessie Dedecker, Tom Van Cutsem, Stijn Mostinckx, Theo D’Hondt, and Wolfgang De Meuter. Ambient-oriented programming. In Companion to the 20th Annual ACM SIGPLAN Conference on Object-oriented Programming, Systems, Languages, and Applications, OOPSLA ’05, pages 31–40, New York, NY, USA, 2005. ACM. [91] Patrick Eugster. Type-based publish/subscribe: Concepts and experiences. ACM Transactions on Programing Languages and Systems, 29, January 2007. [92] Philipp Haller and Tom Van Cutsem. Implementing joins using extensible pattern matching. In Coordination Models and Languages, volume 5052 of Lecture Notes in Computer Science, pages 135–152. Springer Berlin / Heidelberg, 2008. [93] Nick Benton, Luca Cardelli, and Cédric Fournet. Modern Concurrency Abstractions for C#. ACM Transactions on Programing Languages and Systems, 26(5):769–804, 2004. [94] G Stewart Itzstein and David Kearney. Applications of Join Java. In Proceedings of the 7th Asia-Pacific Conference on Computer Systems Architecture, CRPIT ’02, pages 37–46, Darlinghurst, Australia, 2002. Australian Computer Society, Inc. [95] Alex Buckley Sophia Drossopoulou, Alexis Petrounias and Susan Eisenbach. School: a small chorded object-oriented language. Electronic Notes on Theoretical Computer Science, 135:37–47, March 2006. [96] Huber Plociniczak and Susan Eisenbach. JErlang: Erlang with Joins. In 12th International Conference on Coordination Models and Languages, COORDINATION ’10, pages 61–75, June 2010. 184 [97] Erik Demaine. First class communication in MPI. In Proceedings of the Second MPI Developers Conference, page 189, Washington, DC, USA, 1996. IEEE Computer Society. [98] Robert Halstead. Multilisp: A language for concurrent symbolic computation. ACM Transactions on Programing Languages Systems, 7(4):501–538, October 1985. [99] http://java.sun.com/j2se/1.5.0/docs/guide/concurrency. [100] Joseph Hallett Victor Luchangco Jan-Willem Maessen Sukyoung Ryu Guy Steele Eric Allan, David Chase and Sam Tobin-Hochstadt. The Fortress language specification version 1.0. Technical report, Sun Microsystems, Inc., May 2008. [101] Philippe Charles, Christian Grothoff, Vijay Saraswat, Christopher Donawa, Allan Kielstra, Kemal Ebcioglu, Christoph von Praun, and Vivek Sarkar. X10: An objectoriented approach to non-uniform cluster computing. In Proceedings of the 20th Annual ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications, OOPSLA ’05, pages 519–538, New York, NY, USA, 2005. ACM. [102] Barbara Liskov and Liuba Shrira. Promises: Linguistic support for efficient asynchronous procedure calls in distributed systems. In Proceedings of the ACM SIGPLAN 1988 Conference on Programming Language Design and Implementation, PLDI ’88, pages 260–267, New York, NY, USA, 1988. ACM. [103] Robert L. Bocchino, Jr., Vikram S. Adve, Danny Dig, Sarita V. Adve, Stephen Heumann, Rakesh Komuravelli, Jeffrey Overbey, Patrick Simmons, Hyojin Sung, and Mohsen Vakilian. A type and effect system for deterministic parallel Java. In Proceeding of the 24th ACM SIGPLAN Conference on Object-Oriented Programming Systems, Languages, and Applications, OOPSLA ’09, pages 97–116, New York, NY, USA, 2009. ACM. [104] Emery D. Berger, Ting Yang, Tongping Liu, and Gene Novark. Grace: Safe multithreaded programming for C/C++. In Proceeding of the 24th ACM SIGPLAN Conference on Object Oriented Programming Systems, Languages, and Applications, OOPSLA ’09, pages 81–96, New York, NY, USA, 2009. ACM. [105] Gerard Boudol and Gustavo Petri. A theory of speculative computation. In Programming Languages and Systems, Lecture Notes in Computer Science, pages 165–184, Berlin / Heidelberg, 2010. Springer-Verlag. [106] Mark Russinovich and Bryce Cogswell. Replay for concurrent non-deterministic shared-memory applications. In Proceedings of the ACM SIGPLAN 1996 Conference on Programming Language Design and Implementation, PLDI ’96, pages 258– 266, New York, NY, USA, 1996. ACM. [107] Harish Patil, Cristiano Pereira, Mack Stallcup, Gregory Lueck, and James Cownie. PinPlay: A framework for deterministic replay and reproducible analysis of parallel programs. In Proceedings of the 8th Annual IEEE/ACM International Symposium on Code Generation and Optimization, CGO ’10, pages 2–11, New York, NY, USA, 2010. ACM. [108] Pablo Montesinos, Matthew Hicks, Samuel T. King, and Josep Torrellas. Capo: A software-hardware interface for practical deterministic multiprocessor replay. In 185 Proceeding of the 14th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS ’09, pages 73–84, New York, NY, USA, 2009. ACM. [109] Roberto Bruni, Hernán Melgratti, and Ugo Montanari. Theoretical foundations for compensations in flow composition languages. In Proceedings of the 32nd ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL ’05, pages 209–220, New York, NY, USA, 2005. ACM. [110] Umut A. Acar. Self-adjusting computation. PhD thesis, Pittsburgh, PA, USA, 2005. Co-Chair-Guy Blelloch and Co-Chair-Robert Harper. [111] Umut A. Acar, Guy E. Blelloch, Blu Matthias, Robert Harper, and Kanat Tangwongsan. An experimental analysis of self-adjusting computation. ACM Transactions on Programing Languages and Systems, 32:3:1–3:53, November 2009. [112] Umut A. Acar, Guy E. Blelloch, and Robert Harper. Adaptive functional programming. In Proceedings of the 29th ACM SIGPLAN-SIGACT Symposium on Principles of Pogramming Languages, POPL ’02, pages 247–259, New York, NY, USA, 2002. ACM. [113] Kaku Takeuchi, Kohei Honda, and Makoto Kubo. An interaction-based language and its typing system. In PARLE’94, volume 817 of LNCS, pages 398–413. SpringerVerlag, 1994. [114] Kohei Honda, Vasco Thudichum Vasconcelos, and Makoto Kubo. Language primitives and type discipline for structured communication-based programming. In Proceedings of the 7th European Symposium on Programming: Programming Languages and Systems, pages 122–138, London, UK, 1998. Springer-Verlag. [115] Matthias Neubauer and Peter Thiemann. An implementation of session types. In Practical Aspects of Declarative Languages, volume 3057 of Lecture Notes in Computer Science, pages 56–70. Springer Berlin / Heidelberg, 2004. [116] Kohei Honda, Nobuko Yoshida, and Marco Carbone. Multiparty asynchronous session types. In Proceedings of the 35th Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL ’08, pages 273–284, New York, NY, USA, 2008. ACM. [117] Eduardo Bonelli and Adriana Compagnoni. Multisession session types for a distributed calculus. In Proceedings Trustworthy Global Computing: TGC ’07, LNCS, pages 38–57. Springer-Verlag, 2007. [118] Andi Bejleri and Nobuko Yoshida. Synchronous multiparty session types. Electronic Notes on Theoretical Computer Science, 241:3–33, 2009. [119] Simon Gay, Vasco Vasconcelos, and Antonio Ravara. Session types for inter-process communication. Technical report, University of Glasgow, 2003. [120] Riccardo Pucella and Jesse A. Tov. Haskell session types with (almost) no class. In Haskell ’08: Proceedings of the 1st ACM SIGPLAN Symposium on Haskell, pages 25–36, New York, NY, USA, 2008. ACM. [121] Antonio Vallecillo, Vasco T. Vasconcelos, and António Ravara. Typing the behavior of software components using session types. Fundamenta Informaticae, 73(4):583– 598, 2006. 186 [122] Sara Capecchi, Mario Coppo, Mariangiola Dezani-Ciancaglini, Sophia Drossopoulou, and Elena Giachino. Amalgamating sessions and methods in object-oriented languages with generics. Theoretical Computer Science, 410(23):142–167, 2009. [123] Raymond Hu, Nobuko Yoshida, and Kohei Honda. Session-based distributed programming in Java. In Proceedings of the 22nd European Conference on ObjectOriented Programming, ECOOP ’08, pages 516–541, Berlin, Heidelberg, 2008. Springer-Verlag. [124] Manuel Fähndrich, Mark Aiken, Chris Hawblitzel, Orion Hodson, Galen Hunt, James R. Larus, and Steven Levi. Language support for fast and reliable messagebased communication in singularity OS. In EuroSys ’06: Proceedings of the 1st ACM SIGOPS/EuroSys European Conference on Computer Systems 2006, pages 177–190, New York, NY, USA, 2006. ACM. [125] Mario Coppo, Mariangiola Dezani-Ciancaglini, and Nobuko Yoshida. Asynchronous session types and progress for object oriented languages. In Proceedings of the 9th IFIP WG 6.1 International Conference on Formal Methods for Open ObjectBased Distributed Systems, FMOODS’07, pages 1–31, Berlin / Heidelberg, 2007. Springer-Verlag. [126] Mariangiola Dezani-Ciancaglini, Dimitris Mostrous, Nobuko Yoshida, and Sophia Drossopoulou. Session types for object-oriented languages. In Proceedings of the 20th European Conference on Object-Oriented Programming, pages 328–352. Springer-Verlag, 2006. [127] K.C. Sivaramakrishnan, Karthik Nagaraj, Lukasz Ziarek, and Patrick Eugster. Efficient session type guided distributed interaction. In Coordination Models and Languages, volume 6116 of Lecture Notes in Computer Science, pages 152–167. Springer-Verlag, Berlin, Heidelberg, 2010. [128] Kwok Cheung Yeung and Paul H. J. Kelly. Optimising Java RMI programs by communication restructuring. In Middleware ’03: Proceedings of the ACM/IFIP/USENIX 2003 International Conference on Middleware, pages 324–343, New York, NY, USA, 2003. Springer-Verlag New York, Inc. [129] Yee Jiun Song, Marcos K Aguilera, Ramakrishna Kotla, and Dahlia Malkhi. RPC chains: Efficient client-server communication in geodistributed systems. In 6th USENIX Symposium on Networked Systems Design and Implementation (NSDI 2009), pages 17–30, 2009. VITA 187 VITA Lukasz S. Ziarek was born on September 17th 1982 in Warszawa, Poland. He moved to the United States in 1985. He attended St. Joseph High School in South Bend, Indiana. He received his bachelors degree in computer science from the University of Chicago in December 2003. He began graduate school at Purdue University in January of 2004. He married his wife, Kayela, on July 17th of 2010. He received his Ph.D. in computer science in May 2011.