Programming models for the Barrelfish multi-kernel operating system Tim Harris Based on joint work with Martín Abadi, Andrew Baumann, Paul Barham, Richard Black, Tim Harris, Orion Hodson, Rebecca Isaacs, Ross McIlroy, Simon Peter, Vijayan Prabhakaran, Timothy Roscoe, Adrian Schüpbach, Akhilesh Singhania Cache-coherent multicore RAM RAM Cor e L2 Cor e L2 Cor e L2 Cor e L2 Cor e L2 Cor e L2 Cor e L2 Cor e L2 Cor e L2 Cor e L2 Cor e L2 Cor e L2 L3 RAM RAM RAM L3 Cor e L2 Cor e L2 Cor e L2 Cor e L2 Cor e L2 Cor e L2 Cor e L2 Cor e L2 Cor e L2 Cor e L2 Cor e L2 Cor e L2 L3 RAM RAM RAM L3 AMD Istanbul: 6 cores, per-core L2, per-package L3 MC-3 MC-4 RAM MC-0 RAM MC-1 Single-chip cloud computer (SCC) VRC RAM L2 Core Router MPB L2 Core RAM System interface 24 * 2-core tiles On-chip mesh n/w Non-coherent caches Hardware supported messaging MSR Beehive Module RISCN Module RISCN Core N Core 3 Module RISCN Core 2 Module RISCN Core 1 RingIn[31:0], SlotTypeIn[3:0], SrcDestIn[3:0] Module MemMux Messages, Locks MQ WD RA from display controller RA, DDR Controller RD (128 bits) WA Rdreturn (32 bits) (pipelined bus to all cores) RD to Display controller RAM Ring interconnect Message passing in h/w No cache coherence Split-phase memory access IBM Cell SXU SXU SXU SXU SXU SXU SXU SXU L3 LS MF C L3 LS MF C L3 LS MF C L3 LS MF C L3 LS MF C L3 LS MF C L3 LS MF C L3 LS MF C L2 L1 PXU Memory interface I/O interface 64-bit Power (PXU) and SIMD (SXU) core types Messaging vs shared data as default Barrelfish multikernel Traditional operating systems Shared state, one-big-lock Fine-grained locking Clustered objects, partitioning Distributed state, replica maintenance • Fundamental model is message based • “It’s better to have shared memory and not need it than to need shared memory and not have it” The Barrelfish multi-kernel OS App App App App OS node OS node OS node OS node State replica State replica State replica State replica x64 x64 ARM Message passing Hardware interconnect Accelerato r core The Barrelfish multi-kernel OS App App App App OS node OS node OS node OS node State replica State replica State replica State replica x64 x64 ARM Message passing System runs on heterogeneous hardware, currently supporting ARM, Accelerato Beehive, SCC,r core x86 & x64 Hardware interconnect The Barrelfish multi-kernel OS App App App OS node OS node OS node State replica State replica State replica x64 x64 ARM App OS node Focus of this talk: system State components, Message passing each local replicato a specific core, and using message passing System runs on heterogeneous hardware, currently supporting ARM, Accelerato Beehive, SCC,r core x86 & x64 Hardware interconnect User-mode programs: several The Barrelfish multi-kernel OS models supported, including App conventional shared-memory App OpenMP & pthreads App App OS node OS node OS node State replica State replica State replica x64 x64 ARM OS node Focus of this talk: system State components, Message passing each local replicato a specific core, and using message passing System runs on heterogeneous hardware, currently supporting ARM, Accelerato Beehive, SCC,r core x86 & x64 Hardware interconnect The Barrelfish messaging stack High-level language runtime system: threads, join patterns, buffering, ... AC language extensions for C Event-based programming model H/W specific interconnect driver(s) The Barrelfish messaging stack High-level language runtime system: threads, join patterns, buffering, ... AC language extensions for C Event-based programming model H/W specific interconnect driver(s) H/W specific interface: max message sizes, guarantees, flow control, ... The Barrelfish messaging stack High-level language runtime system: threads, join patterns, buffering, ... AC language extensions for C Event-based programming model Portable IDL: marshalling to/from C structs, events on send/receive possible H/W specific interconnect driver(s) H/W specific interface: max message sizes, guarantees, flow control, ... The Barrelfish messaging stack High-level language runtime system: threads, join patterns, buffering, ... AC language extensions for C Simple synchronous send/receive interface, support for concurrency Event-based programming model Portable IDL: marshalling to/from C structs, events on send/receive possible H/W specific interconnect driver(s) H/W specific interface: max message sizes, guarantees, flow control, ... Multi-core hardware & Barrelfish Event-based model Synchronous model Supporting concurrency Cancellation Performance Event-based interface interface Echo { rpc ping(in int32 arg, out int32 result); } Stub compiler Common event-based programming interface Interconnectspecific stubs (SCC) Interconnectspecific stubs (Shared-mem) Interconnectspecific stubs (Beehive) Interconnectspecific stubs (Same core) Event-based interface interface Echo { rpc ping(in int32 arg, out int32 result); } // Try to send => OK or ERR_TX_BUSY errval_t Echo_send_ping_call (Echo_binding *b, int arg); Stub compiler // Register callback when send worth re-trying errval_t Echo_register_send (Echo_binding *b, Echo_can_send_cb *cb); typedef void Echo_can_send_cb (Echo_binding *b); Common programming interface // Register callback on incoming “ping” response errval_t Echo_register_recv_ping_resp (Echo_binding *b, Echo_recv_ping_cb *cb); InterconnectInterconnectInterconnectInterconnecttypedef void Echo_recv_ping_cb (Echo_binding *b, int result); specific stubs specific stubs specific stubs specific stubs (LMP) // Wait for(SCC) next callback to be (UMP) ready, then execute(BMP) it errval_t event_dispatch (void); Event-based interface Echo_binding *b; ... Echo_register_recv_ping_resp(b, &response_handler); ... bool done = false; err = Echo_send_ping_call(b, 10); if (err == ERR_TX_BUSY) { b->st = malloc(sizeof(int)); *(int*)b->st = 10; err = Echo_register_send(b, &resend_handler); assert(err == ERR_OK); } while (!done) { event_dispatch(); } Event-based interface Echo_binding *b; static void response_handler(Echo_binding *b, ... int val) { Echo_register_recv_ping_resp(b, &response_handler); printf(“Got response %d\n”, val); ... done = true; bool done = false; } err = Echo_send_ping_call(b, 10); if (err == ERR_TX_BUSY) { b->st = malloc(sizeof(int)); *(int*)b->st = 10; err = Echo_register_send(b, &resend_handler); assert(err == ERR_OK); } while (!done) { event_dispatch(); } Event-based interface Echo_binding *b; static void response_handler(Echo_binding *b, ... int val) { Echo_register_recv_ping_resp(b, &response_handler); printf(“Got response %d\n”, val); ... done = true; bool done = false; } err = Echo_send_ping_call(b, 10); if (err == ERR_TX_BUSY) { static void resend_handler(Echo_binding *b) { b->st = malloc(sizeof(int)); err = Echo_send_ping_call(b, (int)*(b->st)); *(int*)b->st = 10; if (err == ERR_TX_BUSY) { err = Echo_register_send(b, &resend_handler); err = Echo_register_send(&resend_handler); assert(err == ERR_OK); assert(err == ERR_OK); } } else { free(b->st); while (!done) { } event_dispatch(); } } Why do it this way? • Overlap computation and communication – Non-blocking send/receive operations allow the caller to continue with other work • Remain responsive to multiple clients – Don’t end up “stuck” blocked for a receive from one client while another client is ready for service • Lightweight runtime system – No threads, no GC, etc. – Support on diverse hardware – Use within the implementation of runtime systems for higher-level languages Multi-core hardware & Barrelfish Event-based model Synchronous model Supporting concurrency Cancellation Performance Goals for the synchronous model • Cleaner programming model • Integration in C – Resource consumption can be anticipated • Low overhead over the underlying messaging primitives – Don’t want to harm speed – Don’t want to harm flexibility (e.g., ability to compute while waiting for responses • Focus on concurrency between communicating processes – Everything runs in a single thread (unless the code says otherwise) – Execution is deterministic (modulo the timing and content of external inputs) Event-based programming model interface Echo { rpc ping(in int32 arg, out int32 result); } Common event-based programming interface Interconnectspecific stubs (SCC) Interconnectspecific stubs (Shared-mem) Interconnectspecific stubs (Same core) Stub compiler Synchronous message-passing interface Echo { rpc ping(in int32 arg, out int32 result); } Synchronous message-passing interface Synchronous message-passing stubs Common event-based programming interface Interconnectspecific stubs (SCC) Interconnectspecific stubs (Shared-mem) Interconnectspecific stubs (Same core) Stub compiler Synchronous message-passing interface Echo { rpc ping(in int32 arg, out int32 result); } // Send “ping”, block until complete void Echo_tx_ping (Echo_binding *b, int arg); Synchronous message-passing interface // Wait for and receive response to “ping” stubs Synchronous message-passing void Echo_rx_ping (Echo_binding *b, int *result); Common event-based programming interface // RPC send-receive pair void Echo_ping (Echo_binding *b, int arg, int *result); Interconnectspecific stubs (SCC) Interconnectspecific stubs (UMP) Interconnectspecific stubs (LMP) Stub compiler Channel abstraction Clientside Serverside Channel abstraction Pair of uni-directional channels, of bounded but unknown capacity Clientside Serverside Channel abstraction Send is synchronous between the sender and their channel endpoint Clientside Serverside Channel abstraction FIFO, lossless transmission, with unknown delay Clientside Serverside Channel abstraction FIFO, lossless transmission, with unknown delay Clientside Serverside Channel abstraction FIFO, lossless transmission, with unknown delay Clientside Serverside Channel abstraction FIFO, lossless transmission, with unknown delay Clientside Serverside Channel abstraction Receive is synchronous between receiver and head of channel Clientside Serverside Channel abstraction Clientside Serverside Channel abstraction Only whole-channel failures, e.g. if other party exits Clientside Serverside Two back-to-back RPCs static int total_rpc(Echo_binding *b1, Echo_binding *b2, int arg) { int result1, result2; Echo_ping(b1, arg, &result1); Echo_ping(b2, arg, &result2); return result1+result2; } • This looks cleaner but: – We’ve lost the ability to contact multiple servers concurrently – We’ve lost the ability to overlap computation with waiting Multi-core hardware & Barrelfish Event-based model Synchronous model Supporting concurrency Cancellation Performance Adding asynchrony: async, do..finish static int total_rpc(Echo_binding *b1, Echo_binding *b2, int arg) { int result1, result2; do { async Echo_ping(b1, arg, &result1); async Echo_ping(b2, arg, &result2); } finish; return result1+result2; } Adding asynchrony: async, do..finish static int total_rpc(Echo_binding *b1, Echo_binding *b2, int arg) { If the async code blocks, int result1, result2; then resume after the async do { async Echo_ping(b1, arg, &result1); async Echo_ping(b2, arg, &result2); } finish; return result1+result2; } Adding asynchrony: async, do..finish static int total_rpc(Echo_binding *b1, Echo_binding *b2, int arg) { If the async code blocks, int result1, result2; then resume after the async do { async Echo_ping(b1, arg, &result1); async Echo_ping(b2, arg, &result2); } finish; Wait until all the async work return result1+result2; (dynamically) in the } do..finish has completed Example: same-core L4-style RPC static int total_rpc(Echo_binding *b1, Echo_binding *b2, int arg) { int result1, result2; do { async Echo_ping(b1, arg, &result1); async Echo_ping(b2, arg, &result2); } finish; return result1+result2; } Example: same-core L4-style RPC static int total_rpc(Echo_binding *b1, Echo_binding *b2, int arg) { int result1, result2; do { async Echo_ping(b1, arg, &result1); async Echo_ping(b2, arg, &result2); } finish; return result1+result2; } Execute to the “ping” call as normal Example: same-core L4-style RPC static int total_rpc(Echo_binding *b1, Echo_binding *b2, int arg) { int result1, result2; do { async Echo_ping(b1, arg, &result1); async Echo_ping(b2, arg, &result2); } finish; return result1+result2; } Message send transitions directly to recipient process Example: same-core L4-style RPC static int total_rpc(Echo_binding *b1, Echo_binding *b2, int arg) { int result1, result2; do { async Echo_ping(b1, arg, &result1); async Echo_ping(b2, arg, &result2); } finish; return result1+result2; } Response returned to caller Example: same-core L4-style RPC static int total_rpc(Echo_binding *b1, Echo_binding *b2, int arg) { int result1, result2; do { async Echo_ping(b1, arg, &result1); async Echo_ping(b2, arg, &result2); Caller proceeds through second call } finish; return result1+result2; } Example: cross-core RPC static int total_rpc(Echo_binding *b1, Echo_binding *b2, int arg) { int result1, result2; do { async Echo_ping(b1, arg, &result1); async Echo_ping(b2, arg, &result2); } finish; return result1+result2; } Example: cross-core RPC static int total_rpc(Echo_binding *b1, Echo_binding *b2, int arg) { int result1, result2; do { async Echo_ping(b1, arg, &result1); async Echo_ping(b2, arg, &result2); } finish; return result1+result2; } First call sends message, this core now idle Example: cross-core RPC static int total_rpc(Echo_binding *b1, Echo_binding *b2, int arg) { int result1, result2; do { async Echo_ping(b1, arg, &result1); async Echo_ping(b2, arg, &result2); } finish; return result1+result2; } Resume after async; send second message Example: cross-core RPC static int total_rpc(Echo_binding *b1, Echo_binding *b2, int arg) { int result1, result2; do { async Echo_ping(b1, arg, &result1); async Echo_ping(b2, arg, &result2); Resume after async; block here until done } finish; return result1+result2; } Example: cross-core RPC static int total_rpc(Echo_binding *b1, Echo_binding *b2, int arg) { int result1, result2; do { async Echo_ping(b1, arg, &result1); async Echo_ping(b2, arg, &result2); } finish; return result1+result2; } Example: cross-core RPC static int total_rpc(Echo_binding *b1, Echo_binding *b2, int arg) { int result1, result2; do { async Echo_ping(b1, arg, &result1); async Echo_ping(b2, arg, &result2); } finish; return result1+result2; } Example: cross-core RPC static int total_rpc(Echo_binding *b1, Echo_binding *b2, int arg) { int result1, result2; do { async Echo_ping(b1, arg, &result1); async Echo_ping(b2, arg, &result2); Continue now both responses received } finish; return result1+result2; } Multi-core hardware & Barrelfish Event-based model Synchronous model Supporting concurrency Cancellation Performance Cancellation • Send a request on n channels • Receivers vote yes/no • Wait for 𝑛 2 + 1 votes one way or the other Cancellation • Send abool request on n channels static tally_votes(Vote_binding *b[], int n) { • • int yes=0, no=0; int winningvote = (n/2)+1; Receivers yes/no do { for (int 𝑛i = 0; i < n; i ++) { Wait for + 1 votes one way or async2 { bool v; Vote_get_vote(b[i], &v); if (v) { yes++; } else { no++; } if (yes>=winning || no >=winning) { ??? } } } } finish; return (yes>no); } the other Cancellation static bool tally_votes(Vote_binding *b[], int n) { int yes=0, no=0; int winning = (n/2)+1; do { for (int i = 0; i < n; i ++) { async { bool v; if (Vote_get_vote_x(b[i], &v) != CANCELLED) { if (v) { yes++; } else { no++; } if (yes>=winning || no >=winning) cancel; } } } } finish; return (yes>no); } Cancellation static bool tally_votes(Vote_binding *b[], int n) { int yes=0, no=0; int winning = (n/2)+1; do { “_x” suffix marks this for (int i = 0; i < n; i ++) { as a cancelable async { function bool v; if (Vote_get_vote_x(b[i], &v) != CANCELLED) { if (v) { yes++; } else { no++; } if (yes>=winning || no >=winning) cancel; } } } } finish; return (yes>no); } Cancellation static bool tally_votes(Vote_binding *b[], int n) { int yes=0, no=0; int winning = (n/2)+1; do { “_x” suffix marks this for (int i = 0; i < n; i ++) { as a cancelable async { function bool v; if (Vote_get_vote_x(b[i], &v) != CANCELLED) { if (v) { yes++; } else { no++; } if (yes>=winning || no >=winning) cancel; } We block here waiting } for all the async work to } be done } finish; return (yes>no); } Cancellation with deferred clean-up Split RPC into explicit send/receive: distinguish two cancellation points … async { bool v; if (Echo_tx_get_vote_x(b[i]) == OK) { if (Echo_rx_get_vote_x(b[i], &v)) == OK) { if (v) { yes++; } else { no++; } if (yes>=winning || no >=winning) cacnel; } else { push_work(q, [=]{ Echo_tx_apologise(b[i])}); } C++ closure to } Push clean-up operation apologies for lost } to queue if cancelled votes … after tx, before rx Multi-core hardware & Barrelfish Event-based model Synchronous model Supporting concurrency Cancellation Performance Implementation Thread Finish block (FB) Run queue main Current FB ACTV Enclosing FB Count = 0 Completion AWI First Last Techniques: Per-thread run queue of ready work items (“AWIs”) - Wait for next IO completion when queue empty - - Stack 1 Book-keeping data all stack allocated Perform work lazily when blocking: overhead of “async X()” versus “X()” is a couple of cycles Performance Ping-pong latency (cycles) Using UMP channel directly 931 Using event-based stubs 1134 Synchronous model (client only) 1266 Synchronous model (client and server) 1405 MPI (Visual Studio 2008 + HPC-Pack 2008 SDK) 2780 Ping-pong test Minimum-sized messages AMD 4 * 4-core machine Using cores sharing L3 cache Performance Ping-pong latency (cycles) Using UMP channel directly 931 Using event-based stubs 1134 Synchronous model (client only) 1266 Synchronous model (client and server) 1405 MPI (Visual Studio 2008 + HPC-Pack 2008 SDK) 2780 Ping-pong test Minimum-sized messages AMD 4 * 4-core machine Using cores sharing L3 cache Performance Function call latency (cycles) Direct (normal function call) 8 async foo() (foo does not block) 12 async foo() (foo blocks) 1692 • “Do not fear async” – Think about correctness: if the callee doesn’t block then perf is basically unchanged Software RAID0, 64KB transfers 2 * Intel X25-M SSDs Each 250MB/s Throughput MB/s 500 AC AC AIO Async Sync Sync 400 300 200 100 0 1 2 3 4 5 6 7 8 Pipeline depth / 2^x 9 10 11 2-phase commit between cores Time per operation / cycles 50000 AC Async Sync 40000 30000 20000 10000 0 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 # Cores Project status • Papers: – MMCS’08, HotOS’09, PLOS’09, SOSP’09, HotPar’10 • Substantial prototype system: – 146kLOC (new), 488kLOC (including ported code) • Establishing collaborations around this research platform – – – – – – More than 20,000 downloads of ETHZ source-code release Intel: Early access to SCC, other test systems Interest from other HW vendors: Tilera, ARM, . . . Boston Uni, IBM Research: HPC work, BlueGene KTH Sweden: Tilera port Barcelona Supercomputing Center