Programming models for the Barrelfish multi-kernel operating system Tim Harris

advertisement
Programming models for the
Barrelfish multi-kernel
operating system
Tim Harris
Based on joint work with Martín Abadi, Andrew Baumann,
Paul Barham, Richard Black, Tim Harris, Orion Hodson,
Rebecca Isaacs, Ross McIlroy, Simon Peter, Vijayan Prabhakaran,
Timothy Roscoe, Adrian Schüpbach, Akhilesh Singhania
Cache-coherent multicore
RAM
RAM
Cor
e
L2
Cor
e
L2
Cor
e
L2
Cor
e
L2
Cor
e
L2
Cor
e
L2
Cor
e
L2
Cor
e
L2
Cor
e
L2
Cor
e
L2
Cor
e
L2
Cor
e
L2
L3
RAM
RAM
RAM
L3
Cor
e
L2
Cor
e
L2
Cor
e
L2
Cor
e
L2
Cor
e
L2
Cor
e
L2
Cor
e
L2
Cor
e
L2
Cor
e
L2
Cor
e
L2
Cor
e
L2
Cor
e
L2
L3
RAM
RAM
RAM
L3
AMD Istanbul: 6 cores, per-core L2, per-package L3
MC-3
MC-4
RAM
MC-0
RAM
MC-1
Single-chip cloud computer (SCC)
VRC
RAM
L2
Core
Router
MPB
L2
Core
RAM
System interface
24 * 2-core tiles
On-chip mesh n/w
Non-coherent caches
Hardware supported messaging
MSR Beehive
Module RISCN
Module RISCN
Core N
Core 3
Module RISCN
Core 2
Module RISCN
Core 1
RingIn[31:0],
SlotTypeIn[3:0],
SrcDestIn[3:0]
Module MemMux
Messages, Locks
MQ
WD
RA from display
controller
RA, DDR Controller RD (128 bits)
WA
Rdreturn (32 bits)
(pipelined bus to
all cores)
RD to
Display
controller
RAM
Ring interconnect
Message passing in h/w
No cache coherence
Split-phase memory access
IBM Cell
SXU
SXU
SXU
SXU
SXU
SXU
SXU
SXU
L3
LS
MF
C
L3
LS
MF
C
L3
LS
MF
C
L3
LS
MF
C
L3
LS
MF
C
L3
LS
MF
C
L3
LS
MF
C
L3
LS
MF
C
L2
L1
PXU
Memory
interface
I/O
interface
64-bit Power (PXU) and SIMD (SXU) core types
Messaging vs shared data as default
Barrelfish multikernel
Traditional operating systems
Shared state,
one-big-lock
Fine-grained
locking
Clustered objects,
partitioning
Distributed state,
replica maintenance
• Fundamental model is message based
• “It’s better to have shared memory and not need
it than to need shared memory and not have it”
The Barrelfish multi-kernel OS
App
App
App
App
OS node
OS node
OS node
OS node
State
replica
State
replica
State
replica
State
replica
x64
x64
ARM
Message passing
Hardware interconnect
Accelerato
r core
The Barrelfish multi-kernel OS
App
App
App
App
OS node
OS node
OS node
OS node
State
replica
State
replica
State
replica
State
replica
x64
x64
ARM
Message passing
System runs on heterogeneous
hardware, currently
supporting ARM,
Accelerato
Beehive, SCC,r core
x86 & x64
Hardware interconnect
The Barrelfish multi-kernel OS
App
App
App
OS node
OS node
OS node
State
replica
State
replica
State
replica
x64
x64
ARM
App
OS node
Focus of this talk:
system
State
components,
Message
passing each local
replicato a specific
core, and using message passing
System runs on heterogeneous
hardware, currently
supporting ARM,
Accelerato
Beehive, SCC,r core
x86 & x64
Hardware interconnect
User-mode programs:
several
The Barrelfish multi-kernel
OS
models supported, including
App
conventional shared-memory
App OpenMP & pthreads
App
App
OS node
OS node
OS node
State
replica
State
replica
State
replica
x64
x64
ARM
OS node
Focus of this talk:
system
State
components,
Message
passing each local
replicato a specific
core, and using message passing
System runs on heterogeneous
hardware, currently
supporting ARM,
Accelerato
Beehive, SCC,r core
x86 & x64
Hardware interconnect
The Barrelfish
messaging stack
High-level language
runtime system:
threads, join
patterns, buffering, ...
AC language
extensions for C
Event-based
programming model
H/W specific
interconnect driver(s)
The Barrelfish
messaging stack
High-level language
runtime system:
threads, join
patterns, buffering, ...
AC language
extensions for C
Event-based
programming model
H/W specific
interconnect driver(s)
H/W specific interface:
max message sizes,
guarantees, flow control, ...
The Barrelfish
messaging stack
High-level language
runtime system:
threads, join
patterns, buffering, ...
AC language
extensions for C
Event-based
programming model
Portable IDL: marshalling
to/from C structs, events
on send/receive possible
H/W specific
interconnect driver(s)
H/W specific interface:
max message sizes,
guarantees, flow control, ...
The Barrelfish
messaging stack
High-level language
runtime system:
threads, join
patterns, buffering, ...
AC language
extensions for C
Simple synchronous
send/receive interface,
support for concurrency
Event-based
programming model
Portable IDL: marshalling
to/from C structs, events
on send/receive possible
H/W specific
interconnect driver(s)
H/W specific interface:
max message sizes,
guarantees, flow control, ...
Multi-core hardware & Barrelfish
Event-based model
Synchronous model
Supporting concurrency
Cancellation
Performance
Event-based interface
interface Echo {
rpc ping(in int32 arg, out int32 result);
}
Stub compiler
Common event-based programming interface
Interconnectspecific stubs
(SCC)
Interconnectspecific stubs
(Shared-mem)
Interconnectspecific stubs
(Beehive)
Interconnectspecific stubs
(Same core)
Event-based interface
interface Echo {
rpc ping(in int32 arg, out int32 result);
}
// Try to send => OK or ERR_TX_BUSY
errval_t Echo_send_ping_call (Echo_binding *b, int arg);
Stub
compiler
// Register callback when send worth
re-trying
errval_t Echo_register_send (Echo_binding *b, Echo_can_send_cb *cb);
typedef void Echo_can_send_cb (Echo_binding *b);
Common
programming
interface
// Register callback on incoming
“ping”
response
errval_t Echo_register_recv_ping_resp (Echo_binding *b, Echo_recv_ping_cb *cb);
InterconnectInterconnectInterconnectInterconnecttypedef void Echo_recv_ping_cb (Echo_binding *b, int result);
specific stubs
specific stubs
specific stubs
specific stubs
(LMP)
// Wait for(SCC)
next callback to be (UMP)
ready, then execute(BMP)
it
errval_t event_dispatch (void);
Event-based interface
Echo_binding *b;
...
Echo_register_recv_ping_resp(b, &response_handler);
...
bool done = false;
err = Echo_send_ping_call(b, 10);
if (err == ERR_TX_BUSY) {
b->st = malloc(sizeof(int));
*(int*)b->st = 10;
err = Echo_register_send(b, &resend_handler);
assert(err == ERR_OK);
}
while (!done) {
event_dispatch();
}
Event-based interface
Echo_binding *b;
static void response_handler(Echo_binding *b,
...
int val) {
Echo_register_recv_ping_resp(b, &response_handler);
printf(“Got response %d\n”, val);
...
done = true;
bool done = false;
}
err = Echo_send_ping_call(b, 10);
if (err == ERR_TX_BUSY) {
b->st = malloc(sizeof(int));
*(int*)b->st = 10;
err = Echo_register_send(b, &resend_handler);
assert(err == ERR_OK);
}
while (!done) {
event_dispatch();
}
Event-based interface
Echo_binding *b;
static void response_handler(Echo_binding *b,
...
int val) {
Echo_register_recv_ping_resp(b, &response_handler);
printf(“Got response %d\n”, val);
...
done = true;
bool done = false;
}
err = Echo_send_ping_call(b, 10);
if (err == ERR_TX_BUSY) {
static void resend_handler(Echo_binding *b) {
b->st = malloc(sizeof(int));
err = Echo_send_ping_call(b, (int)*(b->st));
*(int*)b->st = 10;
if (err ==
ERR_TX_BUSY) {
err = Echo_register_send(b,
&resend_handler);
err = Echo_register_send(&resend_handler);
assert(err == ERR_OK);
assert(err == ERR_OK);
}
} else {
free(b->st);
while (!done) {
}
event_dispatch();
}
}
Why do it this way?
• Overlap computation and communication
– Non-blocking send/receive operations allow the caller
to continue with other work
• Remain responsive to multiple clients
– Don’t end up “stuck” blocked for a receive from one
client while another client is ready for service
• Lightweight runtime system
– No threads, no GC, etc.
– Support on diverse hardware
– Use within the implementation of runtime systems for
higher-level languages
Multi-core hardware & Barrelfish
Event-based model
Synchronous model
Supporting concurrency
Cancellation
Performance
Goals for the synchronous model
• Cleaner programming model
• Integration in C
– Resource consumption can be anticipated
• Low overhead over the underlying messaging primitives
– Don’t want to harm speed
– Don’t want to harm flexibility (e.g., ability to compute while waiting
for responses
• Focus on concurrency between communicating processes
– Everything runs in a single thread (unless the code says otherwise)
– Execution is deterministic (modulo the timing and content of
external inputs)
Event-based programming model
interface Echo {
rpc ping(in int32 arg, out int32 result);
}
Common event-based programming interface
Interconnectspecific stubs
(SCC)
Interconnectspecific stubs
(Shared-mem)
Interconnectspecific stubs
(Same core)
Stub
compiler
Synchronous message-passing
interface Echo {
rpc ping(in int32 arg, out int32 result);
}
Synchronous message-passing interface
Synchronous message-passing stubs
Common event-based programming interface
Interconnectspecific stubs
(SCC)
Interconnectspecific stubs
(Shared-mem)
Interconnectspecific stubs
(Same core)
Stub
compiler
Synchronous message-passing
interface Echo {
rpc ping(in int32 arg, out int32 result);
}
// Send “ping”, block until complete
void Echo_tx_ping (Echo_binding *b, int arg);
Synchronous message-passing interface
// Wait for and
receive response
to “ping” stubs
Synchronous
message-passing
void Echo_rx_ping (Echo_binding *b, int *result);
Common
event-based
programming interface
// RPC
send-receive
pair
void Echo_ping (Echo_binding *b, int arg, int *result);
Interconnectspecific stubs
(SCC)
Interconnectspecific stubs
(UMP)
Interconnectspecific stubs
(LMP)
Stub
compiler
Channel abstraction
Clientside
Serverside
Channel abstraction
Pair of uni-directional
channels, of bounded but
unknown capacity
Clientside
Serverside
Channel abstraction
Send is synchronous
between the sender and
their channel endpoint
Clientside
Serverside
Channel abstraction
FIFO, lossless
transmission, with unknown
delay
Clientside
Serverside
Channel abstraction
FIFO, lossless
transmission, with unknown
delay
Clientside
Serverside
Channel abstraction
FIFO, lossless
transmission, with unknown
delay
Clientside
Serverside
Channel abstraction
FIFO, lossless
transmission, with unknown
delay
Clientside
Serverside
Channel abstraction
Receive is synchronous
between receiver and head
of channel
Clientside
Serverside
Channel abstraction
Clientside
Serverside
Channel abstraction
Only whole-channel
failures, e.g. if other party
exits
Clientside
Serverside
Two back-to-back RPCs
static int total_rpc(Echo_binding *b1,
Echo_binding *b2,
int arg) {
int result1, result2;
Echo_ping(b1, arg, &result1);
Echo_ping(b2, arg, &result2);
return result1+result2;
}
• This looks cleaner but:
– We’ve lost the ability to contact multiple servers concurrently
– We’ve lost the ability to overlap computation with waiting
Multi-core hardware & Barrelfish
Event-based model
Synchronous model
Supporting concurrency
Cancellation
Performance
Adding asynchrony: async, do..finish
static int total_rpc(Echo_binding *b1,
Echo_binding *b2,
int arg) {
int result1, result2;
do {
async Echo_ping(b1, arg, &result1);
async Echo_ping(b2, arg, &result2);
} finish;
return result1+result2;
}
Adding asynchrony: async, do..finish
static int total_rpc(Echo_binding *b1,
Echo_binding *b2,
int arg) {
If the async code blocks,
int result1, result2;
then resume after the async
do {
async Echo_ping(b1, arg, &result1);
async Echo_ping(b2, arg, &result2);
} finish;
return result1+result2;
}
Adding asynchrony: async, do..finish
static int total_rpc(Echo_binding *b1,
Echo_binding *b2,
int arg) {
If the async code blocks,
int result1, result2;
then resume after the async
do {
async Echo_ping(b1, arg, &result1);
async Echo_ping(b2, arg, &result2);
} finish;
Wait until all the async work
return result1+result2;
(dynamically) in the
}
do..finish has completed
Example: same-core L4-style RPC
static int total_rpc(Echo_binding *b1,
Echo_binding *b2,
int arg) {
int result1, result2;
do {
async Echo_ping(b1, arg, &result1);
async Echo_ping(b2, arg, &result2);
} finish;
return result1+result2;
}
Example: same-core L4-style RPC
static int total_rpc(Echo_binding *b1,
Echo_binding *b2,
int arg) {
int result1, result2;
do {
async Echo_ping(b1, arg, &result1);
async Echo_ping(b2, arg, &result2);
} finish;
return result1+result2;
}
Execute to the “ping”
call as normal
Example: same-core L4-style RPC
static int total_rpc(Echo_binding *b1,
Echo_binding *b2,
int arg) {
int result1, result2;
do {
async Echo_ping(b1, arg, &result1);
async Echo_ping(b2, arg, &result2);
} finish;
return result1+result2;
}
Message send
transitions directly to
recipient process
Example: same-core L4-style RPC
static int total_rpc(Echo_binding *b1,
Echo_binding *b2,
int arg) {
int result1, result2;
do {
async Echo_ping(b1, arg, &result1);
async Echo_ping(b2, arg, &result2);
} finish;
return result1+result2;
}
Response
returned to
caller
Example: same-core L4-style RPC
static int total_rpc(Echo_binding *b1,
Echo_binding *b2,
int arg) {
int result1, result2;
do {
async Echo_ping(b1, arg, &result1);
async Echo_ping(b2, arg, &result2);
Caller proceeds
through second call
} finish;
return result1+result2;
}
Example: cross-core RPC
static int total_rpc(Echo_binding *b1,
Echo_binding *b2,
int arg) {
int result1, result2;
do {
async Echo_ping(b1, arg, &result1);
async Echo_ping(b2, arg, &result2);
} finish;
return result1+result2;
}
Example: cross-core RPC
static int total_rpc(Echo_binding *b1,
Echo_binding *b2,
int arg) {
int result1, result2;
do {
async Echo_ping(b1, arg, &result1);
async Echo_ping(b2, arg, &result2);
} finish;
return result1+result2;
}
First call sends
message, this core
now idle
Example: cross-core RPC
static int total_rpc(Echo_binding *b1,
Echo_binding *b2,
int arg) {
int result1, result2;
do {
async Echo_ping(b1, arg, &result1);
async Echo_ping(b2, arg, &result2);
} finish;
return result1+result2;
}
Resume after async;
send second
message
Example: cross-core RPC
static int total_rpc(Echo_binding *b1,
Echo_binding *b2,
int arg) {
int result1, result2;
do {
async Echo_ping(b1, arg, &result1);
async Echo_ping(b2, arg, &result2);
Resume after async;
block here until done
} finish;
return result1+result2;
}
Example: cross-core RPC
static int total_rpc(Echo_binding *b1,
Echo_binding *b2,
int arg) {
int result1, result2;
do {
async Echo_ping(b1, arg, &result1);
async Echo_ping(b2, arg, &result2);
} finish;
return result1+result2;
}
Example: cross-core RPC
static int total_rpc(Echo_binding *b1,
Echo_binding *b2,
int arg) {
int result1, result2;
do {
async Echo_ping(b1, arg, &result1);
async Echo_ping(b2, arg, &result2);
} finish;
return result1+result2;
}
Example: cross-core RPC
static int total_rpc(Echo_binding *b1,
Echo_binding *b2,
int arg) {
int result1, result2;
do {
async Echo_ping(b1, arg, &result1);
async Echo_ping(b2, arg, &result2);
Continue now both
responses received
} finish;
return result1+result2;
}
Multi-core hardware & Barrelfish
Event-based model
Synchronous model
Supporting concurrency
Cancellation
Performance
Cancellation
• Send a request on n channels
• Receivers vote yes/no
• Wait for
𝑛
2
+ 1 votes one way or the other
Cancellation
• Send
abool
request
on n channels
static
tally_votes(Vote_binding
*b[], int n) {
•
•
int yes=0, no=0;
int winningvote
= (n/2)+1;
Receivers
yes/no
do {
for (int 𝑛i = 0; i < n; i ++) {
Wait for
+ 1 votes one way or
async2 {
bool v;
Vote_get_vote(b[i], &v);
if (v) { yes++; } else { no++; }
if (yes>=winning || no >=winning) { ??? }
}
}
} finish;
return (yes>no);
}
the other
Cancellation
static bool tally_votes(Vote_binding *b[], int n) {
int yes=0, no=0; int winning = (n/2)+1;
do {
for (int i = 0; i < n; i ++) {
async {
bool v;
if (Vote_get_vote_x(b[i], &v) != CANCELLED) {
if (v) { yes++; } else { no++; }
if (yes>=winning || no >=winning) cancel;
}
}
}
} finish;
return (yes>no);
}
Cancellation
static bool tally_votes(Vote_binding *b[], int n) {
int yes=0, no=0; int winning = (n/2)+1;
do {
“_x” suffix marks this
for (int i = 0; i < n; i ++) {
as a cancelable
async {
function
bool v;
if (Vote_get_vote_x(b[i], &v) != CANCELLED) {
if (v) { yes++; } else { no++; }
if (yes>=winning || no >=winning) cancel;
}
}
}
} finish;
return (yes>no);
}
Cancellation
static bool tally_votes(Vote_binding *b[], int n) {
int yes=0, no=0; int winning = (n/2)+1;
do {
“_x” suffix marks this
for (int i = 0; i < n; i ++) {
as a cancelable
async {
function
bool v;
if (Vote_get_vote_x(b[i], &v) != CANCELLED) {
if (v) { yes++; } else { no++; }
if (yes>=winning || no >=winning) cancel;
}
We block here waiting
}
for all the async work to
}
be done
} finish;
return (yes>no);
}
Cancellation with deferred clean-up
Split RPC into explicit
send/receive: distinguish two
cancellation points
…
async {
bool v;
if (Echo_tx_get_vote_x(b[i]) == OK) {
if (Echo_rx_get_vote_x(b[i], &v)) == OK) {
if (v) { yes++; } else { no++; }
if (yes>=winning || no >=winning) cacnel;
} else {
push_work(q, [=]{ Echo_tx_apologise(b[i])});
}
C++ closure to
}
Push clean-up operation
apologies for lost
}
to queue if cancelled
votes
…
after tx, before rx
Multi-core hardware & Barrelfish
Event-based model
Synchronous model
Supporting concurrency
Cancellation
Performance
Implementation
Thread
Finish block
(FB)
Run queue
main
Current FB
ACTV
Enclosing FB
Count = 0
Completion AWI
First
Last
Techniques:
Per-thread run queue of ready work items (“AWIs”)
-
Wait for next IO completion when queue empty
-
-
Stack 1
Book-keeping data all stack allocated
Perform work lazily when blocking: overhead of
“async X()” versus “X()” is a couple of cycles
Performance
Ping-pong latency (cycles)
Using UMP channel directly
931
Using event-based
stubs
1134
Synchronous model
(client only)
1266
Synchronous model
(client and server)
1405
MPI (Visual Studio 2008 +
HPC-Pack 2008 SDK)
2780
Ping-pong test
Minimum-sized messages
AMD 4 * 4-core machine
Using cores sharing L3 cache
Performance
Ping-pong latency (cycles)
Using UMP channel directly
931
Using event-based
stubs
1134
Synchronous model
(client only)
1266
Synchronous model
(client and server)
1405
MPI (Visual Studio 2008 +
HPC-Pack 2008 SDK)
2780
Ping-pong test
Minimum-sized messages
AMD 4 * 4-core machine
Using cores sharing L3 cache
Performance
Function call latency
(cycles)
Direct
(normal function call)
8
async foo()
(foo does not block)
12
async foo()
(foo blocks)
1692
• “Do not fear async”
– Think about correctness: if the callee doesn’t block
then perf is basically unchanged
Software RAID0, 64KB transfers
2 * Intel X25-M SSDs
Each 250MB/s
Throughput MB/s
500
AC
AC
AIO
Async
Sync
Sync
400
300
200
100
0
1
2
3
4
5
6
7
8
Pipeline depth / 2^x
9
10
11
2-phase commit between cores
Time per operation / cycles
50000
AC
Async
Sync
40000
30000
20000
10000
0
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
# Cores
Project status
• Papers:
– MMCS’08, HotOS’09, PLOS’09, SOSP’09, HotPar’10
• Substantial prototype system:
– 146kLOC (new), 488kLOC (including ported code)
• Establishing collaborations around this research platform
–
–
–
–
–
–
More than 20,000 downloads of ETHZ source-code release
Intel: Early access to SCC, other test systems
Interest from other HW vendors: Tilera, ARM, . . .
Boston Uni, IBM Research: HPC work, BlueGene
KTH Sweden: Tilera port
Barcelona Supercomputing Center
Download