Idempotent Work Stealing

advertisement
Maged M. Michael, Martin T. Vechev,
Vijay A. Saraswat
PPoPP’09
1





Memory Operations Reordering
Problem Definition – Idempotent Work-Stealing
The algorithms
Comparison to Previous Work
Summary
2



Some architectures reorder the memory
accesses to achieve faster execution
read(a)
read(b)
write(a,1)
write(b,2)
read(a)
write(b,2)
write(a,1)
read(b)
Good optimization for uni-processors…
But may be dangerous for multi-processors
3
Memory
a = 0;
b = 0;
P1
L1: if(read(a) = 0)
goto L1
print(read(b))
P1
P2
write(b, 7)
write(a, 1)
P2
Expected output of P1?
What happens if P2 changes the order of memory stores?
4



Operations that synchronize memory accesses
X-Y fence: all previous operations of type X
must commit before all following operations of
type Y start
Example: store-load

read1
write1

store-load

write2
read2


store-store?

5
Memory
a = 0;
b = 0;
P1
L1: if (read(a) = 0)
goto L1
print(read(b))
P1
P2
write(b, 1)
store-store
write(a, 7)
P2
6

A model where:
◦ All processors see all memory operations in the same
order
◦ Must adhere to the program order (for each thread)

Memory operations are not sequential
consistent
Makes program verification a non-simple task
7

Linearizability is stronger than sequential
consistency
If operation A is executed before
operation B (in real-time), then A
precedes B in the order
(and not only for a single thread)
8





Memory Operations Reordering
Problem Definition – Idempotent Work-Stealing
The algorithms
Comparison to Previous Work
Summary
9

Idempotence – the property of certain
operations, that can be applied multiple times
without changing the result (Wikipedia)
In other words: f(f(x))=f(x)

Examples:

1. The absolute function
2. The number 1 is idempotent of multiplication:
1*1
3. SQL query (without updates)
10


A policy to divide procedure executions
(jobs/tasks) efficiently among multiple
processors
Each processor has a deque (double-ended
queue) of jobs
job
job
job
job
job
job
job
job
job
P1
P2
Pk
11


Each processor can put a new job in its own
queue
Each processor can take a job from its own
queue
job
job
job
job
job
job
job
job
job
job
P1
P2
Pk
12

A processor without work can steal jobs from
another processor
job
job
job
job
job
job
job
P1
P2
Pk
13












Fibonacci numbers – fib(7)
P1 – take() -> fib(7)
P1 – put(fib(6)), put(fib(5))
P1 – take() -> fib(6)
P2 – steal(P1)
P2 – take() -> fib(5)
P1 – put(fib(5)), put(fib(4))
P2 – put(fib(4)), put(fib(3))
P1 – take() -> fib(5)
P3 – steal(P1)
P3 – take() -> fib(4)
P2 – take() -> fib(4)
…
Fib(4)
fib(5)
Fib(3)
Fib(6)
Fib(5)
fib(7)
Fib(4)
P1
P2
P3
14



Work stealing seems like a good idea…
But, it can be expensive…
Can Work-Stealing algorithms of
Because:Idempotent tasks avoid using
1. Using locks
synchronization primitives?
2. Using atomic Read-Modify-Write operations
3. Using Memory Ordering Fence

Previous work-stealing algorithms use strong
synchronization primitives
15

Not exactly…

Our goal:
◦ Making Work-stealing cheap when jobs are
idempotent

How?
◦ Making the owner’s operations (“put”, “take”) cheap,
but “steal” remains expensive
16

A snippet of the Chase-Lev algorithm:
1.
2.
3.
4.
5.
…
Task take() {
b := bottom;
CircularArray a = activeArray;
b = b – 1;
bottom = b;
t = top;
store-load
}
17





Memory Operations Reordering
Problem Definition – Idempotent Work-Stealing
The algorithms
Comparison to Previous Work
Summary
18


We will see 3 algorithms
All algorithms insert (put) jobs at the tail
1. Idempotent LIFO – extracting tasks (take/steal)
from the tail
2. Idempotent FIFO – extracting tasks (take/steal)
from the head
3. Idempotent double-ended – the owner takes tasks
from the tail, and the others steal from the head
19

Each processor has:
◦ Dynamic array of tasks
◦ A capacity variable
◦ An anchor (tail index)
insert – to tail
take/steal
from tail
tasks
capacity = 7
anchor = 0
P1
20
1.
2.
3.
4.
void put(Task task) {
t := anchor;
if (t = capacity) { expand(); goto 1;}
tasks[t] := task;
anchor := t + 1;
}
store-store
tasks
task1
capacity = 7
anchor = 1
0
21
1.
2.
3.
4.
5.
Task take() {
t := anchor;
if (t = 0) return EMPTY;
task := tasks[t – 1];
anchor := t - 1;
return task;
}
tasks
task1
task2
task3
capacity = 7
anchor = 2
3
22
1.
2.
3.
4.
5.
6.
Task steal() {
t := anchor;
if (t = 0) return EMPTY;
a := tasks;
task := a[t – 1];
if !CAS(anchor, t, t-1) goto 1;
return task;
}
load-load
load-CAS
tasks
task1
task2
Why
task3
tasks must be
idempotent?
capacity = 7
anchor = 2
3
23
1.
2.
3.
4.
5.
Task take() {
t := anchor;
if (t = 0) return EMPTY;
task := tasks[t – 1];
anchor := t - 1;
return task;
}
task=task3
task1
task2
t
1.
2.
3.
4.
5.
6.
t
Task steal() {
t := anchor;
if (t = 0) return EMPTY;
a := tasks;
task := a[t – 1];
if !CAS(anchor, t, t-1) goto 1;
return task;
}
task=task3
tasks a
task3
capacity = 7
anchor = 2
3
24

How is ABA possible?
task1
task2
t
task3
taskX
capacity = 7
anchor = 2
3
owner
take();
put(taskX);
…
put(taskY);
tasks
taskX is lost!
task=task3
1.
2.
3.
4.
5.
6.
Task steal() {
t := anchor;
if (t = 0) return EMPTY;
a := tasks;
task := a[t – 1];

if !CAS(anchor, t, t-1) goto 1;
return task;
}
25

How can we prevent it?
anchor: <integer, integer>;
1.
2.
3.
4.
// <tail, tag>
void put(Task task) {
<t,tag> := anchor;
if (t = capacity) { expand(); goto 1;}
tasks[t] := task;
anchor := <t + 1, tag + 1>;
}
Task steal() {
1.
2.
3.
4.
5.
6.
}
<t,tag> := anchor;
if (t = 0) return EMPTY;
a := tasks;
task := a[t – 1];
if !CAS(anchor, <t,tag>, <t-1,tag>) goto 1;
return task;
26

Each processor has:
◦
◦
◦
◦
Dynamic cyclic-array of tasks
A capacity variable
Head index (always increasing)
Tail index (always increasing)
task2
task3
insert – to tail
take/steal
from head
tasks
task4
capacity = 7
head = 1
tail = 4
P1
Next…
27
1.
2.
3.
4.
5.
void put(Task task) {
h := head;
t := tail;
if (t = h + tasks.capacity) { expand(); goto 1;}
tasks.array[t%tasks.capacity] := task;
tail := t + 1;
store-store
}
task2
task3
task4
task5
capacity = 7
head = 1
5
tail = 4
28
1.
2.
3.
4.
5.
6.
Task take() {
h := head;
t := tail;
if (h = t) return EMPTY;
task := tasks.array[h%tasks.capacity];
head := h + 1;
return task;
}
task2
task3
task4
task5
capacity = 7
head = 2
1
tail = 4
29
load-load
1.
2.
3.
4.
5.
6.
7.
Task steal() {
h := head;
t := tail;
if (h = t) return EMPTY;
a := tasks;
task := a.array[h%a.capacity];
if !CAS(head, h, h+1) goto 1;
return task;
}
task2
task3
task4
load-load
load-CAS
task5
capacity = 7
head = 2
1
tail = 4
30

Each processor has:
◦ Dynamic cyclic-array of tasks
◦ A capacity variable
◦ An anchor (head, size)
task2
task3
insert – to tail
take – from tail
steal - from head
tasks
task4
capacity= 7
anchor = <1, 3>
P1
Next…
31
1.
2.
3.
4.
void put(Task task) {
<h, s> := anchor;
if (s = tasks.capacity) { expand(); goto 1;}
tasks.array[(h+s)%tasks.capacity] := task;
anchor := <h, s + 1>;
}
task2
task3
task4
store-store
task5
capacity = 7
4
anchor = <1, 3>
32
1.
2.
3.
4.
5.
Task take() {
<h, s> := anchor;
if (s = 0) return EMPTY;
task := tasks.array[(h+s-1)%tasks.capacity];
anchor := <h, s – 1>;
return task;
}
task2
task3
task4
task5
capacity = 7
3
anchor = <1, 4>
33
1.
2.
3.
4.
5.
6.
7.
Task steal() {
<h, s> := head;
if (s = 0) return EMPTY;
a := tasks;
task := a.array[h%a.capacity];
h2 := (h + 1) % a.capacity;
if !CAS(head, <h,s>, <h2,s-1>) goto 1;
return task;
}
task2
task3
task4
load-load
load-CAS
task5
capacity = 7
2, 4
3 >
anchor = <1,
34





Memory Operations Reordering
Problem Definition – Idempotent Work-Stealing
The algorithms
Comparison to Previous Work
Summary
35


Compared against “Chase-Lev” and “Cilk THE”
algorithms (after adding memory fences)
Benchmarks:
◦ Micro – the common case – take() and put()
◦ Irregular Graph Applications
36

2 Scenarios:
◦ Both puts and takes (106 ops for each type)
◦ Only takes (106 ops) – pre populating the work-queues
37

2 Scenarios:
◦ Both puts and takes (106 ops for each type)
◦ Only takes (106 ops) – pre populating the work-queues
38


Based on SIMPLE framework
2D Torus Graph:
◦ Vertices – on the torus
◦ Each vertex connected to its 4
neighbors

Build a spanning tree
40
Up to 6%
redundant work
41





Memory Operations Reordering
Problem Definition – Idempotent Work-Stealing
The algorithms
Comparison to Previous Work
Summary
42




Memory operations reordering improves
execution times
Use with care in multi-processors
“Idempotent Work-Stealing” useful for some
workloads
Idempotent-LIFO gives good results for all
benchmarks
43
Thank You!
Questions?
44
Download