Practical Wait-Free Queues - Computer Science Department

advertisement
Wait-Free Queues with Multiple
Enqueuers and Dequeuers
Alex Kogan
Erez Petrank
Computer Science, Technion, Israel
FIFO queues

One of the most fundamental and common data structures
enqueue
dequeue
5
3
2
9
Concurrent FIFO queues

Concurrent implementation supports “correct” concurrent
adding and removing elements

correct = linearizable
enqueue
dequeue
3
dequeue
dequeue
2
9
empty!
dequeue

The access to the shared memory should be synchronized
Non-blocking synchronization

No thread is blocked in waiting for another thread to
complete


e.g., no locks / critical sections
Progress guarantees:

Obstruction-freedom


Lock-freedom


progress is guaranteed only in the eventual absence of interference
among all threads trying to apply an operation, one will succeed
Wait-freedom

a thread completes its operation in a bounded number of steps
Lock-freedom

Among all threads trying to apply an operation, one will
succeed

opportunistic approach

make attempts until succeeding
global progress
 all but one threads may starve

Many efficient and scalable lock-free queue implementations
Wait-freedom

A thread completes its operation in a bounded number of
steps


A highly desired property of any concurrent data structure


regardless of what other threads are doing
but, commonly regarded as inefficient and too costly to achieve
Particularly important in several domains



real-time systems
operating under SLA
heterogeneous environments
Related work: existing wait-free queues

Limited concurrency




Universal constructions



one enqueuer and one dequeuer
multiple enqueuers, one concurrent dequeuer
multiple dequeuers, one concurrent enqueuer
[Lamport’83]
[David’04]
[Jayanti&Petrovic’05]
[Herlihy’91]
generic method to transform any (sequential) object into lockfree/wait-free concurrent object
expensive impractical implementations
(Almost) no experimental results
Related work: lock-free queue
[Michael & Scott’96]

One of the most scalable and efficient lock-free
implementations

Widely adopted by industry



part of Java Concurrency package
Relatively simple and intuitive implementation
Based on singly-linked list of nodes
12
head
4
17
tail
MS-queue brief review: enqueue
CAS
12
head
4
17
9
tail
CAS
enqueue
9
MS-queue brief review: enqueue
CAS
12
head
4
17
9
tail
5
CAS
enqueue
enqueue
5
9
CAS
MS-queue brief review: dequeue
12
12
head
4
CAS
17
9
tail
dequeue
Our idea (in a nutshell)

Based on the lock-free queue by Michael & Scott

Helping mechanism


each operation is applied in a bounded time
“Wait-free” implementation scheme

each operation is applied exactly once
Helping mechanism

Each operation is assigned a dynamic age-based priority

inspired by the Doorway mechanism used in Bakery mutex
Each thread accessing a queue
 chooses a monotonically increasing
phase number
 writes down its phase and operation
info in a special state array
 helps all threads with a non-larger
phase to apply their operations
phase: long
pending: boolean
enqueue: boolean
node: Node
state entry
per thread
Helping mechanism in action
phase
4
9
9
3
pending
true
false
true
false
enqueue
true
true
true
true
node
ref
null
ref
ref
Helping mechanism in action
phase
4
9
9
10
pending
true
false
true
true
enqueue
true
true
true
true
node
ref
null
ref
ref
I need to
help!
Helping mechanism in action
phase
4
9
9
10
pending
true
false
true
true
enqueue
true
true
true
true
node
ref
null
ref
ref
I do not need
to help!
Helping mechanism in action
phase
4
9
11
10
pending
true
false
true
true
enqueue
true
true
false
true
node
ref
null
null
ref
I need to
I do nothelp!
need
to help!
Helping mechanism in action

phase
4
9
11
10
pending
true
false
true
true
enqueue
true
true
false
true
node
ref
null
null
ref
The number of operations that may linearize before any
given operation is bounded

hence, wait-freedom
Optimized helping

The basic scheme has two drawbacks:

the number of steps executed by each thread on every operation
depends on n (the number of threads)


creates scenarios where many threads help same operations



even when there is no contention
e.g., when many threads access the queue concurrently
large redundant work
Optimization: help one thread at a time, in a cyclic manner


faster threads help slower peers in parallel
reduces the amount of redundant work
How to choose the phase numbers

Every time ti chooses a phase number, it is greater than
the number of any thread that made its choice before ti


defines a logical order on operations and provides waitfreedom
Like in Bakery mutex:
scan through state
 calculate the maximal phase value + 1
 requires O(n) steps


4
3
5
true
false
true
true
true
true
ref
null
ref
Alternative: use an atomic counter
requires O(1) steps
6!
“Wait-free” design scheme

Break each operation into three atomic steps


can be executed by different threads
cannot be interleaved
Initial change of the internal structure
1.

concurrent operations realize that there is an operation-in-progress
2.
Updating the state of the operation-in-progress as being
performed (linearized)
3.
Fixing the internal structure

finalizing the operation-in-progress
Internal structures
1
2
4
head
tail
phase
9
4
9
pending
false
false
false
enqueue
false
true
true
node
null
null
null
state
Internal structures
these
this element
elements
was
were
enqueued
enqueued
by Thread
by Thread
1 0
1
4
0
1
enqTid: int
2
1
-1
0
-1
head
tail
phase
9
4
9
pending
false
false
false
enqueue
false
true
true
node
null
null
null
state
holds ID of the thread that
performs / has performed
the insertion of the
node into the queue
Internal structures
this element was dequeued by Thread 1
1
4
0
1
deqTid: int
2
1
-1
0
-1
head
tail
phase
9
4
9
pending
false
false
false
enqueue
false
true
true
node
null
null
null
state
holds ID of the thread that
performs / has performed
the removal of the
node into the queue
enqueue operation
Creating a new node
12
4
0
-1
17
1
-1
6
0
-1
head
2
-1
tail
phase
9
4
9
pending
false
false
false
enqueue
false
true
true
node
null
null
null
state
enqueue
6
ID: 2
enqueue operation
Announcing a new operation
12
4
0
-1
17
1
-1
6
0
-1
head
2
-1
tail
phase
9
4
10
pending
false
false
true
enqueue
false
true
true
node
null
null
state
enqueue
6
ID: 2
enqueue operation
Step 1: Initial change of the internal structure
12
4
0
-1
17
1
-1
CAS
0
-1
head
6
2
-1
tail
phase
9
4
10
pending
false
false
true
enqueue
false
true
true
node
null
null
state
enqueue
6
ID: 2
enqueue operation
Step 2: Updating the state of the
operation-in-progress as being performed
12
4
0
-1
17
1
-1
6
0
-1
head
2
-1
tail
4CAS
phase
9
pending
false
false
false
enqueue
false
true
true
node
null
null
state
10
enqueue
6
ID: 2
enqueue operation
Step 3: Fixing the internal structure
12
4
0
-1
17
1
-1
6
0
-1
head
2
-1
tail
phase
9
4
10
pending
false
false
false
enqueue
false
true
true
node
null
null
state
CAS
enqueue
6
ID: 2
enqueue operation
Step 1: Initial change of the internal structure
12
4
0
-1
17
1
-1
0
-1
head
enqueue
3
2
-1
tail
phase
9
4
10
pending
false
false
true
enqueue
false
true
true
node
null
null
state
ID: 0
6
enqueue
6
ID: 2
enqueue operation
Creating a new node
Announcing a new operation
12
4
0
-1
17
1
-1
0
-1
head
enqueue
ID: 0
3
2
-1
0
-1
tail
phase
11
4
10
pending
true
false
true
enqueue
true
true
true
node
3
6
null
state
enqueue
6
ID: 2
enqueue operation
Step 2: Updating the state of the
operation-in-progress as being performed
12
4
0
-1
17
1
-1
0
-1
head
enqueue
ID: 0
3
2
-1
0
-1
tail
phase
11
4
10
pending
true
false
true
enqueue
true
true
true
node
3
6
null
state
enqueue
6
ID: 2
enqueue operation
Step 2: Updating the state of the
operation-in-progress as being performed
12
4
0
-1
17
1
-1
0
-1
head
enqueue
ID: 0
3
2
-1
0
-1
tail
4CAS
phase
11
pending
true
false
false
enqueue
true
true
true
node
3
6
null
state
10
enqueue
6
ID: 2
enqueue operation
Step 3: Fixing the internal structure
12
4
0
-1
17
1
-1
0
-1
head
enqueue
phase
11
4
10
pending
true
false
false
enqueue
true
true
true
ID: 0
null
state
3
2
-1
tail
node
3
6
0
-1
CAS
enqueue
6
ID: 2
enqueue operation
Step 1: Initial change of the internal structure
12
4
0
-1
17
1
-1
enqueue
ID: 0
2
-1
3
0
-1
tail
phase
11
4
10
pending
true
false
false
enqueue
true
true
true
node
3
6
0
-1
head
CAS
null
state
enqueue
6
ID: 2
dequeue operation
12
4
0
-1
17
1
-1
0
-1
head
tail
phase
9
4
9
pending
false
false
false
enqueue
false
true
true
node
null
null
null
state
dequeue
ID: 2
dequeue operation
Announcing a new operation
12
4
0
-1
17
1
-1
0
-1
head
tail
phase
9
4
10
pending
false
false
true
enqueue
false
true
false
node
null
null
null
state
dequeue
ID: 2
dequeue operation
Updating state to refer the first node
12
4
0
-1
17
1
-1
0
-1
head
tail
phase
9
4
10
pending
false
false
true
enqueue
false
true
CAS
null
null
false
node
state
dequeue
ID: 2
dequeue operation
Step 1: Initial change of the internal structure
12
CAS
4
0
2
17
1
-1
0
-1
head
tail
phase
9
4
10
pending
false
false
true
enqueue
false
true
false
node
null
null
state
dequeue
ID: 2
dequeue operation
Step 2: Updating the state of the
operation-in-progress as being performed
12
4
0
2
17
1
-1
0
-1
head
tail
4CAS
phase
9
pending
false
false
false
enqueue
false
true
false
node
null
null
state
10
dequeue
ID: 2
dequeue operation
Step 3: Fixing the internal structure
12
4
0
2
head
17
1
-1
0
-1
tail
CAS
phase
9
4
10
pending
false
false
false
enqueue
false
true
false
node
null
null
state
dequeue
ID: 2
Performance evaluation
Architecture
# threads
RAM
OS
Java
two 2.5 GHz quadcore
Xeon E5420 processors
two 1.6 GHz quadcore
Xeon E5310 processors
8
8
8
16GB
16GB
16GB
CentOS 5.5
Server
Ubuntu 8.10
Server
RedHat Enterpise 5.3
Server
Sun’s Java SE Runtime 1.6.0 update 22, 64-bit Server VM
Benchmarks

Enqueue-Dequeue benchmark




the queue is initially empty
each thread iteratively performs enqueue and then dequeue
1,000,000 iterations per thread
50%-Enqueue benchmark



the queue is initialized with 1000 elements
each thread decides uniformly and random which operation to
perform, with equal odds for enqueue and dequeue
1,000,000 operations per thread
Tested algorithms
Compared implementations:
 MS-queue
 Base wait-free queue
 Optimized wait-free queue



Opt 1: optimized helping (help one thread at a time)
Opt 2: atomic counter-based phase calculation
Measure completion time as a function of # threads
Enqueue-Dequeue benchmark

TBD: add figures
The impact of optimizations

TBD: add figures
Optimizing further: false sharing



Created on accesses to state array
Resolved by stretching the state with dummy pads
TBD: add figures
Optimizing further: memory management

Every attempt to update state is preceded by an
allocation of a new record



these records can be reused when the attempt fails
(more) validation checks can be performed to reduce the
number of failed attempts
When an operation is finished, remove the reference
from state to a list node

help garbage collector
Implementing the queue without GC

Apply Hazard Pointers technique [Michael’04]

each thread is associated with hazard pointers





single-writer multi-reader registers
used by threads to point on objects they may access later
when an object should be deleted, a thread stores its address
in a special stack
once in a while, it scans the stack and recycle objects only if
there are no hazard pointers pointing on it
In our case, the technique can be applied with a slight
modification in the dequeue method
Summary

First wait-free queue implementation supporting multiple
enqueuers and dequeuers

Wait-freedom incurs an inherent trade-off


bounds the completion time of a single operation
has a cost in a “typical” case

The additional cost can be reduced and become tolerable

Proposed design scheme might be applicable for other
wait-free data structures
Thank you!
Questions?
Download