Wait-Free Queues with Multiple Enqueuers and Dequeuers Alex Kogan Erez Petrank Computer Science, Technion, Israel FIFO queues One of the most fundamental and common data structures enqueue dequeue 5 3 2 9 Concurrent FIFO queues Concurrent implementation supports “correct” concurrent adding and removing elements correct = linearizable enqueue dequeue 3 dequeue dequeue 2 9 empty! dequeue The access to the shared memory should be synchronized Non-blocking synchronization No thread is blocked in waiting for another thread to complete e.g., no locks / critical sections Progress guarantees: Obstruction-freedom Lock-freedom progress is guaranteed only in the eventual absence of interference among all threads trying to apply an operation, one will succeed Wait-freedom a thread completes its operation in a bounded number of steps Lock-freedom Among all threads trying to apply an operation, one will succeed opportunistic approach make attempts until succeeding global progress all but one threads may starve Many efficient and scalable lock-free queue implementations Wait-freedom A thread completes its operation in a bounded number of steps A highly desired property of any concurrent data structure regardless of what other threads are doing but, commonly regarded as inefficient and too costly to achieve Particularly important in several domains real-time systems operating under SLA heterogeneous environments Related work: existing wait-free queues Limited concurrency Universal constructions one enqueuer and one dequeuer multiple enqueuers, one concurrent dequeuer multiple dequeuers, one concurrent enqueuer [Lamport’83] [David’04] [Jayanti&Petrovic’05] [Herlihy’91] generic method to transform any (sequential) object into lockfree/wait-free concurrent object expensive impractical implementations (Almost) no experimental results Related work: lock-free queue [Michael & Scott’96] One of the most scalable and efficient lock-free implementations Widely adopted by industry part of Java Concurrency package Relatively simple and intuitive implementation Based on singly-linked list of nodes 12 head 4 17 tail MS-queue brief review: enqueue CAS 12 head 4 17 9 tail CAS enqueue 9 MS-queue brief review: enqueue CAS 12 head 4 17 9 tail 5 CAS enqueue enqueue 5 9 CAS MS-queue brief review: dequeue 12 12 head 4 CAS 17 9 tail dequeue Our idea (in a nutshell) Based on the lock-free queue by Michael & Scott Helping mechanism each operation is applied in a bounded time “Wait-free” implementation scheme each operation is applied exactly once Helping mechanism Each operation is assigned a dynamic age-based priority inspired by the Doorway mechanism used in Bakery mutex Each thread accessing a queue chooses a monotonically increasing phase number writes down its phase and operation info in a special state array helps all threads with a non-larger phase to apply their operations phase: long pending: boolean enqueue: boolean node: Node state entry per thread Helping mechanism in action phase 4 9 9 3 pending true false true false enqueue true true true true node ref null ref ref Helping mechanism in action phase 4 9 9 10 pending true false true true enqueue true true true true node ref null ref ref I need to help! Helping mechanism in action phase 4 9 9 10 pending true false true true enqueue true true true true node ref null ref ref I do not need to help! Helping mechanism in action phase 4 9 11 10 pending true false true true enqueue true true false true node ref null null ref I need to I do nothelp! need to help! Helping mechanism in action phase 4 9 11 10 pending true false true true enqueue true true false true node ref null null ref The number of operations that may linearize before any given operation is bounded hence, wait-freedom Optimized helping The basic scheme has two drawbacks: the number of steps executed by each thread on every operation depends on n (the number of threads) creates scenarios where many threads help same operations even when there is no contention e.g., when many threads access the queue concurrently large redundant work Optimization: help one thread at a time, in a cyclic manner faster threads help slower peers in parallel reduces the amount of redundant work How to choose the phase numbers Every time ti chooses a phase number, it is greater than the number of any thread that made its choice before ti defines a logical order on operations and provides waitfreedom Like in Bakery mutex: scan through state calculate the maximal phase value + 1 requires O(n) steps 4 3 5 true false true true true true ref null ref Alternative: use an atomic counter requires O(1) steps 6! “Wait-free” design scheme Break each operation into three atomic steps can be executed by different threads cannot be interleaved Initial change of the internal structure 1. concurrent operations realize that there is an operation-in-progress 2. Updating the state of the operation-in-progress as being performed (linearized) 3. Fixing the internal structure finalizing the operation-in-progress Internal structures 1 2 4 head tail phase 9 4 9 pending false false false enqueue false true true node null null null state Internal structures these this element elements was were enqueued enqueued by Thread by Thread 1 0 1 4 0 1 enqTid: int 2 1 -1 0 -1 head tail phase 9 4 9 pending false false false enqueue false true true node null null null state holds ID of the thread that performs / has performed the insertion of the node into the queue Internal structures this element was dequeued by Thread 1 1 4 0 1 deqTid: int 2 1 -1 0 -1 head tail phase 9 4 9 pending false false false enqueue false true true node null null null state holds ID of the thread that performs / has performed the removal of the node into the queue enqueue operation Creating a new node 12 4 0 -1 17 1 -1 6 0 -1 head 2 -1 tail phase 9 4 9 pending false false false enqueue false true true node null null null state enqueue 6 ID: 2 enqueue operation Announcing a new operation 12 4 0 -1 17 1 -1 6 0 -1 head 2 -1 tail phase 9 4 10 pending false false true enqueue false true true node null null state enqueue 6 ID: 2 enqueue operation Step 1: Initial change of the internal structure 12 4 0 -1 17 1 -1 CAS 0 -1 head 6 2 -1 tail phase 9 4 10 pending false false true enqueue false true true node null null state enqueue 6 ID: 2 enqueue operation Step 2: Updating the state of the operation-in-progress as being performed 12 4 0 -1 17 1 -1 6 0 -1 head 2 -1 tail 4CAS phase 9 pending false false false enqueue false true true node null null state 10 enqueue 6 ID: 2 enqueue operation Step 3: Fixing the internal structure 12 4 0 -1 17 1 -1 6 0 -1 head 2 -1 tail phase 9 4 10 pending false false false enqueue false true true node null null state CAS enqueue 6 ID: 2 enqueue operation Step 1: Initial change of the internal structure 12 4 0 -1 17 1 -1 0 -1 head enqueue 3 2 -1 tail phase 9 4 10 pending false false true enqueue false true true node null null state ID: 0 6 enqueue 6 ID: 2 enqueue operation Creating a new node Announcing a new operation 12 4 0 -1 17 1 -1 0 -1 head enqueue ID: 0 3 2 -1 0 -1 tail phase 11 4 10 pending true false true enqueue true true true node 3 6 null state enqueue 6 ID: 2 enqueue operation Step 2: Updating the state of the operation-in-progress as being performed 12 4 0 -1 17 1 -1 0 -1 head enqueue ID: 0 3 2 -1 0 -1 tail phase 11 4 10 pending true false true enqueue true true true node 3 6 null state enqueue 6 ID: 2 enqueue operation Step 2: Updating the state of the operation-in-progress as being performed 12 4 0 -1 17 1 -1 0 -1 head enqueue ID: 0 3 2 -1 0 -1 tail 4CAS phase 11 pending true false false enqueue true true true node 3 6 null state 10 enqueue 6 ID: 2 enqueue operation Step 3: Fixing the internal structure 12 4 0 -1 17 1 -1 0 -1 head enqueue phase 11 4 10 pending true false false enqueue true true true ID: 0 null state 3 2 -1 tail node 3 6 0 -1 CAS enqueue 6 ID: 2 enqueue operation Step 1: Initial change of the internal structure 12 4 0 -1 17 1 -1 enqueue ID: 0 2 -1 3 0 -1 tail phase 11 4 10 pending true false false enqueue true true true node 3 6 0 -1 head CAS null state enqueue 6 ID: 2 dequeue operation 12 4 0 -1 17 1 -1 0 -1 head tail phase 9 4 9 pending false false false enqueue false true true node null null null state dequeue ID: 2 dequeue operation Announcing a new operation 12 4 0 -1 17 1 -1 0 -1 head tail phase 9 4 10 pending false false true enqueue false true false node null null null state dequeue ID: 2 dequeue operation Updating state to refer the first node 12 4 0 -1 17 1 -1 0 -1 head tail phase 9 4 10 pending false false true enqueue false true CAS null null false node state dequeue ID: 2 dequeue operation Step 1: Initial change of the internal structure 12 CAS 4 0 2 17 1 -1 0 -1 head tail phase 9 4 10 pending false false true enqueue false true false node null null state dequeue ID: 2 dequeue operation Step 2: Updating the state of the operation-in-progress as being performed 12 4 0 2 17 1 -1 0 -1 head tail 4CAS phase 9 pending false false false enqueue false true false node null null state 10 dequeue ID: 2 dequeue operation Step 3: Fixing the internal structure 12 4 0 2 head 17 1 -1 0 -1 tail CAS phase 9 4 10 pending false false false enqueue false true false node null null state dequeue ID: 2 Performance evaluation Architecture # threads RAM OS Java two 2.5 GHz quadcore Xeon E5420 processors two 1.6 GHz quadcore Xeon E5310 processors 8 8 8 16GB 16GB 16GB CentOS 5.5 Server Ubuntu 8.10 Server RedHat Enterpise 5.3 Server Sun’s Java SE Runtime 1.6.0 update 22, 64-bit Server VM Benchmarks Enqueue-Dequeue benchmark the queue is initially empty each thread iteratively performs enqueue and then dequeue 1,000,000 iterations per thread 50%-Enqueue benchmark the queue is initialized with 1000 elements each thread decides uniformly and random which operation to perform, with equal odds for enqueue and dequeue 1,000,000 operations per thread Tested algorithms Compared implementations: MS-queue Base wait-free queue Optimized wait-free queue Opt 1: optimized helping (help one thread at a time) Opt 2: atomic counter-based phase calculation Measure completion time as a function of # threads Enqueue-Dequeue benchmark TBD: add figures The impact of optimizations TBD: add figures Optimizing further: false sharing Created on accesses to state array Resolved by stretching the state with dummy pads TBD: add figures Optimizing further: memory management Every attempt to update state is preceded by an allocation of a new record these records can be reused when the attempt fails (more) validation checks can be performed to reduce the number of failed attempts When an operation is finished, remove the reference from state to a list node help garbage collector Implementing the queue without GC Apply Hazard Pointers technique [Michael’04] each thread is associated with hazard pointers single-writer multi-reader registers used by threads to point on objects they may access later when an object should be deleted, a thread stores its address in a special stack once in a while, it scans the stack and recycle objects only if there are no hazard pointers pointing on it In our case, the technique can be applied with a slight modification in the dequeue method Summary First wait-free queue implementation supporting multiple enqueuers and dequeuers Wait-freedom incurs an inherent trade-off bounds the completion time of a single operation has a cost in a “typical” case The additional cost can be reduced and become tolerable Proposed design scheme might be applicable for other wait-free data structures Thank you! Questions?