augustine-lockfree

advertisement
1
CONCURRENT
PROGRAMMING
Introduction to Locks and Lock-free data structures
Agenda
• Concurrency and Mutual Exclusion
• Mutual Exclusion without hardware primitives
• Mutual Exclusion using locks and critical sections
• Lock-based Stack
• Lock freedom
• Reasoning about concurrency:
• Linearizability
•Disadvantages of lock based data structures
• A lock free stack using CAS
• The ABA problem in the stack we just implemented Fix
• Other problems with CAS
• We need better hardware primitives
• Transactional memory
Mutual Exclusion
3

Mutual Exclusion : aims to avoid the simultaneous
use of a common resource
 Eg:

Global Variables, Databases etc.
Solutions:
 Software:
 Peterson’s
algorithm, Dekker’s algorithm, Bakery etc.
 Hardware:
 Atomic
test and set, compare and set, LL/SC etc.
www.themegallery.com
Using the hardware instruction Test and Set
4

Test and Set, here on, TS:
 TS
on a boolean variable flag
 #atomic
// The two lines below will be executed one after the
other without interruption
 If(flag
== false)
 flag = true;
 #end atomic
bool lock = false; // shared lock variable
// Process i
Init i;
while(true) {
while (lock==false){ // entry protocol
TS(lock)};
Critical secion # i;
lock = false; // exit protocol
//Remainder of code;}
Software solution: Peterson’s Algorithm
5



One of the purely software solutions to the mutual
exclusion problem based on shared memory
Simple solution for two processes P0 and P1 that
would like share the use of a single resource R
More rigorously, P1 shouldn’t have access to R when
P0 is modifying/reading R and vice-versa.
R
P0
P1
Peterson’s Algorithm: Two processor
version
6
Requires one global int variable (turn), and one bool
variable (flag) per process.
The global variable is turn each processor has signal
a variable flag
flag[0] = true is processor P0’s signal that it wants to
enter the critical section
turn = 0 says that it is processor P0’s turn to enter the
critical section
Can be extended to N processors
How to think about
7



Consider you are in a hall way that is only wide
enough for one person to walk.
However, you a see a guy walking in the opposite
direction as you are.
Once you approach him, you have two options:
 Be
a gentleman and step to the side so that he may
walk first, and you will continue after he passes (
Peterson’s algorithm)
 Beat him up and walk over him (Critical section
violation)
The algorithm in code
8
// Process 0
init;
while(true) {
// entry protocol
flag[0] = true;
turn = 1;
while (flag[1] && turn == 1) {};
critical section #0;
// exit protocol
flag[1] = false;
//remainder code
}
// Process 1
init;
while(true) {
// entry protocol
flag[1] = true;
turn = 0;
while (flag[0] && turn == 0) {};
critical section #1;
// exit protocol
flag[1] = false;
//remainder code
}
// Shared variables
bool flag[2] = {false, false}; int turn = 0;
Requirements for Peterson’s
9


Reads and writes have to atomic
No reordering of instructions or memory
 In
order processors sometime reorder memory accesses
even if they don’t reorder memory accesses.
 In

that case one needs to use memory barrier instructions
Visibility: Any change to a variable has to take
immediate effect so that everybody knows about.
 Keyword
volatile in Java
www.themegallery.com
So why don’t people use Peterson’s?
10

Notice the while loop in the algorithm
while (flag[1] && turn == 1) {};




If process 0 waits a lot of time to enter the critical
section, it continually checks the flag and turn to see it
can or not, while not doing any useful work
This is termed busy waiting, and locking mechanisms like
Peterson’s have a major disadvantage in that regard.
Locks that employ continuous checking mechanism for a
flag are called Spin-Locks.
Spin locks are good when the you know that the wait is
not long enough.
Properties of Peterson’s algorithm
11


Mutual Exclusion
Absense of Livelocks and Deadlocks:


A live lock is similar to a dead lock but the states of
competing processes continually change their state but
neither makes any progress.
Eventual Entry: is guaranteed even if scheduling policy
is only weakly fair.

A weakly fair scheduling policy guarantees that if a process
requests to enter its critical section (and does not withdraw
the request), the process will eventually enter its critical
section.
Comparison with Test and Set
12
Test and Set
Peterson’s algorithm
Mutual Exclusion
Yes
Yes
Absence of Deadlocks
Yes
Yes
Absence of unnecessary
delay
Yes
Yes
Eventual Entry
Strongly fair Scheduling
policy
Weakly fair Scheduling
policy
Practical issues
Special instructions
Standard instructions
Easy to implement for
any number of
processors
> 2 processes becomes
complex but doable
Putting it all together: a lock based Stack
13


Stack: A list or an array based data structure that
enforces last-in-first-out ordering of elements
Operations
Void Push(T data) : pushes the variable data on to the stack
 T Pop() : removes the last item that was pushed on to a
stack. Throws a stackEmptyException if the stack is empty
 Int Size() : returns the size of the stack


All operations are synchronized using one common lock
object.
www.themegallery.com
Code : Java
14
Class Stack<T> {
ArrayList<T> _container = new ArrayList<T>();
RentrantLock _lock = new ReentrantLock();
public void push(T data){ _lock.lock(); _container.add(data); _lock.unlock();}
public int size(){
int retVal; _lock.Lock(); retVal = _container.size();
_lock.unlock();
return retVal;
}
public T pop(){
_lock.lock();
if(_container.empty()) {
_lock.unlock();
throw new Exception(“Stack Empty”);}
T retVal _container.get(_container.size() – 1);
_lock.unlock(); return retVal;
}
Problems with locks
15


Stack is simple enough. There is only one lock. The
overhead isn’t that much. But there are data
structures that could have multiple locks
Problems with locking
 Deadlock
 Priority
inversion
 Convoying
 Kill tolerant availability
 Preemption tolerance
 Overall performance
Problems with locking 2
16

Priority inversion:
 Assume
two threads:
 T1
with very low priority
 T2 with very high priority
 Both
need to access a shared resource R but T2 holds
the lock to R
 T2
takes longer to complete the operation leaving the higher
priority thread waiting, hence by extension T1 has achieved
a lower priority

Possible solution Priority inheritance
Problems with Locking 3
17


Deadlock: Processes can’t proceed because each of
them is waiting for the other release a needed
resource.
Scenario:
 There
are two locks A and B
 Process 1 needs A and B in that order to safely execute
 Process 2 needs B and A in that order to safely execute
 Process 1 acquires A and Process two acquires B
 Now Process 1 is waiting for Process 2 to release B and
Process 2 is waiting for process 1 to release A
Problems with Locking 4
18


Convoying, all the processes need a lock A to
proceed however, a lower priority process acquires
A it first. Then all the other processes slow down to
the speed of the lower priority process.
Think of a freeway:
 You
are driving an Aston Martin but you are stuck
behind a beat up old pick truck that is moving very
slow and there is no way to overtake him.
Problems with Locking 5
19

Kill tolerance
 What
happens when a process holding a lock is killed?
 Everybody
else waiting for the lock may not ended up
getting it and would wait forever.

‘Async-signal safety’
Signal handlers can’t use lock-based primitives
 Why?



Suppose a thread receives a signal while holding a user
level lock in the memory allocator
Signal handler executes, calls malloc, wants the lock
Problems with Locking 6
20

Overall performance
 Arguable
 Efficient
lock-based algorithms exist
 Constant struggle between simplicity and efficiency
 Example. thread-safe linked list with lots of nodes
Lock the whole list for every operation?
 Reader/writer locks?
 Allow locking individual elements of the list?

A Possible solution
21
Lock-free Programming
Lock-free data structures
22



A data structure wherein there are no explicit locks
used for achieving synchronization between multiple
threads, and the progress of one thread doesn’t
block/impede the progress of another.
Doesn’t imply starvation freedom ( Meaning one thread
could potentially wait forever). But nobody starves in
practice
Advantages:


You don’t run into all the that you would problems with using
locks
Disadvantages: To be discussed later
Lock-free Programming
23




Think in terms of Algorithms + Data Structure = Program
Thread safe access to shared data without the use of locks,
mutexes etc.
Possible but not practical/feasible in the absence of
hardware support
So what do we need?


A compare and set primitive from the hardware guys,
abbreviated CAS (To be discussed in the next slide)
Interesting TidBit:

Lots of music sharing and streaming applications use lock-free
data structures

PortAudio, PortMidi, and SuperColliderPortAudio
Lock-free Programming
24

Compare and Set primitive


boolean cas( int * valueToChange, int * valueToSet To, int *
ValueToCompareTo)
Sematics: The pseudocode below executes atomically without
interruption

If( valueToChange == valueToCompareTo){
valueToChange = valueToSetTo;
return true;
}
else {
return false;
}


This function is exposed in Java through the atomic namespace, in C++
depending on the OS and architecture, you find libraries
CAS is all you need for lock-free queues, stacks, linked-lists, and
sets.
Trick to building lock-free data
structures
25

Limit the scope of changes to a single
atomic variable
 Stack
: head
 Queue: head or tail depending on enque or deque
A simple lock-free example
26


A lock free Stack
Adopted from Geoff Langdale at CMU
 Intended
to illustrate the design of lock-free data
structures and problems with lock-free synchronization
 There is a primitive operation we need:


CAS or Compare and Set
Available on most modern machines
 X86 assembly: xchg
 PowerPC assembly: LL(load linked), SC (Store Conditional)
Lock-free Stack with Ints in C
27

A stack based on a singly linked list. Not
particularly good design!
struct NodeEle {
int data;
Node *next;
};
typedef NodeEle Node;
Node* head; // The head of the list

Now that we have the nodes let us proceed to meat
of the stack
Lock-free Stack Push
28
void push(int t) {
Node* node = malloc(sizeof(Node));
node->data = t;
do {
node->next = head;
} while (!cas(&head, node, node->next));
}
Let us see how this works!
Push in Action
29
10
6
Head

Currently Head points to the Node containing data 6
Push in Action
30
10
6
Head
T1

push(7);
T2
Two threads T1 and T2 comes along wanting to
push 7 and 8 respectively, by calling the push
function
push(8);
Push in Action
31
10
6
Head
Node* node =
Node* node =
T1

malloc(sizeof(Node));
node->data = 7;
T2
malloc(sizeof(Node));
node->data = 8;
Two new node structs on the heap will be created
on the heap in parallel after the execution of the
code shown
Push in Action
32
10
7
T1


6
Head
8
do {
node->next = head;
} while (!cas(&head, node, node->next));
T2
The above code means set the newly created Nodes next to head, if
the head is still points to 6 then change head pointer to point to the
new Node
Both of them try to execute this portion of the code on their
respective threads. But only one will succeed.
Push in Action
33

Let us Assume T1 Succeeds, therefore T1 exits out of the while and
consequently the push()
10
7
T1

6
Head
8
do {
node->next = head;
} while (!cas(&head, node, node->next));
T2’s cas failed why? Hint: Look at the picture.
 T2
has no choice but to try again
T2
Push in Action
34

Assume T2 Succeeds this time because no one else trying to
push
10
6
7
8
Head
T1
T2
Pop()
35
bool pop(int& t) {
Node* current = head;
while(current) {
if(cas(&head, current->next, current)) {
t = current->data; // problem?
return true;
}
current = head;
}
return false;
}
There is something wrong this code. It is very subtle. Can you figure it
out? Most of the time this piece of code will work.
It is called the ABA problem
36




While a thread tries to modify A, what happens if
A gets changed to B then back to A?
Malloc recycles addresses. It has to eventually.
Now Imagine this scenario.
Curly braces contain addresses for each node
{ 0x89}
{ 0x90}
10
6
Head
ABA problem illustration Step 1
37
{ 0x89}
{ 0x90}
10
6


Head
Assume two threads T1
and T2.
T1 calls pop() to delete
Node at 0x90 but before it
has a change and CAS,
there is a context switch and
T1 goes to sleep.
bool pop(int& t) {
Node* current = del = head;
while(current) {
if(cas(&head, current->next, current)) {
t = current->data; // problem?
delete del;
return true;
}
current = head;
}
return false;
www.themegallery.com
ABA problem illustration Step 2
38
{ 0x89}
{ 0x90}
10
6

The following happens
while T1 is asleep
Head
www.themegallery.com
ABA problem illustration Step 3
39
{ 0x89}
{ 0x90}
10
6


The following happens
while T1 is asleep
T2 calls Pop(), Node at 0x90
is deleted
Head
www.themegallery.com
ABA problem illustration Step 4
40
{ 0x89}
{ 0x90}
10
6


Head

The following happens
while T1 is asleep
T2 calls Pop(), Node at 0x90
is deleted
T2 calls Pop(), Node at 0x89
is deleted
www.themegallery.com
ABA problem illustration Step 5
41
{ 0x89}
{ 0x90}
10
11


Head


The following happens
while T1 is asleep
T2 calls Pop(), Node at 0x90
is deleted
T2 calls Pop(), Node at 0x89
is deleted
T2 calls push(11) but malloc
has recycled the memory 0x90
while allocating space for the
new Node
www.themegallery.com
ABA problem illustration Step 6
42
{ 0x89}
{ 0x90}
10
11
Head
Head is now pointing
to illegal memory!!!!




Replace 10 and 6 with B and A

Now you know where the
name (ABA) comes from
The following happens while
T1 is asleep
T2 calls Pop(), Node at 0x90 is
deleted
T2 calls Pop(), Node at 0x89 is
deleted
T2 calls push(11) but malloc has
recycled the memory 0x90 while
allocating space for the new
Node
T1 now wakes up and the CAS
operation succeeds
www.themegallery.com
Solutions:
43

Double word compare and set.





One 32 bit word for the address
One 32 bit word for the update count which is incremented every
time a node is updated
Compare and Set iff both of the above match
Java provides AtomicStampedReference
Use the lower address bits of the pointer (if the memory is
4/8 byte Aligned) to keep a counter to update

But the probability of a false positive is still greater than
doubleword compareandset because instead of 2^32 choices for
the counter you have 2^2 or 2^3 choices for the counter
Disadvantages of lock-free data
structures
44


Current hardware limits the amount of bits available
in CAS operation to 32/64 bits.
Imagine the implementation of data structures like
BST’s pose a problem
 When
you need to balance a tree you need update
several nodes all at once.

Way to get around it
 Transactional
memory based systems
Language Support
45

C++ 09 (the new C++) : atomic_compare_exchange()



Current C++: pthreads library
GCC:

type __sync_val_compare_and_swap (type *ptr, type oldval, type
newval)

More info: http://gcc.gnu.org/onlinedocs/gcc/Atomic-Builtins.html
Java

Package:

AtomicInteger, AtomicBoolean etc.: Atomic access to a single int

AtomicStampedReference: Associates an int with a reference
AtomicMarkableReference: Associates a boolean with a reference


java.util.concurrent.atomic.*
boolean etc
Your own CAS:

Write an inline assembly function that uses XCHG or LL/SC depending
on the hardware platform
Performance of Lock-based vs. Lockfree under moderate contention
46
Performance of Lock-based vs. Lock-free under
very high contention (almost unrealistic)
47
Ensuring correctness of concurrent
objects
48

To ensure two properties of concurrent objects (eg:
FIFO queues)
 Safety:
Object behavior is correct as per the
specification
 Behavior


a FIFO queue:
If two enques x and y happens in parallel(assume queue is
initially empty) then the next deque should only return either x or
y not z
If enques y and z happened one after other in real time, then
deque() should return y first and z second
 Overall
progress: Conditions under which at least one
thread will progress
Ensuring correctness in concurrent
implementations
49

Linearizability:

Each method call(enq and deq in the case of queue) should
appear to take place instantaneously sometime between the
start and the end of the method call



Meaning no other thread can see the change to the data structure
in a step by step fashion
In English: If the concurrent execution can be mapped to a
valid(meaning correct) sequential execution on the object,
then we assume that it is correct.
Moreover, this can be used as an intuitive way to reason
about concurrent objects.
You already know it because you use it unknowingly
 Think of shared single lock FIFO queue

Linearizability: Intuitively
50

public T deq() throws EmptyException {
lock.lock();
try {
if (tail == head)
throw new EmptyException();
T x = items[head % items.length];

head++;
return x;
} finally {
All modifications
lock.unlock();
of queue are done
}
}
mutually exclusive.
Therefore essentially
happens in sequence.
Consider
the deq()
method for
a queue
Uses a
single
shared lock
for mutual
exclusion
51
Linearizability: Intuitively for the
single lock queue
q.deq(x)
lock()
unlock()
deq
q.enq(x)
Correct
behavior for q
enq(x) precedes
deq(x)
lock()
enq
unlock()
“Sequential”
Linearization
points
time
Art of Multiprocessor
Programming by Maurice
Herlihy
Behavior is
enq
deq
Linearizability: Intuitively
52

Each method of the object should
 “take
effect”
 Instantaneously
 Between invocation and response of the method call

Object is correct if this “sequential” behavior is
correct
 Generalization:
It can happen with or without mutual
exclusion(this is an implementation detail)

Any such concurrent object is
 Linearizable
Art of Multiprocessor
Programming by Maurice
Herlihy
Download