Inside Synchronization Jeff Chase Duke University

advertisement
Inside Synchronization
Jeff Chase
Duke University
Threads and blocking
thread API
e.g., pthreads
or Java threads
kernel interface
for thread libs
(not for users)
active
ready or
running
thread library
threads, mutexes,
condition variables…
wakeup
signal
PG-13
kernel thread support
raw “vessels”, e.g., Linux
CLONE_THREAD+”futex”
blocked
wait
Threads can enter the kernel
(fault or trap) and block.
This slide applies to the process
abstraction too, or, more precisely,
to the main thread of a process.
Blocking
When a thread is blocked
on a synchronization object
(a mutex or CV) its TCB is
placed on a sleep queue
of threads waiting for an
event on that object.
How to synchronize thread
queues and sleep/wakeup
inside the kernel?
active
ready or
running
sleep
wait
wakeup
signal
blocked
kernel TCB
wait
Interrupts drive many wakeup
events.
sleep queue
ready queue
Overview
• Consider multicore synchronization (inside the kernel)
from first principles.
• Details vary from system to system and machine to
machine….
• I’m picking and choosing.
Spinlock: a first try
int s = 0;
lock() {
while (s == 1)
{};
ASSERT (s == 0);
s = 1;
}
unlock ();
ASSERT(s == 1);
s = 0;
}
Spinlocks provide mutual exclusion
among cores without blocking.
Global spinlock variable
Busy-wait until lock is free.
Spinlocks are useful for lightly
contended critical sections where
there is no risk of preemption of a
thread while it is holding the lock, i.e.,
in the lowest levels of the kernel.
Spinlock: what went wrong
int s = 0;
lock() {
while (s == 1)
{};
s = 1;
}
unlock ();
s = 0;
}
Race to acquire.
Two (or more) cores see s == 0.
We need an atomic “toehold”
• To implement safe mutual exclusion, we need support
for some sort of “magic toehold” for synchronization.
– The lock primitives themselves have critical sections to test
and/or set the lock flags.
• Safe mutual exclusion on multicore systems requires
some hardware support: atomic instructions
– Examples: test-and-set, compare-and-swap, fetch-and-add.
– These instructions perform an atomic read-modify-write of a
memory location. We use them to implement locks.
– If we have any of those, we can build higher-level
synchronization objects like monitors or semaphores.
– Note: we also must be careful of interrupt handlers….
– They are expensive, but necessary.
Atomic instructions: Test-and-Set
Spinlock::Acquire () {
while(held);
held = 1;
}
load
test
store
load
test
store
Problem:
interleaved
load/test/store.
Solution: TSL
atomically sets the
flag and leaves the
old value in a
register.
Wrong
load 4(SP), R2
busywait:
load 4(R2), R3
bnz R3, busywait
store #1, 4(R2)
Right
load 4(SP), R2
busywait:
tsl 4(R2), R3
bnz R3, busywait
One example: tsl
test-and-set-lock
(from an old machine)
; load “this”
; load “held” flag
; spin if held wasn’t zero
; held = 1
; load “this”
; test-and-set this->held
; spin if held wasn’t zero
Spinlock: IA32
Idle the core for a
contended lock.
Atomic exchange
to ensure safe
acquire of an
uncontended lock.
Spin_Lock:
CMP lockvar, 0
;Check if lock is free
JE Get_Lock
PAUSE
; Short delay
JMP Spin_Lock
Get_Lock:
MOV EAX, 1
XCHG EAX, lockvar ; Try to get lock
CMP EAX, 0
; Test if successful
JNE Spin_Lock
XCHG is a variant of compare-and-swap: compare x to value in
memory location y; if x == *y then set *y = z. Report success/failure.
Synchronization accesses
• Atomic instructions also impose orderings on memory
accesses.
• Their execution informs the machine that
synchronization is occurring.
• Cores synchronize with one another by accessing a
shared memory location with atomic instructions.
• When cores synchronize, they establish happensbefore ordering relationships among their accesses to
other shared memory locations.
• The machine must ensure a consistent view of memory
that respects these happens-before orderings.
7.1. LOCKED ATOMIC OPERATIONS
The 32-bit IA-32 processors support locked atomic operations on
locations in system memory. These operations are typically used to
manage shared data structures (such as semaphores, segment
descriptors, system segments, or page tables) in which two or more
processors may try simultaneously to modify the same field or flag….
Note that the mechanisms for handling locked atomic operations
have evolved as the complexity of IA-32 processors has evolved….
Synchronization mechanisms in multiple-processor systems may
depend upon a strong memory-ordering model. Here, a program
can use a locking instruction such as the XCHG instruction or the
LOCK prefix to insure that a read-modify-write operation on memory
is carried out atomically. Locking operations typically operate like I/O
operations in that they wait for all previous instructions to complete
and for all buffered writes to drain to memory….
This is just an example of a principle on a particular
machine (IA32): these details aren’t important.
A peek at some deep tech
An execution schedule defines a partial order
of program events. The ordering relation (<)
is called happens-before.
mx->Acquire();
x = x + 1;
mx->Release();
Just three rules govern
happens-before order:
happens
before
(<)
Two events are concurrent if neither
happens-before the other. They might
execute in some order, but only by luck.
before
mx->Acquire();
x = x + 1;
mx->Release();
The next
schedule may
reorder them.
1. Events within a thread are ordered.
2. Mutex handoff orders events across
threads: the release #N happensbefore acquire #N+1.
3. Happens-before is transitive:
if (A < B) and (B < C) then A < C.
Machines may reorder concurrent events, but
they always respect happens-before ordering.
Happens-before and causality
• We humans have a natural notion of causality.
– Event A caused event B if B happened as a result of A, or A was a
factor in B, or knowledge of A was necessary for B to occur….
• Naturally, event A can cause event B only if A < B!
– (A caused B)  (A happens-before B), i.e., A precedes B
– This is obvious: events cannot change the past.
• Of course, the converse is not always true.
– It is not true in general that (A < B)  (A caused B).
– Always be careful in inferring causality.
Causality and inconsistency
• If A caused B, and some thread T observes event B before event A,
then T sees an “inconsistent” event timeline.
– Example: Facebook never shows you a reply to a post before showing
you the post itself. Never happens. It would be too weird.
• That kind of inconsistency might cause a program to fail.
– We’re talking about events that matter for thread interactions at the
machine level: load and store on the shared memory.
Memory ordering
• Shared memory is complex on multicore systems.
• Does a load from a memory location (address) return the
latest value written to that memory location by a store?
• What does “latest” mean in a parallel system?
T1
W(x)=1
R(y)
OK
M
T2
W(y)=1
OK
R(x)
1
1
It is common to presume
that load and store ops
execute sequentially on a
shared memory, and a
store is immediately and
simultaneously visible to
load at all other threads.
But not on real machines.
Memory ordering
• A load might fetch from the local cache and not from memory.
• A store may buffer a value in a local cache before draining the
value to memory, where other cores can access it.
• Therefore, a load from one core does not necessarily return
the “latest” value written by a store from another core.
T1
W(x)=1
R(y)
OK
M
T2
W(y)=1
OK
R(x)
0??
0??
A trick called Dekker’s
algorithm supports mutual
exclusion on multi-core
without using atomic
instructions. It assumes
that load and store ops
on a given location
execute sequentially.
But they don’t.
Memory ordering
• A load might fetch from the local cache and not from memory.
• A store may buffer a value in a local cache before draining the
value to memory, where other cores can access it.
• Therefore, a load from one core does not necessarily return
the “latest” value written by a store from another core.
T1
W(x)=1
R(y)
OK
M
T2
W(y)=1
OK
R(x)
0??
0??
Memory accesses from T1
have no happens-before
ordering defined relative to
the accesses from T2,
unless the program uses
synchronization (e.g., a
mutex handoff) to impose
an ordering.
.
Memory Models
A Case for Rethinking Parallel Languages and Hardware.
Sarita Adve and Hans Boehm, Communications of the ACM, Aug 2010, Vol. 53 Issue 8
sequential
A compiler might reorder the two
independent assignments to hide the
latency of loading Y or X.
Modern processors may use a store buffer
to avoid waiting for stores to complete.
Reordering any pair of accesses, reading values
from write buffers, register promotion, common
subexpression elimination, redundant read
elimination: all may violate sequential consistency.
The point of happens-before
• For consistency, we want a load from a location to return the value
written by the “latest” store to that location.
• But what does “latest” mean? It means the load returns the value
from the last store that happens-before the load.
• Machines are free to reorder concurrent accesses.
– Concurrent events have no restriction on their ordering: no happensbefore relation. Your program’s correctness cannot depend on the
ordering the machine picks for concurrent events: if the interleaving
matters to you, then you should have used a mutex.
– If there is no mutex, then the events are concurrent, and the machine is
free to choose whatever order is convenient for speed, e.g., it may leave
“old” data in caches and not propagate more “recent” data.
The first thing to understand about
memory behavior on multi-core systems
• Cores must see a “consistent” view of shared memory
for programs to work properly. But what does it mean?
– Answer: it depends. Machines vary.
– But they always respect causality: that is a minimal requirement.
– And since machines don’t know what events really cause others
in a program, they play it safe and respect happens-before.
The first thing to understand about
memory behavior on multi-core systems
• Cores must see a “consistent” view of shared memory for programs
to work properly. But what does it mean?
• Synchronization accesses tell the machine that ordering matters: a
happens-before relationship exists. Machines always respect that.
– Modern machines work for race-free programs.
– Otherwise, all bets are off. Synchronize!
T1
W(x)=1
R(y)
OK
pass
lock
M
T2
W(y)=1
OK
R(x)
0??
1
The most you should
assume is that any
memory store before a
lock release is visible to a
load on a core that has
subsequently acquired the
same lock.
The point of all that
• We use special atomic instructions to implement locks.
• E.g., a TSL or CMPXCHG on a lock variable lockvar is a
synchronization access.
• Synchronization accesses also have special behavior with respect
to the memory system.
– Suppose core C1 executes a synchronization access to lockvar at time
t1, and then core C2 executes a synchronization access to lockvar at
time t2.
– Then t1<t2: every memory store that happens-before t1 must be
visible to any load on the same location after t2.
• If memory always had this expensive sequential behavior, i.e., every
access is a synchronization access, then we would not need atomic
instructions: we could use “Dekker’s algorithm”.
• We do not discuss Dekker’s algorithm because it is not applicable to
modern machines. (Look it up on wikipedia if interested.)
Where are we
• We now have basic mutual exclusion that is very useful
inside the kernel, e.g., for access to thread queues.
– Spinlocks based on atomic instructions.
– Can synchronize access to sleep/ready queues used to
implement higher-level synchronization objects.
• Don’t use spinlocks from user space! A thread holding a
spinlock could be preempted at any time.
– If a thread is preempted while holding a spinlock, then other
threads/cores may waste many cycles spinning on the lock.
– That’s a kernel/thread library integration issue: fast spinlock
synchronization in user space is a research topic.
• But spinlocks are very useful in the kernel, esp. for
synchronizing with interrupt handlers!
Wakeup from interrupt handler
return to user mode
trap or fault
sleep
queue
sleep
wakeup
ready
queue
switch
interrupt
Examples?
Note: interrupt handlers do not block: typically there is a single interrupt stack
for each core that can take interrupts. If an interrupt arrived while another
handler was sleeping, it would corrupt the interrupt stack.
Wakeup from interrupt handler
return to user mode
trap or fault
sleep
queue
sleep
wakeup
ready
queue
switch
interrupt
Note: interrupt handlers do not block: typically there is a single interrupt stack
for each core that can take interrupts. If an interrupt arrived while another
handler was sleeping, it would corrupt the interrupt stack.
How should an interrupt handler wakeup a thread? Condition variable
signal? Semaphore V?
Interrupts
An arriving interrupt transfers control immediately to the
corresponding handler (Interrupt Service Routine).
ISR runs kernel code in kernel mode in kernel space.
Interrupts may be nested according to priority.
high-priority
ISR
executing
thread
low-priority
handler (ISR)
Interrupt priority: rough sketch
• N interrupt priority classes
• When an ISR at priority p runs, CPU
blocks interrupts of priority p or lower.
• Kernel software can query/raise/lower
the CPU interrupt priority level (IPL).
spl0
low
splnet
splbio
splimp
clock
high
– Defer or mask delivery of interrupts at
splx(s)
that IPL or lower.
– Avoid races with higher-priority ISR
BSD example
by raising CPU IPL to that priority.
int s;
– e.g., BSD Unix spl*/splx primitives.
s = splhigh();
• Summary: Kernel code can
enable/disable interrupts as needed.
/* all interrupts disabled */
splx(s);
/* IPL is restored to s */
What ISRs do
• Interrupt handlers:
– bump counters, set flags
– throw packets on queues
– …
– wakeup waiting threads
• Wakeup puts a thread on the ready queue.
• Use spinlocks for the queues
• But how do we synchronize with interrupt handlers?
Synchronizing with ISRs
• Interrupt delivery can cause a race if the ISR shares data
(e.g., a thread queue) with the interrupted code.
• Example: Core at IPL=0 (thread context) holds spinlock,
interrupt is raised, ISR attempts to acquire spinlock….
• That would be bad. Disable interrupts.
executing
thread (IPL 0) in
kernel mode
disable
interrupts for
critical section
int s;
s = splhigh();
/* critical section */
splx(s);
Obviously this is just example detail from a particular machine (IA32): the details aren’t important.
Obviously this is just example
detail from a particular OS
(Windows): the details aren’t
important.
Synchronizing with ISRs
executing
thread (IPL 0) in
kernel mode
disable
interrupts for
critical section
int s;
s = splhigh();
/* critical section */
splx(s);
A Rough Idea
Yield() {
next = FindNextToRun();
ReadyToRun(this);
Switch(this, next);
}
Sleep() {
this->status = BLOCKED;
next = FindNextToRun();
Switch(this, next);
}
Issues to resolve:
What if there are no ready threads?
How does a thread terminate?
How does the first thread start?
A Rough Idea
Thread.Sleep(SleepQueue q) {
Thread.Wakeup(SleepQueue q) {
lock and disable interrupts;
lock and disable;
this.status = BLOCKED;
q.RemoveFromQ(this);
q.AddToQ(this);
this.status = READY;
next = sched.GetNextThreadToRun();
sched.AddToReadyQ(this);
unlock and enable;
unlock and enable;
Switch(this, next);
}
}
This is pretty rough
The sleep and wakeup primitives must be used to implement
synchronization objects like mutexes and CVs. And we are waving our
hands at how that will work. Actually, P/V operations on a dedicated perthread semaphore would be better than sleep/wakeup.
A Rough Idea
Thread.Sleep(SleepQueue q) {
Thread.Wakeup(SleepQueue q) {
lock and disable interrupts;
lock and disable;
this.status = BLOCKED;
q.RemoveFromQ(this);
q.AddToQ(this);
this.status = READY;
next = sched.GetNextThreadToRun();
sched.AddToReadyQ(this);
unlock and enable;
unlock and enable;
Switch(this, next);
}
}
This is pretty rough
There is some hidden synchronization: as soon as sleep unlocks,
another sleep (or yield) on another core may try to switch into
the sleeping thread before it switches out. And we have to worry
about interrupts during context switch.
Example: Unix Sleep (BSD)
sleep (void* event, int sleep_priority)
{
struct proc *p = curproc;
int s;
s = splhigh();
/* disable all interrupts */
p->p_wchan = event; /* what are we waiting for */
p->p_priority -> priority; /* wakeup scheduler priority */
p->p_stat = SSLEEP; /* transition curproc to sleep state */
INSERTQ(&slpque[HASH(event)], p); /* fiddle sleep queue */
splx(s);
/* enable interrupts */
mi_switch();
/* context switch */
/* we’re back... */
}
Illustration Only
/*
* Save context of the calling thread (old), restore registers of
* the next thread to run (new), and return in context of new.
*/
switch/MIPS (old, new) {
old->stackTop = SP;
save RA in old->MachineState[PC];
save callee registers in old->MachineState
restore callee registers from new->MachineState
RA = new->MachineState[PC];
SP = new->stackTop;
return (to RA)
}
This example (from the old MIPS ISA) illustrates how context
switch saves/restores the user register context for a thread,
efficiently and without assigning a value directly into the PC.
Example: Switch()
Save current stack
pointer and caller’s
return address in old
thread object.
switch/MIPS (old, new) {
old->stackTop = SP;
save RA in old->MachineState[PC];
save callee registers in old->MachineState
Caller-saved registers (if
needed) are already
saved on its stack, and
restore callee registers from new->MachineState restored automatically
RA = new->MachineState[PC];
on return.
SP = new->stackTop;
return (to RA)
}
RA is the return address register. It
contains the address that a procedure
return instruction branches to.
Switch off of old stack
and over to new stack.
Return to procedure that
called switch in new
thread.
What to know about context switch
• The Switch/MIPS example is an illustration for those of you who are
interested. It is not required to study it. But you should understand
how a thread system would use it (refer to state transition diagram):
• Switch() is a procedure that returns immediately, but it returns onto
the stack of new thread, and not in the old thread that called it.
• Switch() is called from internal routines to sleep or yield (or exit).
• Therefore, every thread in the blocked or ready state has a frame for
Switch() on top of its stack: it was the last frame pushed on the stack
before the thread switched out. (Need per-thread stacks to block.)
• The thread create primitive seeds a Switch() frame manually on the
stack of the new thread, since it is too young to have switched before.
• When a thread switches into the running state, it always returns
immediately from Switch() back to the internal sleep or yield routine,
and from there back on its way to wherever it goes next.
Implementing Sleep on a Multiprocessor
sleep (void* event, int sleep_priority)
{
struct proc *p = curproc;
int s;
What if another CPU takes an
interrupt and calls wakeup?
s = splhigh();
/* disable all interrupts */
p->p_wchan = event;
/* what are we waiting for */
p->p_priority -> priority; /* wakeup scheduler priority */
p->p_stat = SSLEEP;
/* transition curproc to sleep state */
INSERTQ(&slpque[HASH(event)], p);/* fiddle sleep queue */
splx(s);
/* enable interrupts */
mi_switch();
/* context switch */
/* we’re back... */
}
What if another CPU is handling
a syscall and calls sleep or wakeup?
What if another CPU tries to wakeup
curproc before it has completed mi_switch?
Illustration Only
Using Spinlocks in Sleep: First Try
sleep (void* event, int sleep_priority)
{
struct proc *p = curproc;
int s;
Grab spinlock to prevent another
CPU from racing with us.
lock spinlock;
p->p_wchan = event;
/* what are we waiting for */
p->p_priority -> priority; /* wakeup scheduler priority */
p->p_stat = SSLEEP;
/* transition curproc to sleep state */
INSERTQ(&slpque[HASH(event)], p);/* fiddle sleep queue */
unlock spinlock;
mi_switch();
/* context switch */
/* we’re back */
}
Wakeup (or any other related
critical section code) will use the
same spinlock, guaranteeing
mutual exclusion.
Illustration Only
Sleep with Spinlocks: What Went Wrong
sleep (void* event, int sleep_priority)
{
struct proc *p = curproc;
int s;
Potential deadlock: what if we take an
interrupt on this processor, and call
wakeup while the lock is held?
lock spinlock;
p->p_wchan = event;
/* what are we waiting for */
p->p_priority -> priority; /* wakeup scheduler priority */
p->p_stat = SSLEEP;
/* transition curproc to sleep state */
INSERTQ(&slpque[HASH(event)], p);/* fiddle sleep queue */
unlock spinlock;
mi_switch();
/* context switch */
/* we’re back */
}
Potential doubly scheduled
thread: what if another CPU
calls wakeup to wake us up
before we’re finished with
mi_switch on this CPU?
Illustration Only
Using Spinlocks in Sleep: Second Try
sleep (void* event, int sleep_priority)
{
struct proc *p = curproc;
int s;
Grab spinlock and
disable interrupts.
s = splhigh();
lock spinlock;
p->p_wchan = event;
/* what are we waiting for */
p->p_priority -> priority;
/* wakeup scheduler priority */
p->p_stat = SSLEEP;
/* transition curproc to sleep state */
INSERTQ(&slpque[HASH(event)], p); /* fiddle sleep queue */
unlock spinlock;
splx(s);
mi_switch();
/* we’re back */
/* context switch */
}
Illustration Only
Recap
• An OS implements synchronization objects using a
combination of elements:
– Basic sleep/wakeup primitives of some form.
– Sleep places the thread TCB on a sleep queue and does a
context switch to the next ready thread.
– Wakeup places each awakened thread on a ready queue, from
which the ready thread is dispatched to a core.
– Synchronization for the thread queues uses spinlocks based on
atomic instructions, together with interrupt enable/disable.
– The low-level details are tricky and machine-dependent.
– The atomic instructions (synchronization accesses) also drive
memory consistency behaviors in the machine, e.g., a safe
memory model for fully synchronized race-free programs.
CMPXCHG
If our CPU loses the ‘race’, because another CPU changed ‘cmos_lock’ to
some non-zero value after we had fetched our copy of it, then the (now
non-zero) value from the ‘cmos_lock’ destination-operand will have been
copied into EAX, and so the final conditional-jump shown above will take
our CPU back into the spin-loop, where it will resume busy-waiting until the
‘winner’ of the race clears ‘cmos_lock’.
Download