Inside Synchronization Jeff Chase Duke University Threads and blocking thread API e.g., pthreads or Java threads kernel interface for thread libs (not for users) active ready or running thread library threads, mutexes, condition variables… wakeup signal PG-13 kernel thread support raw “vessels”, e.g., Linux CLONE_THREAD+”futex” blocked wait Threads can enter the kernel (fault or trap) and block. This slide applies to the process abstraction too, or, more precisely, to the main thread of a process. Blocking When a thread is blocked on a synchronization object (a mutex or CV) its TCB is placed on a sleep queue of threads waiting for an event on that object. How to synchronize thread queues and sleep/wakeup inside the kernel? active ready or running sleep wait wakeup signal blocked kernel TCB wait Interrupts drive many wakeup events. sleep queue ready queue Overview • Consider multicore synchronization (inside the kernel) from first principles. • Details vary from system to system and machine to machine…. • I’m picking and choosing. Spinlock: a first try int s = 0; lock() { while (s == 1) {}; ASSERT (s == 0); s = 1; } unlock (); ASSERT(s == 1); s = 0; } Spinlocks provide mutual exclusion among cores without blocking. Global spinlock variable Busy-wait until lock is free. Spinlocks are useful for lightly contended critical sections where there is no risk of preemption of a thread while it is holding the lock, i.e., in the lowest levels of the kernel. Spinlock: what went wrong int s = 0; lock() { while (s == 1) {}; s = 1; } unlock (); s = 0; } Race to acquire. Two (or more) cores see s == 0. We need an atomic “toehold” • To implement safe mutual exclusion, we need support for some sort of “magic toehold” for synchronization. – The lock primitives themselves have critical sections to test and/or set the lock flags. • Safe mutual exclusion on multicore systems requires some hardware support: atomic instructions – Examples: test-and-set, compare-and-swap, fetch-and-add. – These instructions perform an atomic read-modify-write of a memory location. We use them to implement locks. – If we have any of those, we can build higher-level synchronization objects like monitors or semaphores. – Note: we also must be careful of interrupt handlers…. – They are expensive, but necessary. Atomic instructions: Test-and-Set Spinlock::Acquire () { while(held); held = 1; } load test store load test store Problem: interleaved load/test/store. Solution: TSL atomically sets the flag and leaves the old value in a register. Wrong load 4(SP), R2 busywait: load 4(R2), R3 bnz R3, busywait store #1, 4(R2) Right load 4(SP), R2 busywait: tsl 4(R2), R3 bnz R3, busywait One example: tsl test-and-set-lock (from an old machine) ; load “this” ; load “held” flag ; spin if held wasn’t zero ; held = 1 ; load “this” ; test-and-set this->held ; spin if held wasn’t zero Spinlock: IA32 Idle the core for a contended lock. Atomic exchange to ensure safe acquire of an uncontended lock. Spin_Lock: CMP lockvar, 0 ;Check if lock is free JE Get_Lock PAUSE ; Short delay JMP Spin_Lock Get_Lock: MOV EAX, 1 XCHG EAX, lockvar ; Try to get lock CMP EAX, 0 ; Test if successful JNE Spin_Lock XCHG is a variant of compare-and-swap: compare x to value in memory location y; if x == *y then set *y = z. Report success/failure. Synchronization accesses • Atomic instructions also impose orderings on memory accesses. • Their execution informs the machine that synchronization is occurring. • Cores synchronize with one another by accessing a shared memory location with atomic instructions. • When cores synchronize, they establish happensbefore ordering relationships among their accesses to other shared memory locations. • The machine must ensure a consistent view of memory that respects these happens-before orderings. 7.1. LOCKED ATOMIC OPERATIONS The 32-bit IA-32 processors support locked atomic operations on locations in system memory. These operations are typically used to manage shared data structures (such as semaphores, segment descriptors, system segments, or page tables) in which two or more processors may try simultaneously to modify the same field or flag…. Note that the mechanisms for handling locked atomic operations have evolved as the complexity of IA-32 processors has evolved…. Synchronization mechanisms in multiple-processor systems may depend upon a strong memory-ordering model. Here, a program can use a locking instruction such as the XCHG instruction or the LOCK prefix to insure that a read-modify-write operation on memory is carried out atomically. Locking operations typically operate like I/O operations in that they wait for all previous instructions to complete and for all buffered writes to drain to memory…. This is just an example of a principle on a particular machine (IA32): these details aren’t important. A peek at some deep tech An execution schedule defines a partial order of program events. The ordering relation (<) is called happens-before. mx->Acquire(); x = x + 1; mx->Release(); Just three rules govern happens-before order: happens before (<) Two events are concurrent if neither happens-before the other. They might execute in some order, but only by luck. before mx->Acquire(); x = x + 1; mx->Release(); The next schedule may reorder them. 1. Events within a thread are ordered. 2. Mutex handoff orders events across threads: the release #N happensbefore acquire #N+1. 3. Happens-before is transitive: if (A < B) and (B < C) then A < C. Machines may reorder concurrent events, but they always respect happens-before ordering. Happens-before and causality • We humans have a natural notion of causality. – Event A caused event B if B happened as a result of A, or A was a factor in B, or knowledge of A was necessary for B to occur…. • Naturally, event A can cause event B only if A < B! – (A caused B) (A happens-before B), i.e., A precedes B – This is obvious: events cannot change the past. • Of course, the converse is not always true. – It is not true in general that (A < B) (A caused B). – Always be careful in inferring causality. Causality and inconsistency • If A caused B, and some thread T observes event B before event A, then T sees an “inconsistent” event timeline. – Example: Facebook never shows you a reply to a post before showing you the post itself. Never happens. It would be too weird. • That kind of inconsistency might cause a program to fail. – We’re talking about events that matter for thread interactions at the machine level: load and store on the shared memory. Memory ordering • Shared memory is complex on multicore systems. • Does a load from a memory location (address) return the latest value written to that memory location by a store? • What does “latest” mean in a parallel system? T1 W(x)=1 R(y) OK M T2 W(y)=1 OK R(x) 1 1 It is common to presume that load and store ops execute sequentially on a shared memory, and a store is immediately and simultaneously visible to load at all other threads. But not on real machines. Memory ordering • A load might fetch from the local cache and not from memory. • A store may buffer a value in a local cache before draining the value to memory, where other cores can access it. • Therefore, a load from one core does not necessarily return the “latest” value written by a store from another core. T1 W(x)=1 R(y) OK M T2 W(y)=1 OK R(x) 0?? 0?? A trick called Dekker’s algorithm supports mutual exclusion on multi-core without using atomic instructions. It assumes that load and store ops on a given location execute sequentially. But they don’t. Memory ordering • A load might fetch from the local cache and not from memory. • A store may buffer a value in a local cache before draining the value to memory, where other cores can access it. • Therefore, a load from one core does not necessarily return the “latest” value written by a store from another core. T1 W(x)=1 R(y) OK M T2 W(y)=1 OK R(x) 0?? 0?? Memory accesses from T1 have no happens-before ordering defined relative to the accesses from T2, unless the program uses synchronization (e.g., a mutex handoff) to impose an ordering. . Memory Models A Case for Rethinking Parallel Languages and Hardware. Sarita Adve and Hans Boehm, Communications of the ACM, Aug 2010, Vol. 53 Issue 8 sequential A compiler might reorder the two independent assignments to hide the latency of loading Y or X. Modern processors may use a store buffer to avoid waiting for stores to complete. Reordering any pair of accesses, reading values from write buffers, register promotion, common subexpression elimination, redundant read elimination: all may violate sequential consistency. The point of happens-before • For consistency, we want a load from a location to return the value written by the “latest” store to that location. • But what does “latest” mean? It means the load returns the value from the last store that happens-before the load. • Machines are free to reorder concurrent accesses. – Concurrent events have no restriction on their ordering: no happensbefore relation. Your program’s correctness cannot depend on the ordering the machine picks for concurrent events: if the interleaving matters to you, then you should have used a mutex. – If there is no mutex, then the events are concurrent, and the machine is free to choose whatever order is convenient for speed, e.g., it may leave “old” data in caches and not propagate more “recent” data. The first thing to understand about memory behavior on multi-core systems • Cores must see a “consistent” view of shared memory for programs to work properly. But what does it mean? – Answer: it depends. Machines vary. – But they always respect causality: that is a minimal requirement. – And since machines don’t know what events really cause others in a program, they play it safe and respect happens-before. The first thing to understand about memory behavior on multi-core systems • Cores must see a “consistent” view of shared memory for programs to work properly. But what does it mean? • Synchronization accesses tell the machine that ordering matters: a happens-before relationship exists. Machines always respect that. – Modern machines work for race-free programs. – Otherwise, all bets are off. Synchronize! T1 W(x)=1 R(y) OK pass lock M T2 W(y)=1 OK R(x) 0?? 1 The most you should assume is that any memory store before a lock release is visible to a load on a core that has subsequently acquired the same lock. The point of all that • We use special atomic instructions to implement locks. • E.g., a TSL or CMPXCHG on a lock variable lockvar is a synchronization access. • Synchronization accesses also have special behavior with respect to the memory system. – Suppose core C1 executes a synchronization access to lockvar at time t1, and then core C2 executes a synchronization access to lockvar at time t2. – Then t1<t2: every memory store that happens-before t1 must be visible to any load on the same location after t2. • If memory always had this expensive sequential behavior, i.e., every access is a synchronization access, then we would not need atomic instructions: we could use “Dekker’s algorithm”. • We do not discuss Dekker’s algorithm because it is not applicable to modern machines. (Look it up on wikipedia if interested.) Where are we • We now have basic mutual exclusion that is very useful inside the kernel, e.g., for access to thread queues. – Spinlocks based on atomic instructions. – Can synchronize access to sleep/ready queues used to implement higher-level synchronization objects. • Don’t use spinlocks from user space! A thread holding a spinlock could be preempted at any time. – If a thread is preempted while holding a spinlock, then other threads/cores may waste many cycles spinning on the lock. – That’s a kernel/thread library integration issue: fast spinlock synchronization in user space is a research topic. • But spinlocks are very useful in the kernel, esp. for synchronizing with interrupt handlers! Wakeup from interrupt handler return to user mode trap or fault sleep queue sleep wakeup ready queue switch interrupt Examples? Note: interrupt handlers do not block: typically there is a single interrupt stack for each core that can take interrupts. If an interrupt arrived while another handler was sleeping, it would corrupt the interrupt stack. Wakeup from interrupt handler return to user mode trap or fault sleep queue sleep wakeup ready queue switch interrupt Note: interrupt handlers do not block: typically there is a single interrupt stack for each core that can take interrupts. If an interrupt arrived while another handler was sleeping, it would corrupt the interrupt stack. How should an interrupt handler wakeup a thread? Condition variable signal? Semaphore V? Interrupts An arriving interrupt transfers control immediately to the corresponding handler (Interrupt Service Routine). ISR runs kernel code in kernel mode in kernel space. Interrupts may be nested according to priority. high-priority ISR executing thread low-priority handler (ISR) Interrupt priority: rough sketch • N interrupt priority classes • When an ISR at priority p runs, CPU blocks interrupts of priority p or lower. • Kernel software can query/raise/lower the CPU interrupt priority level (IPL). spl0 low splnet splbio splimp clock high – Defer or mask delivery of interrupts at splx(s) that IPL or lower. – Avoid races with higher-priority ISR BSD example by raising CPU IPL to that priority. int s; – e.g., BSD Unix spl*/splx primitives. s = splhigh(); • Summary: Kernel code can enable/disable interrupts as needed. /* all interrupts disabled */ splx(s); /* IPL is restored to s */ What ISRs do • Interrupt handlers: – bump counters, set flags – throw packets on queues – … – wakeup waiting threads • Wakeup puts a thread on the ready queue. • Use spinlocks for the queues • But how do we synchronize with interrupt handlers? Synchronizing with ISRs • Interrupt delivery can cause a race if the ISR shares data (e.g., a thread queue) with the interrupted code. • Example: Core at IPL=0 (thread context) holds spinlock, interrupt is raised, ISR attempts to acquire spinlock…. • That would be bad. Disable interrupts. executing thread (IPL 0) in kernel mode disable interrupts for critical section int s; s = splhigh(); /* critical section */ splx(s); Obviously this is just example detail from a particular machine (IA32): the details aren’t important. Obviously this is just example detail from a particular OS (Windows): the details aren’t important. Synchronizing with ISRs executing thread (IPL 0) in kernel mode disable interrupts for critical section int s; s = splhigh(); /* critical section */ splx(s); A Rough Idea Yield() { next = FindNextToRun(); ReadyToRun(this); Switch(this, next); } Sleep() { this->status = BLOCKED; next = FindNextToRun(); Switch(this, next); } Issues to resolve: What if there are no ready threads? How does a thread terminate? How does the first thread start? A Rough Idea Thread.Sleep(SleepQueue q) { Thread.Wakeup(SleepQueue q) { lock and disable interrupts; lock and disable; this.status = BLOCKED; q.RemoveFromQ(this); q.AddToQ(this); this.status = READY; next = sched.GetNextThreadToRun(); sched.AddToReadyQ(this); unlock and enable; unlock and enable; Switch(this, next); } } This is pretty rough The sleep and wakeup primitives must be used to implement synchronization objects like mutexes and CVs. And we are waving our hands at how that will work. Actually, P/V operations on a dedicated perthread semaphore would be better than sleep/wakeup. A Rough Idea Thread.Sleep(SleepQueue q) { Thread.Wakeup(SleepQueue q) { lock and disable interrupts; lock and disable; this.status = BLOCKED; q.RemoveFromQ(this); q.AddToQ(this); this.status = READY; next = sched.GetNextThreadToRun(); sched.AddToReadyQ(this); unlock and enable; unlock and enable; Switch(this, next); } } This is pretty rough There is some hidden synchronization: as soon as sleep unlocks, another sleep (or yield) on another core may try to switch into the sleeping thread before it switches out. And we have to worry about interrupts during context switch. Example: Unix Sleep (BSD) sleep (void* event, int sleep_priority) { struct proc *p = curproc; int s; s = splhigh(); /* disable all interrupts */ p->p_wchan = event; /* what are we waiting for */ p->p_priority -> priority; /* wakeup scheduler priority */ p->p_stat = SSLEEP; /* transition curproc to sleep state */ INSERTQ(&slpque[HASH(event)], p); /* fiddle sleep queue */ splx(s); /* enable interrupts */ mi_switch(); /* context switch */ /* we’re back... */ } Illustration Only /* * Save context of the calling thread (old), restore registers of * the next thread to run (new), and return in context of new. */ switch/MIPS (old, new) { old->stackTop = SP; save RA in old->MachineState[PC]; save callee registers in old->MachineState restore callee registers from new->MachineState RA = new->MachineState[PC]; SP = new->stackTop; return (to RA) } This example (from the old MIPS ISA) illustrates how context switch saves/restores the user register context for a thread, efficiently and without assigning a value directly into the PC. Example: Switch() Save current stack pointer and caller’s return address in old thread object. switch/MIPS (old, new) { old->stackTop = SP; save RA in old->MachineState[PC]; save callee registers in old->MachineState Caller-saved registers (if needed) are already saved on its stack, and restore callee registers from new->MachineState restored automatically RA = new->MachineState[PC]; on return. SP = new->stackTop; return (to RA) } RA is the return address register. It contains the address that a procedure return instruction branches to. Switch off of old stack and over to new stack. Return to procedure that called switch in new thread. What to know about context switch • The Switch/MIPS example is an illustration for those of you who are interested. It is not required to study it. But you should understand how a thread system would use it (refer to state transition diagram): • Switch() is a procedure that returns immediately, but it returns onto the stack of new thread, and not in the old thread that called it. • Switch() is called from internal routines to sleep or yield (or exit). • Therefore, every thread in the blocked or ready state has a frame for Switch() on top of its stack: it was the last frame pushed on the stack before the thread switched out. (Need per-thread stacks to block.) • The thread create primitive seeds a Switch() frame manually on the stack of the new thread, since it is too young to have switched before. • When a thread switches into the running state, it always returns immediately from Switch() back to the internal sleep or yield routine, and from there back on its way to wherever it goes next. Implementing Sleep on a Multiprocessor sleep (void* event, int sleep_priority) { struct proc *p = curproc; int s; What if another CPU takes an interrupt and calls wakeup? s = splhigh(); /* disable all interrupts */ p->p_wchan = event; /* what are we waiting for */ p->p_priority -> priority; /* wakeup scheduler priority */ p->p_stat = SSLEEP; /* transition curproc to sleep state */ INSERTQ(&slpque[HASH(event)], p);/* fiddle sleep queue */ splx(s); /* enable interrupts */ mi_switch(); /* context switch */ /* we’re back... */ } What if another CPU is handling a syscall and calls sleep or wakeup? What if another CPU tries to wakeup curproc before it has completed mi_switch? Illustration Only Using Spinlocks in Sleep: First Try sleep (void* event, int sleep_priority) { struct proc *p = curproc; int s; Grab spinlock to prevent another CPU from racing with us. lock spinlock; p->p_wchan = event; /* what are we waiting for */ p->p_priority -> priority; /* wakeup scheduler priority */ p->p_stat = SSLEEP; /* transition curproc to sleep state */ INSERTQ(&slpque[HASH(event)], p);/* fiddle sleep queue */ unlock spinlock; mi_switch(); /* context switch */ /* we’re back */ } Wakeup (or any other related critical section code) will use the same spinlock, guaranteeing mutual exclusion. Illustration Only Sleep with Spinlocks: What Went Wrong sleep (void* event, int sleep_priority) { struct proc *p = curproc; int s; Potential deadlock: what if we take an interrupt on this processor, and call wakeup while the lock is held? lock spinlock; p->p_wchan = event; /* what are we waiting for */ p->p_priority -> priority; /* wakeup scheduler priority */ p->p_stat = SSLEEP; /* transition curproc to sleep state */ INSERTQ(&slpque[HASH(event)], p);/* fiddle sleep queue */ unlock spinlock; mi_switch(); /* context switch */ /* we’re back */ } Potential doubly scheduled thread: what if another CPU calls wakeup to wake us up before we’re finished with mi_switch on this CPU? Illustration Only Using Spinlocks in Sleep: Second Try sleep (void* event, int sleep_priority) { struct proc *p = curproc; int s; Grab spinlock and disable interrupts. s = splhigh(); lock spinlock; p->p_wchan = event; /* what are we waiting for */ p->p_priority -> priority; /* wakeup scheduler priority */ p->p_stat = SSLEEP; /* transition curproc to sleep state */ INSERTQ(&slpque[HASH(event)], p); /* fiddle sleep queue */ unlock spinlock; splx(s); mi_switch(); /* we’re back */ /* context switch */ } Illustration Only Recap • An OS implements synchronization objects using a combination of elements: – Basic sleep/wakeup primitives of some form. – Sleep places the thread TCB on a sleep queue and does a context switch to the next ready thread. – Wakeup places each awakened thread on a ready queue, from which the ready thread is dispatched to a core. – Synchronization for the thread queues uses spinlocks based on atomic instructions, together with interrupt enable/disable. – The low-level details are tricky and machine-dependent. – The atomic instructions (synchronization accesses) also drive memory consistency behaviors in the machine, e.g., a safe memory model for fully synchronized race-free programs. CMPXCHG If our CPU loses the ‘race’, because another CPU changed ‘cmos_lock’ to some non-zero value after we had fetched our copy of it, then the (now non-zero) value from the ‘cmos_lock’ destination-operand will have been copied into EAX, and so the final conditional-jump shown above will take our CPU back into the spin-loop, where it will resume busy-waiting until the ‘winner’ of the race clears ‘cmos_lock’.