W4118 Operating Systems Instructor: Junfeng Yang Logistics Homework 2 grading: by demos Each TA grades 1/3 of all groups TA will set up a time with you using doodle During grading: • You apply patch in TA’s VM (same as the class VM) • You compile, run testcase, etc • TA may ask question to anyone in your group – Wrong answer all group members lose points – So make sure everyone understands everything of the assignment Homework 3 out Implementing semaphore Implementing barrier (NOTE: different from the memory barriers we talk about today) 2 Barriers Use of a barrier. (a) Processes approaching a barrier. (b) All processes but one blocked at the barrier. (c) When the last process arrives at the barrier, all of them are let through. Example from Tanenbaum, Modern Operating Systems 3 e, (c) 2008 Prentice-Hall, Inc. All rights reserved. 0-13-6006639 3 Last lecture Synchronization Locks: to avoid spin-wait or context switch overhead, use queues to control who gets lock next Semaphore • Purposes: mutual exclusion and scheduling order • Solving producer-consumer problem with semaphores Monitor • Concurrency meets OOP • Condition variables to control order • Solving producer-consumer problem with monitors 4 Today: synchronization in Linux Low-level atomic operations: Memory barrier Atomic operations Interrupt/softirq disabling/enabling Spin locks • avoids compile-time, or runtime instruction re-ordering • memory bus lock, read-modify-write ops • Local, global • general, read/write High-level synchronization primitives: Semaphores • general, read/write Mutex Completion 5 Goals Different flavors of synchronization primivites and when to use them, in the context of Linux kernel How synchronization primitives are implemented for real “Portable” tricks: useful in other context as well (when you write a high performance server) Optimize for common case 6 Synchronization is complex and subtle Already learned this from the implementation examples we’ve seen before Kernel synchronization is even more complex and subtle Higher requirements: performance, protection … Code heavily optimized, “fast path” often in assembly, fit within one cache line 7 Choosing Synch Primitives Avoid synch if possible! (clever instruction ordering) Use atomics or rw spinlocks if possible Low overhead: fast path is a few instructions In interrupt context, use spinlock_irq* if possible Use semaphores if you need to sleep Example: inserting in linked list (needs memory barrier still) Can’t sleep in interrupt context Don’t sleep holding a spinlock! Why? Complicated matrix of choices for protecting data structures accessed by deferred functions 8 Architectural Dependence The implementation of the synchronization primitives is extremely architecture dependent. This is because only the hardware can guarantee atomicity of an operation. Each architecture must provide a mechanism for doing an operation that can examine and modify a storage location atomically. Some architectures do not guarantee atomicity, but inform whether the operation attempted was atomic. 9 Memory Barriers: Motivation The compiler can: Reorder code as long as it correctly maintains data flow dependencies within a function and with called functions Reorder the execution of code to optimize performance The processor can: Reorder instruction execution as long as it correctly maintains register flow dependencies Reorder memory modification as long as it correctly maintains data flow dependencies Reorder the execution of instructions (for performance optimization) 10 Memory Barriers: Definition Memory Barriers are used to prevent a processor and/or the compiler from reordering instruction execution and memory modification. Memory Barriers are instructions to hardware and/or compiler to complete all pending accesses before issuing any more read memory barrier – acts on read requests write memory barrier – acts on write requests Intel – certain instructions act as barriers: lock, iret, control regs rmb – asm volatile("lock;addl $0,0(%%esp)":::"memory") • add 0 to top of stack with lock prefix wmb – Intel never re-orders writes, just for compiler 11 Barrier Operations barrier – prevent only compiler reordering mb – prevents load and store reordering rmb – prevents load reordering wmb – prevents store reordering smp_mb – prevent load and store reordering only in SMP kernel smp_rmb – prevent load reordering only in SMP kernels smp_wmb – prevent store reordering only in SMP kernels set_mb – performs assignment and prevents load and store reordering 12 Atomic Operations Many instructions not atomic in hardware (smp) Read-modify-write instructions: inc, test-and-set, swap unaligned memory access even i++ is not necessarily atomic! Compiler may not generate atomic code If the data that must be protected is a single word, atomic operations can be used. These functions examine and modify the word atomically. The atomic data type is atomic_t. Intel implementation lock prefix byte 0xf0 – locks memory bus Speed: memory speed, (roughly same as load/store) 13 Atomic Operations ATOMIC_INIT – initialize an atomic_t variable atomic_read – examine value atomically atomic_set – change value atomically atomic_inc – increment value atomically atomic_dec – decrement value atomically atomic_add - add to value atomically atomic_sub – subtract from value atomically atomic_inc_and_test – increment value and test for zero atomic_dec_and_test – decrement value and test for zero atomic_sub_and_test – subtract from value and test for zero atomic_set_mask – mask bits atomically atomic_clear_mask – clear bits atomically 14 Serializing with Interrupts Basic primitive in original UNIX Doesn’t protect against other CPUs Intel: “interrupts enabled bit” cli to clear (disable), sti to set (enable) Enabling is often wrong; need to restore local_irq_save() local_irq_restore() • Why? 15 Interrupt Operations Services used to serialize with interrupts are: Dealing with the full interrupt state of the system is officially discouraged. Locks should be used. local_irq_disable - disables interrupts on the current CPU local_irq_enable - enable interrupts on the current CPU local_save_flags - return the interrupt state of the processor local_restore_flags - restore the interrupt state of the processor 16 Disabling Deferred Functions Disabling interrupts disables deferred functions Possible to disable deferred functions but not all interrupts Operations (macros): local_bh_disable() local_bh_enable() 17 Spin Locks A spin lock is a data structure (spinlock_t ) that is used to synchronize access to critical sections. Only one thread can be holding a spin lock at any moment. All other threads trying to get the lock will “spin” (loop while checking the lock status). Spin locks should not be held for long periods because waiting tasks on other CPUs are spinning, and thus wasting CPU execution time. Why use spin locks? 18 __raw_spin_lock_string 1: lock; decb %0 jns 3f 2: rep; nop cmpb $0, %0 jle 2b jmp 1b 3: # # # # # # # atomically decrement if clear sign bit jump forward to 3 wait spin – compare to 0 go back to 2 if <= 0 (locked) unlocked; go back to 1 to try again we have acquired the lock … From linux/include/asm-i386/spinlock.h spin_unlock merely writes 1 into the lock field. 19 Spin Lock Operations Functions used to work with spin locks: spin_lock_init – initialize a spin lock before using it for the first time spin_lock – acquire a spin lock, spin waiting if it is not available spin_unlock – release a spin lock spin_unlock_wait – spin waiting for spin lock to become available, but don't acquire it spin_trylock – acquire a spin lock if it is currently free, otherwise return error spin_is_locked – return spin lock state 20 Spin Locks & Interrupts The spin lock services also provide interfaces that serialize with interrupts (on the current processor): spin_lock_irq - acquire spin lock and disable interrupts spin_unlock_irq - release spin lock and reenable spin_lock_irqsave - acquire spin lock, save interrupt state, and disable spin_unlock_irqrestore - release spin lock and restore interrupt state Files: include/linux/spinlock.h kernel/spinlock.c 21 RW Spin Locks A read/write spin lock is a data structure that allows multiple tasks to hold it in "read" state or one task to hold it in "write" state (but not both conditions at the same time). This is convenient when multiple tasks wish to examine a data structure, but don't want to see it in an inconsistent state. A lock may not be held in read state when requesting it for write state. The data type for a read/write spin lock is rwlock_t. (include/asm-i386/spinlock.h) Writers can starve waiting behind readers. 22 RW Spin Lock Operations Several functions are used to work with read/write spin locks: rwlock_init – initialize a read/write lock before using it for the first time read_lock – get a read/write lock for read write_lock – get a read/write lock for write read_unlock – release a read/write lock that was held for read write_unlock – release a read/write lock that was held for write read_trylock, write_trylock – acquire a read/write lock if it is currently free, otherwise return error 23 RW Spin Locks & Interrupts The read/write lock services also provide interfaces that serialize with interrupts (on the current processor): read_lock_irq - acquire lock for read and disable interrupts read_unlock_irq - release read lock and reenable read_lock_irqsave - acquire lock for read, save interrupt state, and disable read_unlock_irqrestore - release read lock and restore interrupt state Corresponding functions for write exist as well (e.g., write_lock_irqsave). 24 Semaphores A semaphore is a data structure that is used to synchronize access to critical sections or other resources. A semaphore allows a fixed number of tasks (generally one for critical sections) to "hold" the semaphore at one time. Any more tasks requesting to hold the semaphore are blocked (put to sleep). A semaphore can be used for serialization only in code that is allowed to block. 25 Semaphore Operations Operations for manipulating semaphores: up – release the semaphore down – get the semaphore (can block) down_interruptible – get the semaphore, can be woken up if interrupt arrives down_trylock – try to get the semaphore without blocking, otherwise return an error Files: include/asm-i386/semaphore.h arch/i386/kernel/semaphore.c Goal: optimize for uncontended case 26 Semaphore Structure Struct semaphore • > 0: free; • = 0: in use, no waiters; • < 0: in use, waiters wait: wait queue sleepers: wait: wait queue count (atomic_t): • 0 (none), • 1 (some), occasionally 2 implementation requires lower-level synch struct semaphore atomic_t count int sleepers wait_queue_head_t wait lock prev next atomic updates, spinlock, interrupt disabling 27 Semaphores optimized assembly code for uncontended case (down()) up() is easy atomically decrement; continue contended down() is really complex! atomically increment; wake_up() if necessary uncontended down() is easy C code for slower “contended” case (__down()) basically increment sleepers and sleep loop because of potentially concurrent ups/downs still in down() path when lock is acquired 28 A hypothetical down() spin_lock_irqsave(&sem->wait.lock, flags); add_wait_queue_exclusive_locked(&sem->wait, &wait); for (;;;) { my_count = sem->count--; If (my_count >= 0) { break; } spin_unlock_irqrestore(&sem->wait.lock, flags); tsk->state = TASK_UNINTERRUPTIBLE; schedule(); spin_lock_irqsave(&sem->wait.lock, flags); } remove_wait_queue_locked(&sem->wait, &wait); wake_up_locked(&sem->wait); spin_unlock_irqrestore(&sem->wait.lock, flags); Note: not real code! 29 The Real down(), down_failed() inline down: movl $sem, %ecx # why does this work? lock; decl (%ecx)# atomically decr sem count jns 1f # if not negative jump to 1 lea %ecx, %eax # move into eax call __down_failed # 1: # we have the semaphore down_failed: pushl %edx pushl %ecx call __down popl %ecx popl %edx ret # push edx onto stack (C) # push ecx onto stack # call C function # pop ecx # pop edx from include/asm-i386/semaphore.h and arch/i386/kernel/semaphore.c) 30 __down() tsk->state = TASK_UNINTERRUPTIBLE; spin_lock_irqsave(&sem->wait.lock, flags); add_wait_queue_exclusive_locked(&sem->wait, &wait); sem->sleepers++; for (;;) { int sleepers = sem->sleepers; /* * Add "everybody else" into it. They aren't playing, * because we own the spinlock in the wait_queue head */ if (!atomic_add_negative(sleepers - 1, &sem->count)) { sem->sleepers = 0; break; } sem->sleepers = 1; /* us - see -1 above */ spin_unlock_irqrestore(&sem->wait.lock, flags); schedule(); spin_lock_irqsave(&sem->wait.lock, flags); tsk->state = TASK_UNINTERRUPTIBLE; } remove_wait_queue_locked(&sem->wait, &wait); wake_up_locked(&sem->wait); spin_unlock_irqrestore(&sem->wait.lock, flags); tsk->state = TASK_RUNNING; From linux/ lib/ semaphoresleepers.c 31 RW Semaphores A rw_semaphore is a semaphore that allows either one writer or any number of readers (but not both at the same time) to hold it. Any writer requesting to hold the rw_semaphore is blocked when there are readers holding it. A rw_semaphore can be used for serialization only in code that is allowed to block. Both types of semaphores are the only synchronization objects that should be held when blocking. Writers will not starve: once a writer arrives, readers queue behind it Increases concurrency; introduced in 2.4 Files: include/linux/rwsem.h include/asm-i386/rwsem.h 32 lib/rwsem.c RW Semaphore Operations Operations for manipulating semaphores: up_read – release a rw_semaphore held for read. up_write – release a rw_semaphore held for write. down_read – get a rw_semaphore for read (can block, if a writer is holding it) down_write – get a rw_semaphore for write (can block, if one or more readers are holding it) 33 More RW Semaphore Ops Operations for manipulating semaphores: down_read_trylock – try to get a rw_semaphore for read without blocking, otherwise return an error down_write_trylock – try to get a rw_semaphore for write without blocking, otherwise return an error downgrade_write – atomically release a rw_semaphore for write and acquire it for read (can't block) 34 Mutexes A mutex is a data structure that is also used to synchronize access to critical sections or other resources, introduced in 2.6.16. Why? (linux kernel mailing list, Documentation/mutexdesign.txt) 90% of semaphores in kernel are used as mutexes, 9% of semaphores should be spin_locks (Andrew Morton). Slow paths are more critical for highly contended semaphores on SMP Mutexes are simpler (lighter weight) • tighter code • slightly faster, better scalability • no fastpath tradeoffs • debug support – strict checking of adhering to semantics 35 Mutexes (Continued) Mutex -- why not? not the same as semaphores cannot be used from interrupt context cannot be unlocked from a different context than that in which they were locked 36 Mutex Operations Operations for manipulating mutexes: mutex_unlock – release the mutex mutex_lock – get the mutex (can block) mutex_lock_interruptible – get the mutex, but allow interrupts mutex_trylock – try to get the mutex without blocking, otherwise return an error mutex_is_locked – determine if mutex is locked 37 Completions Slightly higher-level, FIFO semaphores Up/down may execute concurrently Solves a subtle synch problem on SMP This is a good thing (when possible) Operations: complete(), wait_for_complete() Spinlock and wait_queue Spinlock serializes ops Wait_queue enforces FIFO 38 Backup Slides The Big Reader Lock Reader optimized RW spinlock RW spinlock suffers cache contention Per-CPU, cache-aligned lock arrays One for reader portion, another for writer portion To read: set bit in reader array, spin on writer On lock and unlock because of write to rwlock_t Acquire when writer lock free; very fast! To write: set bit and scan ALL reader bits Acquire when reader bits all free; very slow! 40 Kernel Synchronization A piece of code is considered reentrant if two tasks can be running in the code at the same time without behaving incorrectly. In the uniprocessor (UP) kernel, when a task is running anywhere in the kernel, no other task can be executing anywhere in the kernel. A piece of code can allow another task to run somewhere in the kernel by blocking or sleeping (saving the task state and running other tasks) in the kernel, waiting for some event to occur. 41 MP Kernel Synchronization On an MP system, multiple tasks can be running in the kernel at the same time. Thus, all code must assume that another task can attempt to run in the code. Any code segment that cannot tolerate multiple tasks executing in it is called a critical section. A critical section must block all tasks but one that attempt to execute it. 42 The Big Kernel Lock (BKL) For serialization that is not performance sensitive, the big kernel lock (BKL) was used This mechanism is historical and should generally be avoided. The function lock_kernel gets the big kernel lock. The function unlock_kernel releases the big kernel lock. The function kernel_locked returns whether the kernel lock is currently held by the current task. The big kernel lock itself is a simple lock called kernel_flag. 43 When Synch Is Not Necessary simplifying assumptions (2.4 uniprocessor) kernel is non pre-emptive process holds cpu while in kernel • unless it blocks, relinquishes cpu, or interrupt locking only needed against interrupts smp kernels require locking smp locks compile out for uniprocessor kernels • no overhead! 2.5 introduces limited kernel pre-emption to reduce scheduler, interrupt latency make Linux more responsive for real-time apps introduces preemption points in kernel code 44 What about ‘cli’? Disabling interrupts will stop time-sharing among tasks on a uniprocessor system But it would be unfair in to allow this in a multi-user system (monopolize the CPU) So cli is a privileged instruction: it cannot normally be executed by user-mode tasks It won’t work for a multiprocessor system 45 Special x86 instructions Need to use x86 assembly-language to implement atomic sync operations Several instruction-choices are possible, but ‘btr’ and ‘bts’ are simplest to use: ‘btr’ means ‘bit-test-and-reset’ ‘bts’ means ‘bit-test-and’set’ Syntax and semantics: asm(“ btr $0, lock “); asm(“ bts $0, lock “); // acquire the lock // release the lock 46 The x86 ‘lock’ prefix In order for the ‘btr’ instruction to perform an ‘atomic’ update (when multiple CPUs are using the same bus to access memory simultaneously), it is necessary to insert an x86 ‘lock’ prefix, like this: asm(“ spin: lock btr $0, mutex “); This instruction ‘locks’ the shared system-bus during this instruction execution -- so another CPU cannot intervene 47 Reentrancy More than one process (or processor) can be safely executing reentrant concurrently It needs to obey two cardinal rules: It contains no ‘self-modifying’ instructions Access to shared variables is ‘exclusive’ 48 Program A int flag1 = 0, flag2 = 0; void p1 (void flag1 = 1; if (!flag2) } void p2 (void flag2 = 1; if (!flag1) } *ignored) { { /* critical section */ } *ignored) { { /* critical section */ } Can both critical sections run? 49 Program B int data = 0, ready = 0; void p1 (void *ignored) { data = 2000; ready = 1; } void p2 (void *ignored) { while (!ready) ; use (data); } Can use be called with value 0? 50 Program C int a = 0, b = 0; void p1 (void *ignored) { a = 1; } void p2 (void *ignored) { if (a == 1) b = 1; } void p3 (void *ignored) { if (b == 1) use (a); } Can use be called with value 0? 51 Correct Answers Program A: I don’t know Program B: I don’t know Program C: I don’t know Why? It depends on your hardware If it provides sequential consistency, then answers all No But not all hardware provides sequential consistency [BTW, examples and some other slide content from excellent Tech Report by Adve & Gharachorloo] 52 Sequential Consistency (SC) Sequential consistency: The result of execution is as if all operations were executed in some sequential order, and the operations of each processor occurred in the order specified by the program. [Lamport] Boils down to two requirements: Maintaining program order on individual processors Ensuring write atomicity Why doesn’t all hardware support sequential consistency? 53 SC Limits HW Optimizations Write buffers Overlapping write operations can be reordered Concurrent writes to different memory modules Coalescing writes to same cache line Non-blocking reads E.g., read flag n before flag (2 − n) written through in Program A E.g., speculatively prefetch data in Program B Cache coherence Write completion only after invalidation/update (Program B) Can’t have overlapping updates (Program C) 54 SC Thwarts Compiler Opts Code motion Caching values in registers E.g., ready flag in Program B Common subexpression elimination Loop blocking Software pipelining 55