W4118 Operating Systems Instructor: Junfeng Yang Logistics Homework 2 update Assembly code to call a hook function written in C • syscall_fail(long syscall_nr) Clarifications on sys_fail (int ith, int ncall, struct syscall_failures calls) • Only fail system calls from the current process • Each fail() injects only one failure • if a system call matches one of the system call numbers specified in the calls argument, count it as one matching call toward ith Mac users: talk to TA to get access to VMware fusion Last lecture Processes in Linux Context switch on x86 Kernel stack captures all state switch stack = switch process Threads: good at expressing concurrency efficiently Multithreading Models: different tradeoffs Race conditions Recall: banking example int balance = 1000; int main() { pthread_t t1, t2; pthread_create(&t1, NULL, deposit, (void*)1); pthread_create(&t2, NULL, withdraw, (void*)2); pthread_join(t1, NULL); pthread_join(t2, NULL); printf(“all done: balance = %d\n”, balance); return 0; } void* deposit(void *arg) { int i; for(i=0; i<1e7; ++i) ++ balance; } void* withdraw(void *arg) { int i; for(i=0; i<1e7; ++i) -- balance; } Recall: a closer look at the banking example $ objdump –d bank … 08048464 <deposit>: … 8048473: a1 80 97 04 08 8048478: 83 c0 01 804847b: a3 80 97 04 08 … 0804849b <withdraw>: … 80484aa: a1 80 97 04 08 80484af: 83 e8 01 80484b2: a3 80 97 04 08 … // ++ mov add mov balance 0x8049780,%eax $0x1,%eax %eax,0x8049780 // -- balance mov 0x8049780,%eax sub $0x1,%eax mov %eax,0x8049780 Avoiding Race Conditions Race condition: a timing dependent error involving shared state Critical section: a segment of code that accesses shared variable (or resource) and must not be concurrently executed by more than one thread // ++ mov add mov … balance 0x8049780,%eax $0x1,%eax %eax,0x8049780 // -- balance mov 0x8049780,%eax sub $0x1,%eax mov %eax,0x8049780 … How to implement critical sections? Atomic operations: no other instructions can be interleaved, executed “as a unit” “all or none”, guaranteed by hardware A possible solution: create a super instruction that does what we want atomically add $0x1, 0x8049780 Problem Can’t anticipate every possible way we want atomicity Increases hardware complexity, slows down other instructions // ++ mov add mov … balance 0x8049780,%eax $0x1,%eax %eax,0x8049780 // -- balance mov 0x8049780,%eax sub $0x1,%eax mov %eax,0x8049780 … Layered approach to synchronization Hardware provides simple low-level atomic operations, upon which we can build high-level, atomic operations, upon which we can implement critical sections and build correct multi-threaded/multi-process programs Properly synchronized application High-level synchronization primitives Hardware-provided low-level atomic operations Example low-level atomic operations and high-level synchronization primitives Low-level atomic operations On uniprocessor, disable/enable interrupt x86 load and store of words Special instructions: • Test-and-set High-level synchronization primitives Lock Semaphore Monitor Will look at them all. Start with lock. Locks (or Mutex: Mutual exclusion) Two common operations lock(): acquire lock exclusively; wait if not available unlock(): release exclusive access to lock pthread example pthread_mutex_t l = PTHREAD_MUTEX_INITIALIZER void* deposit(void *arg) void* withdraw(void *arg) { { int i; int i; for(i=0; i<1e7; ++i) { for(i=0; i<1e7; ++i) { pthread_mutex_lock(&l); pthread_mutex_lock(&l); ++ balance; -- balance; pthread_mutex_unlock(&l); pthread_mutex_unlock(&l); } } } } Critical Section Goals Requirements Safety (aka mutual exclusion): no more than one thread in critical Liveness (aka progress): Bounded waiting (aka starvation-free) Makes no assumptions about the speed and number of CPU section at a time. • If multiple threads simultaneously request to enter critical section, must allow one to proceed • Must not depend on threads outside critical section • Must eventually allow waiting thread to proceed • However, assumes each thread makes progress Desirable properties Efficient: don’t consume too much resource while waiting • Don’t busy wait (spin wait). Better to relinquish CPU and let other thread run Fair: don’t make some thread wait longer than others. Hard to do efficiently Simple: should be easy to use Implementing Locks: version 1 Can cheat on uniprocessor: implement locks by disabling and enabling interrupts lock() { } Linux kernel heavily used this trick in single core days disable_interrupt(); unlock() { enable_interrupt(); } cli(): __asm__ __volatile__("cli": : :"memory") sti(): __asm__ __volatile__(“sti": : :"memory") Good: simple! Bad: Both operations are privileged, can’t let user program use Doesn’t work on multiprocessors Implementing Locks: version 2 Peterson’s algorithm: software-based lock implementation Good: doesn’t require much from hardware Only assumptions: Loads and stores are atomic They execute in order Does not require special hardware instructions Software-based lock: 1st attempt // 0: lock is available, 1: lock is held by a thread int flag = 0; lock() { while (flag == 1) ; // spin wait flag = 1; } unlock() { flag = 0; } Idea: use one flag, test then set, if unavailable, spin-wait (or busy-wait) Why doesn’t work? Not safe: both threads can be in critical section Not efficient: busy wait, particularly bad on uniprocessor (will solve this later) Software-based locks: 2nd attempt // 1: a thread wants to enter critical section, 0: it doesn’t int flag[2] = {0, 0}; lock() { flag[self] = 1; // I need lock while (flag[1- self] == 1) ; // spin wait } unlock() { // not any more flag[self] = 0; } Idea: use per thread flags, set then test, to achieve mutual exclusion Why doesn’t work? Not live: can deadlock Software-based locks: 3rd attempt // whose turn is it? int turn = 0; lock() { // wait for my turn while (turn == 1 – self) ; // spin wait } unlock() { // I’m done. your turn turn = 1 – self; } Idea: strict alternation to achieve mutual exclusion Why doesn’t work? Not live: depends on threads outside critical section Software-based locks: final attempt (Peterson’s algorithm) // whose turn is it? int turn = 0; // 1: a thread wants to enter critical section, 0: it doesn’t int flag[2] = {0, 0}; unlock() lock() { { // not any more flag[self] = 1; // I need lock flag[self] = 0; turn = 1 – self; } // wait for my turn while (flag[1-self] == 1 Why works? && turn == 1 – self) ; // spin wait while the Safe? // other thread has intent Live? // AND it is the other Bounded wait? // thread’s turn } Notes on Peterson’s algorithm Great way to start thinking of multi-threaded programming Scheduler is “malicious” The algorithm is useful in other contexts as well Problem: Doesn’t work with N>2 threads • Obvious extension: N flags, turn = 0,1,…,N-1 • Leslie Lamport’s Bakery’s algorithm doesn’t work More importantly, doesn’t really work on modern out-oforder processors Next: implement locks with hardware support Implementing locks: version 3 // 0: lock is available, 1: lock is held by a thread int flag = 0; lock() unlock() { { while(test_and_set(&flag)) flag = 0; ; } } Problem with the test-then-set approach: test and set are not atomic Special atomic operation test_and_set address, register load *address to register, and set *address to 1 Why works? test_and_set on x86 xchg reg, addr: atomically swaps *addr and reg spin_lock in Linux can be implemented using this instruction (include/asm-i386/spin_lock.h) Another way: append a “lock prefix” before an instruction Examples in include/asm-i386/atomic.h long test_and_set(volatile long* lock) { int old; asm("xchgl %0, %1" : "=r"(old), "+m"(*lock) // output : "0"(1) // input : "memory“ // can clobber anything in // memory, so gcc won’t // reorder this statement with others ); return old; } Spin-wait or block So far the lock implementations we’ve seen are busywait or spin-wait locks: endlessly checking the lock flag without yielding CPU Problem: waste CPU cycles On uniprocessor: should not use spin-lock Worst case: prev thread holding a busy-wait lock gets preempted, other threads try to acquire the same lock Yield CPU when lock not available (need OS support) On multi-processor Thread holding lock gets preempted ??? Correct action depends on how long before lock release • Lock released “quickly” spin-wait • Lock released “slowly” block Quick or slow is relative to context-switch overhead Good plan: spin a bit, then block Problem with simple yield lock() { } Problem: while(test_and_set(&flag)) yield(); Still a lot of context switches Starvation possible Why? No control over who gets the lock next Need explicit control over who gets the lock Implementing locks: version 4 typedef struct __mutex_t { int flag; // 0: mutex is available, 1: mutex is not available int guard; // guard lock to internal mutex data structure queue_t *q; // queue of waiting threads } mutex_t; void lock(mutex_t *m) { while (test_and_set(m->guard)) ; //acquire guard lock by spinning if (m->flag == 0) { m->flag = 1; // acquire mutex m->guard = 0; } else { enqueue(m->q, self); m->guard = 0; yield(); } } void unlock(mutex_t *m) { while (test_and_set(m->guard)) ; if (queue_empty(m->q)) // release mutex; no one wants mutex m->flag = 0; else // hold mutex (for next thread!) wakeup(dequeue(m->q)); m->guard = 0; } Add thread to queue when lock unavailable Semaphore Synchronization tool that does not require busy waiting Semaphore S – integer variable Two standard operations modify S: acquire() and release() Originally called P() and V() • from Dutch Proberen and Verhogen (Dijkstra) Also called down() and up() And even wait() and signal() Higher-level abstraction, less complicated Can only be accessed via two indivisible (atomic) operations P(S) { while(S ≤ 0) ; S--; } V(S) { S++; } Semaphore Types Counting semaphore – integer value can range over an unrestricted domain Used for synchronization Binary semaphore – integer value can range only between 0 and 1; can be simpler to implement Used for mutual exclusion: same as mutex Process i P(S); Critical Section V(S); Remainder Section Semaphore Implementation Must guarantee that no two processes can execute P () / acquire () and V () / release () on the same semaphore at the same time Thus, implementation of these operations becomes the critical section problem again, where the acquire and release code are placed inside the critical section. Could now have busy waiting in critical section implementation But if we know we can’t acquire semaphore, should we busy wait and burn up the CPU? Note that applications may spend lots of time in critical sections and therefore this is not a good solution. We’d like a semaphore that sleeps (or at least lets someone else run) Semaphore Implementation with no Busy waiting With each semaphore there is an associated waiting queue. Each entry in a waiting queue has two data items: Two operations: value (of type integer) pointer to next record in the list block – place the process invoking the operation on the appropriate waiting queue. wakeup – remove one of processes in the waiting queue and place it in the ready queue. Potential queuing policies: FIFO, LIFO, undef Semaphore Implementation with no Busy waiting (Cont.) Implementation of acquire(): Implementation of release(): Producer-Consumer Problem Bounded buffer: size ‘N’ Producer process writes data to buffer Access entry 0… N-1, then “wrap around” to 0 again Must not write more than ‘N’ items more than consumer “ate” Consumer process reads data from buffer Should not try to consume if there is no data 0 1 N-1 In Out Solving Producer-Consumer Problem Solving with semaphores We’ll use two kinds of semaphores We’ll use counters to track how much data is in the buffer • One counter counts as we add data and stops the producer if there are N objects in the buffer • A second counter counts as we remove data and stops a consumer if there are 0 in the buffer Idea: since general semaphores can count for us, we don’t need a separate counter variable Why do we need a second kind of semaphore? We’ll also need a mutex semaphore Producer-Consumer Problem Shared: Semaphores mutex, empty, full; Init: mutex = 1; /* for mutual exclusion*/ empty = N; /* number empty buf entries */ full = 0; /* number full buf entries */ Producer Consumer do { ... // produce an item in nextp ... P(empty); P(mutex); ... // add nextp to buffer ... V(mutex); V(full); } while (true); do { P(full); P(mutex); ... // remove item to nextc ... V(mutex); V(empty); ... // consume item in nextc ... } while (true); Common Errors with Semaphores Process i Process j Process k Process l P(S) CS P(S) V(S) CS V(S) P(S) CS P(S) If (something) return; CS V(S) A typo. Process J won’t next respect Whoever calls P() will freeze up. mutual Iexclusion even the other The bugif might be confusing because A typo. Process will get stuck Someone to release the follow the rules that othercorrectly. process couldforgot be perfectly (forever) theprocesses second time it does the semaphore before The still, once we’ve done two that’s the correct yet onereturning! you’ll P() operation.Worse Moreover, every othercode, nextyou caller operations this hung way, other see when usewill theget stuck. process will “extra” freeze V() up too when trying processes might getdebugger into the CS to look at its state! to enter the critical section! inappropriately!