CS194-24 Advanced Operating Systems Structures and Implementation Lecture 6 Parallelism and Synchronization February 11th, 2013 Prof. John Kubiatowicz http://inst.eecs.berkeley.edu/~cs194-24 Goals for Today • Multithreading/Posix support for threads • Interprocess Communication • Synchronization Interactive is important! Ask Questions! Note: Some slides and/or pictures in the following are adapted from slides ©2013 2/13/13 Kubiatowicz CS194-24 ©UCB Fall 2013 Lec 6.2 Recall: Process Scheduling • PCBs move from queue to queue as they change state – Decisions about which order to remove from queues are Scheduling decisions – Many algorithms possible (few weeks from now) 2/13/13 Kubiatowicz CS194-24 ©UCB Fall 2013 Lec 6.3 Recall: Example of fork( ) int main(int argc, char **argv) { char *name = argv[0]; int child_pid = fork(); if (child_pid == 0) { printf(“Child of %s sees PID of %d\n”, name, child_pid); return 0; } else { printf(“I am the parent %s. My child is %d\n”, name, child_pid); return 0; } } _______________________________ % ./forktest Child of forktest sees PID of 0 I am the parent forktest. My child is 486 2/13/13 Kubiatowicz CS194-24 ©UCB Fall 2013 Lec 6.4 Process Per Request Serial Version A (more compelling?) example of fork( ): Web Server int main() { int listen_fd = listen_for_clients(); while (1) { int client_fd = accept(listen_fd); handle_client_request(client_fd); close(client_fd); } } int main() { int listen_fd = listen_for_clients(); while (1) { int client_fd = accept(listen_fd); if (fork() == 0) { handle_client_request(client_fd); close(client_fd); // Close FD in child when done exit(0); } else { close(client_fd); // Close FD in parent // Let exited children rest in peace! while (waitpid(-1,&status,WNOHANG) > 0); } 2/13/13 } Kubiatowicz CS194-24 ©UCB Fall 2013 Lec 6.5 Recall: Parent-Child relationship Typical process tree for Solaris system • Every thread (and/or Process) has a parentage – A “parent” is a thread that creates another thread – A child of a parent was created by that parent 2/13/13 Kubiatowicz CS194-24 ©UCB Fall 2013 Lec 6.6 Multiple Processes Collaborate on a Task Proc 1 Proc 2 Proc 3 • (Relatively) High Creation/memory Overhead • (Relatively) High Context-Switch Overhead • Need Communication mechanism: – Separate Address Spaces Isolates Processes – Shared-Memory Mapping » Accomplished by mapping addresses to common DRAM » Read and Write through memory – Message Passing » send() and receive() messages » Works across network – Pipes, Sockets, Signals, Synchronization primitives, …. 2/13/13 Kubiatowicz CS194-24 ©UCB Fall 2013 Lec 6.7 Message queues • What are they? – Similar to the FIFO pipes, except that a tag (type) is matched when reading/writing » Allowing cutting in line (I am only interested in a particular type of message) » Equivalent to merging of multiple FIFO pipes in one • Creating a message queue: – int msgget(key_t key, int msgflag); – Key can be any large number. But to avoiding using conflicting keys in different programs, use ftok() (the key master). » key_t ftok(const char *path, int id); • Path point to a file that the process can stat • Id: project ID, only the last 8 bits are used • Message queue operations int int int int msgget(key_t, int flag) msgctl(int msgid, int cmd, struct msgid_ds *buf) msgsnd(int msgid, const void *ptr, size nbytes, int flag); msgrcv(int msgid, void *ptr, size_t nbytes, long type, int flag); • Performance advantage is no longer there in newer systems (compared with pipe) 2/13/13 Kubiatowicz CS194-24 ©UCB Fall 2013 Lec 6.8 Shared Memory Communication Code Data Heap Stack Shared Prog 1 Virtual Address Space 1 Data 2 Stack 1 Heap 1 Code 1 Stack 2 Data 1 Heap 2 Code 2 Shared Code Data Heap Stack Shared Prog 2 Virtual Address Space 2 • Communication occurs by “simply” reading/writing to shared address page – Really low overhead communication – Introduces complex synchronization problems 2/13/13 Kubiatowicz CS194-24 ©UCB Fall 2013 Lec 6.9 Shared Memory Common chunk of read/write memory among processes MAX ptr Shared Memory (unique key) Create Attach 0 Attach Proc. 1 2/13/13 ptr ptr ptr Proc. 3 Proc. 4 Proc. 5 Kubiatowicz CS194-24 ©UCB Fall 2013 ptr Proc. 2 Lec 6.10 Creating Shared Memory // Create new segment int shmget(key_t key, size_t size, int shmflg); Example: key_t key; int shmid; key = ftok(“<somefile>", ‘A'); shmid = shmget(key, 1024, 0644 | IPC_CREAT); Special key: IPC_PRIVATE (create new segment) Flags: IPC_CREAT (Create new segment) IPC_EXCL (Fail if segment with key already exists) lower 9 bits – permissions use on new segment 2/13/13 Kubiatowicz CS194-24 ©UCB Fall 2013 Lec 6.11 Attach and Detach Shared Memory // Attach void *shmat(int shmid, void *shmaddr, int shmflg); // Detach int shmdt(void *shmaddr); Example: key_t key; int shmid; char *data; key = ftok("<somefile>", ‘A'); shmid = shmget(key, 1024, 0644); data = shmat(shmid, (void *)0, 0); shmdt(data); Flags: SHM_RDONLY, SHM_REMAP 2/13/13 Kubiatowicz CS194-24 ©UCB Fall 2013 Lec 6.12 Administrivia • In the news: New Apple product rumor: iWatch – – – – – Is it real? Is it desirable? Needs *REALLY* long battery life Can’t require keyboard entry . What might it do? Supposedly running iOS? • Make sure you update info on Redmine • • • • – Some of you still have cs194-xx as your name! – Update email address, etc. Groups posted on Website and on Piazza Problems with infrastructure? Developing FAQ – please tell us about problems Design Document 2/13/13 – What is in Design Document? – BDD => “No Design Document”!? Kubiatowicz CS194-24 ©UCB Fall 2013 Lec 6.13 Thread Level Parallelism (TLP) • In modern processors, Instruction Level Parallelism (ILP) exploits implicit parallel operations within a loop or straight-line code segment • Thread Level Parallelism (TLP) explicitly represented by the use of multiple threads of execution that are inherently parallel – Threads can be on a single processor – Or, on multiple processors • Concurrency vs Parallelism – Concurrency is when two tasks can start, run, and complete in overlapping time periods. It doesn't necessarily mean they'll ever both be running at the same instant. » For instance, multitasking on a single-threaded machine. – Parallelism is when tasks literally run at the same time, eg. on a multicore processor. • Goal: Use multiple instruction streams to improve – Throughput of computers that run many programs – Execution time of multi-threaded programs 2/13/13 Kubiatowicz CS194-24 ©UCB Fall 2013 Lec 6.14 Multiprocessing vs Multiprogramming • Remember Definitions: – Multiprocessing Multiple CPUs – Multiprogramming Multiple Jobs or Processes – Multithreading Multiple threads per Process • What does it mean to run two threads “concurrently”? – Scheduler is free to run threads in any order and interleaving: FIFO, Random, … – Dispatcher can choose to run each thread to completion or time-slice in big chunks or small chunks Multiprocessing A B C A Multiprogramming 2/13/13 A B B C A C B Kubiatowicz CS194-24 ©UCB Fall 2013 C B Lec 6.15 Correctness for systems with concurrent threads • If dispatcher can schedule threads in any way, programs must work under all circumstances – Can you test for this? – How can you know if your program works? • Independent Threads: – – – – No state shared with other threads Deterministic Input state determines results Reproducible Can recreate Starting Conditions, I/O Scheduling order doesn’t matter (if switch() works!!!) • Cooperating Threads: – Shared State between multiple threads – Non-deterministic – Non-reproducible • Non-deterministic and Non-reproducible means that bugs can be intermittent – Sometimes called “Heisenbugs” 2/13/13 Kubiatowicz CS194-24 ©UCB Fall 2013 Lec 6.16 Interactions Complicate Debugging • Is any program truly independent? – Every process shares the file system, OS resources, network, etc – Extreme example: buggy device driver causes thread A to crash “independent thread” B • You probably don’t realize how much you depend on reproducibility: – Example: Evil C compiler » Modifies files behind your back by inserting errors into C program unless you insert debugging code – Example: Debugging statements can overrun stack • Non-deterministic errors are really difficult to find – Example: Memory layout of kernel+user programs » depends on scheduling, which depends on timer/other things » Original UNIX had a bunch of non-deterministic errors – Example: Something which does interesting I/O » User typing of letters used to help generate secure keys 2/13/13 Kubiatowicz CS194-24 ©UCB Fall 2013 Lec 6.17 Why allow cooperating threads? • People cooperate; computers help/enhance people’s lives, so computers must cooperate – By analogy, the non-reproducibility/non-determinism of people is a notable problem for “carefully laid plans” • Advantage 1: Share resources – One computer, many users – One bank balance, many ATMs » What if ATMs were only updated at night? – Embedded systems (robot control: coordinate arm & hand) • Advantage 2: Speedup – Overlap I/O and computation » Many different file systems do read-ahead – Multiprocessors – chop up program into parallel pieces • Advantage 3: Modularity – More important than you might think – Chop large problem up into simpler pieces » To compile, for instance, gcc calls cpp | cc1 | cc2 | as | ld » Makes system easier to extend 2/13/13 Kubiatowicz CS194-24 ©UCB Fall 2013 Lec 6.18 High-level Example: Web Server • Server must handle many requests • Non-cooperating version: serverLoop() { con = AcceptCon(); ProcessFork(ServiceWebPage(),con); } • What are some disadvantages of this technique? 2/13/13 Kubiatowicz CS194-24 ©UCB Fall 2013 Lec 6.19 Threaded Web Server • Now, use a single process • Multithreaded (cooperating) version: serverLoop() { connection = AcceptCon(); ThreadFork(ServiceWebPage(),connection); } • Looks almost the same, but has many advantages: – Can share file caches kept in memory, results of CGI scripts, other things – Threads are much cheaper to create than processes, so this has a lower per-request overhead • Question: would a user-level (say one-to-many) thread package make sense here? – When one request blocks on disk, all block… • What about Denial of Service attacks or digg / Slash-dot effects? 2/13/13 Kubiatowicz CS194-24 ©UCB Fall 2013 Lec 6.20 Thread Pools • Problem with previous version: Unbounded Threads – When web-site becomes too popular – throughput sinks • Instead, allocate a bounded “pool” of worker threads, representing the maximum level of multiprogramming queue Master Thread Thread Pool master() { allocThreads(worker,queue); while(TRUE) { con=AcceptCon(); Enqueue(queue,con); wakeUp(queue); } } 2/13/13 worker(queue) { while(TRUE) { con=Dequeue(queue); if (con==null) sleepOn(queue); else ServiceWebPage(con); } } Kubiatowicz CS194-24 ©UCB Fall 2013 Lec 6.21 Common Notions of Thread Creation • cobegin/coend • Statements in block may run in parallel • cobegins may be nested • Scoped, so you cannot have a missing coend cobegin job1(a1); job2(a2); coend • fork/join tid1 = fork(job1, a1); job2(a2); join tid1; • future v = future(job1(a1)); … = …v…; • forall forall(I from 1 to N) C[I] = A[I] + B[I] end • Forked procedure runs in parallel • Wait at join point if it’s not finished • Future possibly evaluated in parallel • Attempt to use return value will wait • Separate thread launched for each iteration • Implicit join at end • Threads expressed in the code may not turn into independent computations – Only create threads if processors idle – Example: Thread-stealing runtimes such as cilk 2/13/13 Kubiatowicz CS194-24 ©UCB Fall 2013 Lec 6.22 Overview of POSIX Threads • Pthreads: The POSIX threading interface – System calls to create and synchronize threads – Should be relatively uniform across UNIX-like OS platforms – Originally IEEE POSIX 1003.1c • Pthreads contain support for – Creating parallelism – Synchronizing – No explicit support for communication, because shared memory is implicit; a pointer to shared data is passed to a thread » Only for HEAP! Stacks not shared 2/13/13 Kubiatowicz CS194-24 ©UCB Fall 2013 Lec 6.23 Forking POSIX Threads Signature: int pthread_create(pthread_t *, const pthread_attr_t *, void * (*)(void *), void *); Example call: errcode = pthread_create(&thread_id; &thread_attribute &thread_fun; &fun_arg); • thread_id is the thread id or handle (used to halt, etc.) • thread_attribute various attributes – Standard default values obtained by passing a NULL pointer – Sample attribute: minimum stack size • thread_fun the function to be run (takes and returns void*) • fun_arg an argument can be passed to thread_fun when it starts • errorcode will be set nonzero if the create operation fails 2/13/13 Kubiatowicz CS194-24 ©UCB Fall 2013 Lec 6.24 Simple Threading Example (pThreads) void* SayHello(void *foo) { printf( "Hello, world!\n" ); return NULL; } E.g., compile using gcc –lpthread int main() { pthread_t threads[16]; int tn; for(tn=0; tn<16; tn++) { pthread_create(&threads[tn], NULL, SayHello, NULL); } for(tn=0; tn<16 ; tn++) { pthread_join(threads[tn], NULL); } return 0; } 2/13/13 Kubiatowicz CS194-24 ©UCB Fall 2013 Lec 6.25 Shared Data and Threads • Variables declared outside of main are shared • Objects allocated on the heap may be shared (if pointer is passed) • Variables on the stack are private: passing pointer to these around to other threads can cause problems • Often done by creating a large “thread data” struct, which is passed into all threads as argument char *message = "Hello World!\n"; pthread_create(&thread1, NULL, print_fun,(void*) message); 2/13/13 Kubiatowicz CS194-24 ©UCB Fall 2013 Lec 6.26 Loop Level Parallelism • Many application have parallelism in loops double stuff [n][n]; for (int i = 0; i < n; i++) for (int j = 0; j < n; j++) … pthread_create (…, update_stuff, …, &stuff[i][j]); • But overhead of thread creation is nontrivial – update_stuff should have a significant amount of work • Common Performance Pitfall: Too many threads – The cost of creating a thread is 10s of thousands of cycles on modern architectures – Solution: Thread blocking: use a small # of threads, often equal to the number of cores/processors or hardware threads 2/13/13 Kubiatowicz CS194-24 ©UCB Fall 2013 Lec 6.27 Thread Scheduling main thread Time Thread A Thread B Thread C Thread D Thread E • Once created, when will a given thread run? – It is up to the Operating System or hardware, but it will run eventually, even if you have more threads than cores – But – scheduling may be non-ideal for your application • Programmer can provide hints or affinity in some cases – E.g., create exactly P threads and assign to P cores • Can provide user-level scheduling for some systems – Application-specific tuning based on programming model – Work in the ParLAB on making user-level scheduling easy to do (Lithe, PULSE) 2/13/13 Kubiatowicz CS194-24 ©UCB Fall 2013 Lec 6.29 Review: Synchronization problem with Threads • One thread per transaction, each running: Deposit(acctId, amount) { acct = GetAccount(actId); /* May use disk I/O */ acct->balance += amount; StoreAccount(acct); /* Involves disk I/O */ } • Unfortunately, shared state can get corrupted: Thread 1 load r1, acct->balance Thread 2 load r1, acct->balance add r1, amount2 store r1, acct->balance add r1, amount1 store r1, acct->balance • Atomic Operation: an operation that always runs to completion or not at all – It is indivisible: it cannot be stopped in the middle and state cannot be modified by someone else in the middle 2/13/13 Kubiatowicz CS194-24 ©UCB Fall 2013 Lec 6.30 Implementation of Locks by Disabling Interrupts • Key idea: maintain a lock variable and impose mutual exclusion only during operations on that variable int value = FREE; Acquire() { Release() { disable interrupts; disable interrupts; if (anyone on wait queue) { if (value == BUSY) { take thread off wait queue put thread on wait queue; Place on ready queue; Go to sleep(); } else { // Enable interrupts? value = FREE; } else { } value = BUSY; enable interrupts; } } enable interrupts; } 2/13/13 Kubiatowicz CS194-24 ©UCB Fall 2013 Lec 6.31 How to implement locks? Atomic Read-Modify-Write instructions • Problem with previous solution? – Can’t let users disable interrupts! (Why?) – Doesn’t work well on multiprocessor » Disabling interrupts on all processors requires messages and would be very time consuming • Alternative: atomic instruction sequences – These instructions read a value from memory and write a new value atomically – Hardware is responsible for implementing this correctly » on both uniprocessors (not too hard) » and multiprocessors (requires help from cache coherence protocol) – Unlike disabling interrupts, can be used on both uniprocessors and multiprocessors 2/13/13 Kubiatowicz CS194-24 ©UCB Fall 2013 Lec 6.32 Examples of Read-Modify-Write • test&set (&address) { /* most architectures */ result = M[address]; M[address] = 1; return result; } • swap (&address, register) { /* x86 */ temp = M[address]; M[address] = register; register = temp; } • compare&swap (&address, reg1, reg2) { /* 68000 */ if (reg1 == M[address]) { M[address] = reg2; return success; } else { return failure; } } • load-linked&store conditional(&address) { /* R4000, alpha */ loop: ll r1, M[address]; movi r2, 1; /* Can do arbitrary comp */ sc r2, M[address]; beqz r2, loop; } 2/13/13 Kubiatowicz CS194-24 ©UCB Fall 2013 Lec 6.33 Implementing Locks with test&set • Another flawed, but simple solution: int value = 0; // Free Acquire() { while (test&set(value)); // while busy } Release() { value = 0; } • Simple explanation: – If lock is free, test&set reads 0 and sets value=1, so lock is now busy. It returns 0 so while exits. – If lock is busy, test&set reads 1 and sets value=1 (no change). It returns 1, so while loop continues – When we set value = 0, someone else can get lock • Busy-Waiting: thread consumes cycles while waiting 2/13/13 Kubiatowicz CS194-24 ©UCB Fall 2013 Lec 6.34 Better Locks using test&set • Can we build test&set locks without busy-waiting? – Can’t entirely, but can minimize! – Idea: only busy-wait to atomically check lock value int guard = 0; int value = FREE; Release() { Acquire() { // Short busy-wait time // Short busy-wait time while (test&set(guard)); while (test&set(guard)); if anyone on wait queue { if (value == BUSY) { take thread off wait queue put thread on wait queue; Place on ready queue; go to sleep() & guard = 0; } else { } else { value = FREE; value = BUSY; } guard = 0; guard = 0; } }• Note: sleep has to be sure to reset the guard variable – Why can’t we do it just before or just after the sleep? 2/13/13 Kubiatowicz CS194-24 ©UCB Fall 2013 Lec 6.35 Higher-level Primitives than Locks • What is the right abstraction for synchronizing threads that share memory? – Want as high a level primitive as possible • Good primitives and practices important! – Since execution is not entirely sequential, really hard to find bugs, since they happen rarely – UNIX is pretty stable now, but up until about mid-80s (10 years after started), systems running UNIX would crash every week or so – concurrency bugs • Synchronization is a way of coordinating multiple concurrent activities that are using shared state – This lecture and the next presents a couple of ways of structuring the sharing 2/13/13 Kubiatowicz CS194-24 ©UCB Fall 2013 Lec 6.36 Recall: Semaphores • Semaphores are a kind of generalized lock – First defined by Dijkstra in late 60s – Main synchronization primitive used in original UNIX • Definition: a Semaphore has a non-negative integer value and supports the following two operations: – P(): an atomic operation that waits for semaphore to become positive, then decrements it by 1 » Think of this as the wait() operation – V(): an atomic operation that increments the semaphore by 1, waking up a waiting P, if any » This of this as the signal() operation – Note that P() stands for “proberen” (to test) and V() stands for “verhogen” (to increment) in Dutch 2/13/13 Kubiatowicz CS194-24 ©UCB Fall 2013 Lec 6.37 Semaphores Like Integers Except • Semaphores are like integers, except – No negative values – Only operations allowed are P and V – can’t read or write value, except to set it initially – Operations must be atomic » Two P’s together can’t decrement value below zero » Similarly, thread going to sleep in P won’t miss wakeup from V – even if they both happen at same time • Semaphore from railway analogy – Here is a semaphore initialized to 2 for resource control: Value=2 Value=0 Value=1 2/13/13 Kubiatowicz CS194-24 ©UCB Fall 2013 Lec 6.38 Two Uses of Semaphores • Mutual Exclusion (initial value = 1) – Also called “Binary Semaphore”. – Can be used for mutual exclusion: semaphore.P(); // Critical section goes here semaphore.V(); • Scheduling Constraints (initial value = 0) – Locks are fine for mutual exclusion, but what if you want a thread to wait for something? – Example: suppose you had to implement ThreadJoin which must wait for thread to terminiate: Initial value of semaphore = 0 ThreadJoin { semaphore.P(); } ThreadFinish { semaphore.V(); } 2/13/13 Kubiatowicz CS194-24 ©UCB Fall 2013 Lec 6.39 Monitor with Condition Variables • Lock: the lock provides mutual exclusion to shared data – Always acquire before accessing shared data structure – Always release after finishing with shared data – Lock initially free • Condition Variable: a queue of threads waiting for something inside a critical section – Key idea: make it possible to go to sleep inside critical section by atomically releasing lock at time we go to sleep – Contrast to semaphores: Can’t wait inside critical section 2/13/13 Kubiatowicz CS194-24 ©UCB Fall 2013 Lec 6.40 Simple Monitor Example • Here is an (infinite) synchronized queue Lock lock; Condition dataready; Queue queue; AddToQueue(item) { lock.Acquire(); queue.enqueue(item); dataready.signal(); lock.Release(); } // // // // RemoveFromQueue() { lock.Acquire(); // while (queue.isEmpty()) { dataready.wait(&lock); // } item = queue.dequeue(); // lock.Release(); // return(item); } 2/13/13 Get Lock Add item Signal any waiters Release Lock Get Lock If nothing, sleep Get next item Release Lock Kubiatowicz CS194-24 ©UCB Fall 2013 Lec 6.41 Problem: Busy-Waiting for Lock • Positives for this solution – Machine can receive interrupts – User code can use this lock – Works on a multiprocessor • Negatives – This is very inefficient because the busy-waiting thread will consume cycles waiting – Waiting thread may take cycles away from thread holding lock (no one wins!) – Priority Inversion: If busy-waiting thread has higher priority than thread holding lock no progress! • Priority Inversion problem with original Martian rover • For semaphores and monitors, waiting thread may wait for an arbitrary length of time! – Thus even if busy-waiting was OK for locks, definitely not ok for other primitives – Homework/exam solutions should not have busy-waiting! 2/13/13 Kubiatowicz CS194-24 ©UCB Fall 2013 Lec 6.42 Busy-wait vs Blocking • Busy-wait: I.e. spin lock – Keep trying to acquire lock until read – Very low latency/processor overhead! – Very high system overhead! » Causing stress on network while spinning » Processor is not doing anything else useful • Blocking: – If can’t acquire lock, deschedule process (I.e. unload state) – Higher latency/processor overhead (1000s of cycles?) » Takes time to unload/restart task » Notification mechanism needed – Low system overheadd » No stress on network » Processor does something useful • Hybrid: – Spin for a while, then block – 2-competitive: spin until have waited blocking time 2/13/13 Kubiatowicz CS194-24 ©UCB Fall 2013 Lec 6.43 Summary • Important concept: Atomic Operations – An operation that runs to completion or not at all – These are the primitives on which to construct various synchronization primitives • Talked about hardware atomicity primitives: – Disabling of Interrupts, test&set, swap, comp&swap, load-linked/store conditional • Showed several constructions of Locks – Must be very careful not to waste/tie up machine resources » Shouldn’t disable interrupts for long » Shouldn’t spin wait for long – Key idea: Separate lock variable, use hardware mechanisms to protect modifications of that variable • Talked about Semaphores, Monitors, and Condition Variables – Higher level constructs that are harder to “screw up” 2/13/13 Kubiatowicz CS194-24 ©UCB Fall 2013 Lec 6.44