Duke Systems Servers and Threads, Continued Jeff Chase Duke University Processes and threads virtual address space + Each process has a virtual address space (VAS): a private name space for the virtual memory it uses. The VAS is both a “sandbox” and a “lockbox”: it limits what the process can see/do, and protects its data from others. main thread stack other threads (optional) +… Each process has a thread bound to the VAS, with stacks (user and kernel). From now on, we suppose that a process could have additional threads. If we say a process does something, we really mean its thread does it. We are not concerned with how to implement them, but we presume that they can all make system calls and block independently. The kernel can suspend/restart the thread wherever and whenever it wants. STOP wait Inside your Web server Server application (Apache, Tomcat/Java, etc) accept queue packet queues listen queue disk queue Server operations create socket(s) bind to port number(s) listen to advertise port wait for client to arrive on port (select/poll/epoll of ports) accept client connection read or recv request write or send response close client socket Web server: handling a request Accept Client Connection may block waiting on network Read HTTP Request Header Find File may block waiting on disk I/O Send HTTP Response Header Read File Send Data Want to be able to process requests concurrently. Multi-programmed server: idealized Magic elastic worker pool Resize worker pool to match incoming request load: create/destroy workers as needed. idle workers Workers wait here for next request dispatch. Workers could be processes or threads. worker loop dispatch Incoming request queue Handle one request, blocking as necessary. When request is complete, return to worker pool. Multi-process server architecture Process 1 Accept Conn Read Request Find File Send Header Read File Send Data … separate address spaces Process N Accept Conn Read Request Find File Send Header Read File Send Data Multi-process server architecture • Each of P processes can execute one request at a time, concurrently with other processes. • If a process blocks, the other processes may still make progress on other requests. • Max # requests in service concurrently == P • The processes may loop and handle multiple requests serially, or can fork a process per request. – Tradeoffs? • Examples: – inetd “internet daemon” for standard /etc/services – Design pattern for (Web) servers: “prefork” a fixed number of worker processes. Example: inetd • Classic Unix systems run an inetd “internet daemon”. • Inetd receives requests for standard services. – Standard services and ports listed in /etc/services. – inetd listens on the ports and accepts connections. • For each connection, inetd forks a child process. • Child execs the service configured for the port. • Child executes the request, then exits. [Apache Modeling Project: http://www.fmc-modeling.org/projects/apache] Children of init: inetd New child processes are created to run network services. They may be created on demand on connect attempts from the network for designated service ports. Should they run as root? Prefork In the Apache MPM “prefork” option, only one child polls or accepts at a time: the child at the head of a queue. Avoid “thundering herd”. [Apache Modeling Project: http://www.fmc-modeling.org/projects/apache] Details, details “Scoreboard” keeps track of child/worker activity, so parent can manage an elastic worker pool. Multi-threaded server architecture Thread 1 Accept Conn Read Request Find File Read File Send Data Send Header Read File Send Data … Send Header Thread N Accept Conn Read Request Find File This structure might have lower cost than the multi-process architecture if threads are “cheaper” than processes. Servers structure, recap • The server structure discussion motivates threads, and illustrates the need for concurrency management. – We return later to performance impacts and effective I/O overlap. • A continuing theme of the class presentation: Unix systems fall short of the idealized model. – Thundering herd problem when multiple workers wake up and contend for an arriving request: one worker wins and consumes the request, the others go back to sleep – their work was wasted. Recent fix in Linux. – Separation of poll/select and accept in Unix syscall interface: multiple workers wake up when a socket has new data, but only one can accept the request: thundering herd again, requires an API change to fix it. – There is no easy way to manage an elastic worker pool. • Real servers (e.g., Apache/MPM) incorporate lots of complexity to overcome these problems. We skip this topic. Threads • We now enter the topic of threads and concurrency control. – This will be a focus for several lectures. – We start by introducing more detail on thread management, and the problem of nondeterminism in concurrent execution schedules. • Server structure discussion motivates threads, but there are other motivations. – Harnessing parallel computing power in the multicore era – Managing concurrent I/O streams – Organizing/structuring processing for user interface (UI) – Threading and concurrency management are fundamental to OS kernel implementation: processes/threads execute concurrently in the kernel address space for system calls and fault handling. The kernel is a multithreaded program. • So let’s get to it…. The theater analogy script context (stage) Threads Program Address space Running a program is like performing a play. [lpcox] A Thread “fencepost” Thread* t name/status etc machine state 0xdeadbeef unused low Stack stack top high thread object or thread control block (TCB) int stack[StackSize] ucontext_t Example: pthreads pthread_t threads[N]; int rc; int t = …; rc = pthread_create(&threads[t], NULL, PrintHello, (void *)t); if (rc) error…. void *PrintHello(void *threadid) { long tid; tid = (long)threadid; printf("Hello World! It's me, thread #%ld!\n", tid); pthread_exit(NULL); } [http://computing.llnl.gov/tutorials/pthreads/] Example: Java Threads (1) class PrimeThread extends Thread { long minPrime; PrimeThread(long minPrime) { this.minPrime = minPrime; } public void run() { // compute primes larger than minPrime ... } } PrimeThread p = new PrimeThread(143); p.start(); [http://download.oracle.com/javase/6/docs/api/java/lang/Thread.html] Example: Java Threads (2) class PrimeRun implements Runnable { long minPrime; PrimeRun(long minPrime) { this.minPrime = minPrime; } public void run() { // compute primes larger than minPrime ... } } PrimeRun p = new PrimeRun(143); new Thread(p).start(); [http://download.oracle.com/javase/6/docs/api/java/lang/Thread.html] Thread states and transitions exit running wakeup wait, STOP, read, write, listen, receive, etc. STOP wait EXIT yield The kernel process/thread scheduler governs these transitions. sleep blocked exited ready Sleep and wakeup are internal primitives. Wakeup adds a thread to the scheduler’s ready pool: a set of threads in the ready state. Two threads sharing a CPU concept reality context switch CPU Scheduling 101 The OS scheduler makes a sequence of “moves”. – Next move: if a CPU core is idle, pick a ready thread t from the ready pool and dispatch it (run it). – Scheduler’s choice is “nondeterministic” – Scheduler’s choice determines interleaving of execution blocked threads Wakeup ready pool If timer expires, or wait/yield/terminate GetNextToRun SWITCH() A Rough Idea Yield() { disable; next = FindNextToRun(); ReadyToRun(this); Switch(this, next); enable; } Sleep() { disable; this->status = BLOCKED; next = FindNextToRun(); Switch(this, next); enable; } Issues to resolve: What if there are no ready threads? How does a thread terminate? How does the first thread start? /* * Save context of the calling thread (old), restore registers of * the next thread to run (new), and return in context of new. */ switch/MIPS (old, new) { old->stackTop = SP; save RA in old->MachineState[PC]; save callee registers in old->MachineState restore callee registers from new->MachineState RA = new->MachineState[PC]; SP = new->stackTop; return (to RA) } This example (from the old MIPS ISA) illustrates how context switch saves/restores the user register context for a thread, efficiently and without assigning a value directly into the PC. Example: Switch() Save current stack pointer and caller’s return address in old thread object. switch/MIPS (old, new) { old->stackTop = SP; save RA in old->MachineState[PC]; save callee registers in old->MachineState Caller-saved registers (if needed) are already saved on its stack, and restore callee registers from new->MachineState restored automatically RA = new->MachineState[PC]; on return. SP = new->stackTop; return (to RA) } RA is the return address register. It contains the address that a procedure return instruction branches to. Switch off of old stack and over to new stack. Return to procedure that called switch in new thread. What to know about context switch • The Switch/MIPS example is an illustration for those of you who are interested. It is not required to study it. But you should understand how a thread system would use it (refer to state transition diagram): • Switch() is a procedure that returns immediately, but it returns onto the stack of new thread, and not in the old thread that called it. • Switch() is called from internal routines to sleep or yield (or exit). • Therefore, every thread in the blocked or ready state has a frame for Switch() on top of its stack: it was the last frame pushed on the stack before the thread switched out. (Need per-thread stacks to block.) • The thread create primitive seeds a Switch() frame manually on the stack of the new thread, since it is too young to have switched before. • When a thread switches into the running state, it always returns immediately from Switch() back to the internal sleep or yield routine, and from there back on its way to wherever it goes next. Creating a new thread 1. 2. 3. Also called “forking” a thread Idea: create initial state, put on ready queue Allocate, initialize a new TCB Allocate a new stack Make it look like thread was going to call a function PC points to first instruction in function SP points to new stack Stack contains arguments passed to function 4. Add thread to ready queue Thread control block Address Space TCB1 PC Ready queue SP TCB2 TCB3 registers PC SP registers PC SP registers Code Code Code Stack Stack Stack Thread 1 running PC SP registers CPU Thread control block Address Space Ready queue TCB2 TCB3 PC SP registers PC SP registers Stack Stack Code Stack Thread 1 running PC SP registers CPU Kernel threads (“native”) Thread PC SP … Thread PC SP Thread … PC SP … Thread PC SP … User mode Kernel mode Scheduler User-level threads (“green”) Thread PC SP … Thread PC SP … Thread PC SP Thread … PC SP … Sched User mode Kernel mode Scheduler Andrew Birrell Bob Taylor Concurrency: An Example int counters[N]; int total; /* * Increment a counter by a specified value, and keep a running sum. */ void TouchCount(int tid, int value) { counters[tid] += value; total += value; } Reading Between the Lines of C /* ; counters and total are global data ; tid and value are local data counters[tid] += value; total += value; */ load counters, R1 load 8(SP), R2 shl R2, #2, R2 add R1, R2, R1 load (R1), R2 ; load counters base ; load tid index ; index = index * sizeof(int) ; compute index to array ; load counters[tid] load 4(SP), R3 add R2, R3, R2 store R2, (R1) ; load value ; counters[tid] += value ; store back to counters[tid] load add store total, R2 R2, R3, R2 R2, total ; load total ; total += value ; store total Reading Between the Lines of C load add store total, R2 R2, R3, R2 R2, total ; load total ; total += value ; store total load add store Two executions of this code, so: two values are added to total. load add store Interleaving matters load add store total, R2 R2, R3, R2 R2, total ; load total ; total += value ; store total load load add store add store In this schedule, only one value is added to total: last writer wins. The scheduler made a legal move that broke this program. Non-determinism and ordering Thread A Thread B Thread C Global ordering Why do we care about the global ordering? Might have dependencies between events Different orderings can produce different results Why is this ordering unpredictable? Can’t predict how fast processors will run Time Non-determinism example y=10; Thread A: x = y+1; Thread B: y = y*2; Possible results? A goes first: x = 11 and y = 20 B goes first: y = 20 and x = 21 What is shared between threads? Variable y Another example Two threads (A and B) A tries to increment i B tries to decrement i Thread A: i = o; while (i < 10){ i++; } print “A done.” Thread B: i = o; while (i > -10){ i--; } print “B done.” Example continued Who wins? Does someone have to win? Thread A: i = o; while (i < 10){ i++; } print “A done.” Thread B: i = o; while (i > -10){ i--; } print “B done.” Debugging non-determinism Requires worst-case reasoning Eliminate all ways for program to break Debugging is hard Can’t test all possible interleavings Bugs may only happen sometimes Heisenbug Re-running program may make the bug disappear Doesn’t mean it isn’t still there!