The Kernel and the Unix Process API Jeff Chase Duke University The story so far: process and kernel • A (classical) OS lets us run programs as processes. A process is a running program instance (with a thread). – Program code runs with the core in untrusted user mode. • Processes are protected/isolated. – Virtual address space is a “fenced pasture” – Sandbox: can’t get out. Lockbox: nobody else can get in. • The OS kernel controls everything. – Kernel code runs with the core in trusted kernel mode. • OS offers system call APIs. – Create processes – Control processes – Monitor process execution Operating Systems: The Classical View Programs run as independent processes. data Each process has a private virtual address space and one or more threads. data Protected system calls ...and upcalls (e.g., signals) Protected OS kernel mediates access to shared resources. The kernel code and data are protected from untrusted code. Threads enter the kernel for OS services. “Limited direct execution” Any kind of machine exception transfers control to a registered (trusted) kernel handler running in a protected CPU mode. syscall trap u-start fault u-return u-start fault u-return kernel “top half” kernel “bottom half” (interrupt handlers) clock interrupt user mode kernel mode interrupt return boot Kernel handler manipulates CPU register context to return to selected user context. Exceptions: trap, fault, interrupt synchronous caused by an instruction asynchronous caused by some other event intentional unintentional happens every time contributing factors trap: system call fault open, close, read, write, fork, exec, exit, wait, kill, etc. invalid or protected address or opcode, page fault, overflow, etc. “software interrupt” software requests an interrupt to be delivered at a later time interrupt caused by an external event: I/O op completed, clock tick, power fail, etc. Example: Syscall traps • Programs in C, C++, etc. invoke system calls by linking to a standard library written in assembly. – The library defines a stub or wrapper routine for each syscall. – Stub executes a special trap instruction (e.g., chmk or callsys or syscall instruction) to change mode to kernel. – Syscall arguments/results are passed in registers (or user stack). – OS defines Application Binary Interface (ABI). read() in Unix libc.a Alpha library (executes in user mode): #define SYSCALL_READ 27 move arg0…argn, a0…an move SYSCALL_READ, v0 callsys move r1, _errno return Alpha CPU ISA (defunct) # op ID for a read system call # syscall args in registers A0..AN # syscall dispatch index in V0 # kernel trap # errno = return status MacOS x86-64 syscall example section .data hello_world db section .text global start start: mov rax, 0x2000004 mov rdi, 1 mov rsi, hello_world mov rdx, 14 syscall mov rax, 0x2000001 mov rdi, 0 syscall "Hello World!", 0x0a Illustration only: this program writes “Hello World!” to standard output. ; System call write = 4 ; Write to standard out = 1 ; The address of hello_world string ; The size to write ; Invoke the kernel ; System call number for exit = 1 ; Exit success = 0 ; Invoke the kernel http://thexploit.com/secdev/mac-os-x-64-bit-assembly-system-calls/ Thread states and transitions We will presume that these transitions occur only in kernel mode. This is true in classical Unix and in systems with pure kernel-based threads. Before a thread can sleep, it must first enter the kernel via trap (syscall) or fault. Before a thread can yield, it must enter the kernel, or the core must take an interrupt to return control to the kernel. running yield preempt sleep blocked STOP wait wakeup dispatch ready In the running state, kernel code decides if/when/how to enter user mode, and sets up a suitable context. Kernel Stacks and Trap/Fault Handling Threads execute user code on a user stack in the user virtual memory in the process virtual address space. Each thread has a second kernel stack in kernel space (VM accessible only in kernel mode). data stack stack stack syscall dispatch table stack System calls and faults run in kernel mode on a kernel stack. Kernel code running in P’s process context has access to P’s virtual memory. The syscall handler makes an indirect call through the system call dispatch table to the handler registered for the specific system call. Let’s get down to it UNIX PROCESS API CACM 1974 Thompson and Ritchie Turing Award ‘83 Unix: A lasting achievement? “Perhaps the most important achievement of Unix is to demonstrate that a powerful operating system for interactive use need not be expensive…it can run on hardware costing as little as $40,000.” DEC PDP-11/24 The UNIX Time-Sharing System* D. M. Ritchie and K. Thompson 1974 http://histoire.info.online.fr/pdp11.html Unix fork/exit syscalls int pid = fork(); Create a new process that is a clone of its parent, return new process ID (pid) to parent, return 0 to child. fork parent child exit(status); Exit with status, destroying the process. Note: this is not the only way for a process to exit! time data exit data p exit pid: 5587 pid: 5588 fork The fork syscall returns twice: int pid; int status = 0; if (pid = fork()) { /* parent */ ….. } else { /* child */ ….. exit(status); } It returns a zero in the context of the new child process. It returns the new child process ID (pid) in the context of the parent. fork • Child can’t be an exact copy – Is distinguished by one variable (the return value of fork) if (fork () == 0) { /* child */ execute new program } else { /* parent */ carry on } exit syscall fork (original concept) fork in action today void dofork() { int rc = fork(); if (rc < 0) { perror("fork failed: "); exit(1); } else if (rc == 0) { child(); } else { parent(rc); } } Fork is conceptually difficult but syntactically clean and simple. I don’t have to say anything about what the new child process “looks like”: it is an exact clone of the parent! The child has a new thread executing at the same point in the same program. The child is a new instance of the running program: it has a “copy” of the entire address space. The “only” change is the process ID and return code rc! The parent thread continues on its way. The child thread continues on its way. A simple program: forkdeep int count = 0; int level = 0; void child() { level++; output pids if (level < count) dofork(); if (level == count) sleep(3); } void parent(int childpid) { output pids wait for child to finish How? } main(int argc, char *argv[]) { count = atoi(argv[1]); dofork(); output pid } chase$ ./forkdeep 4 30866-> 30867 30867 30867-> 30868 30868 30868-> 30869 30869 30869-> 30870 30870 30870 30869 30868 30867 30866 chase$ Wait Note: in modern systems the wait syscall has many variants and options. Unix process API int pid; int status = 0; if (pid = fork()) { /* parent */ ….. pid = wait(&status); } else { /* child */ ….. exit(status); } Parent uses wait to sleep until the child exits; wait returns child pid and status. Wait variants allow wait on a specific child, or notification of stops and other “signals”. But how do I run a new program? But how do I run a new program? • The child, or any process really, can replace its program in midstream. • exec* system call: “forget everything in my address space and reinitialize my entire address space with stuff from a named program file.” • The exec system call never returns: the new program executes in the calling process until it dies (exits). – The code from the parent program runs in the child process and controls its future. The parent program selects the program that the child process will run (via exec), and sets up its connections to the outside world. The child program doesn’t even know! exec (original concept) A simple program: forkexec … main(int argc, char *argv[]) { int status; int rc = fork(); if (rc < 0) { perror("fork failed: "); exit(1); } else if (rc == 0) { argv++; execve(argv[0], argv, 0); } else { waitpid(rc, &status, 0); printf("child %d exited with status %d\n", rc, WEXITSTATUS(status)); } } A simple program: prog0 int main() { } chase$ cc –o forkexec forkexec.c chase$ cc –o prog0 prog0.c chase$ ./forkexec prog0 child 19175 exited with status 0 chase$ Simple I/O: args and printf #include <stdio.h> int main(int argc, char* argv[]) { int i; printf("arguments: %d\n", argc); for (i=0; i<argc; i++) { printf("%d: %s\n", i, argv[i]); } } chase$ cc –o prog1 prog1.c chase$ ./forkexec prog1 arguments: 1 0: prog1 child 19178 exited with status 0 chase$ ./forkexec prog1 one 2 3 arguments: 4 0: prog1 1: one 2: 2 3: 3 child 19181 exited with status 0 Environment variables (rough) #include <stdio.h> #include <stdlib.h> int main(int argc, char* argv[], char* envp[]) { int i; int count = atoi(argv[1]); for (i=0; i < count; i++) { printf("env %d: %s\n", i, envp[i]); } } Environment variables (rough) chase$ cc –o env0 env0.c chase$ ./env0 Segmentation fault: 11 chase$ ./env0 12 env 0: TERM_PROGRAM=Apple_Terminal env 1: TERM=xterm-256color env 2: SHELL=/bin/bash env 3: TMPDIR=/var/folders/td/ng76cpqn4zl1wrs57hldf1vm0000gn/T/ env 4: Apple_PubSub_Socket_Render=/tmp/launch-OtU5Bb/Render env 5: TERM_PROGRAM_VERSION=309 env 6: OLDPWD=/Users/chase/c210-stuff env 7: TERM_SESSION_ID=FFCE3A14-1D4B-4B08… env 8: USER=chase env 9: COMMAND_MODE=unix2003 env 10: SSH_AUTH_SOCK=/tmp/launch-W03wn2/Listeners env 11: __CF_USER_TEXT_ENCODING=0x1F5:0:0 chase$ Environment variables (safer) #include <stdio.h> #include <stdlib.h> int main(int argc, char* argv[], char* envp[]) { int i; int count; if (argc < 2) { fprintf(stderr, "Usage: %s <count>\n", argv[0]); exit(1); } count = atoi(argv[1]); for (i=0; i < count; i++) { if (envp == 0) { printf("env %d: nothing!\n", i); exit(1); } else if (envp[i] == 0) { printf("env %d: null!\n", i); exit(1); } else printf("env %d: %s\n", i, envp[i]); } } Where do environment variables come from? chase$ cc –o env env.c chase$ ./env chase$ ./forkexec env Usage: env <count> child 19195 exited with status 1 chase$ ./forkexec env 1 env 0: null! child 19263 exited with status 1 chase$ forkexec revisited char *lala = "lalala\n"; char *nothing = 0; … main(int argc, char *argv[]) { int status; int rc = fork(); if (rc < 0) { … } else if (rc == 0) { argv++; execve(argv[0], argv, &lala); } else { … } chase$ cc –o fel forkexec-lala.c chase$ ./fel env 1 env 0: lalala child 19276 exited with status 0 chase$ forkexec revisited again … main(int argc, char *argv[], char *envp[]) { int status; int rc = fork(); if (rc < 0) { … } else if (rc == 0) { argv++; execve(argv[0], argv, envp); } else { … } chase$ cc –o fe forkexec1.c chase$ ./fe env 3 env 0: TERM_PROGRAM=Apple_Terminal env 1: TERM=xterm-256color env 2: SHELL=/bin/bash child 19290 exited with status 0 chase$ How about this? chase$ ./fe fe fe fe fe fe fe fe fe fe fe fe env 3 <???> How about this? chase$ ./fe fe fe fe fe fe fe fe fe fe fe fe env 3 env 0: TERM_PROGRAM=Apple_Terminal env 1: TERM=xterm-256color env 2: SHELL=/bin/bash It is easy for children to inherit child 19303 exited with status 0 environment variables from their child 19302 exited with status 0 parents. child 19301 exited with status 0 child 19300 exited with status 0 child 19299 exited with status 0 child 19298 exited with status 0 child 19297 exited with status 0 child 19296 exited with status 0 child 19295 exited with status 0 child 19294 exited with status 0 child 19293 exited with status 0 child 19292 exited with status 0 chase$ Exec* enables the parent to control the environment variables and arguments passed to the children. The child process passes the environment variables “to itself” but the parent program controls it. Unix fork/exec/exit/wait syscalls fork child fork parent parent program initializes child context exec int pid = fork(); Create a new process that is a clone of its parent. exec*(“program” [argvp, envp]); Overlay the calling process with a new program, and transfer control to it, passing arguments and environment. exit(status); Exit with status, destroying the process. wait exit int pid = wait*(&status); Wait for exit (or other status change) of a child, and “reap” its exit status. Recommended: use waitpid(). But how is the first process made? Init and Descendents Kernel “handcrafts” initial process to run “init” program. Other processes descend from init, including login program. Login runs user shell in a child process after user authenticates. User shell runs user commands as child processes. The Shell • Users may select from a range of command interpreter (“shell”) programs available. (Or even write their own!) – csh, sh, ksh, tcsh, bash: choose your flavor… • Shells execute commands composed of program filenames, args, and I/O redirection symbols. – Fork, exec, wait, etc., etc. – Can coordinate multiple child processes that run together as a process group or job. – Shells can run files of commands (scripts) for more complex tasks, e.g., by redirecting I/O channels (descriptors). – Shell behavior is guided by environment variables, e.g., $PATH – Parent may control/monitor all aspects of child execution. Unix process: parents rule Created with fork by parent program running in parent process. Parent program running in child process, or exec’d program chosen by parent program. Inherited from parent process, or modified by parent program in child (see next lecture) Virtual address space (Virtual Memory, VM) Process text data heap Thread Program kernel state Clone of parent VM. Environment (argv[] and envp[]) is configured by parent program on exec. Unix process view: data A process has multiple channels for data movement in and out of the process (I/O). I/O channels (“file descriptors”) stdin stdout tty Process stderr The parent process and parent program set up and control the channels for a child (until exec). pipe Thread socket Files Program Standard I/O descriptors I/O channels (“file descriptors”) stdin stdout tty Open files or other I/O channels are named within the process by an integer file descriptor value. stderr Standard descriptors for primary input (stdin=0), primary output (stdout=1), error/status (stderr=2). These are inherited from the parent process and/or set by the parent program. By default they are bound to the controlling terminal. count = read(0, buf, count); if (count == -1) { perror(“read failed”); /* writes to stderr */ exit(1); } count = write(1, buf, count); if (count == -1) { perror(“write failed”); /* writes to stderr */ Shell: foreground command Shell waits in read call for next command from tty, forks child, and exec*s the named command program in child. dsh “echo y” fork wait If a child is running in the foreground, the parent shell blocks in a wait* system call until the child changes state (e.g., exits). exec argv[0]=“echo” argv[1]=“y” argc = 2 Child process continues on to execute “echo” program independently of parent. Unix: signals • A signal is a typed upcall event delivered to a process by kernel. – Process P may use kill* system call to request signal delivery to Q. – Process may register signal handlers for signal types. A signal handler is a procedure that is invoked when a signal of a given type is delivered. Runs in user mode, then returns to normal control flow. – A signal delivered to a registered handler is said to be caught. If there is no registered handler then the signal triggers a default action (e.g., “die”). • A process lives until it exits voluntarily or receives a fatal signal. data Protected system calls data ...and upcalls (signals) Kernel sends signals, e.g., to notify processes of faults. Unix signals • Unix signals are “musty”. We don’t teach/test the details. – Sometimes you can’t ignore them! – (See next lecture) • A signal is an “upcall” event delivered to a process. – Sent by kernel, or by another process via kill* syscall. – An arriving signal transfers control to a handler routine in receiving process, designated to handle that type of signal. – The handler routine returns to normal control flow when done. • Or: if no designated handler, take a standard action. – e.g., DEFAULT, IGNORE, STOP, CONTINUE – Default action is often for receiving process to exit. – An exit from an unhandled signal is reported to parent via wait*. Signals and dsh • Your shell must set up itself and its children properly for “stuff to work like it should”. – We have provided you code to initialize shell, spawn job. – You do not need to understand that code – You should not need to mess with that code. – Please don’t mess with that code. – If you are messing with that code, please talk to us. – What that code does: transfer control of terminal to foreground child job, using process groups and signals, prepare for proper handling of various events that might occur. Process status change: STOP dsh “echo y” A wait returns when the status of the child changes (e.g., STOP). The awakened shell blocks in read to accept the next command. What if a foreground job exits before the parent calls wait? fork exec wait “<ctrl-z>” SIGSTOP STOP “fg” kill SIGCONT wait exit EXIT Child blocks when it receives a STOP signal (e.g., resulting from ctrl-z on keyboard). Child wakes up when it receives a CONTINUE signal (e.g., sent by shell on fg builtin command). I hope we get to here EXTRA SLIDES Mode Changes for Fork/Exit • Syscall traps and “returns” are not always paired. • Fork “returns” (to child) from a trap that “never happened” • Exit system call trap never returns • System may switch processes between trap and return parent Fork call Fork return Wait call Wait return Exec enters the child by doctoring up a saved user context to “return” through. child Fork entry to user space Exit call transition from user to kernel mode (callsys) transition from kernel to user mode (retsys) Using the shell • Commands: ls, cat, and all that • Current directory: cd and pwd • Arguments: echo • Signals: ctrl-c • Job control, foreground, and background: &, ctrl-z, bg, fg • Environment variables: printenv and setenv • Most commands are programs: which, $PATH, and /bin • Shells are commands: sh, csh, ksh, tcsh, bash • Pipes and redirection: ls | grep a • Files and I/O: open, read, write, lseek, close • stdin, stdout, stderr • Users and groups: whoami, sudo, groups The Kernel • Today, all “real” operating systems have protected kernels. The kernel resides in a well-known file: the “machine” automatically loads it into memory (boots) on power-on/reset. Our “kernel” is called the executive in some systems (e.g., Windows). • The kernel is (mostly) a library of service procedures shared by all user programs, but the kernel is protected: User code cannot access internal kernel data structures directly. User code can invoke the kernel only at well-defined entry points (system calls). • Kernel code is “just like” user code, but the kernel is privileged: The kernel has direct access to all hardware functions, and defines the handler entry points for interrupts and exceptions. Protecting Entry to the Kernel Protected events and kernel mode are the architectural foundations of kernel-based OS (Unix, Windows, etc). – The machine defines a small set of exceptional event types. – The machine defines what conditions raise each event. – The kernel installs handlers for each event at boot time. e.g., a table in kernel memory read by the machine The machine transitions to kernel mode only on an exceptional event. The kernel defines the event handlers. Therefore the kernel chooses what code will execute in kernel mode, and when. user trap/return interrupt or fault kernel interrupt or fault The Role of Events • A CPU event (an interrupt or exception, i.e., a trap or fault) is an “unnatural” change in control flow. • Like a procedure call, an event changes the PC. • Also changes mode or context (current stack), or both. • Events do not change the current space! • On boot, the kernel defines a handler routine for each event type. • The machine defines the event types. • Event handlers execute in kernel mode. control flow exception.cc • Every kernel entry results from an event. • Enter at the handler for the event. In some sense, the whole kernel is a “big event handler.” event handler (e.g., ISR: Interrupt Service Routine) Examples • Illegal operation – Reserved opcode, divide-by-zero, illegal access – That’s a fault! Kernel generates a signal, e.g., to kill process or invoke PL exception handlers. • Page fault – Fetch and install page, maybe block process – Nothing illegal about it: “transparent” to faulting process • I/O completion, arriving input, clock ticks. – These external events are interrupts. – Include power fail etc. – Kernel services interrupt in handler. – May wakeup blocked processes, but no blocking. Faults • Faults are similar to system calls in some respects: – Faults occur as a result of a process executing an instruction. • Fault handlers execute on the process kernel stack; the fault handler may block (sleep) in the kernel. – The completed fault handler may return to the faulted context. • But faults are different from syscall traps in other respects: – Syscalls are deliberate, but faults are “accidents”. • divide-by-zero, dereference invalid pointer, memory page fault – Not every execution of the faulting instruction results in a fault. • may depend on memory state or register contents Note • The following material on resource management was not discussed in class. • I’m leaving it in for completeness. We’ll come back to it later in the semester. • But this basic stuff should be easily understandable, given the reading and what we have covered in class. • So: please understand it. Resource management • This is a fast first course in computer systems. We will skimp a little on classical resource management. A couple of points to remember: • In general, resource management is a quantity allocation problem: e.g., how many CPU cycles and how many pages of memory to give to a process? • Resources are allocated to some resource group or unit of accounting: per process or per user account, or some other accounting domain. • Time matters: domains consume an allocation over time. The allocation at any point in time may vary. The total consumption is determined by the magnitude and its integral over time: it’s like power vs. energy. • The rate of resource flow over time matters for performance, especially response time for apps that are interactive (like web services). • A key desired property: performance isolation. The mechanisms must keep each process/account/domain within its allocation: attempts to overuse resources by one user/process should not affect others. You already know how this works. The policies choose how much to allocate and when. The kernel syscall trap/return fault/return system call layer: files, processes, IPC, thread syscalls fault entry: VM page faults, signals, etc. thread/CPU/core management: sleep and ready queues memory management: block/page cache, VM maps sleep queue I/O completions ready queue interrupt/return timer ticks The kernel syscall trap/return fault/return system call layer: files, processes, IPC, thread syscalls fault entry: VM page faults, signals, etc. thread/CPU/core management: sleep and ready queues memory management: block/page cache policy sleep queue I/O completions ready queue interrupt/return policy timer ticks Separation of policy and mechanism • Every OS platform has mechanisms that enable it to mediate access to machine resources. E.g: – Gain control of core by timer interrupts – Fault on access to non-resident virtual memory – I/O through system call traps – Internal protected (kernel) code and data structures to track resource usage and allocate resources • But the mechanisms do not and must/should not determine the policy. • We might want to change the policy! Time sharing vs. space sharing space Two common modes of resource allocation. What kinds of resources do these work for? time Example: Processor Allocation The key issue is: how should an OS allocate its CPU resources among contending demands? – We are concerned with resource allocation policy: how the OS uses underlying mechanisms to meet design goals. – Focus on OS kernel : user code can decide how to use the processor time it is given. – Which thread to run on a free core? – For how long? When to take the core back and give it to some other thread? (timeslice or quantum) – What are the policy goals? CPU Scheduling 101 The OS scheduler makes a sequence of “moves”. – Next move: if a CPU core is idle, pick a ready thread t from the ready pool and dispatch it (run it). – Scheduler’s choice is “nondeterministic” – Scheduler’s choice determines interleaving of execution blocked threads Wakeup ready pool If timer expires, or wait/yield/terminate GetNextToRun SWITCH() Goals of policy • Share resources fairly. • Use machine resources efficiently. • Be responsive to user interaction. But what do these things mean? How do we know if a policy is good or not? What are the metrics? What do we assume about the workload?