Duke Systems CPS 210 Unix and All That Jeff Chase Duke University http://www.cs.duke.edu/~chase/cps210 Unix: A lasting achievement? “Perhaps the most important achievement of Unix is to demonstrate that a powerful operating system for interactive use need not be expensive…it can run on hardware costing as little as $40,000.” DEC PDP-11/24 The UNIX Time-Sharing System* D. M. Ritchie and K. Thompson 1974 http://histoire.info.online.fr/pdp11.html Let’s pause a moment to reflect... Performance (vs. VAX-11/780) 10000 1000 From Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 4th edition, 2006 ??%/year 52%/year 100 Core Rate (SPECint) 10 25%/year Note log scale 1 1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 Today Unix runs embedded in devices costing < $100. Small is beautiful? The UNIX Time-Sharing System* D. M. Ritchie and K. Thompson 1974 [RT74]: historical hardware details • [Ritchie/Thompson74] is the classic reference on Unix. • In 1974, the advances we take for granted were in the future. • They had to prove it on the hardware they had at the time. • Many specific implementation choices have changed. – 14 –character file names – assembly language C – 7 protection bits on files – i-numbers and i-list – 512-byte blocks – ppt is “paper tape”??? – vowel embargo The UNIX Time-Sharing System* D. M. Ritchie and K. Thompson 1974 Some lessons of history • At the time it was created, Unix was the “simplest multi-user OS people could imagine.” – It’s in the name: Unix vs. Multics • Simple abstractions can deliver a lot of power. – Many people have been inspired by the power of Unix. • The community spent four decades making Unix complex again....but the essence is unchanged. • Unix is a simple context to study core issues for classical OS design. “It’s in there.” • Unix variants continue to be in wide use. • They serve as a foundation for advances. Abstraction The UNIX Time-Sharing System* D. M. Ritchie and K. Thompson,1974 Innovation Simple? • users • files • processes • pipes – which “look like” files These persist across reboots. They have symbolic names (you choose it) and internal IDs (the system chooses). These exist within a running system, and they are transient: they disappear on a crash or reboot. They have internal IDs. Unix supports dynamic create/destroy of these objects. It manages the various name spaces. It has system calls to access these objects. It checks permissions. Unix: some key concepts • Names and namespaces – directories and pathnames – name tree and subtree grafting (mount) – root directory and current directory – path prefix list – resolution – links (aliases) and reference counting • Access control by tags and labels – inheritance of tags and labels • Context manipulation – fork vs. exec Files: hierarchical name space root directory applications etc. mount point user home directory external media volume or network storage “Everything is a file” “Files” regular Afiles The UNIX Time-Sharing System* D. M. Ritchie and K. Thompson,1974 Universal Set special Bfiles directories File I/O Open files are named within the process by an integer file descriptor. char buf[BUFSIZE]; int fd; Pathnames may be relative to process current directory. if ((fd = open(“../zot”, O_TRUNC | O_RDWR) == -1) { perror(“open failed”); Process passes status exit(1); back to parent on exit, to } report success/failure. while(read(0, buf, BUFSIZE)) { if (write(fd, buf, BUFSIZE) != BUFSIZE) { perror(“write failed”); Process does not specify exit(1); current file offset: the } system remembers it. } Standard descriptors (0, 1, 2) for input, output, error messages (stdin, stdout, stderr). “Components in context” execute Program Context (Domain) Thread A context defines an isolated sandbox for a running program, so that it can use only the data and resources that the OS grants it. For our purposes, an operating system is a platform that supports protection and isolation: every component runs within a context. Program, context and thread are OS abstractions. Running a program code constants initialized data imports/exports symbols types/interfaces data Program “Unix Classic” simplifications Context == process == (1 VAS + 1 thread + ...) Each process runs exactly one program/component instance (at a time). IPC channels are pipes. All I/O is based on a simple common abstraction: file / stream. The theater analogy script context (stage) Threads Program Address space Running a program is like performing a play. [lpcox] Processes and the kernel Programs run as independent processes. data data Protected system calls Protected OS kernel mediates access to shared resources. Each process has a private virtual address space and one thread. ...and upcalls (e.g., signals) Threads enter the kernel for OS services. The kernel is a separate component/context with enforced modularity. The kernel syscall interface supports processes, files, pipes, and signals. Enforced modularity pipe (or other channel) An important theme from Monday’s class By putting each component instance in a separate context, we can enforce modularity boundaries among components. Each component runs in a sandbox: they can interact only through pipes. Neither can access the internals of the other. Unix defines uniform, modular ways to combine programs to build up more complex functionality. Other application programs sh nroff who cpp a.out Kernel date comp Hardware cc wc as ld grep vi ed Other application programs A key idea: Unix pipes [http://www.bell-labs.com/history/unix/philosophy.html] Unix programming environment Standard unix programs read a byte stream from standard input (fd==0). stdin They write their output to standard output (fd==1). stdout Stdin or stdout might be bound to a file, pipe, device, or network socket. If the parent sets it up, the program doesn’t even have to know. That style makes it easy to combine simple programs using pipes or files. Unix fork/exec/exit/wait syscalls fork parent fork child initialize child context exec int pid = fork(); Create a new process that is a clone of its parent. exec*(“program” [, argvp, envp]); Overlay the calling process with a new program, and transfer control to it. exit(status); Exit with status, destroying the process. Note: this is not the only way for a process to exit! wait exit int pid = wait*(&status); Wait for exit (or other status change) of a child, and “reap” its exit status. Note: child may have exited before parent calls wait! Wait Unix: users and their namespaces • A unix system has a set of user accounts. – identities, principals – often correspond to real users, but not always • Each account has a username. – a human-readable character string: “chase” – also called a symbolic name • Each account has a userID – a number for internal use • These namespaces are flat. • The system keeps a bidirectional map: – f(username) = userID or Protection Systems 101 Reference monitor Example: Unix kernel Isolation boundary Principles of Computer System Design Saltzer & Kaashoek 2009 Labels and access control Alice Every file and every process is labeled/tagged with a user ID. log in login fork, setuid(“alice”), A privileged process may set its user ID. Bob login fork, setuid(“bob”), exec shell exec shell fork/exec fork/exec creat(“foo”) tool write,close uid=“alice” A process inherits its userID from its parent process. foo open(“foo”) read owner=“alice” tool uid=“bob” A file inherits its owner userID from its creating process. Labels and access control Every system defines rules for assigning security labels to subjects (e.g., Bob’s process) and objects (e.g., file foo). Alice login shell Every system defines rules to compare the security labels to authorize attempted accesses. Bob login shell creat(“foo”) tool uid=“alice” write,close foo open(“foo”) read owner=“alice” Should processes running with Bob’s userID be permitted to open file foo? tool uid=“bob” Post-note • We talked about access policy in vanilla Unix. • The owner of a Unix file may tag it with additional status specifying access rights for subjects. – Access types = {read, write, execute} [3 bits] – Subject types = {owner, group, other/anyone} [3 bits] – If the file is executed, should the system setuid the process to the userID of the file’s owner. [1 bit] – 10 bits total: (3x3)+1. Usually given in octal: e.g., “777” means 9 bits set: anyone can r/w/x the file, but no setuid. – It is a very simple form of an access control list (ACL). Later systems like AFS have richer ACLs. • Unix provides a syscall and shell command for owner to set the permission bits on each file (inode). • “Group” was added later and is a little more complicated: a user may belong to multiple groups. Init and Descendents Kernel “handcrafts” initial process to run “init” program. Other processes descend from init, and also run as root, including user login guards. Login invokes a setuid system call to run user shell in a child process after user authenticates. Children of user shell inherit the user’s identity (uid). Processes: A Closer Look virtual address space + The address space is a private name space for a set of memory segments used by the process. The kernel must initialize the process memory for the program to run. thread stack process descriptor (PCB) + Each process has a thread bound to the VAS. The thread has a stack addressable through the VAS. The kernel can suspend/restart the thread wherever and whenever it wants. user ID process ID parent PID sibling links children resources The OS maintains some state for each process in the kernel’s internal data structures: a file descriptor table, links to maintain the process tree, and a place to store the exit status. VAS example (32-bit) • An addressable array of bytes… 0x7fffffff Reserved Stack • Containing every instruction the process thread can execute… • And every piece of data those instructions can read/write… – i.e., read/write == load/store • Partitioned into logical segments with distinct purpose and use. • Every memory reference by a thread is interpreted in its VAS context. – Resolve to a location in machine memory • A given address in different VAS may resolve to different locations. Dynamic data (heap/BSS) Static data Text (code) 0x0 64 bytes: 3 ways p + 0x0 0x0 int p[] int* p char p[] char *p 0x1f p 0x0 char* p[] char** p 0x1f Pointers (addresses) are 8 bytes on a 64-bit machine. 0x1f Alignment p + 0x0 0x0 int p[] int* p X char p[] char *p X 0x1f p char* p[] char** p 0x0 X 0x1f The machine requires that an n-byte value is aligned on an n-byte boundary. n = 2i 0x1f Heap allocation A contiguous chunk of memory obtained from OS kernel. E.g., with Unix sbrk() system call. A runtime library obtains the block and manages it as a “heap” for use by the programming language environment, to store dynamic objects. E.g., with Unix malloc and free library calls. Allocated heap blocks for structs or objects. Align! Alternative: block maps The storage in a heap block is contiguous in the VAS. C and other PL environments require this. That complicates the heap manager because the heap blocks may be different sizes. Idea: use a level of indirection through a map to assemble a storage object from “scraps” of storage in different locations. The “scraps” can be fixed-size slots: that makes allocation easy because they are interchangeable. map Example: page tables that implement a VAS. Indirection Variable Partitioning Variable partitioning is the strategy of parking differently sized cars along a street with no marked parking space dividers. 1 2 3 Wasted space external fragmentation Fixed Partitioning Wasted space internal fragmentation “Classic Linux Address Space” N http://duartes.org/gustavo/blog/category/linux What’s in an Object File or Executable? Header “magic number” indicates type of image. Section table an array of (offset, len, startVA) program sections Used by linker; may be removed after final link step and strip. header text program instructions p data idata immutable data (constants) “hello\n” wdata writable global/static data j, s symbol table j, s ,p,sbuf relocation records int j = 327; char* s = “hello\n”; char sbuf[512]; int p() { int k = 0; j = write(1, s, 6); return(j); } A Peek Inside a Running Program 0 CPU common runtime x your program code library your data R0 heap Rn PC SP x y registers y stack high “memory” address space (virtual or physical) Process Creation in Unix int pid; int status = 0; if (pid = fork()) { /* parent */ ….. pid = wait(&status); } else { /* child */ ….. exit(status); } The fork syscall returns twice: it returns a zero to the child and the child process ID (pid) to the parent. Parent uses wait to sleep until the child exits; wait returns child pid and status. Wait variants allow wait on a specific child, or notification of stops and other signals. The Shell • Users may select from a range of interpreter programs available – or even write their own (to add to the confusion) – csh, sh, ksh, tcsh, bash: choose your flavor… • Shells execute commands composed of program filenames, args, and I/O redirection symbols. – Shells can run files of commands (scripts) for more complex tasks, e.g., by redirecting shell’s stdin. – Shell’s behavior is guided by environment variables. – E.g., $PATH Using the shell • Commands: ls, cat, and all that • Current directory: cd and pwd • Arguments: echo • Signals: ctrl-c • Job control, foreground, and background: &, ctrl-z, bg, fg • Environment variables: printenv and setenv • Most commands are programs: which, $PATH, and /bin • Shells are commands: sh, csh, ksh, tcsh, bash • Pipes and redirection: ls | grep a • Files and I/O: open, read, write, lseek, close • stdin, stdout, stderr • Users and groups: whoami, sudo, groups