Duke Systems CPS 210 Unix and Beyond Jeff Chase Duke University http://www.cs.duke.edu/~chase/cps210 “Just make it” • To get started on heap manager, download the files and type “make”. – Provides a script to build the heap manager test programs on Linux or MacOS. • This lab is just a taste of system programming in C. • The classic text is CS:APP. • Also see PDF “What every computer systems student should know about computers” on the course website. • You may think of it as notes from CS:APP. It covers background from Computer Architecture and also some material for this class. http://csapp.cs.cmu.edu a classic 64 bytes: 3 ways p + 0x0 0x0 int p[] int* p char p[] char *p 0x1f p 0x0 char* p[] char** p 0x1f Pointers (addresses) are 8 bytes on a 64-bit machine. 0x1f Alignment p + 0x0 0x0 int p[] int* p X char p[] char *p X 0x1f p char* p[] char** p 0x0 X 0x1f The machine requires that an n-byte value is aligned on an n-byte boundary. n = 2i 0x1f Heap allocation A contiguous chunk of memory obtained from OS kernel. E.g., with Unix sbrk() system call. A runtime library obtains the block and manages it as a “heap” for use by the programming language environment, to store dynamic objects. E.g., with Unix malloc and free library calls. Allocated heap blocks for structs or objects. Align! Variable Partitioning Variable partitioning is the strategy of parking differently sized cars along a street with no marked parking space dividers. 1 2 3 Wasted space external fragmentation Alternative: block maps The storage in a heap block is contiguous in the VAS. C and other PL environments require this. That complicates the heap manager because the heap blocks may be different sizes. Idea: use a level of indirection through a map to assemble a storage object from “scraps” of storage in different locations. The “scraps” can be fixed-size slots: that makes allocation easy because they are interchangeable. map Example: page tables that implement a VAS. Indirection Fixed Partitioning Wasted space internal fragmentation Post-note • We took much of the class talking about some general issues for naming, illustrated in Unix. • Block maps and other indexed maps are common structure to implement “machine” name spaces: – sequences of logical blocks, e.g., virtual address spaces, files – process IDs, etc. – For sparse block spaces we may use a tree hierarchy of block maps (e.g., inode maps or 2-level page tables, later). – Storage system software is full of these maps. • Symbolic name spaces use different kinds of maps. – They are sparse and require matching more expensive. – Trees of maps create nested namespaces, e.g., the file tree. Files: hierarchical name space root directory applications etc. mount point user home directory external media volume or network storage File I/O char buf[BUFSIZE]; int fd; Pathnames are translated through the directory tree, starting at the root directory or current directory. if ((fd = open(“../zot”, O_TRUNC | O_RDWR) == -1) { perror(“open failed”); Every system call should exit(1); check for errors and } handle appropriately. while(read(0, buf, BUFSIZE)) { if (write(fd, buf, BUFSIZE) != BUFSIZE) { File grows as process perror(“write failed”); writes to it system exit(1); must allocate space } dynamically. } System finds the physical disk locations of the file’s logical blocks by indexing a block map (the file’s index node or “inode”). A filesystem on disk inode 0 bitmap file inode 1 root directory fixed locations on disk 11100010 00101101 10111101 wind: 18 0 snow: 62 0 once upo n a time /n in a l 10011010 00110001 00010101 allocation bitmap file blocks rain: 32 hail: 48 directory blocks file blocks 00101110 00011001 01000100 and far far away , lived th regular file (inode) This is a toy example (Nachos). Names and layers User view notes in notebook file Application notefile fd, byte range* fd bytes block# File System device, block # Disk Subsystem surface, cylinder, sector Add more layers as needed. Directories A creat operation must scan the directory to ensure that creates are exclusive. wind: 18 0 directory inode snow: 62 0 There can be no duplicate names: the name mapping is a function. rain: 32 hail: 48 Note: implementations vary. Large directories are problematic. lblock 32 Entries or free slots are typically found by a linear scan. Operations on Directories (UNIX) • • • • • Link - make entry pointing to file Unlink - remove entry pointing to file Rename Mkdir - create a directory Rmdir - remove a directory Links usr ln -s /usr/Marty/bar bar Lynn creat foo unlink foo foo Marty creat bar ln /usr/Lynn/foo bar unlink bar bar Unix File Naming (Hard Links) directory A A Unix file may have multiple names. Each directory entry naming the file is called a hard link. Each inode contains a reference count showing how many hard links name it. directory B 0 rain: 32 wind: 18 0 hail: 48 sleet: 48 inode link count = 2 inode 48 link system call link (existing name, new name) create a new name for an existing file increment inode link count unlink system call (“remove”) unlink(name) destroy directory entry decrement inode link count if count == 0 and file is not in active use free blocks (recursively) and on-disk inode Illustrates: garbage collection by reference counting. Unix Symbolic (Soft) Links A soft link is a file containing a pathname of some other file. directory A directory B 0 rain: 32 wind: 18 0 hail: 48 sleet: 67 symlink system call symlink (existing name, new name) allocate a new file (inode) with type symlink initialize file contents with existing name create directory entry for new file with new name inode link count = 1 ../A/hail/0 inode 48 inode 67 The target of the link may be removed at any time, leaving a dangling reference. How should the kernel handle recursive soft links? Concepts • • • • • Reference counting and reclamation Redirection/indirection Dangling reference Binding time (create time vs. resolve time) Referential integrity Processes and the kernel Programs run as independent processes. data data Protected system calls Protected OS kernel mediates access to shared resources. Each process has a private virtual address space and one thread. ...and upcalls (e.g., signals) Threads enter the kernel for OS services. The kernel is a separate component/context with enforced modularity. The kernel syscall interface supports processes, files, pipes, and signals. GS4. Layered systems Garlan and Shaw, An Introduction to Software Architecture, 1994. Processes: A Closer Look virtual address space + The address space is a private name space for a set of memory segments used by the process. The kernel must initialize the process memory for the program to run. thread stack process descriptor (PCB) + Each process has a thread bound to the VAS. The thread has a stack addressable through the VAS. The kernel can suspend/restart the thread wherever and whenever it wants. user ID process ID parent PID sibling links children resources The OS maintains some state for each process in the kernel’s internal data structures: a file descriptor table, links to maintain the process tree, and a place to store the exit status. VAS example (32-bit) • An addressable array of bytes… 0x7fffffff Reserved Stack • Containing every instruction the process thread can execute… • And every piece of data those instructions can read/write… – i.e., read/write == load/store • Partitioned into logical segments with distinct purpose and use. • Every memory reference by a thread is interpreted in its VAS context. – Resolve to a location in machine memory • A given address in different VAS may resolve to different locations. Dynamic data (heap/BSS) Static data Text (code) 0x0 A Peek Inside a Running Program 0 CPU common runtime x your program code library your data R0 heap Rn PC SP x y registers y stack high “memory” address space (virtual or physical) Unix File Descriptors Illustrated user space kernel file pipe process file descriptor table socket open file table Processes may share open files (“objects”), but the binding of file descriptors to objects is specific to each process. e.g., see the dup system call tty Disclaimer: this drawing is oversimplified . Networking endpoint port operations advertise (bind) listen connect (bind) close channel binding connection node A write/send read/receive node B Some IPC mechanisms allow communication across a network. E.g.: sockets using Internet communication protocols (TCP/IP). Each endpoint on a node (host) has a port number. Each node has one or more interfaces, each on at most one network. Each interface may be reachable on its network by one or more names. E.g. an IP address and an (optional) DNS name. Networking stack What is a distributed system? "A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable." -- Leslie Lamport Leslie Lamport Example: browser GS6. Interpreter Garlan and Shaw, An Introduction to Software Architecture, 1994. Interpreter: example An interpreter controls how a program executes and what it sees. An interpreter can “sandbox” a program for isolation. Processes in the browser Threads: a familiar metaphor 1 Page links and back button navigate a “stack” of pages in each tab. 2 Each tab has its own stack. One tab is active at any given time. You create/destroy tabs as needed. You switch between tabs at your whim. 3 Similarly, each thread has a separate stack. The OS switches between threads at its whim. One thread is active per CPU core at any given time. time Fork • Child can’t be an exact copy • Is distinguished by one variable (the return value of fork) if (fork () == 0) { /* child */ execute new program } else { /* parent */ carry on } Memory and fragmentation An advantage of address spaces Enforced modularity Concept: garbage collection Managing the pointers Post-note: understand garbage collection • Garbage collection: the language runtime system calls the underlying heap manager to free unused heap blocks automatically; the program itself does not have to do it. – Java does it for you, but C does not. • A heap block is “garbage” only when there are no references to the block, e.g., no pointers to the object that lives in that block. – A reference is a stored name. The garbage collector counts these references, and marks a block as garbage when all references to it are gone. To do that it must find/identify all stored references. • Java knows the types of all of a program’s data objects, so it can find stored references and identify their targets. • A language that supports garbage collection may also move objects around to compact the heap to reduce fragmentation. • Weakly typed languages like C cannot do this for you. Q: can a file system garbage collect or compact stored data on disk? Post-note • Next slide gives more detail on fork/exit. • We will discuss kernel protection and kernel entry and exit more later. Mode Changes for Fork/Exit • Syscall traps and “returns” are not always paired. • Fork “returns” (to child) from a trap that “never happened” • Exit system call trap never returns • System may switch processes between trap and return parent Fork call Fork return Wait call Wait return Exec enters the child by doctoring up a saved user context to “return” through. child Fork entry to user space Exit call transition from user to kernel mode (callsys) transition from kernel to user mode (retsys) Example: System Call Traps • Programs in C, C++, etc. invoke system calls by linking to a standard library of procedures written in assembly language. – the library defines a stub or wrapper routine for each syscall – stub executes a special trap instruction (e.g., chmk or callsys or int) Alpha CPU architecture – syscall arguments/results passed in registers or user stack read() in Unix libc.a Alpha library (executes in user mode): #define SYSCALL_READ 27 move arg0…argn, a0…an move SYSCALL_READ, v0 callsys move r1, _errno return # op ID for a read system call # syscall args in registers A0..AN # syscall dispatch index in V0 # kernel trap # errno = return status Representing a File On Disk file attributes: may include owner, access control list, time of create/modify/access, etc. once upo n a time /nin a l logical block 0 and far far away ,/nlived t logical block 1 block map Index by logical block number physical block pointers in the block map are sector IDs or physical block numbers “inode” he wise and sage wizard. logical block 2 Post-note • The following slides were presented in the next class (on Android) as intro to motivate Android. • Android keeps the Unix (Linux) kernel, but replaces the entire application framework. – Shell is gone. App execution is controlled by trusted system-wide server process, which is part of the system TCB. – Pipes are gone. Apps interact through system events (intents) and service bindings (binder RPC). – There is only one user, but each app has its own userID. – Each app has at most one instance, with its private files. – Terminals are gone: user opens screens (activities) to interact with apps. The system keeps an activity stack with a “back” button. • foreground and background activities? – System launches app components and reclaims them at suitable times. They don’t “exit”. Unix, looking backward: UI+IPC • Conceived around keystrokes and byte streams – User-visible environment is centered on a text-based command shell. • Limited view of how programs interact – files: byte streams in a shared name space – pipes: byte streams between pairs of sibling processes Unix, looking backward: upcalls • Limited view of how programs interact with the OS. – The kernel directs control flow into user process at a fixed entry point: e.g., entry for exec() is _crt0 or “main”. – Process may also register a signal handlers for events relating to the process, (generally) signalled by the kernel. – Process lives until it exits voluntarily or fails • “receives an unhandled signal that is fatal by default”. data Protected system calls data ...and upcalls (e.g., signals) X Windows (1985) Big change: GUI. 1. Windows 2. Window server 3. App events 4. Widget toolkit Unix, looking backward: security • Presumes multiple users sharing a machine. • Each user has a userID. – UserID owns all files created by all programs user runs. – Any program can access any file owned by userID. • Each user trusts all programs it chooses to run. – We “deputize” every program. – Some deputies get confused. – Result: decades of confused deputy security problems. • Contrary view: give programs the privileges they need, and nothing more. – Principle of Least Privilege