The Machine and the Kernel Mode, space, and context: the basics Jeff Chase Duke University 64 bytes: 3 ways p + 0x0 Memory is “fungible”. 0x0 int p[] int* p char p[] char *p 0x1f p 0x0 char* p[] char** p 0x1f Pointers (addresses) are 8 bytes on a 64-bit machine. 0x1f Endianness Lilliput and Blefuscu are at war over which end of a soft-boiled egg to crack. Gulliver’s Travel’s 1726 A silly difference among machine architectures creates a need for byte swapping when unlike machines exchange data over a network. x86 is little-endian Little-endian: the lowestnumbered byte of a word (or longword or quadword) is the least significant. cb ip h=0x68 i=0x69 !=0x21 0 chase$ cc -o heap heap.c chase$ ./heap hi! 0x216968 chase$ Network messages https://developers.google.com/protocol-buffers/docs/overview Byte swapping: example struct sockaddr_in socket_addr; sock = socket(PF_INET, SOCK_STREAM, 0); memset(&socket_addr, 0, sizeof socket_addr); socket_addr.sin_family = PF_INET; socket_addr.sin_port = htons(port); socket_addr.sin_addr.s_addr = htonl(INADDR_ANY); if (bind(sock, (struct sockaddr *) &socket_addr, sizeof socket_addr) < 0) { perror("couldn't bind"); exit(1); } listen(sock, 10); buggyserver.c Heap: dynamic memory A contiguous chunk of memory obtained from OS kernel. E.g., with Unix sbrk() system call. A runtime library obtains the block and manages it as a “heap” for use by the programming language environment, to store dynamic objects. E.g., with Unix malloc and free library calls. Allocated heap blocks for structs or objects. Align! Heap manager policy • The heap manager must find a suitable free block to return for each call to malloc(). – No byte can be part of two simultaneously allocated heap blocks! If any byte of memory is doubly allocated, programs will fail. We test for this! • A heap manager has a policy algorithm to identify a suitable free block within the heap. – Last fit, first fit, best fit, worst fit – Choose your favorite! – Goals: be quick, and use memory efficiently – Behavior depends on workload: pattern of malloc/free requests • This is an old problem in computer science, and it occurs in many settings: variable partitioning. Variable Partitioning Variable partitioning is the strategy of parking differently sized cars along a street with no marked parking space dividers. 1 2 3 Wasted space external fragmentation Fixed Partitioning Wasted space internal fragmentation Time sharing vs. space sharing space Two common modes of resource allocation. What kinds of resources do these work for? time Operating Systems: The Classical View Programs run as independent processes. data data Protected system calls Protected OS kernel mediates access to shared resources. Each process has a private virtual address space and one or more threads. ...and upcalls (e.g., signals) Threads enter the kernel for OS services. The kernel code and data are protected from untrusted processes. 0x7fffffff 0x7fffffff Reserved Reserved Stack Stack Dynamic data (heap/BSS) Dynamic data (heap/BSS) Static data Static data Text (code) Text (code) 0x0 0x0 “Classic Linux Address Space” N http://duartes.org/gustavo/blog/category/linux Windows/IA32 Windows IA-32 (Kernel) Processes: A Closer Look virtual address space + The address space is a private name space for a set of memory segments used by the process. The kernel must initialize the process memory for the program to run. thread stack process descriptor (PCB) + Each process has a thread bound to the VAS. The thread has a stack addressable through the VAS. The kernel can suspend/restart the thread wherever and whenever it wants. user ID process ID parent PID sibling links children resources The OS maintains some state for each process in the kernel’s internal data structures: a file descriptor table, links to maintain the process tree, and a place to store the exit status. A process can have multiple threads int main(int argc, char *argv[]) { if (argc != 2) { fprintf(stderr, "usage: threads <loops>\n"); exit(1); void *worker(void *arg) { } int i; loops = atoi(argv[1]); for (i = 0; i < loops; i++) { pthread_t p1, p2; counter++; printf("Initial value : %d\n", counter); } pthread_create(&p1, NULL, worker, NULL); pthread_exit(NULL); pthread_create(&p2, NULL, worker, NULL); } pthread_join(p1, NULL); pthread_join(p2, NULL); data printf("Final value : %d\n", counter); return 0; } volatile int counter = 0; int loops; Much more on this later! Key Concepts for Classical OS • kernel • The software component that controls the hardware directly, and implements the core privileged OS functions. • Modern hardware has features that allow the OS kernel to protect itself from untrusted user code. • thread • An executing instruction path and its CPU register state. • virtual address space • An execution context for thread(s) defining a name space for executing instructions to address data and code. • process • An execution of a program, consisting of a virtual address space, one or more threads, and some OS kernel state. The theater analogy script virtual memory (stage) Threads Program Address space Running a program is like performing a play. [lpcox] The sheep analogy Address space Thread Code and data CPU cores The machine has a bank of CPU cores for threads to run on. The OS allocates cores to threads. Cores are hardware. They go where the driver tells them. Core #1 Core #2 Switch drivers any time. Threads drive cores What was the point of that whole thing with the electric sheep actors? • A process is a running program. • A running program (a process) has at least one thread (“main”), but it may (optionally) create other threads. • The threads execute the program (“perform the script”). • The threads execute on the “stage” of the process virtual memory, with access to a private instance of the program’s code and data. • A thread can access any virtual memory in its process, but is contained by the “fence” of the process virtual address space. • Threads run on cores: a thread’s core executes instructions for it. • Sometimes threads idle to wait for a free core, or for some event. Sometimes cores idle to wait for a ready thread to run. • The operating system kernel shares/multiplexes the computer’s memory and cores among the virtual memories and threads. Processes and threads virtual address space + Each process has a virtual address space (VAS): a private name space for the virtual memory it uses. The VAS is both a “sandbox” and a “lockbox”: it limits what the process can see/do, and protects its data from others. main thread stack other threads (optional) +… Each process has a thread bound to the VAS, with stacks (user and kernel). If we say a process does something, we really mean its thread does it. The kernel can suspend/restart the thread wherever and whenever it wants. From now on, we suppose that a process could have multiple threads. We presume that they can all make system calls and block independently. STOP wait A thread running in a process VAS 0 CPU common runtime x your program code library your data R0 heap Rn PC SP address space (virtual or physical) x y registers y stack high “memory” e.g., a virtual memory for a process Thread context • Each thread has a context (exactly one). – Context == values in the thread’s registers CPU core – Including a (protected) identifier naming its VAS. – And a pointer to thread’s stack in VAS/memory. • Each core has a context (at least one). R0 – Context == a register set that can hold values. – The register set is baked into the hardware. • A core can change “drivers”: context switch. – Save running thread’s register values into memory. – Load new thread’s register values from memory. – (Think of driver settings for the seat, mirrors, audio…) – Enables time slicing or time sharing of machine. Rn PC SP x y registers Programs gone wild int main() { while(1); } Can you hear the fans blow? How does the OS regain control of the core from this program? How to “make” the process save its context and give some other process a chance to run? How to “make” processes share machine resources fairly? Timer interrupts, faults, etc. • When processor core is running a user program, the user program/thread controls (“drives”) the core. • The hardware has a timer device that interrupts the core after a given interval of time. • Interrupt transfers control back to the OS kernel, which may switch the core to another thread, or resume. • Other events also return control to the kernel. – Wild pointers – Divide by zero – Other program actions – Page faults Entry to the kernel Every entry to the kernel is the result of a trap, fault, or interrupt. The core switches to kernel mode and transfers control to a handler routine. syscall trap/return fault/return OS kernel code and data for system calls (files, process fork/exit/wait, pipes, binder IPC, low-level thread support, etc.) and virtual memory management (page faults, etc.) I/O completions interrupt/return timer ticks The handler accesses the core register context to read the details of the exception (trap, fault, or interrupt). It may call other kernel routines. CPU mode: User and Kernel CPU mode (a field in some status register) indicates whether a machine CPU (core) is running in a user program or in the protected kernel (protected mode). CPU core Some instructions or register accesses are legal only when the CPU (core) is executing in kernel mode. U/K mode CPU mode transitions to kernel mode only on machine exception events (trap, fault, interrupt), which transfers control to a trusted handler routine registered with the machine at kernel boot time. R0 Rn PC So only the kernel program chooses what code ever runs in the kernel mode (or so we hope and intend). A kernel handler can read the user register values at the time of the event, and modify them arbitrarily before (optionally) returning to user mode. x registers Exceptions: trap, fault, interrupt synchronous caused by an instruction asynchronous caused by some other event intentional unintentional happens every time contributing factors trap: system call fault open, close, read, write, fork, exec, exit, wait, kill, etc. invalid or protected address or opcode, page fault, overflow, etc. “software interrupt” software requests an interrupt to be delivered at a later time interrupt caused by an external event: I/O op completed, clock tick, power fail, etc. Kernel Stacks and Trap/Fault Handling Threads execute user code on a user stack in the user virtual memory in the process virtual address space. Each thread has a second kernel stack in kernel space (VM accessible only in kernel mode). data stack stack stack syscall dispatch table stack System calls and faults run in kernel mode on a kernel stack. Kernel code running in P’s process context has access to P’s virtual memory. The syscall handler makes an indirect call through the system call dispatch table to the handler registered for the specific system call. Virtual resource sharing space Understand that the OS kernel implements resource allocation (memory, CPU,…) by manipulating name spaces and contexts visible to user code. The kernel retains control of user contexts and address spaces via the machine’s limited direct execution model, based on protected mode and exceptions. time “Limited direct execution” Any kind of machine exception transfers control to a registered (trusted) kernel handler running in a protected CPU mode. syscall trap u-start fault u-return u-start fault u-return kernel “top half” kernel “bottom half” (interrupt handlers) clock interrupt user mode kernel mode interrupt return boot Kernel handler manipulates CPU register context to return to selected user context. Example: Syscall traps • Programs in C, C++, etc. invoke system calls by linking to a standard library written in assembly. – The library defines a stub or wrapper routine for each syscall. – Stub executes a special trap instruction (e.g., chmk or callsys or syscall instruction) to change mode to kernel. – Syscall arguments/results are passed in registers (or user stack). – OS defines Application Binary Interface (ABI). read() in Unix libc.a Alpha library (executes in user mode): #define SYSCALL_READ 27 move arg0…argn, a0…an move SYSCALL_READ, v0 callsys move r1, _errno return Alpha CPU ISA (defunct) # op ID for a read system call # syscall args in registers A0..AN # syscall dispatch index in V0 # kernel trap # errno = return status Linux x64 syscall conventions MacOS x86-64 syscall example section .data hello_world db section .text global start start: mov rax, 0x2000004 mov rdi, 1 mov rsi, hello_world mov rdx, 14 syscall mov rax, 0x2000001 mov rdi, 0 syscall "Hello World!", 0x0a Illustration only: this program writes “Hello World!” to standard output. ; System call write = 4 ; Write to standard out = 1 ; The address of hello_world string ; The size to write ; Invoke the kernel ; System call number for exit = 1 ; Exit success = 0 ; Invoke the kernel http://thexploit.com/secdev/mac-os-x-64-bit-assembly-system-calls/ A thread running in a process VAS 0 CPU common runtime x your program code library your data R0 heap Rn PC SP address space (virtual or physical) x y registers y stack high “memory” e.g., a virtual memory for a process Messing with the context #include <ucontext.h> int count = 0; ucontext_t context; int main() { int i = 0; getcontext(&context); count += 1; i += 1; sleep(2); printf(”…", count, i); setcontext(&context); } ucontext Standard C library routines to: Save current register context to a block of memory (getcontext from core) Load/restore current register context from a block of memory (setcontext) Also: makecontext, swapcontext Details of the saved context (ucontext_t structure) are machine-dependent. Messing with the context (2) #include <ucontext.h> int count = 0; ucontext_t context; int main() { int i = 0; getcontext(&context); count += 1; i += 1; sleep(1); printf(”…", count, i); setcontext(&context); } Save core context to memory Loading the saved context transfers control to this block of code. (Why?) What about the stack? Load core context from memory Messing with the context (3) #include <ucontext.h> int count = 0; ucontext_t context; int main() { int i = 0; getcontext(&context); count += 1; i += 1; sleep(1); printf(”…", count, i); setcontext(&context); } chase$ cc -o context0 context0.c < warnings: ucontext deprecated on MacOS > chase$ ./context0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 … Reading behind the C count += 1; i += 1; Disassembled code: On MacOS: chase$ man otool chase$ otool –vt context0 … On this machine, with this cc: movl addl movl 0x0000017a(%rip),%ecx $0x00000001,%ecx %ecx,0x0000016e(%rip) Static global _count is addressed relative to the location of the code itself, as given by the PC register [%rip is instruction pointer register] movl addl movl 0xfc(%rbp),%ecx $0x00000001,%ecx %ecx,0xfc(%rbp) Local variable i is addressed as an offset from stack frame. [%rbp is stack frame base pointer] %rip and %rbp are set “right”, then these references “work”. Messing with the context (4) #include <ucontext.h> int count = 0; ucontext_t context; int main() { int i = 0; getcontext(&context); count += 1; i += 1; sleep(1); printf(”…", count, i); setcontext(&context); } chase$ cc –O2 -o context0 context0.c < warnings: ucontext deprecated on MacOS > chase$ ./context0 1 1 2 1 3 1 4 1 5 1 6 1 7 1 … What happened? The point of ucontext • The system can use ucontext routines to: – “Freeze” at a point in time of the execution – Restart execution from a frozen moment in time – Execution continues where it left off…if the memory state is right. • The system can implement multiple independent threads of execution within the same address space. – Create a context for a new thread with makecontext. – Modify saved contexts at will. – Context switch with swapcontext: transfer a core from one thread to another (“change drivers”) – Much more to this picture: need per-thread stacks, kernel support, suspend/sleep, controlled ordering, etc. Two threads: closer look “on deck” and ready to run address space 0 x common runtime program code library running thread CPU (core) data R0 Rn PC SP y x y stack registers stack high Thread context switch switch out switch in address space 0 common runtime x program code library data R0 CPU (core) 1. save registers Rn PC SP y x y registers stack 2. load registers high stack A metaphor: context/switching 1 Page links and back button navigate a “stack” of pages in each tab. 2 Each tab has its own stack. One tab is active at any given time. You create/destroy tabs as needed. You switch between tabs at your whim. 3 Similarly, each thread has a separate stack. The OS switches between threads at its whim. One thread is active per CPU core at any given time. time Messing with the context (5) #include <ucontext.h> int count = 0; ucontext_t context; int main() { int i = 0; getcontext(&context); count += 1; i += 1; sleep(1); printf(”…", count, i); setcontext(&context); } What does this do? Thread/process states and transitions “driving a car” running Scheduler governs these transitions. dispatch sleep “waiting for someplace to go” blocked wakeup wait, STOP, read, write, listen, receive, etc. STOP wait yield ready “requesting a car” Sleep and wakeup are internal primitives. Wakeup adds a thread to the scheduler’s ready pool: a set of threads in the ready state. BLOCK MAPS AND PAGE TABLES Blocks are contiguous The storage in a heap block is contiguous in the Virtual Address Space. The term block always refers to a contiguous sequence of bytes suitable for base+offset addressing. C and other PL environments require this. E.g., C compiler determines the offsets for named fields in a struct and “bakes” them into the code. This requirement complicates the heap manager because the heap blocks may be different sizes. Block maps Large data objects may be mapped so they don’t have to be stored contiguously in machine memory. (e.g., files, segments) Idea: use a level of indirection through a map to assemble a storage object from “scraps” of storage in different locations. The “scraps” can be fixed-size slots: that makes allocation easy because the slots are interchangeable (fixed partitioning). map Example: page tables that implement a VAS. x64, x86-64, AMD64: VM Layout VM page map Source: System V Application Binary Interface AMD64 Architecture Processor Supplement 2005 Indirection Fixed Partitioning Wasted space internal fragmentation Names and maps • Block maps and other indexed maps are common structure to implement “machine” name spaces: – sequences of logical blocks, e.g., virtual address spaces, files – process IDs, etc. – For sparse block spaces we may use a tree hierarchy of block maps (e.g., inode maps or 2-level page tables, later). – Storage system software is full of these maps. • Symbolic name spaces use different kinds of maps. – They are sparse and require matching more expensive. – Property list, key/value hash table – Trees of maps create nested namespaces, e.g., the file tree. I hope we get to here EXTRA SLIDES The Kernel • Today, all “real” operating systems have protected kernels. The kernel resides in a well-known file: the “machine” automatically loads it into memory (boots) on power-on/reset. Our “kernel” is called the executive in some systems (e.g., Windows). • The kernel is (mostly) a library of service procedures shared by all user programs, but the kernel is protected: User code cannot access internal kernel data structures directly. User code can invoke the kernel only at well-defined entry points (system calls). • Kernel code is “just like” user code, but the kernel is privileged: The kernel has direct access to all hardware functions, and defines the handler entry points for interrupts and exceptions. Protecting Entry to the Kernel Protected events and kernel mode are the architectural foundations of kernel-based OS (Unix, Windows, etc). – The machine defines a small set of exceptional event types. – The machine defines what conditions raise each event. – The kernel installs handlers for each event at boot time. e.g., a table in kernel memory read by the machine The machine transitions to kernel mode only on an exceptional event. The kernel defines the event handlers. Therefore the kernel chooses what code will execute in kernel mode, and when. user trap/return interrupt or fault kernel interrupt or fault The Role of Events • A CPU event (an interrupt or exception, i.e., a trap or fault) is an “unnatural” change in control flow. • Like a procedure call, an event changes the PC register. • Also changes mode or context (current stack), or both. • Events do not change the current space! • On boot, the kernel defines a handler routine for each event type. • The machine defines the event types. • Event handlers execute in kernel mode. control flow exception.cc • Every kernel entry results from an event. • Enter at the handler for the event. In some sense, the whole kernel is a “big event handler.” event handler (e.g., ISR: Interrupt Service Routine) Examples • Illegal operation – Reserved opcode, divide-by-zero, illegal access – That’s a fault! Kernel generates a signal, e.g., to kill process or invoke PL exception handlers. • Page fault – Fetch and install page, maybe block process – Nothing illegal about it: “transparent” to faulting process • I/O completion, arriving input, clock ticks. – These external events are interrupts. – Include power fail etc. – Kernel services interrupt in handler. – May wakeup blocked processes, but no blocking. Faults • Faults are similar to system calls in some respects: – Faults occur as a result of a process executing an instruction. • Fault handlers execute on the process kernel stack; the fault handler may block (sleep) in the kernel. – The completed fault handler may return to the faulted context. • But faults are different from syscall traps in other respects: – Syscalls are deliberate, but faults are “accidents”. • divide-by-zero, dereference invalid pointer, memory page fault – Not every execution of the faulting instruction results in a fault. • may depend on memory state or register contents Note: Something Wild • The “Something Wild” example that follows was an earlier version of “Messing with the context”. • It was not discussed in class. • “Messing with the context” simplifies the example, but keeps all the essential info. • “Something Wild” brings it just a little closer to coroutines a context switch from one thread to another. Something wild (1) #include <ucontext.h> Int count = 0; int set = 0; ucontext_t contexts[2]; void proc() { int i = 0; if (!set) { getcontext(&contexts[count]); } printf(…, count, i); count += 1; i += 1; if (set) { setcontext(&contexts[count&0x1]); } } time int main() { set = 0; proc(); proc(); set = 1; proc(); } Something wild (2) #include <ucontext.h> ucontext_t contexts[2]; void proc() { int i = 0; getcontext(&contexts[count]); printf(”…", count, i); count += 1; i += 1; } time int main() { set=0; proc(); proc(); … } Something wild (3) #include <ucontext.h> ucontext_t contexts[2]; void proc() { int i = 0; printf(”…", count, i); count += 1; i += 1; sleep(1); setcontext(&contexts[count&0x1]); } time int main() { … set=1; proc(); } Something wild (4) We have a pair of register contexts that were saved at this point in the code. void proc() { int i = 0; printf(”…", count, i); count += 1; i += 1; sleep(1); setcontext(…); } What will it print? The count is a global variable…but what about i? time If we load either of the saved contexts, it will transfer control to this block of code. (Why?) What about the stack? Switch to the other saved register context. Alternate “even” and “odd” contexts. Lather, rinse, repeat. Something wild (5) time void proc() { int i = 0; printf("%4d %4d\n", count, i); count += 1; i += 1; sleep(1); setcontext(…); } What does this do?