Jeff Chase
Duke University
API
API
Protection boundary
Applications /services.
May interact and serve one another.
Libraries/frameworks : packaged code used by multiple applications
OS platform : same for all applications on a system
E,g,, classical OS kernel
OS mediates access to shared resources.
That requires protection and isolation.
[RAD Lab]
Programs run as independent processes. data data
Each process has a private virtual address space and one or more threads.
Protected system calls
...and upcalls
(e.g., signals)
Protected OS kernel mediates access to shared resources.
The kernel code and data are protected from untrusted processes.
Threads enter the kernel for
OS services.
Android Security Architecture
[http://developer.android.com/guide/topics/security/permissions.html]
“A central design point of the Android security architecture is that no application, by default, has permission to perform any operations that would adversely impact other applications, the operating system, or the user. This includes reading or writing the user's private data (such as contacts or emails), reading or writing another application's files, performing network access, keeping the device awake, and so on.
Because each Android application operates in a process sandbox , applications must explicitly share resources and data. They do this by declaring the permissions they need for additional capabilities not provided by the basic sandbox. Applications statically declare the permissions they require, and the Android system prompts the user for consent at the time the application is installed. Android has no mechanism for granting permissions dynamically (at run-time) because it complicates the user experience to the detriment of security .”
sections code (“text”) constants initialized data segments data
Program mapped file regions
Process with virtual memory
When a program launches, the OS initializes a process with a virtual memory to store the running program’s code and data. Typically it sets up the segments by memory-mapping sections of the executable file.
Header “ magic number ” indicates type of file/image.
Section table an array of (offset, len, startVA) sections
Used by linker; may be removed after final link step and strip. Also includes info for debugger.
header text program instructions p wdata immutable data (constants)
“ hello\n ” writable global/static data j, s symbol table j, s ,p,sbuf int j = 327; char* s = “ hello\n ” ; char sbuf[512]; relocation records int p() { int k = 0; j = write(1, s, 6); return(j);
}
chase:lab1> make gcc -I. -Wall -lm -DNDEBUG -c dmm.c
… gcc -I. -Wall -lm -DNDEBUG -o test_basic test_basic.c dmm.o
gcc -I. -Wall -lm -DNDEBUG -o test_coalesce test_coalesce.c dmm.o
gcc -I. -Wall -lm -DNDEBUG -o test_stress1 test_stress1.c dmm.o
gcc -I. -Wall -lm -DNDEBUG -o test_stress2 test_stress2.c dmm.o
chase:lab1> chase:lab1> ./test_basic calling malloc(10) call to dmalloc() failed chase:lab1>
myprogram.c
int j; char* s = “ hello\n ” ; assembler myprogram.o
object data file int p() { j = write(1, s, 6); return(j);
} compiler
…..
p: store this store that push jsr _write ret etc.
myprogram.s
linker data data data libraries and other object files or archives program header files myprogram
(executable file)
• Globals:
– Fixed-size segment
– Writable by user program
– May have initial values
• Text (instructions)
– Fixed-size segment
– Executable
– Not writable
• Heap and stack
– Variable-size segments
– Writable
– Zero-filled on demand
RCX
PC/RIP
SP/RBP x y registers
CPU core globals text heap stack segments
0x400000 text
Example: the details aren’t important.
r-x
0x600000
0x601000
0x1299000 idata data heap r-rwrw- [anon]
Program
0x2ba976c30000 lib lib r-x
N high addresses r-x
0x7fff1373b000
0x7fff1375c000 stack 64K rw- [anon] libc.so
shared library
• The operating system kernel!
– What is it?
– Where is it?
– How do we get there?
– How is it protected?
– How does it control resources?
– How does it control access to data?
– How does it keep control?
User processes/segme nts kernel user space kernel space
• Today, all “real” operating systems have protected kernels.
• The kernel resides in a well-known file: the machine automatically loads it into memory and starts it ( boot ) on power-on/reset.
• The kernel is (mostly) a library of service procedures shared by all user programs, but the kernel is protected :
• User code cannot access internal kernel data structures directly.
• User code can invoke the kernel only at well-defined entry points
( system calls ).
• Kernel code is “just like” user code, but the kernel is privileged :
• The kernel has direct access to all machine functions, and defines the handler entry points for CPU events: trap, fault, interrupt.
• Once booted, the kernel acts as one big event handler.
• The kernel is just a program: a collection of modules and their state.
• E.g., it may be written in C and compiled/linked a little differently.
– E.g., linked with –static option: no dynamic libs
• At runtime, kernel code and data reside in a protected range of virtual addresses.
– The range ( kernel space ) is “part of” every VAS.
– VPN->PFN translations for kernel space are global .
• (Details vary by machine and OS configuration)
– Access to kernel space is denied for user programs.
– Portions of kernel space may be non-pageable and/or direct-mapped to machine memory.
VAS kernel code kernel data
0x0 user space kernel space high
User spaces one per
VAS occupies
“low half” of
VAS (2GB) kernel space high-order bit set in virtual address
Alternative configuration allows user spaces larger than 2GB kernel space two highest bits are set
(0xc00..)
The point is:
There are lots of different regions within kernel space to meet internal OS needs.
- page tables for various VAS
- page table for kernel space itself
- file block cache
- internal data structures
The details aren’t important.
The current mode of a CPU core is represented by a field in a protected register.
We consider only two possible values: user mode or kernel mode (also called protected mode or supervisor mode ).
If the core is in protected mode then it can:
access kernel space
access certain control registers
execute certain special instructions
CPU core mode
R0
Rn
PC x registers
U/K
If software attempts to do any of these things when the core is in user mode, then the core raises a CPU exception (a fault ).
The details aren’t important.
See [en.wikipedia.org/wiki/Control_register]
• Suppose a CPU core is running user code in user mode:
– The user program controls the core.
– The core goes where the program code takes it…
– …as determined by its register state ( context ) and the values encountered in memory.
• How does the OS get control back? How does the core switch to kernel mode?
– CPU exception: trap, fault, interrupt
• On an exception, the CPU transitions to kernel mode and resets the PC and SP registers.
– Set the PC to execute a pre-designated handler routine for that exception type.
– Set the SP to a pre-designated kernel stack .
user space
Safe control transfer kernel code kernel data kernel space
synchronous caused by an instruction asynchronous caused by some other event intentional happens every time trap: system call open, close, read, write, fork, exec, exit, wait, kill, etc.
“ software interrupt ” software requests an interrupt to be delivered at a later time unintentional contributing factors fault invalid or protected address or opcode, page fault, overflow, etc.
interrupt caused by an external event: I/O op completed, clock tick, power fail, etc.
Usage note : some sources say that exceptions only occur as a result of executing an instruction, and so interrupts are not exceptions. But they are all examples of exceptional changes in control flow due to an event.
Every entry to the kernel is the result of a trap , fault , or interrupt . The core switches to kernel mode and transfers control to a handler routine.
syscall trap/return fault/return
OS kernel code and data for system calls (files, process fork/exit/wait, pipes, binder IPC, low-level thread support, etc.) and virtual memory management (page faults, etc.)
I/O completions interrupt/return timer ticks
The handler accesses the core register context to read the details of the exception (trap, fault, or interrupt). It may call other kernel routines.
• Programs in C, C++, etc. invoke system calls by linking to a standard library (libc) written in assembly.
– The library defines a stub or wrapper routine for each syscall.
– Stub executes a special trap instruction (e.g., chmk or callsys or syscall/sysenter instruction) to change mode to kernel.
– Syscall arguments/results are passed in registers (or user stack).
– OS+machine defines Application Binary Interface (ABI).
read() in Unix libc.a Alpha library ( executes in user mode ):
#define SYSCALL_READ 27 # op ID for a read system call move arg0…argn, a0…an
# syscall args in registers A0..AN
move SYSCALL_READ, v0 # syscall dispatch index in V0 callsys move r1, _errno return
# kernel trap
# errno = return status
Example read syscall stub for Alpha CPU ISA (defunct)
section .data
hello_world db "Hello World!", 0x0a section .text
global start
Illustration only : this program writes
“Hello World!” to standard output (fd == 1), ignores the syscall error return, and exits.
start: mov rax, 0x2000004 ; System call write = 4 mov rdi, 1 ; Write to standard out = 1 mov rsi, hello_world ; The address of hello_world string mov rdx, 14 syscall
; The size to write
; Invoke the kernel mov rax, 0x2000001 ; System call number for exit = 1 mov rdi, 0 ; Exit success = 0 syscall ; Invoke the kernel
Illustration only: the details aren’t important.
http://thexploit.com/secdev/mac-os-x-64-bit-assembly-system-calls/
(user buffer addresses)
Illustration only: the details aren’t important.
space
Understand that the OS kernel implements resource allocation
(memory, CPU,…) by manipulating name spaces and contexts visible to user code.
The kernel retains control of user contexts and address spaces via the machine’s limited direct execution model, based on protected mode and exceptions . time
How does the OS regain control of the core from this program?
No system calls! No faults!
How to give someone else a chance to run?
How to “make” processes share machine resources fairly?
while(1); … user mode resume time u-start boot kernel “top half” kernel “bottom half” (interrupt handlers) clock interrupt interrupt return kernel mode
Enables timeslicing
The system clock (timer) interrupts periodically, giving control back to the kernel.
The kernel can do whatever it wants, e.g., switch threads.
time
How should an OS allocate its memory resources among contending demands?
– Virtual address spaces: fork, exec, sbrk, page fault.
– The kernel controls how many machine memory frames back the pages of each virtual address space.
– The kernel can take memory away from a VAS at any time.
– The kernel always gets control if a VAS (or rather a thread running within a VAS) asks for more.
– The kernel controls how much machine memory to use as a cache for data blocks whose home is on slow storage.
– Policy choices: which pages or blocks to keep in memory? And which ones to evict from memory to make room for others?
• Protection domain
– A “ sandbox ” for threads that limits what memory they can access for read/write/execute.
– A “ lockbox ” that limits which threads can access any given segment of virtual memory.
• Uniform name space
– Threads access their code and data items without caring where they are in machine memory, or even if they are resident in memory at all.
• A set of V P translations
– A level of indirection mapping virtual pages to page frames.
– The OS kernel controls the translations in effect at any time.
Example only : a typical 32-bit architecture with 4KB pages.
virtual address
{
0 VPN
12 offset
Virtual address translation maps a virtual page number (VPN) to a page frame number (PFN) in machine memory: the rest is easy.
address translation
Deliver fault to
OS if translation is not valid and accessible in requested mode.
machine address
{
PFN
+ offset
• Machine memory is “just a cache ” over files and segments: a page fault is “just a cache miss”.
– Machine passes faulting address to kernel (e.g., x86 control register CR2) with fault type and faulting PC.
– Kernel knows which virtual space is active on the core (e.g., x86 control register CR3).
– Kernel consults other data structures related to virtual memory to figure out how to resolve the fault.
– If the fault indicates an error, then signal/kill the process.
– Else construct (or obtain) a frame containing the missing page, install the missing translation in the page table, and resume the user code, restarting the faulting instruction.
The x86 details are examples: not important.
MMU start here probe page table load
TLB access physical memory yes miss probe
TLB hit access valid?
no raise exception load
TLB fetch from disk yes zero-fill no (first reference) page on disk?
OS
(lookup and/or) allocate frame legal reference page fault?
illegal reference kill
How to monitor page reference events/frequency along the fast path?
user mode syscall trap fault fault time u-start u-return u-start u-return kernel “top half” kernel “bottom half” (interrupt handlers) interrupt interrupt return boot
User code runs on a CPU core in user mode in a user space. If it tries to do anything weird, the core transitions to the kernel, which takes over.
kernel mode
The kernel executes a special instruction to transition to user mode (labeled as “u-return”), with selected values in CPU registers.
• Each thread/context transfers control from user process/mode to kernel and back again.
• User can juggle ball (execute) before choosing to hit it back.
• But kernel can force user to return the ball at any time.
• Kernel can juggle or hide the ball
(switch thread out) before hitting it back to user.
• Kernel can drop ball at any time.
• Kernel is a multi-armed robot who plays many users at once.
• At most one ball in play for each core/slot at any given time.
Secure kernels handle system calls verrry carefully.
Syscalls indirect through syscall dispatch table by syscall number. No direct calls to kernel routines from user space!
User program / user space user buffers trap copyout copyin
What about references to kernel data objects passed as syscall arguments (e.g., file to read or write)? read () {…} write () {…} kernel
Kernel copies all arguments into kernel space and validates them.
Kernel interprets pointer arguments in context of the user
VAS, and copies the data in/out of kernel space (e.g., for read and write syscalls).
Use an integer index into a kernel table that points at the data object. The value is called a handle or descriptor . No direct pointers to kernel data from user space!
Process
(running program)
File system calls (e.g., open/read/write)
Files on “disk”
Thread globals text heap
Memory-mapped sections of program file
Program register context
Anonymous
Segments
(zero-fill)
Per-file inodes indexed with logical blockID #.
stack
Segments (regions) in Virtual Address Space
Read disk block address from map entry.
• When processor core is running a user program, the user program/thread controls (“drives”) the core.
• The hardware has a timer device that interrupts the core after a given interval of time.
• Interrupt transfers control back to the OS kernel, which may switch the core to another thread, or resume.
• Other events also return control to the kernel.
– Wild pointers
– Divide by zero
– Other program actions
– Page faults
Know how a classical OS uses the hardware to protect itself and implement a limited direct execution model for untrusted user code.
• Virtual addressing . Applications run in sandboxes that prevent them from calling procedures in the kernel or accessing kernel data directly (unless the kernel chooses to allow it).
• Events . The OS kernel installs handlers for various machine events when it boots (starts up). These events include machine exceptions
( faults ), which may be caused by errant code, interrupts from the clock or external devices (e.g., network packet arrives), and deliberate kernel calls ( traps ) caused by programs requesting service from the kernel through its API.
• Designated handlers . All of these machine events make safe control transfers into the kernel handler for the named event. In fact, once the system is booted, these events are the only ways to ever enter the kernel, i.e., to run code in the kernel.
I hope we get to here
Butler Lampson’s definition: “I am isolated if anything that goes wrong is my fault (or my program’s fault).”
Three dimensions of isolation for protected contexts (e.g., processes):
• Fault isolation . One app or app instance (process) can fail independently of others. If it runs amok, the OS can kill it and reclaim its memory, etc.
• Performance isolation . The OS manages resources (“metal and glass”: computing power, memory, disk space, I/O bandwidth, etc.). Each instance needs the “right amount” of resources to run properly. The OS prevents apps from impacting the performance of other apps.
• Security . An app may contain malware that tries to corrupt the system, steal data, or otherwise compromise the integrity of the system. The OS uses protected contexts and a reference monitor to check and authorize all accesses to data or objects.
• A CPU event (an interrupt or exception , i.e., a trap or fault ) is an
“unnatural” change in control flow.
• Like a procedure call, an event changes the PC register.
• Also changes mode or context (current stack), or both.
• Events do not change the current space!
• On boot, the kernel defines a handler routine for each event type.
• The machine defines the event types.
• Event handlers execute in kernel mode.
control flow exception.cc
• Every kernel entry results from an event.
• Enter at the handler for the event.
In some sense, the whole kernel is a “big event handler.” event handler
(e.g., ISR: I nterrupt
S ervice R outine)
Protected events and kernel mode are the architectural foundations of kernel-based OS (Unix, Windows, etc).
– The machine defines a small set of exceptional event types.
– The machine defines what conditions raise each event.
– The kernel installs handlers for each event at boot time.
e.g., a table in kernel memory read by the machine
The machine transitions to kernel mode only on an exceptional event.
The kernel defines the event handlers.
Therefore the kernel chooses what code will execute in kernel mode, and when.
user trap/return interrupt or fault kernel interrupt or fault
• Illegal operation
– Reserved opcode, divide-by-zero, illegal access
– That’s a fault ! Kernel generates a signal to user program, e.g., to kill it or invoke an application’s exception handler.
• Page fault
– Case 1 : Fetch page (or zero it), map it in PTE, restart instruction
– Case 2 : Signal error (e.g., “segmentation fault”)
• Interrupts
– I/O completion, e.g., disk read complete: resume a program
– Arriving network packet, etc.: kick the network stack
– Clock ticks (timer interrupt): maybe do a context switch
– Power fail etc.: save state
char buf[BUFSIZE]; int fd;
An open file is represented by an integer file descriptor value returned by the kernel.
}
} if ((fd = open ( “ ../zot ” , O_TRUNC | O_RDWR) == -1) { perror( “ open failed ” ); exit(1);
Pass the file descriptor value back to kernel to reference the open file while( read (0, buf, BUFSIZE)) { on subsequent syscalls.
} if ( write (fd, buf, BUFSIZE) != BUFSIZE) { perror( “ write failed ” ); exit(1);
Read/write syscalls pass virtual address of a user-space buffer .
For a write , the kernel retrieves data from the buffer and copies it in to kernel space. For a read , the kernel copies the data out of kernel space and places it into the buffer.
user space kernel space int fd pointer per-process descriptor table system-wide open file table file pipe socket tty
Disclaimer: this drawing is oversimplified
(and we will talk about pipes, sockets, and tty later)
Processes often reference OS kernel objects with integers that index into a table of pointers in the kernel. Windows calls them handles .
Example : a Unix file descriptor is a value stored in an ordinary integer variable in a user program. The kernel chooses the value for the descriptor: when the program opens a file, the kernel selects a free entry in the descriptor table and returns its index as the value. The program remembers the number and uses it to name the open file for subsequent system calls.
2. Enter kernel for read syscall.
3. Figure out what disk blocks to fetch, and fetch them into kernel buffers.
5. Copy data from kernel buffer to user buffer in read .
(kernel mode)
6.
Return to user mode.
1. Compute
(user mode)
4. sleep for I/O ( stall ) CPU seek transfer (DMA) Disk
Time
This slide clarifies the safe copy primitives used by kernel syscall handlers. The names may be confusing to some of us (;-). Copyin to the kernel, copyout from the kernel.
This is an example from BSD Unix systems: the details aren’t important, but note the safety features. [From http://www.unix.com/man-page/FreeBSD/9/copyout/] copyin() Copies len bytes of data from the user-space address uaddr to the kernel-space address kaddr.
copyout() <copyout copies out of the kernel and in to the user-space buffer. > copyinstr() Copies a NUL-terminated string, at most len bytes long, from user-space address uaddr to kernel-space address kaddr. The number of bytes actually copied, including the terminating NUL, is returned in * done…
RETURN VALUES
The copy functions return 0 on success or EFAULT if a bad address is encountered. In addition, the copystr(), and copyinstr() functions return ENAMETOOLONG if the string is longer than len bytes.
The kernel keeps a vm_map for each VAS.
The vm_map is a linked list of map entries, one for each segment, sorted by starting virtual address.
Each map entry points to a descriptor for the segment (a vm_object).
(heap)
This data structure is used in Mach-derived kernels, including BSD
Unix and Mac OSX.
“Vnode” refers to the inode for the underlying
(backing) file.
The triangles represent VM objects
(segments). The dots represent pages within segments. A segment may have any number of pages resident.
[http://manrix.sourceforge.net/microkernelservice.htm]
Text and initialized static data are mapped from the executable file.
Missing pages may be fetched from the
(backing) file on demand.
(heap)
Pages from anonymous segments are initialized to zero, but the process may write to them. If they are evicted from memory the contents must be stored somewhere on disk.
The stack and heap are zero-filled virtual memory: called anonymous because the backing file has no name (i.e., no links: it is destroyed if the process dies).
[http://manrix.sourceforge.net/microkernelservice.htm]
• Kernel searches maps for the object mapped at that address. No object? Then it ’s an error, e.g., segmentation fault .
• Kernel checks intended mode of access for the object (rwx). Access not allowed? Then it’s an error, e.g., protection fault .
vm_map
1. Run down the vm_map for the
VAS to find the segment/region containing the faulting address.
2. If we find the segment, check the protection to see if the access is legal.
vm_object
3. If the access is legal, identify the backing object containing the page.
• Is the missing page (object/offset) in a memory frame somewhere, but just missing from the page table?
– Index page cache (object/offset hash table) to find out. (The page could be resident if the segment or backing object is shared, and the page is resident in memory on behalf of some other process.)
• If not, then find a free frame of memory to hold the missing page.
• Is the missing page in an object on backing storage? Figure out where: index the inode block map. Fetch page into the frame.
• Or: is it the first reference to a page in a zero-fill object (e.g., stack or heap)? Then fill the frame with zeros.
• So far so good? Install a translation in the page table entry (pte) mapping the faulted virtual page to its frame.
• Adjust PC to restart faulted instruction, and return to user mode.
slabinfo - version: 1.1
kmem_cache 59 78 100 2 2 1 ip_fib_hash ip_conntrack
10 113 32 1 1 1
0 0 384 0 0 1 urb_priv clip_arp_cache ip_mrt_cache
0 0 64 0 0 1
0 0 128 0 0 1
0 0 96 0 0 1 tcp_tw_bucket tcp_bind_bucket tcp_open_request inet_peer_cache ip_dst_cache arp_cache
0 30 128 0 1 1
5 113 32 1 1 1
0 0 96 0 0 1
0 0 64 0 0 1
23 100 192 5 5 1
2 30 128 1 1 1 blkdev_requests 256 520 96 7 13 1 dnotify cache 0 0 20 0 0 1 file lock cache 2 42 92 1 1 1 fasync cache uid_cache
1 202 16 1 1 1
4 113 32 1 1 1 skbuff_head_cache 93 96 160 4 4 1 sock 115 126 1280 40 42 1 sigqueue cdev_cache bdev_cache
0 29 132 0 1 1
156 177 64 3 3 1
69 118 64 2 2 1 mnt_cache inode_cache dentry_cache dquot
13 40 96 1 1 1
5561 5580 416 619 620 1
7599 7620 128 254 254 1
0 0 128 0 0 1 filp 1249 1280 96 32 32 1 names_cache 0 8 4096 0 8 1 buffer_head mm_struct
15303 16920 96 422 423 1
47 72 160 2 3 1 vm_area_struct 1954 2183 64 34 37 1 fs_cache files_cache
46 59 64 1 1 1
46 54 416 6 6 1
The columns are cache name, active objects, total number of objects, object size, number of full or partial pages, total allocated pages, and pages per slab.
Lookup:
HASH( blockID ) free/eviction list
This is what a software-based cache looks like.
Each frame/buffer of memory is described by a meta-object ( header ).
Resident pages/blocks are accessible for lookup in a global hash table.
An ordered list of eviction candidates winds through the hash chains.
Hash table bucket array bucket lists
Policy choices : which pages or blocks to keep in memory? Which to evict from memory to make room for others? How to handle writes?