kernel

advertisement
The Machine and the Kernel
Mode, space, and context: the basics
Jeff Chase
Duke University
64 bytes: 3 ways
p + 0x0
Memory is “fungible”.
0x0
int p[]
int* p
char p[]
char *p
0x1f
p
0x0
char* p[]
char** p
0x1f
Pointers (addresses) are 8
bytes on a 64-bit machine.
0x1f
Endianness
Lilliput and Blefuscu
are at war over which
end of a soft-boiled
egg to crack.
Gulliver’s Travel’s
1726
A silly difference among machine architectures creates a need for
byte swapping when unlike machines exchange data over a network.
x86 is little-endian
Little-endian: the lowestnumbered byte of a word (or
longword or quadword) is the
least significant.
cb
ip
h=0x68
i=0x69
!=0x21
0
chase$ cc -o heap heap.c
chase$ ./heap
hi!
0x216968
chase$
Network messages
https://developers.google.com/protocol-buffers/docs/overview
Byte swapping: example
struct sockaddr_in socket_addr;
sock = socket(PF_INET, SOCK_STREAM, 0);
memset(&socket_addr, 0, sizeof socket_addr);
socket_addr.sin_family = PF_INET;
socket_addr.sin_port = htons(port);
socket_addr.sin_addr.s_addr = htonl(INADDR_ANY);
if (bind(sock, (struct sockaddr *) &socket_addr, sizeof socket_addr) < 0) {
perror("couldn't bind");
exit(1);
}
listen(sock, 10);
buggyserver.c
Heap: dynamic memory
A contiguous chunk of
memory obtained from
OS kernel.
E.g., with Unix sbrk()
system call.
A runtime library obtains the
block and manages it as a
“heap” for use by the
programming language
environment, to store
dynamic objects.
E.g., with Unix malloc and
free library calls.
Allocated heap blocks
for structs or objects.
Align!
Heap manager policy
• The heap manager must find a suitable free block to
return for each call to malloc().
– No byte can be part of two simultaneously allocated heap blocks!
If any byte of memory is doubly allocated, programs will fail.
We test for this!
• A heap manager has a policy algorithm to identify a
suitable free block within the heap.
– Last fit, first fit, best fit, worst fit
– Choose your favorite!
– Goals: be quick, and use memory efficiently
– Behavior depends on workload: pattern of malloc/free requests
• This is an old problem in computer science, and it occurs
in many settings: variable partitioning.
Variable Partitioning
Variable partitioning is the strategy of parking differently sized cars
along a street with no marked parking space dividers.
1
2
3
Wasted space
external fragmentation
Fixed Partitioning
Wasted space
internal fragmentation
Time sharing vs. space sharing
space
Two common modes
of resource
allocation. What
kinds of resources do
these work for?
time 
Operating Systems: The Classical View
Programs
run as
independent
processes.
data
data
Protected
system calls
Protected OS
kernel
mediates
access to
shared
resources.
Each process
has a private
virtual address
space and one
or more
threads.
...and upcalls
(e.g., signals)
Threads
enter the
kernel for
OS
services.
The kernel code and data are protected from untrusted processes.
0x7fffffff
0x7fffffff
Reserved
Reserved
Stack
Stack
Dynamic data
(heap/BSS)
Dynamic data
(heap/BSS)
Static data
Static data
Text
(code)
Text
(code)
0x0
0x0
“Classic Linux Address Space”
N
http://duartes.org/gustavo/blog/category/linux
Windows/IA32
Windows IA-32
(Kernel)
Processes: A Closer Look
virtual address space
+
The address space is
a private name space
for a set of memory
segments used by the
process.
The kernel must
initialize the process
memory for the
program to run.
thread
stack
process descriptor (PCB)
+
Each process has a thread
bound to the VAS.
The thread has a stack
addressable through the
VAS.
The kernel can
suspend/restart the thread
wherever and whenever it
wants.
user ID
process ID
parent PID
sibling links
children
resources
The OS maintains
some state for each
process in the
kernel’s internal
data structures: a
file descriptor table,
links to maintain the
process tree, and a
place to store the
exit status.
A process can have multiple threads
int main(int argc, char *argv[]) {
if (argc != 2) {
fprintf(stderr, "usage: threads <loops>\n");
exit(1);
void *worker(void *arg) {
}
int i;
loops = atoi(argv[1]);
for (i = 0; i < loops; i++) {
pthread_t p1, p2;
counter++;
printf("Initial value : %d\n", counter);
}
pthread_create(&p1, NULL, worker, NULL);
pthread_exit(NULL);
pthread_create(&p2, NULL, worker, NULL);
}
pthread_join(p1, NULL);
pthread_join(p2, NULL);
data
printf("Final value : %d\n", counter);
return 0;
}
volatile int counter = 0;
int loops;
Much more on this later!
Key Concepts for Classical OS
•
kernel
• The software component that controls the hardware directly,
and implements the core privileged OS functions.
• Modern hardware has features that allow the OS kernel to
protect itself from untrusted user code.
•
thread
• An executing instruction path and its CPU register state.
•
virtual address space
• An execution context for thread(s) defining a name space for
executing instructions to address data and code.
•
process
• An execution of a program, consisting of a virtual address
space, one or more threads, and some OS kernel state.
The theater analogy
script
virtual
memory
(stage)
Threads
Program
Address space
Running a program is like performing a play.
[lpcox]
The sheep analogy
Address space
Thread
Code
and
data
CPU cores
The machine has a
bank of CPU cores for
threads to run on.
The OS allocates cores
to threads.
Cores are hardware.
They go where the
driver tells them.
Core #1
Core #2
Switch drivers any time.
Threads drive cores
What was the point of that whole
thing with the electric sheep actors?
• A process is a running program.
• A running program (a process) has at least one thread (“main”), but
it may (optionally) create other threads.
• The threads execute the program (“perform the script”).
• The threads execute on the “stage” of the process virtual memory,
with access to a private instance of the program’s code and data.
• A thread can access any virtual memory in its process, but is
contained by the “fence” of the process virtual address space.
• Threads run on cores: a thread’s core executes instructions for it.
• Sometimes threads idle to wait for a free core, or for some event.
Sometimes cores idle to wait for a ready thread to run.
• The operating system kernel shares/multiplexes the computer’s
memory and cores among the virtual memories and threads.
Processes and threads
virtual address space
+
Each process has a
virtual address space
(VAS): a private name
space for the virtual
memory it uses.
The VAS is both a
“sandbox” and a
“lockbox”: it limits what
the process can
see/do, and protects
its data from others.
main thread
stack
other threads (optional)
+…
Each process has a thread
bound to the VAS, with
stacks (user and kernel).
If we say a process does
something, we really mean
its thread does it.
The kernel can
suspend/restart the thread
wherever and whenever it
wants.
From now on, we suppose
that a process could have
multiple threads.
We presume that they can
all make system calls and
block independently.
STOP
wait
A thread running in a process VAS
0
CPU
common runtime
x
your program
code library
your data
R0
heap
Rn
PC
SP
address space
(virtual or physical)
x
y
registers
y
stack
high
“memory”
e.g., a virtual memory
for a process
Thread context
• Each thread has a context (exactly one).
– Context == values in the thread’s registers
CPU core
– Including a (protected) identifier naming its VAS.
– And a pointer to thread’s stack in VAS/memory.
• Each core has a context (at least one).
R0
– Context == a register set that can hold values.
– The register set is baked into the hardware.
• A core can change “drivers”: context switch.
– Save running thread’s register values into memory.
– Load new thread’s register values from memory.
– (Think of driver settings for the seat, mirrors, audio…)
– Enables time slicing or time sharing of machine.
Rn
PC
SP
x
y
registers
Programs gone wild
int
main()
{
while(1);
}
Can you hear the fans blow?
How does the OS regain control of
the core from this program?
How to “make” the process save its
context and give some other process
a chance to run?
How to “make” processes share
machine resources fairly?
Timer interrupts, faults, etc.
• When processor core is running a user program, the
user program/thread controls (“drives”) the core.
• The hardware has a timer device that interrupts the
core after a given interval of time.
• Interrupt transfers control back to the OS kernel, which
may switch the core to another thread, or resume.
• Other events also return control to the kernel.
– Wild pointers
– Divide by zero
– Other program actions
– Page faults
Entry to the kernel
Every entry to the kernel is the result of a trap, fault, or interrupt. The
core switches to kernel mode and transfers control to a handler routine.
syscall trap/return
fault/return
OS kernel code and data for system calls (files, process
fork/exit/wait, pipes, binder IPC, low-level thread support, etc.)
and virtual memory management (page faults, etc.)
I/O completions
interrupt/return
timer ticks
The handler accesses the core register context to read the details of
the exception (trap, fault, or interrupt). It may call other kernel routines.
CPU mode: User and Kernel
CPU mode (a field in some status register) indicates
whether a machine CPU (core) is running in a user
program or in the protected kernel (protected mode).
CPU core
Some instructions or register accesses are legal only
when the CPU (core) is executing in kernel mode.
U/K
mode
CPU mode transitions to kernel mode only on
machine exception events (trap, fault, interrupt),
which transfers control to a trusted handler routine
registered with the machine at kernel boot time.
R0
Rn
PC
So only the kernel program chooses what code ever
runs in the kernel mode (or so we hope and intend).
A kernel handler can read the user register values at
the time of the event, and modify them arbitrarily
before (optionally) returning to user mode.
x
registers
Exceptions: trap, fault, interrupt
synchronous
caused by an
instruction
asynchronous
caused by
some other
event
intentional
unintentional
happens every time
contributing factors
trap: system call
fault
open, close, read,
write, fork, exec, exit,
wait, kill, etc.
invalid or protected
address or opcode, page
fault, overflow, etc.
“software interrupt”
software requests an
interrupt to be delivered
at a later time
interrupt
caused by an external
event: I/O op completed,
clock tick, power fail, etc.
Kernel Stacks and Trap/Fault Handling
Threads
execute user
code on a user
stack in the
user virtual
memory in the
process virtual
address space.
Each thread has a
second kernel
stack in kernel
space (VM
accessible only in
kernel mode).
data
stack
stack
stack
syscall
dispatch
table
stack
System calls
and faults run
in kernel
mode on a
kernel stack.
Kernel code
running in P’s
process context
has access to
P’s virtual
memory.
The syscall handler makes an indirect call through the system call
dispatch table to the handler registered for the specific system call.
Virtual resource sharing
space
Understand that the OS kernel
implements resource allocation
(memory, CPU,…) by manipulating
name spaces and contexts visible
to user code.
The kernel retains control of user
contexts and address spaces via
the machine’s limited direct
execution model, based on
protected mode and exceptions.
time 
“Limited direct execution”
Any kind of machine exception transfers control to a registered
(trusted) kernel handler running in a protected CPU mode.
syscall trap
u-start
fault
u-return
u-start
fault
u-return
kernel “top half”
kernel “bottom half” (interrupt handlers)
clock
interrupt
user
mode
kernel
mode
interrupt
return
boot
Kernel handler manipulates
CPU register context to return
to selected user context.
Example: Syscall traps
• Programs in C, C++, etc. invoke system calls by linking to
a standard library written in assembly.
– The library defines a stub or wrapper routine for each syscall.
– Stub executes a special trap instruction (e.g., chmk or callsys or
syscall instruction) to change mode to kernel.
– Syscall arguments/results are passed in registers (or user stack).
– OS defines Application Binary Interface (ABI).
read() in Unix libc.a Alpha library (executes in user mode):
#define SYSCALL_READ 27
move arg0…argn, a0…an
move SYSCALL_READ, v0
callsys
move r1, _errno
return
Alpha CPU ISA (defunct)
# op ID for a read system call
# syscall args in registers A0..AN
# syscall dispatch index in V0
# kernel trap
# errno = return status
Linux x64 syscall conventions
MacOS x86-64 syscall example
section .data
hello_world db
section .text
global start
start:
mov rax, 0x2000004
mov rdi, 1
mov rsi, hello_world
mov rdx, 14
syscall
mov rax, 0x2000001
mov rdi, 0
syscall
"Hello World!", 0x0a
Illustration only: this program writes
“Hello World!” to standard output.
; System call write = 4
; Write to standard out = 1
; The address of hello_world string
; The size to write
; Invoke the kernel
; System call number for exit = 1
; Exit success = 0
; Invoke the kernel
http://thexploit.com/secdev/mac-os-x-64-bit-assembly-system-calls/
A thread running in a process VAS
0
CPU
common runtime
x
your program
code library
your data
R0
heap
Rn
PC
SP
address space
(virtual or physical)
x
y
registers
y
stack
high
“memory”
e.g., a virtual memory
for a process
Messing with the context
#include <ucontext.h>
int count = 0;
ucontext_t context;
int main()
{
int i = 0;
getcontext(&context);
count += 1;
i += 1;
sleep(2);
printf(”…", count, i);
setcontext(&context);
}
ucontext
Standard C library routines to:
Save current register context to a block of
memory (getcontext from core)
Load/restore current register context
from a block of memory (setcontext)
Also: makecontext, swapcontext
Details of the saved context (ucontext_t
structure) are machine-dependent.
Messing with the context (2)
#include <ucontext.h>
int count = 0;
ucontext_t context;
int main()
{
int i = 0;
getcontext(&context);
count += 1;
i += 1;
sleep(1);
printf(”…", count, i);
setcontext(&context);
}
Save core context to memory
Loading the saved context
transfers control to this block
of code. (Why?)
What about the stack?
Load core context from memory
Messing with the context (3)
#include <ucontext.h>
int count = 0;
ucontext_t context;
int main()
{
int i = 0;
getcontext(&context);
count += 1;
i += 1;
sleep(1);
printf(”…", count, i);
setcontext(&context);
}
chase$ cc -o context0 context0.c
<
warnings:
ucontext deprecated on MacOS
>
chase$ ./context0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
…
Reading behind the C
count += 1;
i += 1;
Disassembled code:
On MacOS:
chase$ man otool
chase$ otool –vt context0
…
On this machine, with this cc:
movl
addl
movl
0x0000017a(%rip),%ecx
$0x00000001,%ecx
%ecx,0x0000016e(%rip)
Static global _count is addressed
relative to the location of the code
itself, as given by the PC register
[%rip is instruction pointer register]
movl
addl
movl
0xfc(%rbp),%ecx
$0x00000001,%ecx
%ecx,0xfc(%rbp)
Local variable i is addressed as an
offset from stack frame.
[%rbp is stack frame base pointer]
%rip and %rbp are set “right”, then these references “work”.
Messing with the context (4)
#include <ucontext.h>
int count = 0;
ucontext_t context;
int main()
{
int i = 0;
getcontext(&context);
count += 1;
i += 1;
sleep(1);
printf(”…", count, i);
setcontext(&context);
}
chase$ cc –O2 -o context0 context0.c
<
warnings:
ucontext deprecated on MacOS
>
chase$ ./context0
1 1
2 1
3 1
4 1
5 1
6 1
7 1
…
What happened?
The point of ucontext
• The system can use ucontext routines to:
– “Freeze” at a point in time of the execution
– Restart execution from a frozen moment in time
– Execution continues where it left off…if the memory state is right.
• The system can implement multiple independent threads
of execution within the same address space.
– Create a context for a new thread with makecontext.
– Modify saved contexts at will.
– Context switch with swapcontext: transfer a core from one
thread to another (“change drivers”)
– Much more to this picture: need per-thread stacks, kernel
support, suspend/sleep, controlled ordering, etc.
Two threads: closer look
“on deck” and
ready to run
address space
0
x
common runtime
program
code library
running
thread
CPU
(core)
data
R0
Rn
PC
SP
y
x
y
stack
registers
stack
high
Thread context switch
switch
out
switch
in
address space
0
common runtime
x
program
code library
data
R0
CPU
(core)
1. save registers
Rn
PC
SP
y
x
y
registers
stack
2. load registers
high
stack
A metaphor: context/switching
1
Page links and
back button
navigate a
“stack” of pages
in each tab.
2 Each tab has its own stack.
One tab is active at any given time.
You create/destroy tabs as needed.
You switch between tabs at your whim.
3
Similarly, each thread has a separate stack.
The OS switches between threads at its whim.
One thread is active per CPU core at any given time.
time 
Messing with the context (5)
#include <ucontext.h>
int count = 0;
ucontext_t context;
int main()
{
int i = 0;
getcontext(&context);
count += 1;
i += 1;
sleep(1);
printf(”…", count, i);
setcontext(&context);
}
What does this do?
Thread/process states and
transitions
“driving a car”
running
Scheduler governs
these transitions.
dispatch
sleep
“waiting for
someplace
to go”
blocked
wakeup
wait, STOP, read, write,
listen, receive, etc.
STOP
wait
yield
ready
“requesting a car”
Sleep and wakeup are internal
primitives. Wakeup adds a thread to
the scheduler’s ready pool: a set of
threads in the ready state.
BLOCK MAPS AND PAGE
TABLES
Blocks are contiguous
The storage in a heap block is contiguous in the Virtual Address Space.
The term block always refers to a contiguous sequence of bytes suitable
for base+offset addressing.
C and other PL environments require this. E.g., C compiler determines
the offsets for named fields in a struct and “bakes” them into the code.
This requirement complicates the heap manager because the heap
blocks may be different sizes.
Block maps
Large data objects may be
mapped so they don’t have to
be stored contiguously in
machine memory.
(e.g., files, segments)
Idea: use a level of indirection
through a map to assemble a
storage object from “scraps” of
storage in different locations.
The “scraps” can be fixed-size
slots: that makes allocation
easy because the slots are
interchangeable (fixed
partitioning).
map
Example: page tables that
implement a VAS.
x64, x86-64, AMD64: VM Layout
VM page
map
Source:
System V Application Binary Interface AMD64
Architecture Processor Supplement 2005
Indirection
Fixed Partitioning
Wasted space
internal fragmentation
Names and maps
• Block maps and other indexed maps are common structure
to implement “machine” name spaces:
– sequences of logical blocks, e.g., virtual address spaces, files
– process IDs, etc.
– For sparse block spaces we may use a tree hierarchy of block
maps (e.g., inode maps or 2-level page tables, later).
– Storage system software is full of these maps.
• Symbolic name spaces use different kinds of maps.
– They are sparse and require matching  more expensive.
– Property list, key/value hash table
– Trees of maps create nested namespaces, e.g., the file tree.
I hope we get to here
EXTRA SLIDES
The Kernel
• Today, all “real” operating systems have protected kernels.
The kernel resides in a well-known file: the “machine” automatically
loads it into memory (boots) on power-on/reset.
Our “kernel” is called the executive in some systems (e.g., Windows).
• The kernel is (mostly) a library of service procedures shared by
all user programs, but the kernel is protected:
User code cannot access internal kernel data structures directly.
User code can invoke the kernel only at well-defined entry points
(system calls).
• Kernel code is “just like” user code, but the kernel is privileged:
The kernel has direct access to all hardware functions, and defines the
handler entry points for interrupts and exceptions.
Protecting Entry to the Kernel
Protected events and kernel mode are the architectural
foundations of kernel-based OS (Unix, Windows, etc).
– The machine defines a small set of exceptional event types.
– The machine defines what conditions raise each event.
– The kernel installs handlers for each event at boot time.
e.g., a table in kernel memory read by the machine
The machine transitions to kernel mode only on
an exceptional event.
The kernel defines the event handlers.
Therefore the kernel chooses what code will
execute in kernel mode, and when.
user
trap/return
interrupt or
fault
kernel
interrupt or
fault
The Role of Events
• A CPU event (an interrupt or exception, i.e., a trap or fault) is an
“unnatural” change in control flow.
• Like a procedure call, an event changes the PC register.
• Also changes mode or context (current stack), or both.
• Events do not change the current space!
• On boot, the kernel defines a handler routine for each event type.
• The machine defines the event types.
• Event handlers execute in kernel mode.
control flow
exception.cc
• Every kernel entry results from an event.
• Enter at the handler for the event.
In some sense, the whole kernel
is a “big event handler.”
event handler
(e.g., ISR: Interrupt
Service Routine)
Examples
• Illegal operation
– Reserved opcode, divide-by-zero, illegal access
– That’s a fault! Kernel generates a signal, e.g., to kill process
or invoke PL exception handlers.
• Page fault
– Fetch and install page, maybe block process
– Nothing illegal about it: “transparent” to faulting process
• I/O completion, arriving input, clock ticks.
– These external events are interrupts.
– Include power fail etc.
– Kernel services interrupt in handler.
– May wakeup blocked processes, but no blocking.
Faults
• Faults are similar to system calls in some respects:
– Faults occur as a result of a process executing an instruction.
• Fault handlers execute on the process kernel stack; the fault
handler may block (sleep) in the kernel.
– The completed fault handler may return to the faulted context.
• But faults are different from syscall traps in other respects:
– Syscalls are deliberate, but faults are “accidents”.
• divide-by-zero, dereference invalid pointer, memory page fault
– Not every execution of the faulting instruction results in a fault.
• may depend on memory state or register contents
Note: Something Wild
• The “Something Wild” example that follows was an
earlier version of “Messing with the context”.
• It was not discussed in class.
• “Messing with the context” simplifies the example, but
keeps all the essential info.
• “Something Wild” brings it just a little closer to coroutines
a context switch from one thread to another.
Something wild (1)
#include <ucontext.h>
Int count = 0;
int set = 0;
ucontext_t contexts[2];
void proc() {
int i = 0;
if (!set) {
getcontext(&contexts[count]);
}
printf(…, count, i);
count += 1;
i += 1;
if (set) {
setcontext(&contexts[count&0x1]);
}
}
time 
int
main() {
set = 0;
proc();
proc();
set = 1;
proc();
}
Something wild (2)
#include <ucontext.h>
ucontext_t contexts[2];
void proc()
{
int i = 0;
getcontext(&contexts[count]);
printf(”…", count, i);
count += 1;
i += 1;
}
time 
int
main() {
set=0;
proc();
proc();
…
}
Something wild (3)
#include <ucontext.h>
ucontext_t contexts[2];
void proc() {
int i = 0;
printf(”…", count, i);
count += 1;
i += 1;
sleep(1);
setcontext(&contexts[count&0x1]);
}
time 
int
main() {
…
set=1;
proc();
}
Something wild (4)
We have a pair of
register contexts that
were saved at this
point in the code.
void proc() {
int i = 0;
printf(”…", count, i);
count += 1;
i += 1;
sleep(1);
setcontext(…);
}
What will it print? The count is a
global variable…but what about i?
time 
If we load either of the
saved contexts, it will
transfer control to this
block of code. (Why?)
What about the stack?
Switch to the other
saved register context.
Alternate “even” and
“odd” contexts.
Lather, rinse, repeat.
Something wild (5)
time 
void proc() {
int i = 0;
printf("%4d %4d\n", count, i);
count += 1;
i += 1;
sleep(1);
setcontext(…);
}
What does this do?
Download