The Kernel and the Unix Process API Jeff Chase Duke University

advertisement
The Kernel and the Unix Process API
Jeff Chase
Duke University
The story so far: process and kernel
• A (classical) OS lets us run programs as processes. A
process is a running program instance (with a thread).
– Program code runs with the core in untrusted user mode.
• Processes are protected/isolated.
– Virtual address space is a “fenced pasture”
– Sandbox: can’t get out. Lockbox: nobody else can get in.
• The OS kernel controls everything.
– Kernel code runs with the core in trusted kernel mode.
• OS offers system call APIs.
– Create processes
– Control processes
– Monitor process execution
Operating Systems: The Classical View
Programs
run as
independent
processes.
data
Each process
has a private
virtual address
space and one
or more
threads.
data
Protected
system calls
...and upcalls
(e.g., signals)
Protected OS
kernel
mediates
access to
shared
resources.
The kernel code and data are protected from untrusted code.
Threads
enter the
kernel for
OS
services.
“Limited direct execution”
Any kind of machine exception transfers control to a registered
(trusted) kernel handler running in a protected CPU mode.
syscall trap
u-start
fault
u-return
u-start
fault
u-return
kernel “top half”
kernel “bottom half” (interrupt handlers)
clock
interrupt
user
mode
kernel
mode
interrupt
return
boot
Kernel handler manipulates
CPU register context to return
to selected user context.
Exceptions: trap, fault, interrupt
synchronous
caused by an
instruction
asynchronous
caused by
some other
event
intentional
unintentional
happens every time
contributing factors
trap: system call
fault
open, close, read,
write, fork, exec, exit,
wait, kill, etc.
invalid or protected
address or opcode, page
fault, overflow, etc.
“software interrupt”
software requests an
interrupt to be delivered
at a later time
interrupt
caused by an external
event: I/O op completed,
clock tick, power fail, etc.
Example: Syscall traps
• Programs in C, C++, etc. invoke system calls by linking to
a standard library written in assembly.
– The library defines a stub or wrapper routine for each syscall.
– Stub executes a special trap instruction (e.g., chmk or callsys or
syscall instruction) to change mode to kernel.
– Syscall arguments/results are passed in registers (or user stack).
– OS defines Application Binary Interface (ABI).
read() in Unix libc.a Alpha library (executes in user mode):
#define SYSCALL_READ 27
move arg0…argn, a0…an
move SYSCALL_READ, v0
callsys
move r1, _errno
return
Alpha CPU ISA (defunct)
# op ID for a read system call
# syscall args in registers A0..AN
# syscall dispatch index in V0
# kernel trap
# errno = return status
MacOS x86-64 syscall example
section .data
hello_world db
section .text
global start
start:
mov rax, 0x2000004
mov rdi, 1
mov rsi, hello_world
mov rdx, 14
syscall
mov rax, 0x2000001
mov rdi, 0
syscall
"Hello World!", 0x0a
Illustration only: this program writes
“Hello World!” to standard output.
; System call write = 4
; Write to standard out = 1
; The address of hello_world string
; The size to write
; Invoke the kernel
; System call number for exit = 1
; Exit success = 0
; Invoke the kernel
http://thexploit.com/secdev/mac-os-x-64-bit-assembly-system-calls/
Thread states and transitions
We will presume that these transitions occur only in kernel mode. This is
true in classical Unix and in systems with pure kernel-based threads.
Before a thread can sleep, it
must first enter the kernel via
trap (syscall) or fault.
Before a thread can yield, it must enter
the kernel, or the core must take an
interrupt to return control to the kernel.
running
yield
preempt
sleep
blocked
STOP
wait
wakeup
dispatch
ready
In the running
state, kernel code
decides
if/when/how to
enter user mode,
and sets up a
suitable context.
Kernel Stacks and Trap/Fault Handling
Threads
execute user
code on a user
stack in the
user virtual
memory in the
process virtual
address space.
Each thread has a
second kernel
stack in kernel
space (VM
accessible only in
kernel mode).
data
stack
stack
stack
syscall
dispatch
table
stack
System calls
and faults run
in kernel
mode on a
kernel stack.
Kernel code
running in P’s
process context
has access to
P’s virtual
memory.
The syscall handler makes an indirect call through the system call
dispatch table to the handler registered for the specific system call.
Let’s get down to it
UNIX PROCESS API
CACM
1974
Thompson and Ritchie
Turing Award ‘83
Unix: A lasting achievement?
“Perhaps the most important achievement of Unix is to
demonstrate that a powerful operating system for
interactive use need not be expensive…it can run on
hardware costing as little as $40,000.”
DEC PDP-11/24
The UNIX Time-Sharing System*
D. M. Ritchie and K. Thompson
1974
http://histoire.info.online.fr/pdp11.html
Unix fork/exit syscalls
int pid = fork();
Create a new process that is a clone of
its parent, return new process ID (pid)
to parent, return 0 to child.
fork
parent
child
exit(status);
Exit with status, destroying the process.
Note: this is not the only way for a
process to exit!
time
data
exit
data
p
exit
pid: 5587
pid: 5588
fork
The fork syscall
returns twice:
int pid;
int status = 0;
if (pid = fork()) {
/* parent */
…..
} else {
/* child */
…..
exit(status);
}
It returns a zero in
the context of the
new child process.
It returns the new
child process ID
(pid) in the context
of the parent.
fork
• Child can’t be an exact copy
– Is distinguished by one variable (the return value of fork)
if (fork () == 0) {
/* child */
execute new program
} else {
/* parent */
carry on
}
exit syscall
fork (original concept)
fork in action today
void
dofork()
{
int rc = fork();
if (rc < 0) {
perror("fork failed: ");
exit(1);
} else if (rc == 0) {
child();
} else {
parent(rc);
}
}
Fork is conceptually difficult but
syntactically clean and simple.
I don’t have to say anything about what
the new child process “looks like”: it is an
exact clone of the parent!
The child has a new thread executing at
the same point in the same program.
The child is a new instance of the
running program: it has a “copy” of the
entire address space. The “only” change
is the process ID and return code rc!
The parent thread continues on its way.
The child thread continues on its way.
A simple program: forkdeep
int count = 0;
int level = 0;
void child() {
level++;
output pids
if (level < count)
dofork();
if (level == count)
sleep(3);
}
void parent(int childpid) {
output pids
wait for child to finish
How?
}
main(int argc, char *argv[])
{
count = atoi(argv[1]);
dofork();
output pid
}
chase$ ./forkdeep 4
30866-> 30867
30867
30867-> 30868
30868
30868-> 30869
30869
30869-> 30870
30870
30870
30869
30868
30867
30866
chase$
Wait
Note: in modern systems the
wait syscall has many variants
and options.
Unix process API
int pid;
int status = 0;
if (pid = fork()) {
/* parent */
…..
pid = wait(&status);
} else {
/* child */
…..
exit(status);
}
Parent uses wait to
sleep until the child
exits; wait returns
child pid and status.
Wait variants allow
wait on a specific
child, or notification of
stops and other
“signals”.
But how do I run a new program?
But how do I run a new program?
• The child, or any process really, can replace its
program in midstream.
• exec* system call: “forget everything in my address
space and reinitialize my entire address space with stuff
from a named program file.”
• The exec system call never returns: the new program
executes in the calling process until it dies (exits).
– The code from the parent program runs in the child process and
controls its future. The parent program selects the program that
the child process will run (via exec), and sets up its connections
to the outside world. The child program doesn’t even know!
exec (original concept)
A simple program: forkexec
…
main(int argc, char *argv[]) {
int status;
int rc = fork();
if (rc < 0) {
perror("fork failed: ");
exit(1);
} else if (rc == 0) {
argv++;
execve(argv[0], argv, 0);
} else {
waitpid(rc, &status, 0);
printf("child %d exited with status %d\n", rc,
WEXITSTATUS(status));
}
}
A simple program: prog0
int
main()
{
}
chase$ cc –o forkexec forkexec.c
chase$ cc –o prog0 prog0.c
chase$ ./forkexec prog0
child 19175 exited with status 0
chase$
Simple I/O: args and printf
#include <stdio.h>
int
main(int argc, char* argv[])
{
int i;
printf("arguments: %d\n", argc);
for (i=0; i<argc; i++) {
printf("%d: %s\n", i, argv[i]);
}
}
chase$ cc –o prog1 prog1.c
chase$ ./forkexec prog1
arguments: 1
0: prog1
child 19178 exited with status 0
chase$ ./forkexec prog1 one 2 3
arguments: 4
0: prog1
1: one
2: 2
3: 3
child 19181 exited with status 0
Environment variables (rough)
#include <stdio.h>
#include <stdlib.h>
int
main(int argc, char* argv[], char* envp[])
{
int i;
int count = atoi(argv[1]);
for (i=0; i < count; i++) {
printf("env %d: %s\n", i, envp[i]);
}
}
Environment variables (rough)
chase$ cc –o env0 env0.c
chase$ ./env0
Segmentation fault: 11
chase$ ./env0 12
env 0: TERM_PROGRAM=Apple_Terminal
env 1: TERM=xterm-256color
env 2: SHELL=/bin/bash
env 3: TMPDIR=/var/folders/td/ng76cpqn4zl1wrs57hldf1vm0000gn/T/
env 4: Apple_PubSub_Socket_Render=/tmp/launch-OtU5Bb/Render
env 5: TERM_PROGRAM_VERSION=309
env 6: OLDPWD=/Users/chase/c210-stuff
env 7: TERM_SESSION_ID=FFCE3A14-1D4B-4B08…
env 8: USER=chase
env 9: COMMAND_MODE=unix2003
env 10: SSH_AUTH_SOCK=/tmp/launch-W03wn2/Listeners
env 11: __CF_USER_TEXT_ENCODING=0x1F5:0:0
chase$
Environment variables (safer)
#include <stdio.h>
#include <stdlib.h>
int
main(int argc, char* argv[], char* envp[])
{
int i;
int count;
if (argc < 2) {
fprintf(stderr,
"Usage: %s <count>\n", argv[0]);
exit(1);
}
count = atoi(argv[1]);
for (i=0; i < count; i++) {
if (envp == 0) {
printf("env %d: nothing!\n", i);
exit(1);
}
else if (envp[i] == 0) {
printf("env %d: null!\n", i);
exit(1);
} else
printf("env %d: %s\n", i, envp[i]);
}
}
Where do environment variables
come from?
chase$ cc –o env env.c
chase$ ./env
chase$ ./forkexec env
Usage: env <count>
child 19195 exited with status 1
chase$ ./forkexec env 1
env 0: null!
child 19263 exited with status 1
chase$
forkexec revisited
char *lala = "lalala\n";
char *nothing = 0;
…
main(int argc, char *argv[]) {
int status;
int rc = fork();
if (rc < 0) {
…
} else if (rc == 0) {
argv++;
execve(argv[0], argv, &lala);
} else {
…
}
chase$ cc –o fel forkexec-lala.c
chase$ ./fel env 1
env 0: lalala
child 19276 exited with status 0
chase$
forkexec revisited again
…
main(int argc, char *argv[],
char *envp[]) {
int status;
int rc = fork();
if (rc < 0) {
…
} else if (rc == 0) {
argv++;
execve(argv[0], argv, envp);
} else {
…
}
chase$ cc –o fe forkexec1.c
chase$ ./fe env 3
env 0: TERM_PROGRAM=Apple_Terminal
env 1: TERM=xterm-256color
env 2: SHELL=/bin/bash
child 19290 exited with status 0
chase$
How about this?
chase$ ./fe fe fe fe fe fe fe fe fe fe fe fe env 3
<???>
How about this?
chase$ ./fe fe fe fe fe fe fe fe fe fe fe fe env 3
env 0: TERM_PROGRAM=Apple_Terminal
env 1: TERM=xterm-256color
env 2: SHELL=/bin/bash
It is easy for children to inherit
child 19303 exited with status 0
environment variables from their
child 19302 exited with status 0
parents.
child 19301 exited with status 0
child 19300 exited with status 0
child 19299 exited with status 0
child 19298 exited with status 0
child 19297 exited with status 0
child 19296 exited with status 0
child 19295 exited with status 0
child 19294 exited with status 0
child 19293 exited with status 0
child 19292 exited with status 0
chase$
Exec* enables the parent to control
the environment variables and
arguments passed to the children.
The child process passes the
environment variables “to itself” but
the parent program controls it.
Unix fork/exec/exit/wait syscalls
fork child
fork parent
parent
program
initializes
child
context
exec
int pid = fork();
Create a new process that is a clone of
its parent.
exec*(“program” [argvp, envp]);
Overlay the calling process with a new
program, and transfer control to it,
passing arguments and environment.
exit(status);
Exit with status, destroying the process.
wait
exit
int pid = wait*(&status);
Wait for exit (or other status change) of a
child, and “reap” its exit status.
Recommended: use waitpid().
But how is the first process made?
Init and Descendents
Kernel “handcrafts”
initial process to run
“init” program.
Other processes descend from
init, including login program.
Login runs user shell in a
child process after user
authenticates.
User shell runs user
commands as child
processes.
The Shell
• Users may select from a range of command interpreter
(“shell”) programs available. (Or even write their own!)
– csh, sh, ksh, tcsh, bash: choose your flavor…
• Shells execute commands composed of program
filenames, args, and I/O redirection symbols.
– Fork, exec, wait, etc., etc.
– Can coordinate multiple child processes that run together as a
process group or job.
– Shells can run files of commands (scripts) for more complex
tasks, e.g., by redirecting I/O channels (descriptors).
– Shell behavior is guided by environment variables, e.g., $PATH
– Parent may control/monitor all aspects of child execution.
Unix process: parents rule
Created with fork
by parent program
running in parent
process.
Parent program
running in child
process, or exec’d
program chosen by
parent program.
Inherited from
parent process, or
modified by parent
program in child
(see next lecture)
Virtual address space
(Virtual Memory, VM)
Process
text
data
heap
Thread
Program
kernel state
Clone of parent VM.
Environment (argv[] and
envp[]) is configured by
parent program on exec.
Unix process view: data
A process has multiple channels
for data movement in and out of
the process (I/O).
I/O channels
(“file descriptors”)
stdin
stdout
tty
Process
stderr
The parent process
and parent program
set up and control
the channels for a
child (until exec).
pipe
Thread
socket
Files
Program
Standard I/O descriptors
I/O channels
(“file descriptors”)
stdin
stdout
tty
Open files or other I/O channels
are named within the process by
an integer file descriptor value.
stderr
Standard descriptors for
primary input (stdin=0),
primary output (stdout=1),
error/status (stderr=2).
These are inherited from
the parent process and/or
set by the parent program.
By default they are bound
to the controlling terminal.
count = read(0, buf, count);
if (count == -1) {
perror(“read failed”); /* writes to stderr
*/
exit(1);
}
count = write(1, buf, count);
if (count == -1) {
perror(“write failed”); /* writes to stderr
*/
Shell: foreground command
Shell waits in read call for
next command from tty, forks
child, and exec*s the named
command program in child.
dsh
“echo y”
fork
wait
If a child is running in the
foreground, the parent shell
blocks in a wait* system call
until the child changes state
(e.g., exits).
exec
argv[0]=“echo”
argv[1]=“y”
argc = 2
Child process continues on
to execute “echo” program
independently of parent.
Unix: signals
• A signal is a typed upcall event delivered to a process by kernel.
– Process P may use kill* system call to request signal delivery to Q.
– Process may register signal handlers for signal types. A signal handler
is a procedure that is invoked when a signal of a given type is delivered.
Runs in user mode, then returns to normal control flow.
– A signal delivered to a registered handler is said to be caught. If there is
no registered handler then the signal triggers a default action (e.g., “die”).
• A process lives until it exits voluntarily or receives a fatal signal.
data
Protected
system calls
data
...and upcalls (signals)
Kernel sends signals,
e.g., to notify
processes of faults.
Unix signals
• Unix signals are “musty”. We don’t teach/test the details.
– Sometimes you can’t ignore them!
– (See next lecture)
• A signal is an “upcall” event delivered to a process.
– Sent by kernel, or by another process via kill* syscall.
– An arriving signal transfers control to a handler routine in
receiving process, designated to handle that type of signal.
– The handler routine returns to normal control flow when done.
• Or: if no designated handler, take a standard action.
– e.g., DEFAULT, IGNORE, STOP, CONTINUE
– Default action is often for receiving process to exit.
– An exit from an unhandled signal is reported to parent via wait*.
Signals and dsh
• Your shell must set up itself and its children properly for
“stuff to work like it should”.
– We have provided you code to initialize shell, spawn job.
– You do not need to understand that code
– You should not need to mess with that code.
– Please don’t mess with that code.
– If you are messing with that code, please talk to us.
– What that code does: transfer control of terminal to foreground
child job, using process groups and signals, prepare for proper
handling of various events that might occur.
Process status change: STOP
dsh
“echo y”
A wait returns when the
status of the child
changes (e.g., STOP).
The awakened shell
blocks in read to accept
the next command.
What if a foreground job
exits before the parent
calls wait?
fork
exec
wait
“<ctrl-z>”
SIGSTOP
STOP
“fg”
kill SIGCONT
wait
exit
EXIT
Child blocks
when it receives
a STOP signal
(e.g., resulting
from ctrl-z on
keyboard).
Child wakes up
when it receives
a CONTINUE
signal (e.g., sent
by shell on fg
builtin
command).
I hope we get to here
EXTRA SLIDES
Mode Changes for Fork/Exit
• Syscall traps and “returns” are not always paired.
• Fork “returns” (to child) from a trap that “never
happened”
• Exit system call trap never returns
• System may switch processes between trap and return
parent
Fork
call
Fork
return
Wait
call
Wait
return
Exec enters the child by
doctoring up a saved user
context to “return” through.
child
Fork
entry to
user space
Exit
call
transition from user to kernel mode (callsys)
transition from kernel to user mode (retsys)
Using the shell
• Commands: ls, cat, and all that
• Current directory: cd and pwd
• Arguments: echo
• Signals: ctrl-c
• Job control, foreground, and background: &, ctrl-z, bg, fg
• Environment variables: printenv and setenv
• Most commands are programs: which, $PATH, and /bin
• Shells are commands: sh, csh, ksh, tcsh, bash
• Pipes and redirection: ls | grep a
• Files and I/O: open, read, write, lseek, close
• stdin, stdout, stderr
• Users and groups: whoami, sudo, groups
The Kernel
• Today, all “real” operating systems have protected kernels.
The kernel resides in a well-known file: the “machine” automatically
loads it into memory (boots) on power-on/reset.
Our “kernel” is called the executive in some systems (e.g., Windows).
• The kernel is (mostly) a library of service procedures shared by
all user programs, but the kernel is protected:
User code cannot access internal kernel data structures directly.
User code can invoke the kernel only at well-defined entry points
(system calls).
• Kernel code is “just like” user code, but the kernel is privileged:
The kernel has direct access to all hardware functions, and defines the
handler entry points for interrupts and exceptions.
Protecting Entry to the Kernel
Protected events and kernel mode are the architectural
foundations of kernel-based OS (Unix, Windows, etc).
– The machine defines a small set of exceptional event types.
– The machine defines what conditions raise each event.
– The kernel installs handlers for each event at boot time.
e.g., a table in kernel memory read by the machine
The machine transitions to kernel mode only on
an exceptional event.
The kernel defines the event handlers.
Therefore the kernel chooses what code will
execute in kernel mode, and when.
user
trap/return
interrupt or
fault
kernel
interrupt or
fault
The Role of Events
• A CPU event (an interrupt or exception, i.e., a trap or fault) is an
“unnatural” change in control flow.
• Like a procedure call, an event changes the PC.
• Also changes mode or context (current stack), or both.
• Events do not change the current space!
• On boot, the kernel defines a handler routine for each event type.
• The machine defines the event types.
• Event handlers execute in kernel mode.
control flow
exception.cc
• Every kernel entry results from an event.
• Enter at the handler for the event.
In some sense, the whole kernel
is a “big event handler.”
event handler
(e.g., ISR: Interrupt
Service Routine)
Examples
• Illegal operation
– Reserved opcode, divide-by-zero, illegal access
– That’s a fault! Kernel generates a signal, e.g., to kill process
or invoke PL exception handlers.
• Page fault
– Fetch and install page, maybe block process
– Nothing illegal about it: “transparent” to faulting process
• I/O completion, arriving input, clock ticks.
– These external events are interrupts.
– Include power fail etc.
– Kernel services interrupt in handler.
– May wakeup blocked processes, but no blocking.
Faults
• Faults are similar to system calls in some respects:
– Faults occur as a result of a process executing an instruction.
• Fault handlers execute on the process kernel stack; the fault
handler may block (sleep) in the kernel.
– The completed fault handler may return to the faulted context.
• But faults are different from syscall traps in other respects:
– Syscalls are deliberate, but faults are “accidents”.
• divide-by-zero, dereference invalid pointer, memory page fault
– Not every execution of the faulting instruction results in a fault.
• may depend on memory state or register contents
Note
• The following material on resource management was
not discussed in class.
• I’m leaving it in for completeness. We’ll come back to it
later in the semester.
• But this basic stuff should be easily understandable,
given the reading and what we have covered in class.
• So: please understand it.
Resource management
•
This is a fast first course in computer systems. We will skimp a little on
classical resource management. A couple of points to remember:
•
In general, resource management is a quantity allocation problem: e.g., how
many CPU cycles and how many pages of memory to give to a process?
•
Resources are allocated to some resource group or unit of accounting: per
process or per user account, or some other accounting domain.
•
Time matters: domains consume an allocation over time. The allocation at
any point in time may vary. The total consumption is determined by the
magnitude and its integral over time: it’s like power vs. energy.
•
The rate of resource flow over time matters for performance, especially
response time for apps that are interactive (like web services).
•
A key desired property: performance isolation. The mechanisms must
keep each process/account/domain within its allocation: attempts to overuse
resources by one user/process should not affect others. You already know
how this works. The policies choose how much to allocate and when.
The kernel
syscall trap/return
fault/return
system call layer: files, processes, IPC, thread syscalls
fault entry: VM page faults, signals, etc.
thread/CPU/core management: sleep and ready queues
memory management: block/page cache, VM maps
sleep queue
I/O completions
ready queue
interrupt/return
timer ticks
The kernel
syscall trap/return
fault/return
system call layer: files, processes, IPC, thread syscalls
fault entry: VM page faults, signals, etc.
thread/CPU/core management: sleep and ready queues
memory management: block/page cache
policy
sleep queue
I/O completions
ready queue
interrupt/return
policy
timer ticks
Separation of policy and mechanism
• Every OS platform has mechanisms that enable it to
mediate access to machine resources. E.g:
– Gain control of core by timer interrupts
– Fault on access to non-resident virtual memory
– I/O through system call traps
– Internal protected (kernel) code and data structures to track
resource usage and allocate resources
• But the mechanisms do not and must/should not
determine the policy.
• We might want to change the policy!
Time sharing vs. space sharing
space
Two common modes
of resource
allocation. What
kinds of resources do
these work for?
time 
Example: Processor Allocation
The key issue is: how should an OS allocate its CPU
resources among contending demands?
– We are concerned with resource allocation policy: how the OS
uses underlying mechanisms to meet design goals.
– Focus on OS kernel : user code can decide how to use the
processor time it is given.
– Which thread to run on a free core?
– For how long? When to take the core back and give it to some
other thread? (timeslice or quantum)
– What are the policy goals?
CPU Scheduling 101
The OS scheduler makes a sequence of “moves”.
– Next move: if a CPU core is idle, pick a ready thread t from the
ready pool and dispatch it (run it).
– Scheduler’s choice is “nondeterministic”
– Scheduler’s choice determines interleaving of execution
blocked
threads
Wakeup
ready pool
If timer expires, or
wait/yield/terminate
GetNextToRun
SWITCH()
Goals of policy
• Share resources fairly.
• Use machine resources efficiently.
• Be responsive to user interaction.
But what do these things mean?
How do we know if a policy is good or not?
What are the metrics?
What do we assume about the workload?
Download