Instructor: Junfeng Yang
Find two teammates before next Thursday
Post ads in CourseWorks
OS structure
Monolithic v.s. microkernel
Modern OS: modules
Virtual machine
Intro to Linux
Interrupts in Linux
IRQs
PIC
INTR
Memory Bus idtr
CPU intr #
0
IDT
Mask points
How to handle interrupts?
255
ISR
Interrupts in Linux (cont.)
Interrupt handlers
System calls in Linux
Intro to Process
What if a second interrupt occurs while an interrupt routine is executing?
Generally a good thing to permit that — is it possible?
And why is it a good thing?
You want to keep all I/O devices as busy as possible
In general, an I/O interrupt represents the end of an operation; another request should be issued as soon as possible
Most devices don ’ t interfere with each others ’ data structures; there ’ s no reason to block out other devices
Hardware invokes handler with interrupt disabled
As soon as possible, unmask the global interrupt
Interrupts from the same IRQ line?
Wants to process in serial
Thus, interrupt from same IRQ is not enabled during interrupt-handling
To preserve IRQ order on the same line, must disable incoming interrupts on same line
New interrupts can get lost if controller buffer overflow
Interrupt preempts what CPU was doing, which may be important
Even not important, undesirable to block user program for long
So, handler must run for a very short time!
Do as little as possible in the interrupt handler
• Often just: queue a work item and set a flag
Defer non-critical actions till later
Interrupts (as opposed to exceptions) are not associated with particular instructions, nor the current process. It’s like an unexpected jump.
• Why? Context switch expensive
Implication
Interrupt handlers cannot call functions that may sleep (i.e. yield CPU to scheduler) !
Why not?
• Scheduler only schedules processes, so wouldn’t know to reschedule the interrupt handler
• The current process may be doing something dangerous, and cannot sleep
Top half (th) and bottom half (bh)
Top-half: do minimum work and return (ISR)
Bottom-half: deferred processing (softirqs, tasklets, workqueues)
Top half tasklet softirq workqueue
Bottom half
Perform minimal, common functions: saving registers, unmasking other interrupts. Eventually, undoes that: restores registers, returns to previous context.
Most important: call proper interrupt handler provided in device drivers (C program)
Typically queue the request and set a flag for deferred processing
Top half softirq
Three deferred work mechanisms: softirqs,
tasklets, and work queues (tasklet built on top of softirq)
All of these use request queues
All can be interrupted
Top half tasklet softirq workqueue
Bottom half
1
2
3
4
5
Types are statically allocated: at kernel compile time
Limited number:
Priority Type
0 High-priority tasklets (generic)
Timer interrupts
Network transmission
Network reception
SCSI disks
Regular tasklets (generic)
What does a softirq handler do?
Parse packet header
Verify checksum
Deliver packet up to the network stack
return
Linux-2.6.11/net/core/dev.c, function net_rx_action
When to execute softirq?
Run at various points by the kernel, using current process’s context
Most important: after handling IRQs and after timer interrupts
Essentially polling
Problem: while processing one softirq, another is raised. Process it?
No long delay for new irq
Always starve user program when long softirq burst
• Livelock!
100% CPU utilization, but no progress
Why?
Need user program to eventually process requests
• E.g. webserver
However, if too many interrupt requests, starve user program
Big deal for networking in 90s
Solution:
• Eliminating receive livelock in an interrupt-driven kernel, Jeffrey C. Mogul, K. K. Ramakrishnan
• Adopted into Linux
Goal: provide user program fair share of CPU time despite interrupt burst
Quota + dedicated context ksoftirqd
Process up to N softirqs for one softirq hanlder invocation
• Bound time spent in handler
Process the rest in ksoftirqd
• ksoftirqd subject to scheduling, as user process
• Provide fairness to user process
Problem: softirq is static
To add a new type of Softirq, need to convince Linus!
Solution: tasklets
Built on top of softirq
New types are created and destroyed dynamically
Simplified for muliticore processing: at any time, only one tasklet among all of the same type can run
Problem with softirq and tasklets: they have no process contexts either, thus cannot sleep
Softirqs and tasklets run in an interrupt context; work queues have a process context
The idea:
You throw work (fn, args) to a workqueue
Workqueue add to an internal FIFO queue
A dedicated workqueue process loops forever, dequeuing (fn, args), and running fn(args)
Because they have a process context, they can sleep
Linux has a pseudo-file system, /proc , for monitoring (and sometimes changing) kernel behavior
Run cat /proc/interrupts to see what ’ s going on
CPU0
0: 162 IO-APIC-edge timer
1: 0 IO-APIC-edge i8042
4: 10 IO-APIC-edge
7: 0 IO-APIC-edge parport0
8: 1232299 IO-APIC-edge rtc
9: 0 IO-APIC-fasteoi acpi
12: 1 IO-APIC-edge i8042
16: 19256781 IO-APIC-fasteoi uhci_hcd:usb1, …
17: 79 IO-APIC-fasteoi uhci_hcd:usb2, uhci_hcd:usb4
…
# Columns: IRQ, count, interrupt controller, devices
Interrupts in Linux (cont.)
System calls in Linux
Intro to Process
User mode kernel mode
0x80
{
} printf(“hello world!\n”); libc
%eax = sys_write; int 0x80
} system_call() { fn = syscalls[%eax]
IDT
} sys_write(…) {
// do real work syscalls table
Library calls are much faster than system calls
If you can do it in user space, you should strlen?
write?
Learn what a library call/system call do:
Documents are called “manpages,” divided into sections
Library calls (section 3) e.g. man 3 strlen
System calls (section 2) e.g. man 2 write
User mode kernel mode
0x80
{
} printf(“hello world!\n”); libc
%eax = sys_write; int 0x80
} system_call() { fn = syscalls[%eax]
IDT
} sys_write(…) {
// do real work syscalls table
Generating the assembly code for trapping into the kernel is complex so Linux provides a set of macros to do this for you!
Macros with name _syscallN(), where N is the number of system call parameters
_syscallN(return_type, name, arg1type, arg1name, …) in linux-2.6.11/include/asm-i386/unistd.h
Macro will expands to a wrapper function
Example:
long open(const char *filename, int flags, int mode);
_syscall3(long, open, const char *, filename, int, flags, int, mode)
NOTE: _syscallN obsolete after 2.6.18; now syscall (…), can take different # of args
Library calls return -1 on error and place a specific error code in the global variable errno
System calls return specific negative values to indicate an error
Most system calls return -errno
The library wrapper code is responsible for conforming the return values to the errno convention
User mode kernel mode
0x80
{
} printf(“hello world!\n”); libc
%eax = sys_write; int 0x80
} system_call() { fn = syscalls[%eax]
IDT
} sys_write(…) {
// do real work syscalls table
.section
system_call:
.text
// copy parameters from registers onto stack … call sys_call_table(, %eax, 4) jmp ret_from_sys_call ret_from_sys_call:
// perform rescheduling and signal-handling … iret // return to caller (in user-mode)
// File arch/i386/kernel/entry.S
Why jump table? Can’t we use if-then-else?
There are approximately 300 system-calls
Any specific system-call is selected by its IDnumber (it ’ s placed into register %eax)
It would be inefficient to use if-else tests or even a switch-statement to transfer to the service-routine ’ s entry-point
Instead an array of function-pointers is directly accessed (using the ID-number)
This array is named ‘ sys_call_table[] ’
Defined in file arch/i386/kernel/entry.S
.section
.data
sys_call_table:
.long
.long
.long
.long
.long
… sys_restart_syscall sys_exit sys_fork sys_read sys_write
NOTE: syscall numbers cannot be reused (why?); deprecated syscalls are implemented by a special “not implemented” syscall
(sys_ni_syscall)
Usually a library function “ foo() ” will do some work and then call a system call ( “ sys_ foo() ” )
In Linux, all system calls begin with “ sys_ ”
Often “ sys_foo() ” just does some simple error checking and then calls a worker function named “ do_ foo() ”
Linux has a powerful mechanism for tracing system call execution for a compiled application
Output is printed for each system call as it is executed, including parameters and return codes
The ptrace () system call is used
Also used by debuggers (breakpoint, singlestep, etc)
Use the “ strace ” command (man strace for info)
You can trace library calls using the “ ltrace ” command
The first parameter is always the syscall #
eax on Intel
Linux allows up to six additional parameters
ebx, ecx, edx, esi, edi, ebp on Intel
System calls that require more parameters package the remaining params in a struct and pass a pointer to that struct as the sixth parameter
Problem: must validate pointers
Could be invalid, e.g. NULL crash OS
Or worse, could point to OS, device memory security hole
Too expensive to do a thorough check
Need to check that the pointer is within all valid memory regions of the calling process
Solution: No comprehensive check
Linux does a simple check for address pointers and only determines if pointer variables are within the largest possible range of user memory (more details when talking about process)
Even if a pointer value passes this check, it is still quite possible that the specific value is invalid
Dereferencing an invalid pointer in kernel code would normally be interpreted as a kernel bug and generate an Oops message on the console and kill the offending process
Linux does something very sophisticated to avoid this situation
Kernel code must access user-pointers using a small set of “paranoid” routines (e.g. copy_from_user)
Thus, kernel knows what addresses in its code can throw invalid memory access exceptions (page fault)
When a page fault occurs, the kernel’s page fault handler checks the faulting EIP (recall: saved by hw)
If EIP matches one of the paranoid routines, kernel will not oops; instead, will call “fixup” code
Many violations of this rule in Linux. Once built a checker and found tons of security holes
Function get_user(), __get_user() clear_user(), __clear_user()
Action reads integer (1,2,4 bytes) put_user(), __put_user() writes integer (1,2,4 bytes) copy_from_user(), __copy_from_user copy a block from user space copy_to_user(), __copy_to_user() strncpy_from_user(),
__strncpy_from_user() strnlen_user(), __strnlen_user() copy a block to user space copies null-terminated string from user space returns length of null-terminated string in user space fills memory area with zeros
Exception table
Faulting instruction address fixup code
On page fault, kernel scans exception table to find the fixup code
Typically the fixup code terminates the system call with an arguments)
EINVAL error code (means: invalid
Some ELF tricks help to generate exception table and implement fixup code; see ULK
Chapter 10 for gruesome details
int 0x80 not used any more (I lied …)
Intel has a hardware optimization (sysenter) that provides an optimized system call invocation
Read the gory details in ULK Chapter 10
Interrupts in Linux (cont.)
System calls in Linux
Intro to Process
What are processes?
Why need them?
“Program in execution” “virtual CPU”
Process is an execution stream in the context of a particular process state
Execution stream: a stream of instructions
Running piece of code sequential sequence of instructions
Process state: determines the effects of the instructions.
Stuff that the running code can affect or be affected by
Registers
• General purpose, floating point, EIP …
Memory: everything a process can address
• Code, data, stack, heap
I/O
• File descriptor table
More … stack reg SP heap
IP data cpu code mem
Process != program
Program: static code and static data
Process: dynamic instantiation of code and data
Process <-> program: no 1:1 mapping
Process > program: code + data + other things program process main() { f(x); main() { f(x); stack for f()
} f(int x) {
} f(int x) { regs
IP
} } heap
Process <-> program: no 1:1 mapping
Program > process: one program can invoke multiple processes
• E.g. shell can run commands in different processes
Process > program: can have multiple processes of the same program
• E.g. Multiple users run multiple /usr/bin/tcsh
More details when discussing memory management
AS = All memory a process can address + addresses
Virtual address space:
Really large memory to use
Linear array of bytes: [0, N), N roughly 2^32, 2^64
Process and virtual address space: 1 : 1 mapping
Key : an AS is a protection domain
One process can’t address another process’s address space (without permission)
• E.g. Value stored at 0x800abcd in p1 is different from
0x800abcd
Thus can’t read/write
}
}
More details when discussing threads
Process != Threads
Threads: many streams of executions in one process
Threads share address space process threads main() { f(x); stack for f() main() { f(x); stack for f() stack for f() regs } regs regs f(int x) { IP f(int x) { IP IP heap } heap
Divide and conquer
Decompose a large problem into smaller ones easier to think well contained smaller problems
Sequential
Easier to think about
Increase performance.
System has many concurrent jobs going on
Most OS support process
Uniprogramming : only one process at a time
Good: simple
Bad: low utilization, low interactivity
Multiprogramming : multiple at a time
When one proc blocks (e.g. I/O), switch to another
NOTE : different from multiprocessing (systems with multiple processors)
Good: increase utilization, interactivity
Bad: complex
OS support for multiprogramming
Policy: scheduling, what proc to run? (next week)
Mechanism:
• dispatching, how to run/block process?
• how to protect from one another?
Separation of policy and mechanism
Recurring theme in OS
Policy : decision making with some performance metric and workload
• Scheduling (next week)
Mechanism : low-level code to implement decisions
• Dispatching (today)
Process state
New: being created
Running: instructions are running on CPU
Waiting: waiting for some event (e.g. IO)
Ready: waiting to be assigned a CPU
Terminated: finished
OS stores processes on system-wide lists
OS dispatching loop: while(1) { proc = choose(ready procs); load proc state; run proc for a while; save proc state;
Q1: how to gain control?
}
Q2: what state must be saved?
Cooperative v.s. preemptive
Cooperative. Process voluntarily yield control to OS
When? System call
Why bad? OS trusts process !
• Malicious process? Bugs?
Preemptive
When? Interrupt, especially timer (studied before)
OS trusts no one !
Registers
Memory?
I/O?
Context switch
How?
Why hard? Need code to save registers, but code needs registers
Why expensive? A lot of registers to load and store
Interleaved assembly entry points:
ret_from_exception() ret_from_intr() ret_from_sys_call() ret_from_fork()
Things that happen:
Run scheduler if necessary
Return to user mode if no nested handlers
• Restore context, user-stack, switch mode
• Re-enable interrupts if necessary
Deliver pending signals
(Some DOS emulation stuff – VM86 Mode)
Which has a higher priority, a disk interrupt or a network interrupt?
Different CPU architectures make different decisions
By not assuming or enforcing any priority,
Linux becomes more portable
When an interrupt occurs, what stack is used?
Exceptions: The kernel stack of the current process, whatever it is, is used (There ’ s always some process running — the “ idle ” process, if nothing else)
Interrupts: hard IRQ stack (1 per processor)
SoftIRQs: soft IRQ stack (1 per processor)
These stacks are configured in the IDT and
TSS at boot time by the kernel
app making system call user-mode
(restricted privileges)
… xyz()
… call ret wrapper routine in std C library xyz {
… int 0x80;
…
} int 0x80 iret kernel-mode
(unrestricted privileges) sys_xyz() { … } system call service routine call ret system_call:
… sys_xyz();
… system call handler
6
7
8
4
5
2
3
0
1
sys_call_table sys_restart_syscall sys_exit sys_fork sys_read sys_write sys_open sys_close
.section .text
…etc…
Recall that an address space is the collection of valid addresses that a process can reference at any given time (0 … 4GB on a 32 bit processor)
The kernel is a logically separate address space (separate code/data) from all user processes
Trapping to the kernel logically involves changing the address space
(like a process context switch)
Modern OSes use a trick to optimize system calls : every process gets at most 3GB of virtual addresses (0..3GB) and a copy of the kernel address space is mapped into *every* process address space
(3..4GB)
The kernel code is mapped but not accessible in user-mode so processes can only “ see ” the kernel code when they trap into the kernel and the mode bit is changed
Pros: save TLB flushes, make copying in/out of user space easy
Cons: steals processor address space, limited kernel mapping
Privilege-level 0
4 GB
Kernel space
3 GB
User-mode stack-area kernel-mode stack
Privilege-level 3
User space
Shared runtime-libraries
Task’s code and data
0 GB
Generating the assembly code for trapping into the kernel is complex so Linux provides a set of macros to do this for you!
There are seven macros with names _syscallN() where N is a digit between 0 and 6 and N corresponds to the number of parameters
The macros are a bit strange and accept data types as parameters
For each macro there are 2 + 2*N parameters; the first two correspond to the return type of syscall (usually long) and the syscall name; the remaining 2*N parameters are the type and name of each syscall parameter
Example:
long open(const char *filename, int flags, int mode);
_syscall3(long, open, const char *, filename, int, flags, int, mode)
Types of interrupts
Device IRQs APICs Interrupt
ISR
SoftIRQs, tasklets, workqueues, ksoftirqd
Interrupt context vs. process context
Who can block
Nested interrupt
System calls often block in the kernel (e.g. waiting for IO completion)
When a syscall blocks, the scheduler is called and selects another process to run
Linux distinguishes “ slow ” and “ fast ” syscalls:
Slow: may block indefinitely (e.g. network read)
Fast: should eventually return (e.g. disk read)
Slow syscalls can be “ interrupted ” by the delivery of a signal (e.g. Control-C)
Interrupts can be interrupted
By different interrupts; handlers need not be reentrant
No notion of priority in Linux
Small portions execute with interrupts disabled
Interrupts remain pending until acked by CPU
Exceptions can be interrupted
By interrupts (devices needing service)
Exceptions can nest two levels deep
Exceptions indicate coding error
Exception code (kernel code) shouldn ’ t have bugs
Page fault is possible (trying to touch user data)
if NIC has incoming packets
Device already copied packet to an OS buffer
ISR allocates a descriptor (sk_buff) for this buffer
ISR queue this descriptor raises a softirq (== set a flag, explained next)
return (will enable interrupt)
$ cat /proc/pci
PCI devices found:
Bus 0, device 0, function 0:
Host bridge: PCI device 8086:2550 (Intel Corp.) (rev 3).
Prefetchable 32 bit memory at 0xe8000000 [0xebffffff].
Bus 0, device 29, function 1:
USB Controller: Intel Corp. 82801DB USB (Hub #2) (rev 2).
IRQ 19.
I/O at 0xd400 [0xd41f].
Bus 0, device 31, function 1:
IDE interface: Intel Corp. 82801DB ICH4 IDE (rev 2).
IRQ 16.
I/O at 0xf000 [0xf00f].
Non-prefetchable 32 bit memory at 0x80000000 [0x800003ff].
Bus 3, device 1, function 0:
Ethernet controller: Broadcom NetXtreme BCM5703X Gigabit Eth (rev 2).
IRQ 48.
Master Capable. Latency=64. Min Gnt=64.
Non-prefetchable 64 bit memory at 0xf7000000 [0xf700ffff].