Linux Kernel Design: 2.6/3.x Overview

advertisement
Operating System Design
LINUX KERNEL DESIGN
(2.6/3.X)
Dr. C.C. Lee
Ref: Linux Kernel Development by R. Love
Ref: Operating System Concepts by Silberschatz…
Introduction
 Monolithic & dynamically loadable kernel module
 SMP support (run queue per CPU, load balance)
 Kernel preemptive, schedulable, thread support
 CPU (soft & hard) affinity
 Kernel memory not pageable
 Source in GNU C (not ANSI C) with extension, in-
line for efficiency,
 Kernel source tree – architecture indep/dep. part
 Portable to different architecture
CPU Affinity
 CPU affinity: less overhead, in cache
 Soft affinity means that processes do not
frequently migrate between processors.
 Hard affinity means that processes run on
processors you specify
Reason 1: You have a hunch – computations
Reason 2: Testing complex applications – linear scalability?
Reason 3: Running time-sensitive, deterministic processes
sched_setaffinity (…) set CPU affinity mask
Process (Task) Basics
 Process States

TASK_RUNNING (run or ready)

TASK_INTERRUPTIBLE (sleeping or blocked, may be
waken by signal)

TASK_UNTERRUPTIBLE (sleeping/blocked, only event can
wake this task)

TASK_STOPPED (SIGSTOP, SIGTTIN, SIGTTOU signals)

TASK_ZOMBIE (pending for parent task to issue wait)
Process (Task) Basics - Continue
 Context

Process context – user code or kernel (from system calls)

Interrupt context – kernel interrupt handling
 Task (Process) Creation

Fork (may be implemented by: COW i.e.Copy On Write)

Vfork :same as fork (but shared page table, parent wait for child)

Clone system call is used to implement fork and vfork

Threads are created the same as normal tasks except that
the clone system call is passed with spec. resources shared
 Task (Process) Termination

Memory/files/timers/semaphores released, notify parent
Process (Task) Scheduling
 Preemptive
 Scheduler Classes (priority for classes)
 Real-time: FIFO and RR (timeslice), fixed priority
 Normal (SCHED_NORMAL)
 SMP (Run queue/structure per CPU, why?)
 Processor Affinity (Soft & Hard)
 Load balancing
Process (Task) Scheduling Cont.
 Two process-scheduling Classes:
 Normal
time-sharing (dynamic)
(Nice value: 19 to -20, with default 0 = 120)
 Real-time
algorithm (FIFO/RR) - Soft
Absolute priorities (static): 0-99
FIFO run till Exit , Yield, or Block
RR run with time slice
Preemption possible with priority
 Normal Processes: to be studied here
Early Kernel 2.6 - O(1) Scheduler
 O(1) Scheduler (Early Kernel 2.6)
Improved scheduler with O(1) operations
using bit map operations to search highest
priority queue
 Active and Expired Array (Run Queues per
CPU)
 Scalable
 Heuristics for CPU/IO bound, Interactivities
O(1) Scheduler Priority Array
Operating System Concepts
21.9
Silberschatz, Galvin and Gagne ©2005
O(1) Scheduler Summary
 Implements a priority-based array of task entries that
enables the highest-priority task to be found quickly
(by using a priority bitmap with a fast instruction).
 Recalculates the timeslice and priority of an expired
task before it places it on the expired queue. When all
the tasks expire, the scheduler simply needs to swap the
active and expired queue pointers and schedule the next
task. Long scans of runqueues are, thus, eliminated
 This process takes the same amount of processing,
irrespective of the number of tasks in the system. It no
longer depends on the value of n, but is a fixed constant
O(1) Scheduler Problems
 Although O(1) scheduler performed well
and scaled effortlessly for large systems
with many tens or hundreds of processors,
IT FAILS ON:
 Slow response to latency-sensitive
applications i.e. interactive processes
for typical desktop systems
 Not achieving Fair (Equal) CPU Allocation
Current: Completely Fair Scheduler (CFS)
 Since Kernel 2.6.23
 CFS Aiming at
 Giving each task a fair share (portion) of the processor
time (Completely Fair)
 Improving the interactive performance of O(1)
scheduler for desktop. While O(1) scheduler is ideal
for large server workloads
 Introduces simple/efficient algorithmic approach (redblack tree) with O(log N). While O(1) scheduler uses
heuristics and the code is large and lacks algorithm
substance.
Completely Fair Scheduler (CFS)
CFS – Processor Time Allocation
 Select next that has run the least. Rather than
assign each process a time slice, CFS calculates
how long a process should run as a function of the
total number of runnable processes and its
niceness (default: 1 ms as minimum granularity)
 Nice values are used to weight the portion of
processor a process is to receive (not by additive
increases, but by geometric differences). Each
process will run for a “timeslice” proportional to its
weight divided by total weight of all runnable
processes. Assume TARGETED_LANTENCY = 20ms:
Two threads: the niceness are 0(10), and 5(15),
CFS assigns relative weight 3 : 1 (approx.) – *particular algorithm
Niceness 0(10) receives 15ms and Niceness 5(15) receives 5ms
Here, CPU portion is determined only by the relative value.
CFS – The Virtual Runtime (vruntime)
 The virtual runtime (vruntime) is the actual runtime
(the amount of time spent) weighted by its niceness
nice=0, factor=1; vruntime is same as real run time spent by task
nice<0, factor< 1; vruntime is less than real run time spent. vruntime
grows slower than real run time used.
nice>0, factor> 1; vruntime is more than real run time spent. vruntime
grows faster than real run time used.
(The virtual runtime is measured in nano seconds)
 Every time a thread runs for t ns, vruntime += t
(weighted by task niceness i.e. priority)
 The virtual runtime (vruntime) is used to account
for how long a process has run. CFS will then pick
up the process with the smallest vruntime.
CFS – Process Selection
 CFS select the process with the minimum
virtual runtime i.e. vruntime
 CFS use a red-black tree (rbtree – a type
of self-balancing binary search tree) to
manage the list of runnable processes and
efficiently (algorithm) find the process with
the smallest vruntime
 The selected process with the smallest
vruntime is the leftmost node in the (rbtree)
tree.
CFS – Process just Created or Awaken

A new process is created
The new process is assigned the current Minimum
Virtual Runtime (adjusted) and inserted into the
rbtree
 A process is awakened from blocking
vruntime = Maximum (old vruntime, current Min_vruntime
substracted by adjusted TARGETED_LANTENCY)
This can prevent a process that blocked for a long
time from monopolizing the CPU
CFS – Group Scheduling
 In plain CFS, if there are 25 runnable processes,
CFS will allocate 4% to each (assume same). If
20 belong to user A, and 5 belong to user B, then
user B is at an inherent disadvantage.
 Group scheduling will first try to be fair to the
group and then individual in the group, i.e. 50% to
user A and 50% to user B.
Thus for A, the allocated 50% of A will be divided
fairly among A’s 20 tasks. For B, the allocated
50% will be divided fairly among B’s 5 tasks.
CFS – Run Queue (Red-Black Tree)
 Tasks are maintained in a time-ordered (i.e.
vruntime) red-black tree for each CPU
 Red-Black Tree: Self-balancing binary search tree
Balancing is preserved by painting each node with one of
two colors in a way to satisfy certain properties. When the
tree is modified , the new tree is rearranged and repainted to
restore the coloring properties.
The balancing of the tree can guarantee that no leaf can be
more than twice as deep as others and the tree operations
(searching/insertion/deletion/recoloring) can be performed in
O(log N) time
 CFS will switch to the leftmost task in the tree, that
is, the one with the lowest virtual runtime (most
need for CPU) to maintain fairness.
CFS – Red-Black Tree
(www.ibm.com/developerworks/linux/library/l-completely-fair-scheduler/)
Interrupt Handling
 Interrupts (Hardware)

Asynchronous

Dev.->Interrupt Controller->CPU-->Interrupt Handlers

Device has unique value for each interrupt line: IRQ
(Interrupt ReQuest number)
On PC, IRQ 0 = timer interrupt, IRQ 1 is keyboard interrupt
 Exceptions (Soft Interrupt)

Synchronous

Fault (segment fault, page fault,…)

Trap (system call)

Programming exception
Top Halves and Bottom Halves
 Top Half

Interrupts disabled (Line, local)

Run (immediately)

ACK & reset hardware, copy data from hardware buffer
 Bottom Half

Interrupt enabled

Run (deferred)

Detailed work processing
 Example of Network Card

Top half: alert the kernel to optimize network throughput, copy
packets to memory, ACK network hardware and ready network
card for more packets

The rest will be left to bottom half
Top-Half
 Writing an Interrupt Handlers (for vectored
interrupt table)
 Registering an Interrupt Handler
int request_irq (irq#, *handler, irqflags, *devname,
*dev_id)
 When kernel receives interrupt
From interrupt table (IRQ number)
invokes sequentially each registered
handler on the line (till device is found)
Bottom Halves and Deferring Work
 Softirqs – interrupt context routine(can not block)
Handling those with time-critical and high concurrency.
Handling routines run right after top-half that raised softirq.
Tasklets: Special softirqs, intended for those with
less time-critical/concurrency/locking requirements
It has simpler interface and implementation
 Work Queues – A different form of deferring work
Work queues run by kernel threads in process context –
thus schedulable. Therefore, If the deferred work needs to
sleep (allocate a lot of memory, obtain semaphores…),
work queues should be used. Otherwise, softirqs/tasklets
are used.
Bottom Halves - Ksoftirqd
 When the system is overwhelmed with softirqs
activities, low-priority user processes can not
run and may become starved. Thus
A per-CPU kernel thread Ksoftirqd (run with the
lowest priority i.e. nice value=19) will be awakened.
 With this low-level priority Ksoftirqd to
handle softirqs under the busy situation, user
processes can be relieved from starvation.
Which Bottom Half to Use
 Bottom Half
Softirq
Tasklet
Work Queues
Context
Interrupt
Interrupt
Process
Inherent Serialization
None
Against the same tasklet
None
 If the deferred work needs to run in process context: work queue
 The highest overhead: work queue (kernel thread, context switch)
 Ease of use: work queue
 The fastest, highly threaded, timing critical use: softirq
 Same as softirq, but simple interface and ease of use: tasklets
 Normal driver writers have two choices:
Need a schedulable entity to perform the work (sleep for events?)
If so, work queue is the only choice. Otherwise, tasklets are preferred,
unless scalability is a concern which will use softirq (highly threaded)
Kernel Synchronization
 Kernel has concurrency (threads) and
need synchronization
 Code safe from concurrent access -
Terminology
Interrupt safe (from interrupt handler)
SMP safe
Preempt safe (kernel preemption)
 Spinlock, R/W spinlock, semaphore, R/W
semaphore, sequential lock, completion
variables
Spin Locks
 Spin locks: Lightweight
For short durations to save context switch overhead
 Spin Locks and Top-Half
Kernel must disable local interrupts before obtaining
the spin locks. Otherwise the Interrupt Handler (IH)
may interrupt kernel and attempts to acquire the same
lock while the lock is held by the kernel – spin?
 Spin Locks and Bottom Halves
Kernel must disable bottom-half before obtaining the
spin locks. Otherwise, the bottom-half may preempt
kernel code and attempts to acquire this same lock
while the lock is held by the kernel – spin?
Reader-Writer Spin Locks
 Shared/Exclusive Locks
 Reader and Writer Path
read_lock(&my_rwlock)
CR
read_unlock(…)
write_lock(…)
CR
write_unlock(…)
 Linux 2.6 favors readers over writers
(starvation of writers) for Reader-Writer
Spin Locks
Semaphores
 Semaphores for long wait
 Semaphores are for process context (can sleep)
 Can not hold a spin lock while acquiring a
semaphore (may sleep)
 Kernel code holding semaphore can be
interrupted or preempted
 Using Semaphores: down, up
Reader-Writer Semaphore
 Reader-Writer flavor of semaphores
 Reader-Writer Semaphores are mutexes
 Reader-Writer Semaphores : locks use
uninterruptible sleep
 As with semaphores, the following are
provided:
down_read_trylock(), down_write_trylock()
down_read, down_write, up_read, up_write
Completion variables
 A task signals other task for an event
One task waits on the completion variable while
other task performs work. When it completes, it
uses a completion variable to wake up the other
task
init_completion(struct completion *) or
DECLARE_COMPLETION (mr_comp)
wait_for_completion (struct completion *)
complete (struct completion *)
Sequential Locks
 Simple mechanism for reading and writing shared
data by maintaining a sequence counter
write  lock obtained  seq# incr; unlock -> seq# incr.
Prior to and after read: the sequence number is read
The sequence number must be even (prior read) and equal at end
 Writer always succeed (if no other writers),
Readers never block
 Favors writers over readers
 Readers does not affect writer’s locking
 Seq locks provide very light weight and scalable
lock for use with many readers and a few writers
Sequential Locks (Cont.)
 Example:
seqlock_t mr_seq_lock *s1
WRITE:
write_seq_lock (s1); {spin_lock(s1->lock); ++s1->sequence; SMP_wmb();}
/* Write Data */
write_sequnlock (s1); {SMP_wmb(); s1->sequence++; spin_unlock(s1-> lock);}
READ:
do {
seq = read_seqbegin (s1); {ret = s1->sequence; SMP_rmb(); return ret;}
/* read data */
} while (read_seqretry (s1, seq)); {SMP_rmb(); return (seq&1) |
s1->sequence^seq) }
 Pending writers continually cause read loop to repeat until
writers are done.
Ordering and Barriers
 Both compiler and CPU can reorder reads/writes:
Compiler: optimization, CPU: performance i.e. pipeline
 Instruct CPU not to reorder R/W
Barrier() call to instruct compiler not to reorder R/W
 Memory Barrier and Compiler Barrier Methods
barrier()
// compiler barrier - load/store
smp_rmb(), wmb(), mb()
Intel X86 processors: do not ever reorder writes
Memory Management
 Main Memory : Three (3) parts
kernel memory (never paged out),
kernel memory for memory map (never paged out)
pageable page frames (user pages, paging cache, etc.)
 Memory Map : mem_map
Array of page descriptor for each page frame in system
with pointers to address space they belong to (if not free) or
with linked list for free frames
Memory Management
 Physical Memory
 For kernel (never paged out)
 For memory map table (never paged out)
For page frame to virtual page mapping
For maintaining free page list
 For pageable page frames
User pages and paging caches
 Arbitrary size, contiguous kernel memory
Kmalloc(…)
Memory Allocation Mechanisms
 Page allocator - buddy algorithm (2**i split or
combined) 65 page chunk->ask for 128 page chunk
 Slab allocator: carves chunk (from buddy algorithm)
into slabs - one or more physically contiguous pages
A cache (for each kernel data structure): one or more slabs
and is populated with kernel objects (TCBs, semaphores)
Example: To allocate a new task_struct, Kernel looks in the object
cache. Try: partially full slab?, empty slab?, then a new slab?
 kmalloc(): Similar to user-space malloc. It returns a
pointer to a region of (physically contiguous) memory that is
at least requested ‘size’ bytes in length.
 Vmalloc():
allocates chunk of physical memory (that
may not be contiguous) and fix up the page tables to map
the memory into a contiguous chunk of logical address
space.
Virtual Memory
 Virtual Address Space
Homogeneous, contiguous, page-aligned areas (text, mapped files)
Page size: 4KB (Pentium), 8KB (Alpha) – Linux also support 4MB
 Memory Descriptor
A process address space is represented by mm_struct
(pointed to by mm field of task_struct)
struct mm_struct {
struct vm_area_struct *mmap;
pgd_t *pgd;
atomic_t mm_users
atomic_t mm_count;
struct list_head mmlist;
…
….
}
// list of memory areas – text, data,…
// page global directory
// addr. space users – 2 for 2 threads
// primary reference count
// list of all mm_struct
// lock, semaphore…
// start/end addr. Of code, data, heap, stack
Virtual Memory - Paging
 Four-level paging (for 64 bit architectures)
global/upper/middle directory, and page table
Pentium using two-level paging (global directory
points to page table)
 Demand paging (no pre-paging)
With only user structure (PCB), and page tables
need to be in memory
Page daemon (process 2): awaken (periodically or
demand) – check ‘free’
Page Replacement
 Modified Version of LRU Scheme
One particular failure of the LRU strategy (besides
its cost of implementation) is that many files are
accessed once and then never again. Putting
them at the top of the LRU list is thus not optimal.
In general, the kernel has no way of knowing that
a file is going to be accessed only once.
However, it does know how many times it has
been accessed in the past. This leads to a
modified version of LRU i.e. Two-List Strategy as
follows:
Page Replacement (Cont.)
 Two-list strategy (modified version of LRU)
Active list (hot) and Inactive list (reclaim candidate)
Pages when first allocated are placed on inactive list
If referenced while on that list, it will be placed on active list
Both lists are maintained in a pseudo-LRU manner: items
are added to the tail and remove from the head as a queue.
 Lists balanced: if active list becomes larger, items will be
moved from the active list back to the inactive list for
potential eviction. The action starts from the head item:
The reference bit is checked. If it was set, it will be reset, the item
is moved back to the list, and the next page is checked. Otherwise
it will be moved to the inactive list (resembles a Clock algorithm)
Page Replacement (Cont.)
 A Global Policy
All reclaimable pages are contained in just
two lists and pages belonging to any
process may be reclaimed, rather than just
those belonging to a faulting process
 The two-list strategy enables simpler,
pseudo-LRU semantics to perform well
Solves the only-used-once failure in a
classical LRU scheme
The Filesystem
 To the user, Linux’s file system appears as a hierarchical directory
tree obeying UNIX semantics
 Internally, the kernel hides implementation details and manages
the multiple different file systems via an abstraction layer, that is,
the virtual file system (VFS)
 The Linux VFS is designed around object-oriented principles:

Write -> sys_write()

Then --> filesystem’s write method --> physical media
// VFS
 VFS Objects

Primary: superblock, inode(cached), dentry (cached), and file objects

An operation object is contained within each primary object:
super_operations, inode_operation, dentry_operation, file_operations

Other VFS Objects: file_system_type, vfsmount, and three per-process
structures such as file_struct, fs_struct and namespace structures
File System and Device Drivers
User applications
Libraries
User mode
Kernel mode
File subsystem
Buffer/page cache
Character device driver
Block device driver
Hardware control
Virtual File System
Download