W4118 Operating Systems Instructor: Junfeng Yang

W4118 Operating Systems
Instructor: Junfeng Yang
Logistics

Homework 2 grading: by demos



Each TA grades 1/3 of all groups
TA will set up a time with you using doodle
During grading:
• You apply patch in TA’s VM (same as the class VM)
• You compile, run testcase, etc
• TA may ask question to anyone in your group
– Wrong answer  all group members lose points
– So make sure everyone understands everything of the
assignment

Homework 3 out


Implementing semaphore
Implementing barrier (NOTE: different from
the memory barriers we talk about today)
2
Barriers
Use of a barrier. (a) Processes approaching a barrier. (b) All
processes but one blocked at the barrier. (c) When the last
process arrives at the barrier, all of them are let through.
Example from Tanenbaum, Modern Operating Systems 3 e, (c) 2008 Prentice-Hall, Inc. All rights reserved. 0-13-6006639
3
Last lecture

Synchronization


Locks: to avoid spin-wait or context switch
overhead, use queues to control who gets lock next
Semaphore
• Purposes: mutual exclusion and scheduling order
• Solving producer-consumer problem with semaphores

Monitor
• Concurrency meets OOP
• Condition variables to control order
• Solving producer-consumer problem with monitors
4
Today: synchronization in Linux


Low-level atomic operations:

Memory barrier

Atomic operations

Interrupt/softirq disabling/enabling

Spin locks
• avoids compile-time, or runtime instruction re-ordering
• memory bus lock, read-modify-write ops
• Local, global
• general, read/write
High-level synchronization primitives:



Semaphores
• general, read/write
Mutex
Completion
5
Goals



Different flavors of synchronization
primivites and when to use them, in the
context of Linux kernel
How synchronization primitives are
implemented for real
“Portable” tricks: useful in other context as
well (when you write a high performance
server)

Optimize for common case
6
Synchronization is complex and subtle


Already learned this from the implementation
examples we’ve seen before
Kernel synchronization is even more complex
and subtle


Higher requirements: performance, protection …
Code heavily optimized, “fast path” often in
assembly, fit within one cache line
7
Choosing Synch Primitives

Avoid synch if possible! (clever instruction ordering)


Use atomics or rw spinlocks if possible



Low overhead: fast path is a few instructions
In interrupt context, use spinlock_irq* if possible
Use semaphores if you need to sleep



Example: inserting in linked list (needs memory barrier
still)
Can’t sleep in interrupt context
Don’t sleep holding a spinlock! Why?
Complicated matrix of choices for protecting data
structures accessed by deferred functions
8
Architectural Dependence




The implementation of the synchronization
primitives is extremely architecture dependent.
This is because only the hardware can guarantee
atomicity of an operation.
Each architecture must provide a mechanism for
doing an operation that can examine and modify a
storage location atomically.
Some architectures do not guarantee atomicity,
but inform whether the operation attempted was
atomic.
9
Memory Barriers: Motivation

The compiler can:



Reorder code as long as it correctly maintains data flow
dependencies within a function and with called functions
Reorder the execution of code to optimize performance
The processor can:



Reorder instruction execution as long as it correctly
maintains register flow dependencies
Reorder memory modification as long as it correctly
maintains data flow dependencies
Reorder the execution of instructions (for performance
optimization)
10
Memory Barriers: Definition


Memory Barriers are used to prevent a processor and/or the
compiler from reordering instruction execution and memory
modification.
Memory Barriers are instructions to hardware and/or compiler
to complete all pending accesses before issuing any more



read memory barrier – acts on read requests
write memory barrier – acts on write requests
Intel –


certain instructions act as barriers: lock, iret, control regs
rmb – asm volatile("lock;addl $0,0(%%esp)":::"memory")
• add 0 to top of stack with lock prefix

wmb – Intel never re-orders writes, just for compiler
11
Barrier Operations








barrier – prevent only compiler reordering
mb – prevents load and store reordering
rmb – prevents load reordering
wmb – prevents store reordering
smp_mb – prevent load and store reordering only in
SMP kernel
smp_rmb – prevent load reordering only in SMP
kernels
smp_wmb – prevent store reordering only in SMP
kernels
set_mb – performs assignment and prevents load and
store reordering
12
Atomic Operations

Many instructions not atomic in hardware (smp)

Read-modify-write instructions: inc, test-and-set, swap
unaligned memory access

even i++ is not necessarily atomic!





Compiler may not generate atomic code
If the data that must be protected is a single word,
atomic operations can be used. These functions
examine and modify the word atomically.
The atomic data type is atomic_t.
Intel implementation


lock prefix byte 0xf0 – locks memory bus
Speed: memory speed, (roughly same as
load/store)
13
Atomic Operations
ATOMIC_INIT – initialize an atomic_t variable
atomic_read – examine value atomically
atomic_set – change value atomically
atomic_inc – increment value atomically
atomic_dec – decrement value atomically
atomic_add - add to value atomically
atomic_sub – subtract from value atomically
atomic_inc_and_test – increment value and test for zero
atomic_dec_and_test – decrement value and test for zero
atomic_sub_and_test – subtract from value and test for zero
atomic_set_mask – mask bits atomically
atomic_clear_mask – clear bits atomically
14
Serializing with Interrupts



Basic primitive in original UNIX
Doesn’t protect against other CPUs
Intel: “interrupts enabled bit”


cli to clear (disable), sti to set (enable)
Enabling is often wrong; need to restore


local_irq_save()
local_irq_restore()
• Why?
15
Interrupt Operations

Services used to serialize with interrupts are:

Dealing with the full interrupt state of the system
is officially discouraged. Locks should be used.
local_irq_disable - disables interrupts on the current
CPU
local_irq_enable - enable interrupts on the current CPU
local_save_flags - return the interrupt state of the
processor
local_restore_flags - restore the interrupt state of the
processor
16
Disabling Deferred Functions



Disabling interrupts disables deferred functions
Possible to disable deferred functions but not all
interrupts
Operations (macros):


local_bh_disable()
local_bh_enable()
17
Spin Locks




A spin lock is a data structure (spinlock_t ) that is
used to synchronize access to critical sections.
Only one thread can be holding a spin lock at any
moment. All other threads trying to get the lock
will “spin” (loop while checking the lock status).
Spin locks should not be held for long periods
because waiting tasks on other CPUs are spinning,
and thus wasting CPU execution time.
Why use spin locks?
18
__raw_spin_lock_string
1: lock; decb %0
jns
3f
2: rep; nop
cmpb $0, %0
jle 2b
jmp 1b
3:
#
#
#
#
#
#
#
atomically decrement
if clear sign bit jump forward to 3
wait
spin – compare to 0
go back to 2 if <= 0 (locked)
unlocked; go back to 1 to try again
we have acquired the lock …
From linux/include/asm-i386/spinlock.h
spin_unlock merely writes 1 into the lock field.
19
Spin Lock Operations

Functions used to work with spin locks:
spin_lock_init – initialize a spin lock before using
it for the first time
spin_lock – acquire a spin lock, spin waiting if it is
not available
spin_unlock – release a spin lock
spin_unlock_wait – spin waiting for spin lock to
become available, but don't acquire it
spin_trylock – acquire a spin lock if it is currently
free, otherwise return error
spin_is_locked – return spin lock state
20
Spin Locks & Interrupts

The spin lock services also provide
interfaces that serialize with interrupts
(on the current processor):
spin_lock_irq - acquire spin lock and disable
interrupts
spin_unlock_irq - release spin lock and reenable
spin_lock_irqsave - acquire spin lock, save
interrupt state, and disable
spin_unlock_irqrestore - release spin lock and
restore interrupt state

Files:


include/linux/spinlock.h
kernel/spinlock.c
21
RW Spin Locks





A read/write spin lock is a data structure that
allows multiple tasks to hold it in "read" state or
one task to hold it in "write" state (but not both
conditions at the same time).
This is convenient when multiple tasks wish to
examine a data structure, but don't want to see it
in an inconsistent state.
A lock may not be held in read state when
requesting it for write state.
The data type for a read/write spin lock is
rwlock_t. (include/asm-i386/spinlock.h)
Writers can starve waiting behind readers.
22
RW Spin Lock Operations

Several functions are used to work with
read/write spin locks:
rwlock_init – initialize a read/write lock before using it
for the first time
read_lock – get a read/write lock for read
write_lock – get a read/write lock for write
read_unlock – release a read/write lock that was held
for read
write_unlock – release a read/write lock that was held
for write
read_trylock, write_trylock – acquire a read/write lock
if it is currently free, otherwise return error
23
RW Spin Locks & Interrupts

The read/write lock services also provide
interfaces that serialize with interrupts
(on the current processor):
read_lock_irq - acquire lock for read and disable
interrupts
read_unlock_irq - release read lock and reenable
read_lock_irqsave - acquire lock for read, save
interrupt state, and disable
read_unlock_irqrestore - release read lock and
restore interrupt state

Corresponding functions for write exist as
well (e.g., write_lock_irqsave).
24
Semaphores



A semaphore is a data structure that is used to
synchronize access to critical sections or other
resources.
A semaphore allows a fixed number of tasks
(generally one for critical sections) to "hold" the
semaphore at one time. Any more tasks
requesting to hold the semaphore are blocked (put
to sleep).
A semaphore can be used for serialization only in
code that is allowed to block.
25
Semaphore Operations

Operations for manipulating semaphores:
up – release the semaphore
down – get the semaphore (can block)
down_interruptible – get the semaphore, can be
woken up if interrupt arrives
down_trylock – try to get the semaphore without
blocking, otherwise return an error

Files:



include/asm-i386/semaphore.h
arch/i386/kernel/semaphore.c
Goal: optimize for uncontended case
26
Semaphore Structure

Struct semaphore

• > 0: free;
• = 0: in use, no waiters;
• < 0: in use, waiters

wait: wait queue
sleepers:

wait: wait queue


count (atomic_t):
• 0 (none),
• 1 (some), occasionally 2
implementation requires lower-level
synch

struct semaphore
atomic_t count
int sleepers
wait_queue_head_t wait
lock
prev
next
atomic updates, spinlock, interrupt
disabling
27
Semaphores

optimized assembly code for uncontended case (down())


up() is easy


atomically decrement; continue
contended down() is really complex!



atomically increment; wake_up() if necessary
uncontended down() is easy


C code for slower “contended” case (__down())
basically increment sleepers and sleep
loop because of potentially concurrent ups/downs
still in down() path when lock is acquired
28
A hypothetical down()
spin_lock_irqsave(&sem->wait.lock, flags);
add_wait_queue_exclusive_locked(&sem->wait, &wait);
for (;;;) {
my_count = sem->count--;
If (my_count >= 0) {
break;
}
spin_unlock_irqrestore(&sem->wait.lock, flags);
tsk->state = TASK_UNINTERRUPTIBLE;
schedule();
spin_lock_irqsave(&sem->wait.lock, flags);
}
remove_wait_queue_locked(&sem->wait, &wait);
wake_up_locked(&sem->wait);
spin_unlock_irqrestore(&sem->wait.lock, flags);
Note: not real code!
29
The Real down(), down_failed()
inline down:
movl $sem, %ecx # why does this work?
lock; decl (%ecx)# atomically decr sem count
jns 1f
# if not negative jump to 1
lea %ecx, %eax
# move into eax
call __down_failed
#
1:
# we have the semaphore
down_failed:
pushl %edx
pushl %ecx
call __down
popl %ecx
popl %edx
ret
# push edx onto stack (C)
# push ecx onto stack
# call C function
# pop ecx
# pop edx
from include/asm-i386/semaphore.h and arch/i386/kernel/semaphore.c)
30
__down()
tsk->state = TASK_UNINTERRUPTIBLE;
spin_lock_irqsave(&sem->wait.lock, flags);
add_wait_queue_exclusive_locked(&sem->wait, &wait);
sem->sleepers++;
for (;;) {
int sleepers = sem->sleepers;
/*
* Add "everybody else" into it. They aren't playing,
* because we own the spinlock in the wait_queue head
*/
if (!atomic_add_negative(sleepers - 1, &sem->count)) {
sem->sleepers = 0;
break;
}
sem->sleepers = 1;
/* us - see -1 above */
spin_unlock_irqrestore(&sem->wait.lock, flags);
schedule();
spin_lock_irqsave(&sem->wait.lock, flags);
tsk->state = TASK_UNINTERRUPTIBLE;
}
remove_wait_queue_locked(&sem->wait, &wait);
wake_up_locked(&sem->wait);
spin_unlock_irqrestore(&sem->wait.lock, flags);
tsk->state = TASK_RUNNING;
From
linux/
lib/
semaphoresleepers.c
31
RW Semaphores






A rw_semaphore is a semaphore that allows either one
writer or any number of readers (but not both at the
same time) to hold it.
Any writer requesting to hold the rw_semaphore is
blocked when there are readers holding it.
A rw_semaphore can be used for serialization only in
code that is allowed to block. Both types of
semaphores are the only synchronization objects that
should be held when blocking.
Writers will not starve: once a writer arrives, readers
queue behind it
Increases concurrency; introduced in 2.4
Files:

include/linux/rwsem.h

include/asm-i386/rwsem.h
32

lib/rwsem.c
RW Semaphore Operations

Operations for manipulating semaphores:
up_read – release a rw_semaphore held for read.
up_write – release a rw_semaphore held for
write.
down_read – get a rw_semaphore for read (can
block, if a writer is holding it)
down_write – get a rw_semaphore for write (can
block, if one or more readers are holding it)
33
More RW Semaphore Ops

Operations for manipulating semaphores:
down_read_trylock – try to get a rw_semaphore
for read without blocking, otherwise return an
error
down_write_trylock – try to get a rw_semaphore
for write without blocking, otherwise return an
error
downgrade_write – atomically release a
rw_semaphore for write and acquire it for
read (can't block)
34
Mutexes


A mutex is a data structure that is also used to
synchronize access to critical sections or other
resources, introduced in 2.6.16.
Why? (linux kernel mailing list, Documentation/mutexdesign.txt)



90% of semaphores in kernel are used as mutexes, 9% of
semaphores should be spin_locks (Andrew Morton).
Slow paths are more critical for highly contended
semaphores on SMP
Mutexes are simpler (lighter weight)
• tighter code
• slightly faster, better scalability
• no fastpath tradeoffs
• debug support – strict checking of adhering to semantics
35
Mutexes (Continued)

Mutex -- why not?



not the same as semaphores
cannot be used from interrupt context
cannot be unlocked from a different context
than that in which they were locked
36
Mutex Operations

Operations for manipulating mutexes:
mutex_unlock – release the mutex
mutex_lock – get the mutex (can block)
mutex_lock_interruptible – get the mutex, but
allow interrupts
mutex_trylock – try to get the mutex without
blocking, otherwise return an error
mutex_is_locked – determine if mutex is locked
37
Completions

Slightly higher-level, FIFO semaphores


Up/down may execute concurrently


Solves a subtle synch problem on SMP
This is a good thing (when possible)
Operations: complete(), wait_for_complete()



Spinlock and wait_queue
Spinlock serializes ops
Wait_queue enforces FIFO
38
Backup Slides
The Big Reader Lock


Reader optimized RW spinlock
RW spinlock suffers cache contention


Per-CPU, cache-aligned lock arrays


One for reader portion, another for writer portion
To read: set bit in reader array, spin on writer


On lock and unlock because of write to rwlock_t
Acquire when writer lock free; very fast!
To write: set bit and scan ALL reader bits

Acquire when reader bits all free; very slow!
40
Kernel Synchronization



A piece of code is considered reentrant if two
tasks can be running in the code at the same time
without behaving incorrectly.
In the uniprocessor (UP) kernel, when a task is
running anywhere in the kernel, no other task can
be executing anywhere in the kernel.
A piece of code can allow another task to run
somewhere in the kernel by blocking or sleeping
(saving the task state and running other tasks) in
the kernel, waiting for some event to occur.
41
MP Kernel Synchronization




On an MP system, multiple tasks can be running in
the kernel at the same time.
Thus, all code must assume that another task can
attempt to run in the code.
Any code segment that cannot tolerate multiple
tasks executing in it is called a critical section.
A critical section must block all tasks but one that
attempt to execute it.
42
The Big Kernel Lock (BKL)

For serialization that is not performance
sensitive, the big kernel lock (BKL) was used





This mechanism is historical and should generally be
avoided.
The function lock_kernel gets the big kernel lock.
The function unlock_kernel releases the big kernel
lock.
The function kernel_locked returns whether the
kernel lock is currently held by the current task.
The big kernel lock itself is a simple lock called
kernel_flag.
43
When Synch Is Not Necessary

simplifying assumptions (2.4 uniprocessor)


kernel is non pre-emptive
process holds cpu while in kernel
• unless it blocks, relinquishes cpu, or interrupt


locking only needed against interrupts
smp kernels require locking

smp locks compile out for uniprocessor kernels
• no overhead!

2.5 introduces limited kernel pre-emption
 to reduce scheduler, interrupt latency


make Linux more responsive for real-time apps
introduces preemption points in kernel code
44
What about ‘cli’?




Disabling interrupts will stop time-sharing
among tasks on a uniprocessor system
But it would be unfair in to allow this in a
multi-user system (monopolize the CPU)
So cli is a privileged instruction: it cannot
normally be executed by user-mode tasks
It won’t work for a multiprocessor system
45
Special x86 instructions


Need to use x86 assembly-language to
implement atomic sync operations
Several instruction-choices are possible, but
‘btr’ and ‘bts’ are simplest to use:



‘btr’ means ‘bit-test-and-reset’
‘bts’ means ‘bit-test-and’set’
Syntax and semantics:


asm(“ btr $0, lock “);
asm(“ bts $0, lock “);
// acquire the lock
// release the lock
46
The x86 ‘lock’ prefix

In order for the ‘btr’ instruction to perform
an ‘atomic’ update (when multiple CPUs are
using the same bus to access memory
simultaneously), it is necessary to insert an
x86 ‘lock’ prefix, like this:
asm(“ spin:

lock
btr $0, mutex “);
This instruction ‘locks’ the shared system-bus
during this instruction execution -- so another
CPU cannot intervene
47
Reentrancy


More than one process (or processor) can be
safely executing reentrant concurrently
It needs to obey two cardinal rules:


It contains no ‘self-modifying’ instructions
Access to shared variables is ‘exclusive’
48
Program A
int flag1 = 0, flag2 = 0;
void p1 (void
flag1 = 1;
if (!flag2)
}
void p2 (void
flag2 = 1;
if (!flag1)
}
*ignored) {
{ /* critical section */ }
*ignored) {
{ /* critical section */ }
Can both critical sections run?
49
Program B
int data = 0, ready = 0;
void p1 (void *ignored) {
data = 2000;
ready = 1;
}
void p2 (void *ignored) {
while (!ready)
;
use (data);
}
Can use be called with value 0?
50
Program C
int a = 0, b = 0;
void p1 (void *ignored) { a = 1; }
void p2 (void *ignored) {
if (a == 1)
b = 1;
}
void p3 (void *ignored) {
if (b == 1)
use (a);
}
Can use be called with value 0?
51
Correct Answers




Program A: I don’t know
Program B: I don’t know
Program C: I don’t know
Why?




It depends on your hardware
If it provides sequential consistency, then answers all No
But not all hardware provides sequential consistency
[BTW, examples and some other slide content from
excellent Tech Report by Adve & Gharachorloo]
52
Sequential Consistency (SC)


Sequential consistency: The result of execution is as if
all operations were executed in some sequential order,
and the operations of each processor occurred in the
order specified by the program. [Lamport]
Boils down to two requirements:



Maintaining program order on individual processors
Ensuring write atomicity
Why doesn’t all hardware support sequential
consistency?
53
SC Limits HW Optimizations

Write buffers


Overlapping write operations can be reordered



Concurrent writes to different memory modules
Coalescing writes to same cache line
Non-blocking reads


E.g., read flag n before flag (2 − n) written through in
Program A
E.g., speculatively prefetch data in Program B
Cache coherence


Write completion only after invalidation/update (Program
B)
Can’t have overlapping updates (Program C)
54
SC Thwarts Compiler Opts


Code motion
Caching values in registers




E.g., ready flag in Program B
Common subexpression elimination
Loop blocking
Software pipelining
55