W4118 Operating Systems Instructor: Junfeng Yang

advertisement
W4118 Operating Systems
Instructor: Junfeng Yang
File System Consistency

Journaling




What is the consistent update problem?
What is journaling?
What are different journaling modes?
eXplode: an automatic tool for detecting
serious storage system errors
1
The Consistent Update Problem

Atomically update file system from one
consistent state to another, which may require
modifying several sectors, despite that the
disk only provides atomic write of one sector
at a time
2
Example: ext2 File Creation
Memory
Disk
01000 01000
inode block
inode
bitmap bitmap
direct: 1
…
blocks
. 1
.. root
3
Read to In-memory Cache
01000
. 1
.. root
01000 01000
inode block
bitmap bitmap
inode
data blocks
4
Modify blocks
01010
“Dirty” blocks,
must write to disk
. 1
.. root
f 3
01000 01000
inode block
bitmap bitmap
inode
data blocks
5
Crash?

Disk: atomically write one sector



Atomic: if crash, a sector is either completely
written, or none of this sector is written
An FS operation may modify multiple sectors
Crash  FS partially updated
6
Possible Crash Scenarios: Any Subset

File creation dirties three blocks




inode bitmap (B)
inode (I)
parent directory data block (D)
Possible crash scenarios








None
B
I
D
B+I
B+D
I+D
B+I+D
7
Some Solutions

Using a repair utility fsck


Upon reboot, scan entire disk to make FS consistent
Disadvantages
• Slow to scan large disk
• Cannot correctly fix all crashed disks (e.g., B+D)
• Not well-defined consistency

Journaling


Write-ahead logging from database community
Persistently write intent to log (or journal), then update
file system
• Crash before intent is written == no-op
• Crash after intent is written == redo op

Advantages
• no need to scan entire disk
• Well-defined consistency
8
Ext3 Journaling

Three steps




Commit dirty blocks to journal as one transaction
Write dirty blocks to real file system
Reclaim the journal space for the transaction
Physical journaling: write real physical
contents of the update

v.s. logical journaling: “create file f in dir d”
• More complex
• May be faster and save disk space
9
Ext3 Journaling Example
01010
“Dirty” blocks,
must write to disk
. 1
.. root
f 3
01000 01000
journal
Desc
01010
Commit
10
Ext3 Journaling Example
01010
“Dirty” blocks,
must write to disk
. 1
.. root
f 3
01010 01000
journal
Desc
01010
Commit
11
Ext3 Journaling Example
01010
“Dirty” blocks,
must write to disk
. 1
.. root
f 3
01010 01000
journal
Desc
01010
Commit
12
Write Orders

Journal writes < FS writes


FS writes < Journal clear


Otherwise:
Otherwise:
When to write commit block?
journal

Desc
01010
X
Commit
Journal writes and commit ?
13
Journaling Modes

Cost: one write = two disk writes, two seeks

Data journaling: journal all writes, including file data


Metadata journaling: journal only metadata



Problem: expensive to journal data
Used by most FS (IBM JFS, SGI XFS, NTFS)
Problem: file may contain garbage data
Ordered mode: write file data to real FS first, then
journal metadata

Default mode for ext3
14
EXPLODE: a Lightweight, General System
for Finding Serious Storage System Errors
Junfeng Yang
Stanford University
Joint work with Can Sar, Paul Twohey, Ben Pfaff,
Dawson Engler and Madan Musuvathi
15
Why check storage systems?

Storage system errors: some of the most serious





machine crash
data loss
data corruption
Code complicated, hard to get right
 Conflicting goals: speed, reliability (recover from
any failures and crashes)
Typical ways to find these errors: ineffective



Manual inspection: strenuous, erratic
Randomized testing (e.g. unplug the power cord): blindly
throwing darts
Error report from mad users
16
Goal: build tools to automatically find
storage system errors
Sub-goal: comprehensive, lightweight, general
17
EXPLODE [OSDI06]

Comprehensive: adapt ideas from model checking

General, real: check live systems


Fast, easy



Check a new storage system: 200 lines of C++ code
Port to a new OS: 1 kernel module + optional modification
Effective



Can run (on Linux, BSD), can check, even w/o source code
17 storage systems: 10 Linux FS, Linux NFS, Soft-RAID, 3
version control, Berkeley DB, VMware
Found serious data-loss in all
Subsumes FiSC [OSDI04, best paper]
18
Outline

Overview

Checking process

Implementation


Example check: crashes during recovery are
recoverable
Results
19
Long-lived bug fixed in 2 days in the
IBM Journaling file system (JFS)

Serious



Loss of an entire FS!
Fixed in 2 days with our complete trace
Hard to find

3 years old, ever since the first version
Dave Kleikamp (IBM JFS): “I really appreciate
your work finding and recreating this bug. I'm
sure this has bitten us before, but it's usually
hard to go back and find out what causes the
file system to get messed up so bad”
20
Events to trigger the JFS bug
creat(“/f”);
flush “/f”
5-char system call,
not
a typo
crash!
Buffer
Cache
(in mem)
/
f
fsck.jfs
File system recovery
utility, run after reboot
Disk
/
Orphan file removed.
Legal behavior for file
systems
21
Events to trigger the JFS bug
creat(“/f”);
Buffer
bug under low
Cache
mem (design flaw) (in mem)
/
f
flush “/”
crash!
fsck.jfs
File system recovery
utility, run after reboot
Disk
dangling
/
pointer!
“fix” by zeroing,
entire FS gone!
22
Overview
User-written
Checker
creat(“/f”)
void mutate() {
void check() {
creat(“/old”);
int fdcreat(“/f”)
= open(“/old”,
Toy
checker: crash after
sync(); not lose any old file that
O_RDONLY
);
should
is
User-written
checker:
creat(“/f”);
(fd < 0)
already
persistent
on ifdisk
check_crash_now();
error(“lost old!”);
can be either very sophisticated
}
close(fd);
or very simple
}
EXPLODE
Runtime
“crash-disk”
fsck.jfs
check(
/
fsck.jfs
check(
/ f
fsck.jfs
check( /root )
f
JFS
/root
old
/ f
EKM
Linux Kernel
Hardware
User code
/root
Our code
old
/root
)
)
old f
EKM = EXPLODE kernel module
23
Outline

Overview

Checking process

Implementation


Example check: crashes during recovery are
recoverable
Results
24
One core idea from model checking:
explore all choices


Bugs are often triggered by corner cases
How to find? Drive execution down to these
tricky corner cases
Principle
When execution reaches a point in program that can do
one of N different actions, fork execution and in first
child do first action, in second do second, etc.
Result: rare events appear as often as common ones
25
Crashes (Overview slide revisit)
User-written
Checker
“crash-disk”
fsck.jfs
check(
/
fsck.jfs
check(
/ f
fsck.jfs
check( /root )
EXPLODE
f
Runtime
JFS
/root
old
/ f
EKM
/root
old
/root
)
)
old f
Linux Kernel
Hardware
26
External choices

Fork and do every possible operation
/root
a b
/root
/root
a c
a
/root
…
Explore generated
states as well
…
Users write code to
check FS valid.
EXPLODE “amplifies”
/root
a
27
Internal choices

Fork and explore all internal choices
/root
a b
/root
a
/root
a
/root
struct block* read_block (int i) {
struct block *b;
if ((b = cache_lookup(i)))
return b;
return disk_read (i);
}
a b
28
Users expose choices using choose(N)


To explore N-choice point, users instrument
code using choose(N) (also used in other model
checkers)
choose(N): N-way fork, return K in K’th kid
struct block* read_block (int i) {
struct block *b;
if ((b = cache_lookup(i)))
return b;
return disk_read (i);
}

cache_lookup (int i) {
if(choose(2) == 0)
return NULL;
// normal lookup
…
}
Optional. Instrumented only 7 places in Linux
29
Crash X External X Internal
/root
a b
/root
/root
a c
a
…
/root
…
/root
a
30
Speed: skip same states
/root
/root
a bc
a b
/root
/root
a c
a
/root
…
/root
…
Abstract and hash
a state, discard if
seen.
a
31
Outline

Overview

Checking process

Implementation


FiSC, File System Checker, [OSDI04], best paper
EXPLODE, storage system checker, [OSDI06]

Example check: crashes during recovery are recoverable

Results
32
Checking process
S0
…
S0 = checkpoint()
enqueue(S0)
while(queue not empty){
S = dequeue()
for each action in S {
restore(S)
do action
S’ = checkpoint()
if(S’ is new)
enqueue(S’)
}
}
How to checkpoint and restore a live OS?
33
FiSC: jam OS into tool
User-written
Checker

Pros


JFS


Cons

User Mode Linux
FiSC
Linux Kernel
Hardware
User code

Comprehensive, effective
No model, check code
Checkpoint and restore: easy
Intrusive. Build fake
environment. Hard to check
anything new. Months for
new OS, 1 week for new FS
Many tricks, so complicated
that we won best paper
OSDI 04
Our code
34
EXPLODE: jam tool into OS
User-written
Checker
User-written
Checker
EXPLODE
JFS
Runtime
User Mode Linux
FiSC
Linux Kernel
Hardware
User code
Our code
JFS
EKM
Linux Kernel
Hardware
EKM = EXPLODE kernel module
35
EKM
lines of code
OS
Lines of code
Linux 2.6
1,915
FreeBSD 6.0
1,210
EXPLODE kernel modules (EKM) are
small and easy to write
36
How to checkpoint and restore
a live OS kernel?

Checker
EXPLODE
Runtime

Hard to checkpoint
live kernel memory
Virtual machine? No

JFS


EKM
Linux Kernel
Hardware

VMware: no source
Xen: not portable
heavyweight
There’s a better
solution for storage
systems
37
Checkpoint: save actions instead of bits
state = list of actions
checkpoint S = save (creat, cache miss)
restore = re-initialize, creat, cache miss
re-initialize = unmount, mkfs
S0
creat
Utility that clears
Utility to
in-mem state of a
Redo-to-restore-state
idea
used
create
an in
storage
system to tradeempty
model
checking
time FS
for space
S
(stateless search).
…
We use it only to reduce intrusiveness
38
Deterministic replay

Storage system: isolated subsystem

Non-deterministic kernel scheduling decision


Non-deterministic interrupt


Fix: use RAM disks, no interrupt for checked system
Non-deterministic kernel choose() calls by other code


Opportunistic fix: priorities
Fix: filter by thread IDs. No choose() in interrupt
Worked well in practice


Mostly deterministic
Worst case: auto-detect & ignore non-repeatable errors
39
Outline

Overview

Checking process

Implementation


Example check: crashes during recovery are
recoverable
Results
40
Why check crashes during recovery?

Crashes are highly correlated


Often caused by kernel bugs, hardware errors
Reboot, hit same bug/error
What to check?

fsck once == fsck & crash, re-run fsck



fsck(crash-disk) to completion, “/a” recovered
fsck(crash-disk) and crash, fsck, “/a” gone
Bug!
Powerful heuristic, found interesting bugs (wait
until results)
41
How to check crashes during recovery?
“crash-disk”
fsck.jfs
EXPLODE
Runtime
fsck.jfs
same as
?
fsck.jfs
same as
?
fsck.jfs
same as
?
“crash-crash-disk”
Problem: N blocks  2^N crash-crash-disks.
Too many! Can prune many crash-crash-disks
42
Simplified example

fsck(000)


3-block disk, B1, B2, B3
each block is either 0 or 1
crash-disk = 000 (B1 to B3)
Read(B1) = 0
Write(B2, 1)
buffer cache: B2=1
Write(B3, 1)
buffer cache: B2=1, B3=1
Read(B3) = 1
Write(B1, 1)
buffer cache: B2=1, B3=1, B1=1
fsck(000) = 111
43
Naïve strategy: 7 crash-crash-disks
crash-disk = 000
fsck(000) = 111
buffer cache: B2=1, B3=1, B1=1
000 + {B2=1}
fsck(010) == 111?
Read(B1) = 0
fsck(001) == 111?
Write(B2, 1)
fsck(011) == 111?
Write(B3, 1)
fsck(100) == 111?
Read(B3) = 1
fsck(110) == 111?
Write(B1, 1)
fsck(101) == 111?
fsck(111) == 111?
44
Optimization: exploiting determinism
crash-disk = 000

fsck(000) = 111

Read(B1) = 0
Read(B3) = 1
Write(B1, 1)
read same blocks  write
same blocks
000 + {B2=1}
Write(B2, 1)
Write(B3, 1)
For all practical purposes,
fsck is deterministic

fsck(010) == 111?

fsck(000) doesn’t read B2

So, fsck(010) = 111
45
What blocks does fsck(000) actually
read?
crash-disk = 000
fsck(000) = 111
Read(B1) = 0
Write(B2, 1)
Write(B3, 1)
Read(B3) = 1
Write(B1, 1)
Read of B3 will get what we just
wrote. Can’t depend on B3
fsck(000) reads/depends only on B1.
It doesn’t matter what we write to
the other blocks.
fsck(0**) = 111
46
Prune crash-crash-disks matching 0**
crash-disk = 000
fsck(000) = 111
buffer cache: B2=1, B3=1, B1=1
fsck(010) == 111?
Read(B1) = 0
fsck(001) == 111?
Write(B2, 1)
fsck(011) == 111?
Write(B3, 1)
fsck(100) == 111?
Read(B3) = 1
fsck(110) == 111?
Write(B1, 1)
fsck(101) == 111?
Can further
optimize using
this and other
ideas
fsck(111) == 111?
47
Outline

Overview

Checking process

Implementation


Example check: crashes during recovery are
recoverable
Results
48
Bugs caused by crashes during recovery


Found data-loss bugs in all three FS that use
logging (ext3, JFS, ReiserFS), total 5
Strict order under normal operation:



First, write operation to log, commit
Second, apply operation to actual file system
Strict (reverse) order during recovery:



First, replay log to patch actual file system
Second, clear log
No order  corrupted FS and no log to patch it!
49
Bug in fsck.ext3
recover_ext3_journal(…) {
// …
retval = -journal_recover(journal);
// …
// clear the journal
e2fsck_journal_release(…)
// …
}
journal_recover(…) {
// replay the journal
//…
// sync modifications to disk
fsync_no_super (…)
}
// Error! Empty macro, doesn’t sync data!
#define fsync_no_super(dev)
do {} while (0)


Code directly adapted from the kernel
But, fsync_no_super defined as NOP: “hard to implement”
50
FiSC Results (can reproduce in EXPLODE)
Error Type
VFS
ext2
ext3
JFS
Data loss
N/A
N/A
1
8
False clean
N/A
N/A
1
1
2
2
2
1
3+2
Security
Crashes
1
Other
1
Total
2
10
2
1
1
5
21
ReiserFS total
1
1
10
12
3
2
32
32 in total, 21 fixed, 9 of the remaining 11 confirmed
51
EXPLODE checkers lines of code and
errors found
Storage System Checked
Checker
Bugs
10 file systems
5,477
18
CVS
68
1
Subversion
69
1
“EXPENSIVE”
124
3
Berkeley DB
202
6
RAID
FS + 137
2
NFS
FS
4
VMware
GSX/Linux
FS
1
6,008
36
Storage
applications
Transparent
subsystems
Total
6 bugs per 1,000 lines of checker code
52
Related work

FS Testing

Static (compile-time) analysis

Software model checking
53
Conclusion

EXPLODE




Comprehensive: adapt ideas from model checking
General, real: check live systems in situ, w/o source code
Fast, easy: simple C++ checking interface
Results



Checked 17 widely-used, well-tested, real-world storage
systems: 10 Linux FS, Linux NFS, Soft-RAID, 3 version
control, Berkeley DB, VMware
Found serious data-loss bugs in all, over 70 bugs in total
Many bug reports led to immediate kernel patches
54
Download