W4118 Operating Systems Instructor: Junfeng Yang File System Consistency Journaling What is the consistent update problem? What is journaling? What are different journaling modes? eXplode: an automatic tool for detecting serious storage system errors 1 The Consistent Update Problem Atomically update file system from one consistent state to another, which may require modifying several sectors, despite that the disk only provides atomic write of one sector at a time 2 Example: ext2 File Creation Memory Disk 01000 01000 inode block inode bitmap bitmap direct: 1 … blocks . 1 .. root 3 Read to In-memory Cache 01000 . 1 .. root 01000 01000 inode block bitmap bitmap inode data blocks 4 Modify blocks 01010 “Dirty” blocks, must write to disk . 1 .. root f 3 01000 01000 inode block bitmap bitmap inode data blocks 5 Crash? Disk: atomically write one sector Atomic: if crash, a sector is either completely written, or none of this sector is written An FS operation may modify multiple sectors Crash FS partially updated 6 Possible Crash Scenarios: Any Subset File creation dirties three blocks inode bitmap (B) inode (I) parent directory data block (D) Possible crash scenarios None B I D B+I B+D I+D B+I+D 7 Some Solutions Using a repair utility fsck Upon reboot, scan entire disk to make FS consistent Disadvantages • Slow to scan large disk • Cannot correctly fix all crashed disks (e.g., B+D) • Not well-defined consistency Journaling Write-ahead logging from database community Persistently write intent to log (or journal), then update file system • Crash before intent is written == no-op • Crash after intent is written == redo op Advantages • no need to scan entire disk • Well-defined consistency 8 Ext3 Journaling Three steps Commit dirty blocks to journal as one transaction Write dirty blocks to real file system Reclaim the journal space for the transaction Physical journaling: write real physical contents of the update v.s. logical journaling: “create file f in dir d” • More complex • May be faster and save disk space 9 Ext3 Journaling Example 01010 “Dirty” blocks, must write to disk . 1 .. root f 3 01000 01000 journal Desc 01010 Commit 10 Ext3 Journaling Example 01010 “Dirty” blocks, must write to disk . 1 .. root f 3 01010 01000 journal Desc 01010 Commit 11 Ext3 Journaling Example 01010 “Dirty” blocks, must write to disk . 1 .. root f 3 01010 01000 journal Desc 01010 Commit 12 Write Orders Journal writes < FS writes FS writes < Journal clear Otherwise: Otherwise: When to write commit block? journal Desc 01010 X Commit Journal writes and commit ? 13 Journaling Modes Cost: one write = two disk writes, two seeks Data journaling: journal all writes, including file data Metadata journaling: journal only metadata Problem: expensive to journal data Used by most FS (IBM JFS, SGI XFS, NTFS) Problem: file may contain garbage data Ordered mode: write file data to real FS first, then journal metadata Default mode for ext3 14 EXPLODE: a Lightweight, General System for Finding Serious Storage System Errors Junfeng Yang Stanford University Joint work with Can Sar, Paul Twohey, Ben Pfaff, Dawson Engler and Madan Musuvathi 15 Why check storage systems? Storage system errors: some of the most serious machine crash data loss data corruption Code complicated, hard to get right Conflicting goals: speed, reliability (recover from any failures and crashes) Typical ways to find these errors: ineffective Manual inspection: strenuous, erratic Randomized testing (e.g. unplug the power cord): blindly throwing darts Error report from mad users 16 Goal: build tools to automatically find storage system errors Sub-goal: comprehensive, lightweight, general 17 EXPLODE [OSDI06] Comprehensive: adapt ideas from model checking General, real: check live systems Fast, easy Check a new storage system: 200 lines of C++ code Port to a new OS: 1 kernel module + optional modification Effective Can run (on Linux, BSD), can check, even w/o source code 17 storage systems: 10 Linux FS, Linux NFS, Soft-RAID, 3 version control, Berkeley DB, VMware Found serious data-loss in all Subsumes FiSC [OSDI04, best paper] 18 Outline Overview Checking process Implementation Example check: crashes during recovery are recoverable Results 19 Long-lived bug fixed in 2 days in the IBM Journaling file system (JFS) Serious Loss of an entire FS! Fixed in 2 days with our complete trace Hard to find 3 years old, ever since the first version Dave Kleikamp (IBM JFS): “I really appreciate your work finding and recreating this bug. I'm sure this has bitten us before, but it's usually hard to go back and find out what causes the file system to get messed up so bad” 20 Events to trigger the JFS bug creat(“/f”); flush “/f” 5-char system call, not a typo crash! Buffer Cache (in mem) / f fsck.jfs File system recovery utility, run after reboot Disk / Orphan file removed. Legal behavior for file systems 21 Events to trigger the JFS bug creat(“/f”); Buffer bug under low Cache mem (design flaw) (in mem) / f flush “/” crash! fsck.jfs File system recovery utility, run after reboot Disk dangling / pointer! “fix” by zeroing, entire FS gone! 22 Overview User-written Checker creat(“/f”) void mutate() { void check() { creat(“/old”); int fdcreat(“/f”) = open(“/old”, Toy checker: crash after sync(); not lose any old file that O_RDONLY ); should is User-written checker: creat(“/f”); (fd < 0) already persistent on ifdisk check_crash_now(); error(“lost old!”); can be either very sophisticated } close(fd); or very simple } EXPLODE Runtime “crash-disk” fsck.jfs check( / fsck.jfs check( / f fsck.jfs check( /root ) f JFS /root old / f EKM Linux Kernel Hardware User code /root Our code old /root ) ) old f EKM = EXPLODE kernel module 23 Outline Overview Checking process Implementation Example check: crashes during recovery are recoverable Results 24 One core idea from model checking: explore all choices Bugs are often triggered by corner cases How to find? Drive execution down to these tricky corner cases Principle When execution reaches a point in program that can do one of N different actions, fork execution and in first child do first action, in second do second, etc. Result: rare events appear as often as common ones 25 Crashes (Overview slide revisit) User-written Checker “crash-disk” fsck.jfs check( / fsck.jfs check( / f fsck.jfs check( /root ) EXPLODE f Runtime JFS /root old / f EKM /root old /root ) ) old f Linux Kernel Hardware 26 External choices Fork and do every possible operation /root a b /root /root a c a /root … Explore generated states as well … Users write code to check FS valid. EXPLODE “amplifies” /root a 27 Internal choices Fork and explore all internal choices /root a b /root a /root a /root struct block* read_block (int i) { struct block *b; if ((b = cache_lookup(i))) return b; return disk_read (i); } a b 28 Users expose choices using choose(N) To explore N-choice point, users instrument code using choose(N) (also used in other model checkers) choose(N): N-way fork, return K in K’th kid struct block* read_block (int i) { struct block *b; if ((b = cache_lookup(i))) return b; return disk_read (i); } cache_lookup (int i) { if(choose(2) == 0) return NULL; // normal lookup … } Optional. Instrumented only 7 places in Linux 29 Crash X External X Internal /root a b /root /root a c a … /root … /root a 30 Speed: skip same states /root /root a bc a b /root /root a c a /root … /root … Abstract and hash a state, discard if seen. a 31 Outline Overview Checking process Implementation FiSC, File System Checker, [OSDI04], best paper EXPLODE, storage system checker, [OSDI06] Example check: crashes during recovery are recoverable Results 32 Checking process S0 … S0 = checkpoint() enqueue(S0) while(queue not empty){ S = dequeue() for each action in S { restore(S) do action S’ = checkpoint() if(S’ is new) enqueue(S’) } } How to checkpoint and restore a live OS? 33 FiSC: jam OS into tool User-written Checker Pros JFS Cons User Mode Linux FiSC Linux Kernel Hardware User code Comprehensive, effective No model, check code Checkpoint and restore: easy Intrusive. Build fake environment. Hard to check anything new. Months for new OS, 1 week for new FS Many tricks, so complicated that we won best paper OSDI 04 Our code 34 EXPLODE: jam tool into OS User-written Checker User-written Checker EXPLODE JFS Runtime User Mode Linux FiSC Linux Kernel Hardware User code Our code JFS EKM Linux Kernel Hardware EKM = EXPLODE kernel module 35 EKM lines of code OS Lines of code Linux 2.6 1,915 FreeBSD 6.0 1,210 EXPLODE kernel modules (EKM) are small and easy to write 36 How to checkpoint and restore a live OS kernel? Checker EXPLODE Runtime Hard to checkpoint live kernel memory Virtual machine? No JFS EKM Linux Kernel Hardware VMware: no source Xen: not portable heavyweight There’s a better solution for storage systems 37 Checkpoint: save actions instead of bits state = list of actions checkpoint S = save (creat, cache miss) restore = re-initialize, creat, cache miss re-initialize = unmount, mkfs S0 creat Utility that clears Utility to in-mem state of a Redo-to-restore-state idea used create an in storage system to tradeempty model checking time FS for space S (stateless search). … We use it only to reduce intrusiveness 38 Deterministic replay Storage system: isolated subsystem Non-deterministic kernel scheduling decision Non-deterministic interrupt Fix: use RAM disks, no interrupt for checked system Non-deterministic kernel choose() calls by other code Opportunistic fix: priorities Fix: filter by thread IDs. No choose() in interrupt Worked well in practice Mostly deterministic Worst case: auto-detect & ignore non-repeatable errors 39 Outline Overview Checking process Implementation Example check: crashes during recovery are recoverable Results 40 Why check crashes during recovery? Crashes are highly correlated Often caused by kernel bugs, hardware errors Reboot, hit same bug/error What to check? fsck once == fsck & crash, re-run fsck fsck(crash-disk) to completion, “/a” recovered fsck(crash-disk) and crash, fsck, “/a” gone Bug! Powerful heuristic, found interesting bugs (wait until results) 41 How to check crashes during recovery? “crash-disk” fsck.jfs EXPLODE Runtime fsck.jfs same as ? fsck.jfs same as ? fsck.jfs same as ? “crash-crash-disk” Problem: N blocks 2^N crash-crash-disks. Too many! Can prune many crash-crash-disks 42 Simplified example fsck(000) 3-block disk, B1, B2, B3 each block is either 0 or 1 crash-disk = 000 (B1 to B3) Read(B1) = 0 Write(B2, 1) buffer cache: B2=1 Write(B3, 1) buffer cache: B2=1, B3=1 Read(B3) = 1 Write(B1, 1) buffer cache: B2=1, B3=1, B1=1 fsck(000) = 111 43 Naïve strategy: 7 crash-crash-disks crash-disk = 000 fsck(000) = 111 buffer cache: B2=1, B3=1, B1=1 000 + {B2=1} fsck(010) == 111? Read(B1) = 0 fsck(001) == 111? Write(B2, 1) fsck(011) == 111? Write(B3, 1) fsck(100) == 111? Read(B3) = 1 fsck(110) == 111? Write(B1, 1) fsck(101) == 111? fsck(111) == 111? 44 Optimization: exploiting determinism crash-disk = 000 fsck(000) = 111 Read(B1) = 0 Read(B3) = 1 Write(B1, 1) read same blocks write same blocks 000 + {B2=1} Write(B2, 1) Write(B3, 1) For all practical purposes, fsck is deterministic fsck(010) == 111? fsck(000) doesn’t read B2 So, fsck(010) = 111 45 What blocks does fsck(000) actually read? crash-disk = 000 fsck(000) = 111 Read(B1) = 0 Write(B2, 1) Write(B3, 1) Read(B3) = 1 Write(B1, 1) Read of B3 will get what we just wrote. Can’t depend on B3 fsck(000) reads/depends only on B1. It doesn’t matter what we write to the other blocks. fsck(0**) = 111 46 Prune crash-crash-disks matching 0** crash-disk = 000 fsck(000) = 111 buffer cache: B2=1, B3=1, B1=1 fsck(010) == 111? Read(B1) = 0 fsck(001) == 111? Write(B2, 1) fsck(011) == 111? Write(B3, 1) fsck(100) == 111? Read(B3) = 1 fsck(110) == 111? Write(B1, 1) fsck(101) == 111? Can further optimize using this and other ideas fsck(111) == 111? 47 Outline Overview Checking process Implementation Example check: crashes during recovery are recoverable Results 48 Bugs caused by crashes during recovery Found data-loss bugs in all three FS that use logging (ext3, JFS, ReiserFS), total 5 Strict order under normal operation: First, write operation to log, commit Second, apply operation to actual file system Strict (reverse) order during recovery: First, replay log to patch actual file system Second, clear log No order corrupted FS and no log to patch it! 49 Bug in fsck.ext3 recover_ext3_journal(…) { // … retval = -journal_recover(journal); // … // clear the journal e2fsck_journal_release(…) // … } journal_recover(…) { // replay the journal //… // sync modifications to disk fsync_no_super (…) } // Error! Empty macro, doesn’t sync data! #define fsync_no_super(dev) do {} while (0) Code directly adapted from the kernel But, fsync_no_super defined as NOP: “hard to implement” 50 FiSC Results (can reproduce in EXPLODE) Error Type VFS ext2 ext3 JFS Data loss N/A N/A 1 8 False clean N/A N/A 1 1 2 2 2 1 3+2 Security Crashes 1 Other 1 Total 2 10 2 1 1 5 21 ReiserFS total 1 1 10 12 3 2 32 32 in total, 21 fixed, 9 of the remaining 11 confirmed 51 EXPLODE checkers lines of code and errors found Storage System Checked Checker Bugs 10 file systems 5,477 18 CVS 68 1 Subversion 69 1 “EXPENSIVE” 124 3 Berkeley DB 202 6 RAID FS + 137 2 NFS FS 4 VMware GSX/Linux FS 1 6,008 36 Storage applications Transparent subsystems Total 6 bugs per 1,000 lines of checker code 52 Related work FS Testing Static (compile-time) analysis Software model checking 53 Conclusion EXPLODE Comprehensive: adapt ideas from model checking General, real: check live systems in situ, w/o source code Fast, easy: simple C++ checking interface Results Checked 17 widely-used, well-tested, real-world storage systems: 10 Linux FS, Linux NFS, Soft-RAID, 3 version control, Berkeley DB, VMware Found serious data-loss bugs in all, over 70 bugs in total Many bug reports led to immediate kernel patches 54