Formal Modeling and Analysis of a Flash Filesystem in Alloy Eunsuk Kang TDS Seminar, Mar. 14, 2008 What is flash memory? Non-volatile, high-performance storage Applications: MP3 players, laptop drives, digital cameras, etc. NASA Mars Exploration Rover Spirit On-board flash memory to store scientific data Flash anomaly on Spirit System failure18 days after landing (2004) Loss of communication with Earth, stuck in “reboot” loop Cause: Flaw in the flash filesystem Cost: 10 days of lost scientific activity Testing for unanticipated? Out of free space, but still attempted to service file operations “There was a belief among the FSW development team that the system would not exhibit the behavior that is the root cause of the anomaly…” [Reeves, 2004] Testing is essential, but is it enough? Answer: Formal methods? Allows exhaustive analysis BUT: Verifying a poorly designed piece of code in an after-the-fact, ad hoc manner is impractical Apply formal methods early, get the design right Grand Challenge in Verification Long term “Build a verifying compiler” – Tony Hoare Short term “Build a verified flash filesystem” – Joshi & Holzmann (Jet Propulsion Laboratory) In this talk “Build a verified design for a flash filesystem” Outline What is POSIX? IEEE standard for filesystem operations Adopted by UNIX, Mac OS X, etc. Reference model for the flash filesystem Function signatures & behaviors e.g. write(fildes, *buf, nbyte, offset) “The write() function shall attempt to write nbyte bytes from the buffer pointed to by buf to the file associated with the open file descriptor, fildes.” POSIX filesystem in Alloy Alloy First-order relational logic + transitive closure sig Data {} sig FID {} // data element // file identifier sig File { contents : seq Data } sig AbsFsys { fileMap : FID -> lone File } // abstract filesystem // “lone” means one or zero Abstract read operation fun readAbs [fsys: AbsFsys, fid: FID, offset, size: Int] : seq Data { let file = fsys.fileMap[fid] | (file.contents).subseq[offset, offset + size – 1] } // simulation run { some fsys : AbsFsys, fid : FID, output : seq Data | output = readAbs[fsys, fid, 1, 3] } for 3 Abstract write operation pred writeAbs[fsys, fsys’ : AbsFsys, fid : FID, buffer : seq Data, offset, size : Int] { let file = fsys.fileMap[fid], file’ = fsys’.fileMap[fid], buffer’ = buffer.subseq[0, size – 1] { (#buffer’ = 0) => file’ = file // case 1 (#buffer’ != 0) => // case 2 & 3 file’.contents = (zeros[offset] ++ file.contents) ++ shift[buffer’, offset] writePromote[fsys, fsys’, file, file’, fid] } } } // promotion pred writePromote[fsys, fsys’ : AbsFsys, file, file’ : File, fid : FID] { fsys’.fileMap = fsys.fileMap ++ (fid -> file’) } Alloy is pure logic No built-in syntax/semantics for state machines Transition as an explicit constraint between two states pred writeAbs[fsys, fsys’ : AbsFsys, fid : FID, buffer : seq Data, offset, size : Int] { let file = fsys.fileMap[fid], file’ = fsys’.fileMap[fid], buffer’ = buffer.subseq[0, size – 1] { (#buffer’ = 0) => file’ = file // case 1 (#buffer’ != 0) => // case 2 & 3 file’.contents = (zeros[offset] ++ file.contents) ++ shift[buffer’, offset] writePromote[fsys, fsys’, file, file’, fid] } } } // promotion pred writePromote[fsys, fsys’ : AbsFsys, file, file’ : File, fid : FID] { fsys’.fileMap = fsys.fileMap ++ (fid -> file’) } Abstract write operation: Case 1 Input buffer is empty; no changes to the file pred writeAbs[fsys, fsys’ : AbsFsys, fid : FID, buffer : seq Data, offset, size : Int] { let file = fsys.fileMap[fid], file’ = fsys’.fileMap[fid], buffer’ = buffer.subseq[0, size – 1] { (#buffer’ = 0) => file’ = file // case 1 (#buffer’ != 0) => // case 2 & 3 file’.contents = (zeros[offset] ++ file.contents) ++ shift[buffer’, offset] writePromote[fsys, fsys’, file, file’, fid] } } } // promotion pred writePromote[fsys, fsys’ : AbsFsys, file, file’ : File, fid : FID] { fsys’.fileMap = fsys.fileMap ++ (fid -> file’) } Abstract write operation: Case 2 Offset is within the file Shift buffer by offset & override existing data pred writeAbs[fsys, fsys’ : AbsFsys, fid : FID, buffer : seq Data, offset, size : Int] { let file = fsys.fileMap[fid], file’ = fsys’.fileMap[fid], buffer’ = buffer.subseq[0, size – 1] { (#buffer’ = 0) => file’ = file // case 1 (#buffer’ != 0) => // case 2 & 3 file’.contents = (zeros[offset] ++ file.contents) ++ shift[buffer’, offset] writePromote[fsys, fsys’, file, file’, fid] } } } // promotion pred writePromote[fsys, fsys’ : AbsFsys, file, file’ : File, fid : FID] { fsys’.fileMap = fsys.fileMap ++ (fid -> file’) } Abstract write operation: Case 3 Offset is after the end of the file Fill in the gap with zeros pred writeAbs[fsys, fsys’ : AbsFsys, fid : FID, buffer : seq Data, offset, size : Int] { let file = fsys.fileMap[fid], file’ = fsys’.fileMap[fid], buffer’ = buffer.subseq[0, size – 1] { (#buffer’ = 0) => file’ = file // case 1 (#buffer’ != 0) => // case 2 & 3 file’.contents = (zeros[offset] ++ file.contents) ++ shift[buffer’, offset] writePromote[fsys, fsys’, file, file’, fid] } } } // promotion pred writePromote[fsys, fsys’ : AbsFsys, file, file’ : File, fid : FID] { fsys’.fileMap = fsys.fileMap ++ (fid -> file’) } Promotion A style of modeling changes in system state Ensure all other files remain unchanged pred writeAbs[fsys, fsys’ : AbsFsys, fid : FID, buffer : seq Data, offset, size : Int] { let file = fsys.fileMap[fid], file’ = fsys’.fileMap[fid], buffer’ = buffer.subseq[0, size – 1] { (#buffer’ = 0) => file’ = file // case 1 (#buffer’ != 0) => // case 2 & 3 file’.contents = (zeros[offset] ++ file.contents) ++ shift[buffer’, offset] writePromote[fsys, fsys’, file, file’, fid] } } } // promotion pred writePromote[fsys, fsys’ : AbsFsys, file, file’ : File, fid : FID] { fsys’.fileMap = fsys.fileMap ++ (fid -> file’) } Outline What makes flash special? Two types: NOR and NAND Program (i.e. write) at the page level, erase at the block level Must erase before programming Block can be erased only a limited number of times (need wear-leveling) Modeling memory hierarchy sig Page { data : seq Data } { #data = PAGE_SIZE } sig Block { pages : seq Page } { #pages = BLOCK_SIZE } sig LUN { blocks : seq Block } { #blocks = LUN_SIZE } sig Device { LUNs : seq LUN … } { #LUNs = DEVICE_SIZE } // simulation with constraints run { some Device DEVICE_SIZE = 1 LUN_SIZE = 2 BLOCK_SIZE = 2 PAGE_SIZE = 4 } for 4 Addressing mode Row & column addresses: sig RowAddr { // used to access a page lunIndex : Int blockIndex : Int pageIndex : Int } A column address is an Int, and identifies a data element in a page Example: rowAddr.lunIndex = 0 rowAddr.blockIndex = 1 rowAddr. pageIndex = 1 columnAddr = 1 Page status & data structures Each page is associated with its current status abstract sig PageStatus {} one sig Free, Allocated, Valid, Invalid extends PageStatus {} Auxiliary data structures* sig Device { LUNs : seq LUN, pageStatusMap : RowAddr -> one PageStatus, eraseCountMap : RowAddr -> one Int, // wear-leveling reserveBlock : RowAddr // garbage collection } { #LUNs = DEVICE_SIZE } (* disclaimers) Flash API functions // reads data from page, starting at “colAddr” fun read[d : Device, colAddr : Int, rowAddr : RowAddr] : seq Data { … } // program data into page & set page status to “Allocated” pred program[d, d’ : Device, colAddr : Int, rowAddr : RowAddr, data : seq Data] { … } // erase data in block & increase its erase count, and set status of every page in block to “Free” pred erase[d, d’ : Device, rowAddr : RowAddr] { … } Outline Abstract vs. concrete filesystem Concrete filesystem in Alloy sig Inode { blockList : seq VBlock } sig VBlock {} // virtual block sig ConcFsys { inodeMap : FID -> lone Inode blockMap : VBlock one -> one RowAddr } Concrete read operation (snippet) pred readConc[fsys : ConcFsys, d : Device, fid : FID, offset, size : Int, buffer : seq Data] { … all i : blocksToRead.inds { let vblock = blocksToRead[i], rowAddr = fsys.blockMap[vblock], from = PAGE_SIZE*i, to = from + PAGE_SIZE – 1 | buffer.subseq[from, to] = read[d, 0, rowAddr] } } … } State of a flash filesystem State is represented by a pair (ConcFsys, Device) pred readConc[fsys : ConcFsys, d : Device, fid : FID, offset, size : Int, buffer : seq Data] { … all i : blocksToRead.inds { let vblock = blocksToRead[i], rowAddr = fsys.blockMap[vblock], from = PAGE_SIZE*i, to = from + PAGE_SIZE – 1 | buffer.subseq[from, to] = read[d, 0, rowAddr] } } } Read operation animated Initially, buffer is empty Read operation animated Read operation animated Read operation animated Three calls to flash read in total Concrete read operation: Step 1 Extract blocks to read from inode using offset & size pred readConc[fsys : ConcFsys, d : Device, fid : FID, offset, size : Int, buffer : seq Data] { … all i : blocksToRead.inds { let vblock = blocksToRead[i], rowAddr = fsys.blockMap[vblock], from = PAGE_SIZE*i, to = from + PAGE_SIZE – 1 | buffer.subseq[from, to] = read[d, 0, rowAddr] } } } Concrete read operation: Step 2 Consider each index i in blocksToRead pred readConc[fsys : ConcFsys, d : Device, fid : FID, offset, size : Int, buffer : seq Data] { … all i : blocksToRead.inds { let vblock = blocksToRead[i], rowAddr = fsys.blockMap[vblock], from = PAGE_SIZE*i, to = from + PAGE_SIZE – 1 | buffer.subseq[from, to] = read[d, 0, rowAddr] } } } Concrete read operation: Step 3 Retrieve the address of page for ith virtual block pred readConc[fsys : ConcFsys, d : Device, fid : FID, offset, size : Int, buffer : seq Data] { … all i : blocksToRead.inds { let vblock = blocksToRead[i], rowAddr = fsys.blockMap[vblock], from = PAGE_SIZE*i, to = from + PAGE_SIZE – 1 | buffer.subseq[from, to] = read[d, 0, rowAddr] } } } Concrete read operation: Step 4 Calculate indices for current buffer slot pred readConc[fsys : ConcFsys, d : Device, fid : FID, offset, size : Int, buffer : seq Data] { … all i : blocksToRead.inds { let vblock = blocksToRead[i], rowAddr = fsys.blockMap[vblock], from = PAGE_SIZE*i, to = from + PAGE_SIZE – 1 | buffer.subseq[from, to] = read[d, 0, rowAddr] } } } Concrete read operation: Step 5 Execute the flash API function, read pred readConc[fsys : ConcFsys, d : Device, fid : FID, offset, size : Int, buffer : seq Data] { … all i : blocksToRead.inds { let vblock = blocksToRead[i], rowAddr = fsys.blockMap[vblock], from = PAGE_SIZE*i, to = from + PAGE_SIZE – 1 | buffer.subseq[from, to] = read[d, 0, rowAddr] } } } Wear-leveling Wear-leveling example Client sends a write request to overwrite data in VBlk1 with 0110 Simple approach: Erase Block2 & program Page5 Non-wear-leveling approach: Step 1 Client sends a write request to overwrite data in VBlk1 with 0110 Step 1: Erase Block2 Non-wear-leveling approach: Step 2 Client sends a write request to overwrite data in VBlk1 with 0110 Step 2: Program 0110 into Page5 - Done. Why wear-level? What’s wrong with a simple approach? 1. Frequent requests on VBlk1: Block2 wears out quickly 2. H/W failure: Original data in Page5 is lost Wear-leveling approach Client sends a write request to overwrite data in VBlk1 with 0110 Wear-leveling approach: Search for a free page & program Wear-leveling approach: Step 1 Client sends a write request to overwrite data in VBlk1 with 0110 Step 1: Program 0110 into a free page, Page3 Wear-leveling approach: Step 2 Client sends a write request to overwrite data in VBlk1 with 0110 Step 2: Invalidate Page5 & validate Page3 Wear-leveling approach: Step 3 Client sends a write request to overwrite data in VBlk1 with 0110 Step 3: Update blockMap Erase unit reclamation (garbage collection) Erase-unit reclamation example Client sends a write request to append 0101 at the end of the inode Problem: Flash is out of free pages (besides reserved ones) Erase-unit reclamation: Step 1 Client sends a write request to append 0101 at the end of the inode Step 1: Pick a dirty block with the least erase count Erase-unit reclamation: Step 2 Client sends a write request to append 0101 at the end of the inode Step 2: Relocate valid data to reserveBlock Erase-unit reclamation: Step 3 Client sends a write request to append 0101 at the end of the inode Step 3: Invalidate/validate pages & update blockMap Erase-unit reclamation: Step 4 Client sends a write request to append 0101 at the end of the inode Step 4: Erase Block2 & set it as the new reserveBlock Erase-unit reclamation complete Client sends a write request to append 0101 at the end of the inode Complete: Page0 in Block0 is now free and available for use Concrete write operation Concrete write operation Transition between two pairs (fsys, d) and (fsys’, d’) pred writeConc[fsys, fsys’ : ConcFsys, d, d’ : Device, fid : FID, buffer : seq Data, offset, size : Int] { … } PAGE_SIZE = 4 Flash API program is a single-step transition between two device states Write operation: Phase 1 Partition input buffer into N fragments & program them 1. Introduce an intermediate device, interDev 2. Create a sequence of states between d and interDev using seq Device 3. Constrain the sequence pred stateSeqConds[init, final : Device, stateSeq : seq Device, length : Int] { stateSeq.first = init stateSeq.last = final #stateSeq = length + 1 } 4. Program fragments one by one Write operation: Phase 1.1 Introduce & constrain intermediate device states pred writeConc[fsys, fsys’ : ConcFsys, d, d’ : Device, fid : FID, buffer : seq Data, offset, size : Int] { … some stateSeq : seq Device, interDev : Device { stateSeqConds[d, interDev, stateSeq, numBlocksToProgram] all i : stateSeq.butlast.inds { let from = PAGE_SIZE * i, to = from + PAGE_SIZE – 1, dataFragment = buffer.subseq[from, to], vblock = inode.blockList[startBlkIndex + i], rowAddr = fsys.blockMap[vblock], preState = stateSeq[i], postState = stateSeq[i + 1] | programPage[preState, postState, rowAddr, dataFragment] } … Write operation: Phase 1.2 For each sequence index i, extract a data fragment from buffer pred writeConc[fsys, fsys’ : ConcFsys, d, d’ : Device, fid : FID, buffer : seq Data, offset, size : Int] { … some stateSeq : seq Device, interDev : Device { stateSeqConds[d, interDev, stateSeq, numBlocksToProgram] all i : stateSeq.butlast.inds { let from = PAGE_SIZE * i, to = from + PAGE_SIZE – 1, dataFragment = buffer.subseq[from, to], vblock = inode.blockList[startBlkIndex + i], rowAddr = fsys.blockMap[vblock], preState = stateSeq[i], postState = stateSeq[i + 1] | programPage[preState, postState, rowAddr, dataFragment] } … Write operation: Phase 1.3 Retrieve the address of page for ith virtual block (could be empty) pred writeConc[fsys, fsys’ : ConcFsys, d, d’ : Device, fid : FID, buffer : seq Data, offset, size : Int] { … some stateSeq : seq Device, interDev : Device { stateSeqConds[d, interDev, stateSeq, numBlocksToProgram] all i : stateSeq.butlast.inds { let from = PAGE_SIZE * i, to = from + PAGE_SIZE – 1, dataFragment = buffer.subseq[from, to], vblock = inode.blockList[startBlkIndex + i], rowAddr = fsys.blockMap[vblock], preState = stateSeq[i], postState = stateSeq[i + 1] | programPage[preState, postState, rowAddr dataFragment] } … Write operation: Phase 1.4 Retrieve the current pair of pre- and post- states pred writeConc[fsys, fsys’ : ConcFsys, d, d’ : Device, fid : FID, buffer : seq Data, offset, size : Int] { … some stateSeq : seq Device, interDev : Device { stateSeqConds[d, interDev, stateSeq, numBlocksToProgram] all i : stateSeq.butlast.inds { let from = PAGE_SIZE * i, to = from + PAGE_SIZE – 1, dataFragment = buffer.subseq[from, to], vblock = inode.blockList[startBlkIndex + i], rowAddr = fsys.blockMap[vblock], preState = stateSeq[i], postState = stateSeq[i + 1] | programPage[preState, postState, rowAddr, dataFragment] } … Write operation: Phase 1.5 Program data fragment into page at rowAddr pred writeConc[fsys, fsys’ : ConcFsys, d, d’ : Device, fid : FID, buffer : seq Data, offset, size : Int] { … some stateSeq : seq Device, interDev : Device { stateSeqConds[d, interDev, stateSeq, numBlocksToProgram] all i : stateSeq.butlast.inds { let from = PAGE_SIZE * i, to = from + PAGE_SIZE – 1, dataFragment = buffer.subseq[from, to], vblock = inode.blockList[startBlkIndex + i], rowAddr = fsys.blockMap[vblock], preState = stateSeq[i], postState = stateSeq[i + 1] | programPage[preState, postState, rowAddr, dataFragment] } … Write operation: Phase 2 Invalidate obsolete pages & validate all allocated pages by updating interDev.pageStatusMap pred writeConc[fsys, fsys’ : ConcFsys, d, d’ : Device, fid : FID, buffer : seq Data, offset, size : Int] { … some stateSeq : seq Device, interDev : Device { … updatePageStatus[interDev, d’] updateFilesystemInfo[fsys, fsys’] } … } Write operation: Phase 3 Update filesystem information (blockMap & inode.blockList) pred writeConc[fsys, fsys’ : ConcFsys, d, d’ : Device, fid : FID, buffer : seq Data, offset, size : Int] { … some stateSeq : seq Device, interDev : Device { … updatePageStatus[interDev, d’] updateFilesystemInfo[fsys, fsys’] } … } Fault Tolerance Fault Tolerance What happens when H/W loses power in the middle of a write operation?: On recovery, the filesystem must be in a state as if: 1. the operation has never begun, or 2. the operation has successfully completed Power loss may occur either in Phase 1 or Phase 2 Phase 1 crash At the time of failure, one or more pages programmed & status set to Allocated. Recovery: Invalidate every allocated page Recovery from Phase 1 crash After recovery, the filesystem is in the original state (but has extra invalid pages) Phase 2 crash At the time of failure: 1. some/all obsolete pages have been invalidated 2. all obsolete pages have been invalidated, and some allocated pages have been validated Recovery: Complete the rest of Phase 2 & Phase 3 Recovery from Phase 2 After recovery, the inode contains the new data as expected by the caller of writeConc Outline Refinement: Trace inclusion Does the concrete filesystem conform to the abstract filesystem? Abstract function pred alpha[asys : AbsFsys, csys : ConcFsys, d : Device] { all fid : FID | let file = asys.fileMap[fid], inode = csys.inodeMap[fid], vblocks = inode.blockList { #file.contents = #vblocks * PAGE_SIZE (all i : vblocks.inds | let vblock = vblocks[i], from = i * PAGE_SIZE, to = from + PAGE_SIZE – 1, absDataFrag = file.contents.subseq[from, to], concDataFrag = findPageData[vblock, csys, d] | absDataFrag = concDataFrag) } } Write refinement assert WriteRefinement { all csys, csys’ : ConcFsys, asys, asys’ : AbsFsys, d, d’ : Device, fid : FID, buffer : seq Data, offset, size : Int | concInvariant[csys, d] and writeConc[csys, csys’, d, d’, fid, buffer, offset, size] and alpha[asys, csys, d] and alpha[asys’, csys’, d’] => writeAbs[asys, asys’, fid, buffer, offset, size] } State invariant assert WriteRefinement { all csys, csys’ : ConcFsys, asys, asys’ : AbsFsys, d, d’ : Device, fid : FID, buffer : seq Data, offset, size : Int | concInvariant[csys, d] and writeConc[csys, csys’, d, d’, fid, buffer, offset, size] and alpha[asys, csys, d] and alpha[asys’, csys’, d’] => writeAbs[asys, asys’, fid, buffer, offset, size] } e.g. All pages within an inode have a valid status … all inode : FID.(csys.inodeMap) | all rowAddr : csys.blockMap[inode.blockList.elems] | d.pageStatusMap[rowAddr] = Valid … Write refinement assert WriteRefinement { all csys, csys’ : ConcFsys, asys, asys’ : AbsFsys, d, d’ : Device, fid : FID, buffer : seq Data, offset, size : Int | concInvariant[csys, d] and writeConc[csys, csys’, d, d’, fid, buffer, offset, size] and alpha[asys, csys, d] and alpha[asys’, csys’, d’] => writeAbs[asys, asys’, fid, buffer, offset, size] } Analysis results WriteRefinement: A scope of 5 for each domain 6 pages, each with 4 data elements Incremental modeling & analysis Found over 20 bugs over development Final version returned no counterexample, approximately 8 hours to check ReadRefinement: Final version returned no counterexample, approximately 45 minutes to check Discussion & future work Discussion On analysis: Our filesystem is small, but still found bugs Many bugs occur in “boundary” cases, involving a small number of components Scientific argument for confidence? On the Alloy language: Explicitly modeling state transitions – need better syntax/semantics? Future work On filesystem: Extended functionality (directories, etc.) Revisiting assumptions about flash H/W A wider variety of fault tolerance mechanisms Concurrency On Alloy: Syntax/semantics for imperative statements Scalability Proof