Formal Modeling and Analysis of a Flash Filesystem in Alloy

advertisement
Formal Modeling and Analysis of
a Flash Filesystem in Alloy
Eunsuk Kang
TDS Seminar, Mar. 14, 2008
What is flash memory?
Non-volatile, high-performance storage
Applications: MP3 players, laptop
drives, digital cameras, etc.
NASA Mars Exploration Rover Spirit
On-board flash memory to store scientific data
Flash anomaly on Spirit
System failure18 days after landing
(2004)
Loss of communication with Earth, stuck
in “reboot” loop
Cause: Flaw in the flash filesystem
Cost: 10 days of lost scientific activity
Testing for unanticipated?
Out of free space, but still attempted to
service file operations
“There was a belief among the FSW
development team that the system
would not exhibit the behavior that is the
root cause of the anomaly…” [Reeves,
2004]
Testing is essential, but is it enough?
Answer: Formal methods?
Allows exhaustive analysis
BUT: Verifying a poorly designed piece
of code in an after-the-fact, ad hoc
manner is impractical
Apply formal methods early, get the
design right
Grand Challenge in Verification
Long term
“Build a verifying compiler” – Tony Hoare
Short term
“Build a verified flash filesystem” – Joshi &
Holzmann (Jet Propulsion Laboratory)
In this talk
“Build a verified design for a flash filesystem”
Outline
What is POSIX?
 IEEE standard for filesystem operations
 Adopted by UNIX, Mac OS X, etc.
 Reference model for the flash filesystem
 Function signatures & behaviors
e.g. write(fildes, *buf, nbyte, offset)
“The write() function shall attempt to write nbyte
bytes from the buffer pointed to by buf to the file
associated with the open file descriptor, fildes.”
POSIX filesystem in Alloy
Alloy
First-order relational logic + transitive closure
sig Data {}
sig FID {}
// data element
// file identifier
sig File {
contents : seq Data
}
sig AbsFsys {
fileMap : FID -> lone File
}
// abstract filesystem
// “lone” means one or zero
Abstract read operation
fun readAbs [fsys: AbsFsys, fid: FID, offset, size: Int] : seq Data {
let file = fsys.fileMap[fid] |
(file.contents).subseq[offset, offset + size – 1]
}
// simulation
run {
some fsys : AbsFsys,
fid : FID, output : seq Data |
output = readAbs[fsys, fid, 1, 3]
} for 3
Abstract write operation
pred writeAbs[fsys, fsys’ : AbsFsys, fid : FID, buffer : seq Data,
offset, size : Int] {
let file = fsys.fileMap[fid], file’ = fsys’.fileMap[fid],
buffer’ = buffer.subseq[0, size – 1] {
(#buffer’ = 0) => file’ = file
// case 1
(#buffer’ != 0) =>
// case 2 & 3
file’.contents = (zeros[offset] ++ file.contents) ++ shift[buffer’, offset]
writePromote[fsys, fsys’, file, file’, fid]
} } }
// promotion
pred writePromote[fsys, fsys’ : AbsFsys, file, file’ : File, fid : FID] {
fsys’.fileMap = fsys.fileMap ++ (fid -> file’)
}
Alloy is pure logic
 No built-in syntax/semantics for state machines
 Transition as an explicit constraint between two states
pred writeAbs[fsys, fsys’ : AbsFsys, fid : FID, buffer : seq Data,
offset, size : Int] {
let file = fsys.fileMap[fid], file’ = fsys’.fileMap[fid],
buffer’ = buffer.subseq[0, size – 1] {
(#buffer’ = 0) => file’ = file
// case 1
(#buffer’ != 0) =>
// case 2 & 3
file’.contents = (zeros[offset] ++ file.contents) ++ shift[buffer’, offset]
writePromote[fsys, fsys’, file, file’, fid]
} } }
// promotion
pred writePromote[fsys, fsys’ : AbsFsys, file, file’ : File, fid : FID] {
fsys’.fileMap = fsys.fileMap ++ (fid -> file’)
}
Abstract write operation: Case 1
 Input buffer is empty; no changes to the file
pred writeAbs[fsys, fsys’ : AbsFsys, fid : FID, buffer : seq Data,
offset, size : Int] {
let file = fsys.fileMap[fid], file’ = fsys’.fileMap[fid],
buffer’ = buffer.subseq[0, size – 1] {
(#buffer’ = 0) => file’ = file
// case 1
(#buffer’ != 0) =>
// case 2 & 3
file’.contents = (zeros[offset] ++ file.contents) ++ shift[buffer’, offset]
writePromote[fsys, fsys’, file, file’, fid]
} } }
// promotion
pred writePromote[fsys, fsys’ : AbsFsys, file, file’ : File, fid : FID] {
fsys’.fileMap = fsys.fileMap ++ (fid -> file’)
}
Abstract write operation: Case 2
 Offset is within the file
 Shift buffer by offset & override existing data
pred writeAbs[fsys, fsys’ : AbsFsys, fid : FID, buffer : seq Data,
offset, size : Int] {
let file = fsys.fileMap[fid], file’ = fsys’.fileMap[fid],
buffer’ = buffer.subseq[0, size – 1] {
(#buffer’ = 0) => file’ = file
// case 1
(#buffer’ != 0) =>
// case 2 & 3
file’.contents = (zeros[offset] ++ file.contents) ++ shift[buffer’, offset]
writePromote[fsys, fsys’, file, file’, fid]
} } }
// promotion
pred writePromote[fsys, fsys’ : AbsFsys, file, file’ : File, fid : FID] {
fsys’.fileMap = fsys.fileMap ++ (fid -> file’)
}
Abstract write operation: Case 3
 Offset is after the end of the file
 Fill in the gap with zeros
pred writeAbs[fsys, fsys’ : AbsFsys, fid : FID, buffer : seq Data,
offset, size : Int] {
let file = fsys.fileMap[fid], file’ = fsys’.fileMap[fid],
buffer’ = buffer.subseq[0, size – 1] {
(#buffer’ = 0) => file’ = file
// case 1
(#buffer’ != 0) =>
// case 2 & 3
file’.contents = (zeros[offset] ++ file.contents) ++ shift[buffer’, offset]
writePromote[fsys, fsys’, file, file’, fid]
} } }
// promotion
pred writePromote[fsys, fsys’ : AbsFsys, file, file’ : File, fid : FID] {
fsys’.fileMap = fsys.fileMap ++ (fid -> file’)
}
Promotion
 A style of modeling changes in system state
 Ensure all other files remain unchanged
pred writeAbs[fsys, fsys’ : AbsFsys, fid : FID, buffer : seq Data,
offset, size : Int] {
let file = fsys.fileMap[fid], file’ = fsys’.fileMap[fid],
buffer’ = buffer.subseq[0, size – 1] {
(#buffer’ = 0) => file’ = file
// case 1
(#buffer’ != 0) =>
// case 2 & 3
file’.contents = (zeros[offset] ++ file.contents) ++ shift[buffer’, offset]
writePromote[fsys, fsys’, file, file’, fid]
} } }
// promotion
pred writePromote[fsys, fsys’ : AbsFsys, file, file’ : File, fid : FID] {
fsys’.fileMap = fsys.fileMap ++ (fid -> file’)
}
Outline
What makes flash special?
Two types: NOR and NAND
Program (i.e. write) at the page level,
erase at the block level
Must erase before programming
Block can be erased only a limited
number of times (need wear-leveling)
Modeling memory hierarchy
sig Page { data : seq Data } { #data = PAGE_SIZE }
sig Block { pages : seq Page } { #pages = BLOCK_SIZE }
sig LUN { blocks : seq Block } { #blocks = LUN_SIZE }
sig Device {
LUNs : seq LUN
…
} { #LUNs = DEVICE_SIZE }
// simulation with constraints
run {
some Device
DEVICE_SIZE = 1
LUN_SIZE = 2
BLOCK_SIZE = 2
PAGE_SIZE = 4
} for 4
Addressing mode
Row & column addresses:
sig RowAddr { // used to access a page
lunIndex : Int
blockIndex : Int
pageIndex : Int
}
A column address is an Int, and
identifies a data element in a page
Example:
rowAddr.lunIndex = 0
rowAddr.blockIndex = 1
rowAddr. pageIndex = 1
columnAddr = 1
Page status & data structures
 Each page is associated with its current status
abstract sig PageStatus {}
one sig Free,
Allocated,
Valid,
Invalid extends PageStatus {}
 Auxiliary data structures*
sig Device {
LUNs : seq LUN,
pageStatusMap : RowAddr -> one PageStatus,
eraseCountMap : RowAddr -> one Int,
// wear-leveling
reserveBlock : RowAddr
// garbage collection
} { #LUNs = DEVICE_SIZE }
(* disclaimers)
Flash API functions
// reads data from page, starting at “colAddr”
fun read[d : Device, colAddr : Int, rowAddr : RowAddr] : seq Data { … }
// program data into page & set page status to “Allocated”
pred program[d, d’ : Device, colAddr : Int, rowAddr : RowAddr,
data : seq Data] { … }
// erase data in block & increase its erase count, and set status of
every page in block to “Free”
pred erase[d, d’ : Device, rowAddr : RowAddr] { … }
Outline
Abstract vs. concrete filesystem
Concrete filesystem in Alloy
sig Inode { blockList : seq VBlock }
sig VBlock {} // virtual block
sig ConcFsys {
inodeMap : FID -> lone Inode
blockMap : VBlock one -> one RowAddr
}
Concrete read operation (snippet)
pred readConc[fsys : ConcFsys, d : Device,
fid : FID, offset, size : Int, buffer : seq Data] {
…
all i : blocksToRead.inds {
let vblock = blocksToRead[i],
rowAddr = fsys.blockMap[vblock],
from = PAGE_SIZE*i,
to = from + PAGE_SIZE – 1 |
buffer.subseq[from, to] = read[d, 0, rowAddr] }
}
…
}
State of a flash filesystem
 State is represented by a pair (ConcFsys, Device)
pred readConc[fsys : ConcFsys, d : Device,
fid : FID, offset, size : Int, buffer : seq Data] {
…
all i : blocksToRead.inds {
let vblock = blocksToRead[i],
rowAddr = fsys.blockMap[vblock],
from = PAGE_SIZE*i,
to = from + PAGE_SIZE – 1 |
buffer.subseq[from, to] = read[d, 0, rowAddr] }
}
}
Read operation animated
Initially, buffer is empty
Read operation animated
Read operation animated
Read operation animated
Three calls to flash read in total
Concrete read operation: Step 1
 Extract blocks to read from inode using offset & size
pred readConc[fsys : ConcFsys, d : Device,
fid : FID, offset, size : Int, buffer : seq Data] {
…
all i : blocksToRead.inds {
let vblock = blocksToRead[i],
rowAddr = fsys.blockMap[vblock],
from = PAGE_SIZE*i,
to = from + PAGE_SIZE – 1 |
buffer.subseq[from, to] = read[d, 0, rowAddr] }
}
}
Concrete read operation: Step 2
 Consider each index i in blocksToRead
pred readConc[fsys : ConcFsys, d : Device,
fid : FID, offset, size : Int, buffer : seq Data] {
…
all i : blocksToRead.inds {
let vblock = blocksToRead[i],
rowAddr = fsys.blockMap[vblock],
from = PAGE_SIZE*i,
to = from + PAGE_SIZE – 1 |
buffer.subseq[from, to] = read[d, 0, rowAddr] }
}
}
Concrete read operation: Step 3
 Retrieve the address of page for ith virtual block
pred readConc[fsys : ConcFsys, d : Device,
fid : FID, offset, size : Int, buffer : seq Data] {
…
all i : blocksToRead.inds {
let vblock = blocksToRead[i],
rowAddr = fsys.blockMap[vblock],
from = PAGE_SIZE*i,
to = from + PAGE_SIZE – 1 |
buffer.subseq[from, to] = read[d, 0, rowAddr] }
}
}
Concrete read operation: Step 4
 Calculate indices for current buffer slot
pred readConc[fsys : ConcFsys, d : Device,
fid : FID, offset, size : Int, buffer : seq Data] {
…
all i : blocksToRead.inds {
let vblock = blocksToRead[i],
rowAddr = fsys.blockMap[vblock],
from = PAGE_SIZE*i,
to = from + PAGE_SIZE – 1 |
buffer.subseq[from, to] = read[d, 0, rowAddr] }
}
}
Concrete read operation: Step 5
 Execute the flash API function, read
pred readConc[fsys : ConcFsys, d : Device,
fid : FID, offset, size : Int, buffer : seq Data] {
…
all i : blocksToRead.inds {
let vblock = blocksToRead[i],
rowAddr = fsys.blockMap[vblock],
from = PAGE_SIZE*i,
to = from + PAGE_SIZE – 1 |
buffer.subseq[from, to] = read[d, 0, rowAddr] }
}
}
Wear-leveling
Wear-leveling example
Client sends a write request to overwrite data in VBlk1
with 0110
Simple approach: Erase Block2 & program Page5
Non-wear-leveling approach: Step 1
Client sends a write request to overwrite data in VBlk1
with 0110
Step 1: Erase Block2
Non-wear-leveling approach: Step 2
Client sends a write request to overwrite data in VBlk1
with 0110
Step 2: Program 0110 into Page5 - Done.
Why wear-level?
What’s wrong with a simple approach?
1. Frequent requests on VBlk1: Block2 wears out quickly
2. H/W failure: Original data in Page5 is lost
Wear-leveling approach
Client sends a write request to overwrite data in VBlk1 with
0110
Wear-leveling approach: Search for a free page & program
Wear-leveling approach: Step 1
Client sends a write request to overwrite data in VBlk1 with
0110
Step 1: Program 0110 into a free page, Page3
Wear-leveling approach: Step 2
Client sends a write request to overwrite data in VBlk1 with
0110
Step 2: Invalidate Page5 & validate Page3
Wear-leveling approach: Step 3
Client sends a write request to overwrite data in VBlk1 with
0110
Step 3: Update blockMap
Erase unit reclamation
(garbage collection)
Erase-unit reclamation example
Client sends a write request to append 0101 at the end of
the inode
Problem: Flash is out of free pages (besides reserved ones)
Erase-unit reclamation: Step 1
Client sends a write request to append 0101 at the end of
the inode
Step 1: Pick a dirty block with the least erase count
Erase-unit reclamation: Step 2
Client sends a write request to append 0101 at the end of
the inode
Step 2: Relocate valid data to reserveBlock
Erase-unit reclamation: Step 3
Client sends a write request to append 0101 at the end of
the inode
Step 3: Invalidate/validate pages & update blockMap
Erase-unit reclamation: Step 4
Client sends a write request to append 0101 at the end of
the inode
Step 4: Erase Block2 & set it as the new reserveBlock
Erase-unit reclamation complete
Client sends a write request to append 0101 at the end of
the inode
Complete: Page0 in Block0 is now free and available for use
Concrete write operation
Concrete write operation
 Transition between two pairs (fsys, d) and (fsys’, d’)
pred writeConc[fsys, fsys’ : ConcFsys, d, d’ : Device,
fid : FID, buffer : seq Data, offset, size : Int] { … }
PAGE_SIZE = 4
Flash API program is a
single-step transition
between two device states
Write operation: Phase 1
 Partition input buffer into N fragments & program them
1. Introduce an intermediate device, interDev
2. Create a sequence of states between d and interDev using
seq Device
3. Constrain the sequence
pred stateSeqConds[init, final : Device, stateSeq : seq Device, length : Int] {
stateSeq.first = init
stateSeq.last = final
#stateSeq = length + 1
}
4. Program fragments one by one
Write operation: Phase 1.1
 Introduce & constrain intermediate device states
pred writeConc[fsys, fsys’ : ConcFsys, d, d’ : Device,
fid : FID, buffer : seq Data, offset, size : Int] {
…
some stateSeq : seq Device, interDev : Device {
stateSeqConds[d, interDev, stateSeq, numBlocksToProgram]
all i : stateSeq.butlast.inds {
let from = PAGE_SIZE * i, to = from + PAGE_SIZE – 1,
dataFragment = buffer.subseq[from, to],
vblock = inode.blockList[startBlkIndex + i],
rowAddr = fsys.blockMap[vblock],
preState = stateSeq[i], postState = stateSeq[i + 1] |
programPage[preState, postState, rowAddr, dataFragment]
}
…
Write operation: Phase 1.2
 For each sequence index i, extract a data fragment
from buffer
pred writeConc[fsys, fsys’ : ConcFsys, d, d’ : Device,
fid : FID, buffer : seq Data, offset, size : Int] {
…
some stateSeq : seq Device, interDev : Device {
stateSeqConds[d, interDev, stateSeq, numBlocksToProgram]
all i : stateSeq.butlast.inds {
let from = PAGE_SIZE * i, to = from + PAGE_SIZE – 1,
dataFragment = buffer.subseq[from, to],
vblock = inode.blockList[startBlkIndex + i],
rowAddr = fsys.blockMap[vblock],
preState = stateSeq[i], postState = stateSeq[i + 1] |
programPage[preState, postState, rowAddr, dataFragment]
}
…
Write operation: Phase 1.3
 Retrieve the address of page for ith virtual block
(could be empty)
pred writeConc[fsys, fsys’ : ConcFsys, d, d’ : Device,
fid : FID, buffer : seq Data, offset, size : Int] {
…
some stateSeq : seq Device, interDev : Device {
stateSeqConds[d, interDev, stateSeq, numBlocksToProgram]
all i : stateSeq.butlast.inds {
let from = PAGE_SIZE * i, to = from + PAGE_SIZE – 1,
dataFragment = buffer.subseq[from, to],
vblock = inode.blockList[startBlkIndex + i],
rowAddr = fsys.blockMap[vblock],
preState = stateSeq[i], postState = stateSeq[i + 1] |
programPage[preState, postState, rowAddr dataFragment]
}
…
Write operation: Phase 1.4
 Retrieve the current pair of pre- and post- states
pred writeConc[fsys, fsys’ : ConcFsys, d, d’ : Device,
fid : FID, buffer : seq Data, offset, size : Int] {
…
some stateSeq : seq Device, interDev : Device {
stateSeqConds[d, interDev, stateSeq, numBlocksToProgram]
all i : stateSeq.butlast.inds {
let from = PAGE_SIZE * i, to = from + PAGE_SIZE – 1,
dataFragment = buffer.subseq[from, to],
vblock = inode.blockList[startBlkIndex + i],
rowAddr = fsys.blockMap[vblock],
preState = stateSeq[i], postState = stateSeq[i + 1] |
programPage[preState, postState, rowAddr, dataFragment]
}
…
Write operation: Phase 1.5
 Program data fragment into page at rowAddr
pred writeConc[fsys, fsys’ : ConcFsys, d, d’ : Device,
fid : FID, buffer : seq Data, offset, size : Int] {
…
some stateSeq : seq Device, interDev : Device {
stateSeqConds[d, interDev, stateSeq, numBlocksToProgram]
all i : stateSeq.butlast.inds {
let from = PAGE_SIZE * i, to = from + PAGE_SIZE – 1,
dataFragment = buffer.subseq[from, to],
vblock = inode.blockList[startBlkIndex + i],
rowAddr = fsys.blockMap[vblock],
preState = stateSeq[i], postState = stateSeq[i + 1] |
programPage[preState, postState, rowAddr, dataFragment]
}
…
Write operation: Phase 2
 Invalidate obsolete pages & validate all allocated
pages by updating interDev.pageStatusMap
pred writeConc[fsys, fsys’ : ConcFsys, d, d’ : Device,
fid : FID, buffer : seq Data, offset, size : Int] {
…
some stateSeq : seq Device, interDev : Device {
…
updatePageStatus[interDev, d’]
updateFilesystemInfo[fsys, fsys’]
}
…
}
Write operation: Phase 3
 Update filesystem information (blockMap & inode.blockList)
pred writeConc[fsys, fsys’ : ConcFsys, d, d’ : Device,
fid : FID, buffer : seq Data, offset, size : Int] {
…
some stateSeq : seq Device, interDev : Device {
…
updatePageStatus[interDev, d’]
updateFilesystemInfo[fsys, fsys’]
}
…
}
Fault Tolerance
Fault Tolerance
What happens when H/W loses power in the
middle of a write operation?:
 On recovery, the filesystem must be in a state as if:
1. the operation has never begun, or
2. the operation has successfully completed
 Power loss may occur either in Phase 1 or Phase 2
Phase 1 crash
At the time of failure, one or more pages programmed &
status set to Allocated.
Recovery: Invalidate every allocated page
Recovery from Phase 1 crash
After recovery, the filesystem is in the original state (but
has extra invalid pages)
Phase 2 crash
At the time of failure:
1. some/all obsolete pages have been invalidated
2. all obsolete pages have been invalidated, and some
allocated pages have been validated
Recovery: Complete the rest of Phase 2 & Phase 3
Recovery from Phase 2
After recovery, the inode contains the new data as
expected by the caller of writeConc
Outline
Refinement: Trace inclusion
Does the concrete filesystem conform
to the abstract filesystem?
Abstract function
pred alpha[asys : AbsFsys, csys : ConcFsys, d : Device] {
all fid : FID |
let file = asys.fileMap[fid], inode = csys.inodeMap[fid],
vblocks = inode.blockList {
#file.contents = #vblocks * PAGE_SIZE
(all i : vblocks.inds |
let vblock = vblocks[i],
from = i * PAGE_SIZE, to = from + PAGE_SIZE – 1,
absDataFrag = file.contents.subseq[from, to],
concDataFrag = findPageData[vblock, csys, d] |
absDataFrag = concDataFrag)
}
}
Write refinement
assert WriteRefinement {
all csys, csys’ : ConcFsys, asys, asys’ : AbsFsys, d, d’ : Device,
fid : FID, buffer : seq Data, offset, size : Int |
concInvariant[csys, d] and
writeConc[csys, csys’, d, d’, fid, buffer, offset, size] and
alpha[asys, csys, d] and
alpha[asys’, csys’, d’]
=>
writeAbs[asys, asys’, fid, buffer, offset, size]
}
State invariant
assert WriteRefinement {
all csys, csys’ : ConcFsys, asys, asys’ : AbsFsys, d, d’ : Device,
fid : FID, buffer : seq Data, offset, size : Int |
concInvariant[csys, d] and
writeConc[csys, csys’, d, d’, fid, buffer, offset, size] and
alpha[asys, csys, d] and
alpha[asys’, csys’, d’]
=>
writeAbs[asys, asys’, fid, buffer, offset, size]
}
e.g. All pages within an inode have a valid status
…
all inode : FID.(csys.inodeMap) |
all rowAddr : csys.blockMap[inode.blockList.elems] |
d.pageStatusMap[rowAddr] = Valid
…
Write refinement
assert WriteRefinement {
all csys, csys’ : ConcFsys, asys, asys’ : AbsFsys, d, d’ : Device,
fid : FID, buffer : seq Data, offset, size : Int |
concInvariant[csys, d] and
writeConc[csys, csys’, d, d’, fid, buffer, offset, size] and
alpha[asys, csys, d] and
alpha[asys’, csys’, d’]
=>
writeAbs[asys, asys’, fid, buffer, offset, size]
}
Analysis results
WriteRefinement:
 A scope of 5 for each domain
 6 pages, each with 4 data elements
 Incremental modeling & analysis
 Found over 20 bugs over development
 Final version returned no counterexample,
approximately 8 hours to check
ReadRefinement:
 Final version returned no counterexample,
approximately 45 minutes to check
Discussion & future work
Discussion
On analysis:
 Our filesystem is small, but still found bugs
 Many bugs occur in “boundary” cases, involving a
small number of components
 Scientific argument for confidence?
On the Alloy language:
 Explicitly modeling state transitions – need better
syntax/semantics?
Future work
On filesystem:
 Extended functionality (directories, etc.)
 Revisiting assumptions about flash H/W
 A wider variety of fault tolerance mechanisms
 Concurrency
On Alloy:
 Syntax/semantics for imperative statements
 Scalability
 Proof
Download