A Software Layer for Disk Fault Injection Jake Adriaens Dan Gibson

advertisement
A Software Layer for Disk Fault
Injection
Jake Adriaens
Dan Gibson
CS 736 Spring 2005
Instructor: Remzi Arpaci-Dusseau
Outline
1.
2.
3.
4.
5.
6.
Introduction, Motivation, & Challenges
Related Work
Implementation Details & IDE Driver
Fault Model
Methods & Evaluation
Summary
Overview - 1


Software system for
modeling IDE disk faults
in an x86/Linux-based
computer
Modification to IDE
driver for read/write
event interception
Overview - 2



Disks faults described at a high level
Faults passed to kernel-level module
On read/write event:
–
–
–
IDE driver calls kernel module to perform request
modification
Before write event, module may modify data to-bewritten
After read event, module may modify data read from
disk
Motivation – Why purposely cause
disk failures?

Commodity HW (and SW!) fails, usually at
unexpected times
–

Causing failures at expected times can help improve
fault tolerance measures
Can be used to determine fault tolerance of
systems
–
Various flavors of RAID need fault injection
Motivation

Faults can happen at the worst time
–
In the middle of a PowerPoint presentation…
Challenges

Drivers are typically written with reliability in
mind
–
May have error detection / correction measures


Should these be removed? Fooled? Applauded?
Low-level drivers critically affect performance
and stability of the system
–
Disk faults need not be “stable,” but shouldn’t have
unusual “side effects”
Challenges

Failure models difficult to justify
–
Disk manufacturers don’t offer details on how/why
their disks fail


Failstop model is widely used: models complete, detected
disk failure
Other models must be chosen generally to account for
many different disks, controllers, etc.
Outline
1.
2.
3.
4.
5.
6.
Introduction, Motivation, & Challenges
Related Work
Implementation Details & IDE Driver
Fault Model
Methods & Evaluation
Summary
Related Work

Software fault injection
–
–
–
Huang et. al. (and many others) use software fault
injection for modifying cached web pages
(ACM/ProcWWW)
Jarboui et. al. inject software faults into the Linux
kernel and observe system behavior
Nagaraja et. al. inject faults into cluster-based
systems
Related Work

Disk Faults, Modeling, Detection
–
–
–
Kaaniche et. al. inject disk faults to study RAID
behavior
Kari et. al. presents fault detection and diagnosis
techniques (separate studies)
Various other RAID and/or FS papers use some
form of fault injection to model failures
Related Work

Hardware Fault Injection
Outline
1.
2.
3.
4.
5.
6.
Introduction, Motivation, & Challenges
Related Work
Implementation Details & IDE Driver
Fault Model
Methods & Evaluation
Summary
Implementation

Core components
–
–
–
–



User-level parser
In-kernel injection module
In-driver upcalls
System calls
Added ~20 lines to IDE driver code
Kernel module is demand-loaded, ~250 lines in size
2 System calls, inject_fault and getdrivesize, ~ 120
lines
Implementation – User-level
Console

Used for fault definition
–
–
–
Console interface for
fault definition
Processes batch files
Checks faults for validity

–
Sector ranges, probability,
etc. (more later)
Passes faults to kernel
module
Implementation – IDE Driver
Modification

Added “upcalls” to injection module
–
–

Pass I/O requests to module for modification
Provide callback service on I/O completion
Added special-purpose code for certain fault
models
–
Failstop model requires in-driver actions
Implementation – Kernel Module


Receives fault lists from user-level console
Called by IDE driver to perform insertion when:
–
–
–
LBA sector (SCSI-like) becomes known – sector
may be modified
Write is initiated – data to be written may be
modified
Read completes – data may be modified before
returning control to I/O initiator
Implementation – System Calls

Added two system calls
–
inject_faults()

–
Used to pass fault definitions to kernel module from user
space
getsectors()

Used to determine raw sector ranges of IDE devices by
name (there are other ways to do this)
Implementation
Faults
Defined
Disk
Request
Control
Returns
I/O Initiated
I/O Returns
Faults
Injected
Upcall
Modified
Request
Bus Traffic
IDE Driver (2.4.26 Linux Kernel)

Important structures
–
struct request

Information about an IDE request
–
READ / WRITE
– Number of sectors
– Etc
–
struct ide_drive_s (_t)

Information about a drive
–
–
Drive name (eg. “hdc”)
Sizing/addressing information
– Etc
IDE Driver (2.4.26 Linux Kernel)

Functions
–
ide_do_rw_disk (3 versions)



Common choke-point for reads & writes
Many other similar functions, only this one in use
Two versions, swapped by preprocessor directives (one for
DMA, one for PIO)
Outline
1.
2.
3.
4.
5.
6.
Introduction, Motivation, & Challenges
Related Work
Implementation Details
Fault Model
Methods & Evaluation
Summary
Failure Model

Models selected to represent “generic IDE”
disk
–
–
No modeling of specific failure (i.e. Western Digital’s
“classic” servo malfunction)
Models based on ranges of affected logical sectors
(ala SCSI)
Failure Model – Fault Types
sectorfail
–
–
Models inability of a given sector (block) or sector
range to store data reliably
Excited on read of sector:

Data read is permuted in some way:
–
Randomized
– Set to specific value
– Added to offset
– Shifted by one or more bytes
Failure Model – Fault Types
sectorro
–
–
Writes to block have no effect on stored value
Excited on writes to sector:

Write requests ignored
sectorwrong
–
–
Traffic to a given block is directed to a different
block
Excited on reads & writes

Address permuted, similarly to data
Failure Model – Fault Types
transaddr
–
–
Sector number wrong for first fault excitation, but
right for all others
Excited on reads & writes

Sector permuted as in sectorwrong
transdata
–
Data is wrong for first fault excitation

Data permuted as in sectorfail
Failure Model – Fault Types
failstop
–
–
Drive is totally unresponsive—performs no reads or
writes
Differs from traditional Failstop in that our
failstop is invisible

Drive does not report any errors, simply fails to perform
reads or writes to any sector
Outline
1.
2.
3.
4.
5.
6.
Introduction, Motivation, & Challenges
Related Work
Implementation Details
Fault Model
Methods & Evaluation
Summary
Verification of Faults (?)



Faults excited and observed by
microbenchmarks tailored to individual fault
types
Techniques similar to latent fault detection
(Kari et. al., and other studies)
Verification of faults is fault-specific
Verification - sectorfail

Corrupts data when read from disk
1.
2.
3.
4.
Write known data to disk - observe location using
printk statement
Inject sectorfail fault at location of file on disk.
Unmount/remount FS (flush cache)
Attempt to read faulty file (with cat)
Verification - sectorro

Ignores writes to a given location
1.
2.
3.
4.
5.
6.
Write known data to disk
Inject sectorro fault
Flush file cache
Write different data to same location
Flush file cache
Read data from (1) from disk
Verification - sectorwrong

Changes address (sector) to another sector
number
1.
2.
3.
4.
Write known data to disk
Flush file cache
Inject sectorwrong fault—redirect to known
location
Read from file – observe data from other sector
Verification - transdata

Data modified after read, but only the first
time
1.
Verify sectorfail functionality
2.
Flush file cache
Re-read, expect correct data
3.
Verification - transaddr

Sector number modified before reads & writes
1.
Verify sectorwrong functionality
2.
Flush file cache
Repeat read, expect correct data
3.
Verification - failstop

Easy!
1.
Install failstop fault
2.
Attempt to access any portion of affected drive
Expect bad things
3.
–
Usually causes kernel panic
Evaluation

Execution time overhead
of injection SW
–
–
–
Overhead << standard
dev. of runtime for
unaffected regions of disk
space
Overhead << standard
dev. of runtime for affected
regions
Averaged over 250
accesses
Avg. (ms)
Std.Dev.
3.025
0.075
Unaffected
region
3.020
0.076
No
injection
Affected
Region
Outline
1.
2.
3.
4.
5.
6.
Introduction, Motivation, & Challenges
Related Work
Implementation Details
Fault Model
Methods & Evaluation
Summary
Summary


Present five new failure models for disk
accesses, and the ability to inject them
Verified fault manifestation
–

Did not verify potential side effects ?
Fault injection has no noticeable effect on
access times
–
Small SW overhead much smaller than access time
to physical device
Download