Reliability analysis of ZFS - University of Wisconsin–Madison

advertisement
University of Wisconsin - Madison
RELIABILITY ANALYSIS OF
ZFS
CS 736 Project
Reliability Analysis of ZFS




Summary
To perform reliability analysis of ZFS
Test existing reliability claims
Layered driver interface – simulating transient block
corruptions at various levels in ZFS on-disk
hierarchy.
Results
 Classes
of fault handled by ZFS.
 Measure of the robustness of ZFS.
 Lessons on building a reliable, robust file system.
University of Wisconsin - Madison
Coming Up

Outline of the talk
ZFS Organization
 ZFS
On Disk format
 ZFS features and specs regarding reliability.



Experimental Setup and Experiments
Results and Conclusions
Future Work
University of Wisconsin - Madison
ZFS Organization
ZFS
ZFS
ZFS
Pooled Storage Model
ZFS
ZFS Pool
-Pooled Storage Model
- Disk is a ZFS pool comprising of many file
systems.
University of Wisconsin - Madison
ZFS Organization

Object based
Transactional based object file system
 Every
structure is an object.
 Operation on object(s) is a transaction.
 Grouping of transaction as transaction group.

All data and metadata blocks are checksummed.
 No

silent corruptions.
Modifications are always Copy on Write
 Always

on-disk consistent.
All metadata and data(optional) is compressed.
University of Wisconsin - Madison
ZFS Structures

Entire file system is represented as
 Objects
- dnode_phys_t
 Object Sets - dnode_phys_t [ ]

P/L analogy – each object is a template. The bonus
buffer describes specific attributes.
University of Wisconsin - Madison
ZFS Structures


Blocks and block pointers
Data transferred to disks in terms of blocks.
Block pointers (blkptr_t) used to locate, verify and
describe blocks.
 Contains
checksum and compression information.
 Physical size of block <> Logical Size of block
 Gang blocks
University of Wisconsin - Madison
ZFS Structures


Block pointers
Data Virtual Address –
combination of fields in blkptr_t
to locate block on disk.
Wideness – blkptr_t can store
upto three copies of the data
pointed by a unique DVA. These
blocks are called as “ditto
blocks”.



vdev1
asize
offset1
vdev2
Three for pool wide metadata
Two for file system wide
metadata
One for data (configurable)
asize
offset2
vdev3
asize
offset3
Lvl typ cksum comp psize lsize
University of Wisconsin - Madison
ZFS Structures
University of Wisconsin - Madison
Wideness
ZFS Structures


Attributes on disk
ZAP (ZFS Attribute Processor)
ZAP objects used to handle arbitrary (name, object)
associations within an object set (objset)
 Most
commonly used to implement directories
 Also used extensively throughout the DSL
University of Wisconsin - Madison
Putting it all together
•Everything in ZFS is an
object.
Objects
•A dnode describes and
organizes a collection of
blocks making up an object.
University of Wisconsin - Madison
Objects
Putting it all together
Object set
Objects
Object Sets
•Group related objects to
form objsets.
•Filesystems, volumes, clones
and snapshots are objsets.
University of Wisconsin - Madison
Putting it all together
Object set
Objects
Space
map
Snapshot
Information
DataSets
•Encapsulates objset and
provides
•Space usage
•Snapshot Information
DataSet
University of Wisconsin - Madison
Putting it all together
Dataset directories
•Groups Datasets
Object set
Space
map
Objects
•Properties such as
quotas, compression
Snapshot
Information
•Dataset Relationships
DataSet
Properties
Child
Map
DataSet Directory
University of Wisconsin - Madison
A road less travelled
University of Wisconsin - Madison
From vdev label to data
To sum up






Moving forward
Layers of indirection
End to end Checksums which are separated from
data.
Wideness (Ditto Blocks) (3 – 2 – 1)
Compression
Copy on Write
Scrub facility
University of Wisconsin - Madison
Experimental Setup

Corruption Framework
 Corrupter
 Modify
Driver
physical disk
blocks
 Analyzer
App
 Understand
on-disk ZFS
structures
 Consumer
App
 Monitor
ZFS responses,
error codes
University of Wisconsin - Madison
Experimental Setup - Simplification




Setup on Solaris 10 VM
Only one physical vdev (disk)
No striping, mirror, raid…
Initial target – Pointer Corruption
 Reduced
Sample Space
 Interesting Cases

Disable compression as much as possible
University of Wisconsin - Madison
Initial Finding

All metadata compressed
 Cannot


disable metadata compression
Pointer Corruption not feasible
Perform corruptions on compressed objects
 Representative
of effects of disk faults on ZFS
University of Wisconsin - Madison
Corruption Experiments

TYPE:


Type-aware Object Corruptions
TARGET (Targeted On-Disk Objects)



Vdev labels [@Pool]
Uberblocks [@Pool]
Object sets

Meta Object Set [@Pool]



Myfs Object Set [@FS]






objset_phys_t (describing object set)
Object array
objset_phys_t
Indirect blkptr objects
Object array
ZIL [@FS]
File Data [@FS]
Directory Data [@FS]
University of Wisconsin - Madison
Results
Detection
Recovery
Correction
vdev label
YES/Checksum
YES/Replica
NO/COW
uberblock
YES/Checksum
YES/Replica
NO/COW
MOS Object
YES/Checksum
YES/Replica
NO/COW
MOS Object Set
YES/Checksum
YES/Replica
NO/COW
FS Object
YES/Checksum
YES/Replica
NO/COW
FS Indirect Objects
YES/Checksum
YES/Replica
NO/COW
FS Object Set
YES/Checksum
YES/Replica
NO/COW
ZIL
YES/Checksum
NO
NO
Directory Data
YES/Checksum
NO/Configurable
NO/Configurable
File Data
YES/Checksum
NO/Configurable
NO/Configurable
University of Wisconsin - Madison
Summary (using IRON Taxonomy)

Detection
 Checksums
in
parent blkptrs

Recovery
 Replication
in
parent blkptrs
(ditto blocks)
University of Wisconsin - Madison
Conclusion

Integration of File System and Volume Manager
 Saves

an additional translation
Use of one generic pointer block for checksums and
replication
 Merkel


tree provides Robustness
Use of replication/compression in commodity file
system viable
COW can be used effectively
University of Wisconsin - Madison
Observations/Questions

No correction of ditto blocks: relies on COW
 Consecutive
(n=wideness) failures without transaction
group commit ??
 Snapshot corruption ??

Explicit scrubbing corrects ditto blocks in-place
 Potential

for corruption ??
Space/ Performance hit due to
redundancy/compression
 2%
hit in terms of space/IO ?? (Banham & Nash)
 No Page Cache, uses ARC
University of Wisconsin - Madison
Future Work


Snapshot corruptions
Multiple device configuration
 Striping
 Mirror
 RAID-Z
University of Wisconsin - Madison
Download