FFS, LFS, and RAID Andy Wang COP 5611 Advanced Operating Systems

advertisement
FFS, LFS, and RAID
Andy Wang
COP 5611
Advanced Operating Systems
UNIX Fast File System
Designed to improve performance of UNIX
file I/O
Two major areas of performance
improvement
Bigger block sizes
Better on-disk layout for files
Block Size Improvement
4x block size quadrupled amount of data
gotten per disk fetch
But could lead to fragmentation problems
So fragments introduced
Small files stored in fragments
Fragments addressable
But not independently fetchable
Disk Layout Improvements
Aimed toward avoiding disk seeks
Bad if finding related files takes many
seeks
Very bad if find all the blocks of a single
file requires seeks
Spatial locality: keep related things close
together on disk
Cylinder Groups
A cylinder group: a set of consecutive disk
cylinders in the FFS
Files in the same directory stored in the
same cylinder group
Within a cylinder group, tries to keep
things contiguous
But must not let a cylinder group fill up
Locations for New Directories
Put new directory in relatively empty
cylinder group
What is “empty”?
Many free i_nodes
Few directories already there
The Importance of Free Space
FFS must not run too close to capacity
No room for new files
Layout policies ineffective when too few
free blocks
Typically, FFS needs 10% of the total
blocks free to perform well
Performance of FFS
4x to 15x the bandwidth of old UNIX file
system
Depending on size of disk blocks
Performance on original file system
Limited by CPU speed
Due to memory-to-memory buffer copies
FFS Not the Ultimate Solution
Based on technology of the early 80s
And file usage patterns of those times
In modern systems, FFS achieves only
~5% of raw disk bandwidth
The Log-Structured File System
Large caches can catch almost all reads
But most writes have to go to disk
So FS performance can be limited by
writes
So, produce a FS that writes quickly
Like an append-only log
Basic LFS Architecture
Buffer writes, send them sequentially to
disk
Data blocks
Attributes
Directories
And almost everything else
Converts small sync writes to large async
writes
A Simple Log Disk Structure
File
A
File
Z
File
M
File
A
File
F
File
A
File
L
File
L
Block Block Block Block Block Block Block Block
7
1
202
3
1
7
26
25
Head of Log
Key Issues in Log-Based Architecture
1. Retrieving information from the log
No matter how well you cache, sooner
or later you have to read
2. Managing free space on the disk
You need contiguous space to write in the long run, how do you
get more?
Finding Data in the Log
File
A
File
Z
File
M
File
A
File
F
File
A
File
L
File
L
Block Block Block Block Block Block Block Block
7
1
202
3
1
7
26
25
Give me block 25 of file L
Or,
Give me block 1 of file F
Retrieving Information From the Log
Must avoid sequential scans of disk to
read files
Solution: store index structures in log
Index is essentially the most recent
version of the i_node
Finding Data in the Log
Foo
Block1
(old)
Foo
Foo
Block3 Block2
Foo
Block 1
How do you find all blocks of file Foo?
Finding Data in the Log with an
I_node
Foo
Block1
(old)
Foo
Foo
Block3 Block2
Foo
Block 1
How Do You Find a File’s I_node?
You could search sequentially
LFS optimizes by writing i_node maps to
the log
The i_node map points to the most recent
version of each i_node
A file system’s i_nodes cover multiple
blocks of i_node map
How Do You Find the Inode?
The Inode Map
How Do You Find Inode Maps?
Use a fixed region on disk that always
points to the most recent i_node map
blocks
But cache i_node maps in main memory
Small enough that few disk accesses
required to find i_node maps
Finding I_node Maps
An old i_node map
New i_node maps
Reclaiming Space in the Log
Eventually, the log reaches the end of the
disk partition
So LFS must reuse disk space like
superseded data blocks
Space can be reclaimed in background or
when needed
Goal is to maintain large free extents on
disk
Example of Need for Reuse
Head of log
New data to be logged
Major Alternatives for Reusing Log
Threading
+ Fast
- Fragmentation
- Slower reads
Head of log
New data to be logged
Major Alternatives for Reusing Log
Copying
+Simple
+Avoids fragmentation
-Expensive
New data to be logged
LFS Space Reclamation Strategy
Combination of copying and threading
Copy to free large fixed-size segments
Thread free segments together
Try to collect long-lived data permanently
into segments
A Threaded, Segmented Log
Head of log
Cleaning a Segment
1. Read several segments into memory
2. Identify the live blocks
3. Write live data back (hopefully) into a
smaller number of segments
Identifying Live Blocks
Hard to track down live blocks of all files
Instead, each segment maintains a
segment summary block
Identifying what is in each block
Crosscheck blocks with owning i_node’s
block pointers
Written at end of log write, for low
overhead
Segment Cleaning Policies
What are some important questions?
When do you clean segments?
How many segments to clean?
Which segments to clean?
How to group blocks in their new segments?
When to Clean
Periodically
Continuously
During off-hours
When disk is nearly full
On-demand
LFS uses a threshold system
How Many Segments to Clean
The more cleaned at once, the better the
reorganization of the disk
But the higher the cost of cleaning
LFS cleans a few tens at a time
Till disk drops below threshold value
Empirically, LFS not very sensitive to this
factor
Which Segments to Clean?
Cleaning segments with lots of dead data
gives great benefit
Some segments are hot, some segments
are cold
But “cold” free space is more valuable
than “hot” free space
Since cold blocks tend to stay cold
Cost-Benefit Analysis
u = utilization
A = age
Benefit to cost = u*A/(u + 1)
Clean cold segments with some space,
hot segments with a lot of space
What to Put Where?
Given a set of live blocks and some
cleaned segments, which goes where?
Order blocks by age
Write them to segments oldest first
Goal is very cold, highly utilized segments
Goal of LFS Cleaning
number of segments
empty
number of segments
100% full
empty
100% full
Performance of LFS
On modified Andrew benchmark, 20%
faster than FFS
LFS can create and delete 8 times as
many files per second as FFS
LFS can read 1 ½ times as many small
files
LFS slower than FFS at sequential reads
of randomly written files
Logical Locality vs. Temporal Locality
Logical locality (spatial locality): Normal
file systems keep a file’s data blocks close
together
Temporal locality: LFS keeps data written
at the same time close together
When temporal locality = logical locality
Systems perform the same
Major Innovations of LFS
Abstraction: everything is a log
Temporal locality
Use of caching to shape disk access
patterns
Cache most reads
Optimized writes
Separating full and empty segments
Where Did LFS Look For
Performance Improvements?
Minimized disk access
Only write when segments filled up
Increased size of data transfers
Write whole segments at a time
Improving locality
Assuming temporal locality, a file’s blocks are
all adjacent on disk
And temporally related files are nearby
Parallel Disk Access and RAID
One disk can only deliver data at its
maximum rate
So to get more data faster, get it from
multiple disks simultaneously
Saving on rotational latency and seek time
Utilizing Disk Access Parallelism
Some parallelism available just from
having several disks
But not much
Instead of satisfying each access from one
disk, use multiple disks for each access
Store part of each data block on several
disks
Disk Parallelism Example
open(foo)
read(bar)
write(zoo)
File
System
Data Striping
Transparently distributing data over disks
Benefits –
Increases disk parallelism
Faster response for big requests
Major parameters
Number of disks
Size of data interleaf
Fine vs. Coarse grained Data
Interleaving
 Fine grained data interleaving
+ High data rate for all requests
But only one request per disk array
Lots of time spent positioning
 Coarse grained data interleaving
+ Large requests access many disks
+ Many small requests handled at once
Small I/O requests access few disks
Reliability of Disk Arrays
Without disk arrays, failure of one disk
among N loses 1/Nth of the data
With disk arrays (fine grained across all N
disks), failure of one disk loses all data
N disks 1/Nth as reliable as one disk
Adding Reliability to Disk Arrays
Buy more reliable disks
Build redundancy into the disk array
Multiple levels of disk array redundancy
possible
Most organizations can prevent any data loss
from single disk failure
Basic Reliability Mechanisms
Duplicate data
Parity for error detection
Error Correcting Code for detection and
correction
Parity Methods
Can use parity to detect multiple errors
But typically used to detect single error
If hardware errors are self-identifying,
parity can also correct errors
When data is written, parity must be
written, too
Error-Correcting Code
Based on Hamming codes, mostly
Not only detect error, but identify which bit
is wrong
RAID Architectures
Redundant Arrays of Independent Disks
Basic architectures for organizing disks
into arrays
Assuming independent control of each
disk
Standard classification scheme divides
architectures into levels
Non-Redundant Disk Arrays
(RAID Level 0)
No redundancy at all
So, what we just talked about
Any failure causes data loss
Non-Redundant Disk Array Diagram
(RAID Level 0)
open(foo)
read(bar)
write(zoo)
File
System
Mirrored Disks (RAID Level 1)
Each disk has second disk that mirrors its
contents
Writes go to both disks
No data striping
+ Reliability is doubled
+ Read access faster
- Write access slower
- Expensive and inefficient
Mirrored Disk Diagram (RAID Level 1)
open(foo)
read(bar)
write(zoo)
File
System
Memory-Style ECC (RAID Level 2)
Some disks in array are used to hold ECC
E.g., 4 data disks require 3 ECC disks
+ More efficient than mirroring
+ Can correct, not just detect, errors
- Still fairly inefficient
Memory-Style ECC Diagram
(RAID Level 2)
open(foo)
read(bar)
write(zoo)
File
System
Bit-Interleaved Parity (RAID Level 3)
Each disk stores one bit of each data
block
One disk in array stores parity for other
disks
+ More efficient that Levels 1 and 2
- Parity disk doesn’t add bandwidth
Bit-Interleaved RAID Diagram (Level 3)
open(foo)
read(bar)
write(zoo)
File
System
Block-Interleaved Parity (RAID Level 4)
Like bit-interleaved, but data is interleaved
in blocks of arbitrary size
Size is called striping unit
Small read requests use 1 disk
+ More efficient data access than level 3
+ Satisfies many small requests at once
- Parity disk can be a bottleneck
- Small writes require 4 I/Os
Block-Interleaved Parity Diagram
(RAID Level 4)
open(foo)
read(bar)
write(zoo)
File
System
Block-Interleaved Distributed-Parity
(RAID Level 5)
Spread the parity out over all disks
+ No parity disk bottleneck
+ All disks contribute read bandwidth
– Requires 4 I/Os for small writes
Block-Interleaved Distributed-Parity
Diagram (RAID Level 5)
open(foo)
read(bar)
write(zoo)
File
System
Other RAID Configurations
RAID 6
Can survive two disk failures
RAID 10 (RAID 1+0)
Data striped across mirrored pairs
RAID 01 (RAID 0+1)
Mirroring two RAID 0 arrays
RAID 15, RAID 51
Where Did RAID Look For
Performance Improvements?
Parallel use of disks
Improve overall delivered bandwidth by getting
data from multiple disks
Biggest problem is small write
performance
But we know how to deal with small writes
...
Bonus (1 critique)
File system content analysis
Need root access to computers running
workloads that generate lots of files (e.g.,
simulation, web development)
Used by government
Used by industry
Used by academic
Run content analysis based on (1) a static
snapshot (e.g., recursive ls) (2) a one-week
dynamic trace (e.g., strace file system system
calls)
Bonus (1 critique)
Contact Bobby Roy for the trace scripts
Report
Description of the workload environment (e.g.,
computer from a company, running simulations)
Number of files and bytes that can be (1)
reinstalled (e.g., software packages), (2)
redownloaded (e.g., multimedia files), (3)
regenerated (e.g., .o files)
Number of files/bytes that contain usergenerated content
Download