Conquest Life After Disks CS239 Seminar October 24, 2002

advertisement
Conquest: Preparing for
Life After Disks
CS239 Seminar
October 24, 2002
An-I Andy Wang
University of California, Los Angeles
Conquest Overview

File systems are optimized for disks


Performance problem
Complexity

Now we have tons of inexpensive RAM
 What can we do with that RAM?
2
Conquest Approach

Combine disk and persistent RAM (e.g.,
battery-backed RAM) in a novel way

Simplification


> 20% fewer semicolons than ext2, reiserfs,
and SGI XFS
Performance (under popular benchmarks)

24% to 1900% faster than LRU disk caching
3
Outline of the Talk





Motivation
Conquest design (high level)
Conquest components
Performance evaluation
Conclusion
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
4
Motivation

Most file systems are built for disks

Problems with the disk assumption:


Performance
Complexity
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
5
Hardware Evolution
CPU (50% /yr)
memory (50% /yr)
1 GHz
accesses 1 MHz
per
second
1 KHz
(log scale)
1990
(1 sec : 6 days)
106
105
disk (15% /yr)
1995
2000
(1 sec : 3 months)
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
6
Inside Pandora’s Box


Disk arm
Disk platters
Access time = seek time (disk arm)
+ rotational delay (disk platter)
+ transfer time
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
7
Disk Optimization Methods

Disk arm scheduling
 Group information on
disk
 Disk readahead
 Buffered writes
 Disk caching


Data mirroring
Hardware parallelism
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
8
Complexity Bytes
predictive readahead
synchronization
cache replacement
elevator algorithm
data consistency
asynchronous write
data clustering
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
9
Storage Media Alternatives
$/MB (log scale)
magnetic RAM?
10-3
100
10-3
tape
103
disk
106
accesses/sec (log scale)
battery-backed DRAM
(write once) flash memory
persistent RAM
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
[Caceres et al., 1993; Hillyer et al., 1996; Qualstar 1998; Tanisys 1999; Micron
Semiconductor Products 2000; Quantum 2000]
10
Price Trend of Persistent RAM
102
101
$/MB 100
(log
10-1
scale)
10-2
1995
booming of digital
photography
4 to 10 GB of
persistent RAM
paper/film
persistent RAM
1" HDD
3.5" HDD 2.5" HDD
2000
year
2005
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
[Grochowski 2000]
11
Old Order; New World

Disk will stay around


RAM as a viable storage alternative


Cost, capacity, power, heat
PDAs, digital cameras, MP3 players
More architectural changes due to RAM


A big assumption change from disk
Rethink data structures, interfaces,
applications
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
12
Getting a Fresh Start
What does it take to design and build a system
that assumes ample persistent RAM as the
primary storage medium?
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
13
Conquest Design

Design and build a disk/persistent-RAM
hybrid file system
 Deliver all file system services from memory,
with the exception of high-capacity storage
 Two separate data paths to memory and disk
 Benefits:


Simplicity
Performance
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
14
Simplicity

Remove disk-related complexities for most
files
 Make things simpler for disk as well
 Less complexity



Fewer bugs
Easier maintenance
Shorter data paths
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
15
Performance

Overall


Memory data path


All management performed in memory
No disk-related overhead
Disk data path

Faster speed due to simpler access models
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
16
Conquest Components






Media management
Metadata representation
Directory service
Allocation service
Persistence support
Resiliency support
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
17
User Access Patterns

Small files



Large files



Take little space (10%)
Represent most accesses (90%)
Take most space
Mostly sequential accesses
Not characteristic of database applications
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
[Iram 1993; Douceur et al., 1999; Roselli et al., 2000]
18
Files Stored in Persistent RAM

Small files (< 1MB)




Metadata



No seek time or rotational delays
Fast byte-level accesses
Contiguous allocation
Fast synchronous update
No dual representations
Executables and shared libraries

In-place execution
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
19
Memory Data Path of Conquest
Conventional File Systems
Conquest Memory Data Path
storage requests
storage requests
IO buffer
management
persistence
support
IO buffer
battery-backed
RAM
persistence
support
small file and metadata storage
disk
management
disk
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
20
Large-File-Only Disk Storage

Allocate in big chunks


Lower access overhead
Reduced management overhead

No fragmentation management
 No tricks for small files


Storing data in metadata
No elaborate data structures

Wrapping a balanced tree onto disk cylinders
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
[Devlinux.com 2000]
21
Sequential-Access Large Files

Sequential disk accesses

Near-raw bandwidth

Well-defined readahead semantics
 Read-mostly

Little synchronization overhead (between
memory and disk)
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
22
Disk Data Path of Conquest
Conventional File Systems
Conquest Disk Data Path
storage requests
storage requests
IO buffer
management
IO buffer
management
IO buffer
persistence
support
IO buffer
battery-backed
RAM
small file and metadata storage
disk
management
disk
management
disk
disk
large-file-only file system
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
23
Random-Access Large Files

Random access?




Common definition: nonsequential access
A typical movie has 150 scene changes
MP3 stores the title at the end of the files
Near sequential access?

Simplifies large-file metadata representation
significantly
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
24
Logical File Representation
Name(s)

i-node
 File attributes

Data
File
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
25
Physical File Representation
Name(s)


i-node
 File attributes
 Data locations
Data blocks
File
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
26
Ext2 Data Representation
data block location
data block location
data block location
data block location
10
index block location
index block location
index block location
i-node
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
27
Disadvantages with Ext2 Design





Designed for disk storage
Optimization for small files makes things
complex
Random-access data structure for large files
that are accessed mostly sequentially
Data access time dependent on the byte
position in a file
Maximum file size is limited
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
28
Conquest Representation

Persistent RAM



Hash(file name) = location of data
Offset(location of data)
Disk storage

Per-file, doubly linked list of disk block
segments (stored in persistent RAM)
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
29
Advantages Conquest Design

Direct data access for in-core files
 Worse case: sequential memory search for
random disk locations
 Maximum file size limited by physical storage
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
30
Directory Service

Requirements



Fast sequential traversal (e.g., ls)
Fast random lookup (e.g., locate file x)
Hard links (apply multiple names to data)
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
31
First Design

A doubly hashed table for each directory


Conserves space
Problems:



Dynamic resizing of directories
Need to handle the current file position
Important for rm -fr
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
32
Second Design

A variant of extensible hash table for each
directory
 An old data structure fits nicely
0100 | dir1
0011
file_1
empty
1001 | file1
0100
file_2
empty
1001 | file2
1110 | file2_hardlink
empty
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
[Fagin et al., 1979]
33
Additional Engineering Details

Popular hash functions randomize lower bits
 Dynamic file positioning
 Need to handle collisions
 Memory overhead and complexity tradeoffs
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
34
Metadata Allocation

Requirements



Keep track of usage
status of metadata
entries
Avoid duplicate
allocation with unique
IDs
Fast retrieval of
metadata with a given
ID
ID: 1| free
ID: 2| in use
ID: 3| free
ID: 4| free
ID: 5| in use
ID: 6| free
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
35
Existing Memory Allocation

Services



Keep track of
unallocated memory
No duplicate allocation
of physical addresses
Hmm…
ADDR 0xe000000| free
ADDR 0xe000038| in use
ADDR 0xe000070| free
ADDR 0xe0000A8| free
ADDR 0xe0000E0| free
ADDR 0xe000118| in use
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
36
Conquest Metadata Management

Metadata = memory allocated by memory
manager
 Metadata ID = physical address of metadata
Unique IDs and fast retrieval
ID: 1| free
ADDR 0xe000000| free
ID: 2| in use
ADDR 0xe000038| in use
ID: 3| free
ADDR 0xe000070| free
ID: 4| free
ADDR 0xe0000A8| free
ID: 5| in use
ADDR 0xe0000E0| free
ID: 6| free
ADDR 0xe000118| in use
Usage status
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
37
Persistence Support

Restore file system states after a reboot



Data
Metadata
Memory manager

Keep track of metadata allocation
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
38
Linux Memory Manager (1)

Page allocator maintains individual pages
Page allocator
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
39
Linux Memory Manager (2)

Zone allocator allocates memory in power-oftwo sizes
Zone allocator
Page allocator
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
40
Linux Memory Manager (3)

Slab allocator groups allocations by sizes to
reduce internal memory fragmentation
Slab allocator
Zone allocator
Page allocator
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
41
Linux Memory Manager (4)

Difficult to restore the persistent states


Three layers of pointer-rich mappings
Mixing of persistent and temporary allocations
Slab allocator
Zone allocator
Page allocator
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
42
Conquest Persistence

Create memory zones with own instantiations
of memory managers
Slab allocator
Zone allocator
Page allocator
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
43
Conquest Persistence

Encapsulate all pointers within each zone
 Pointers can survive reboots
 No serialization and deserialization
 Swapping and paging


Disabled for Conquest memory zones
Enabled for non-Conquest zones
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
44
Resiliency Support

Instantaneous metadata commit

No fsck (ad hoc metadata consistency check)

Built-in checkpointing
 Pointer-switch commit semantics
pointer
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
45
Implementation Status





Kernel module under Linux 2.4.2
Fully functional and POSIX compliant
Modified memory manager to support
Conquest persistence
Need to overcome BIOS limitations for
distribution
Looking for licensing opportunities
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
46
Performance Evaluation

Architectural simplification


Feature count
Performance improvement


Memory-only workload
Memory and disk workload
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
47
Conventional Data Path
Conventional File Systems
storage requests
IO buffer
management
IO buffer
persistence
support
disk
management










disk


Buffer allocation management
Buffer garbage collection
Data caching
Metadata caching
Predictive readahead
Write behind
Cache replacement
Metadata allocation
Metadata placement
Metadata translation
Disk layout
Fragmentation management
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
48
Memory Path of Conquest
Conquest Memory Data Path 
storage requests
Persistence
support
battery-backed
RAM
small file and metadata storage







Memory manager
encapsulation





Buffer allocation management
Buffer garbage collection
Data caching
Metadata caching
Predictive readahead
Write behind
Cache replacement
Metadata allocation
Metadata placement
Metadata translation
Disk layout
Fragmentation management
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
49
Disk Path of Conquest
Conquest Disk Data Path

storage requests

IO buffer
management
battery-backed
IO buffer
RAM
small file and metadata storage
disk
management







disk


large-file-only file system

Buffer allocation management
Buffer garbage collection
Data caching
Metadata caching
Predictive readahead
Write behind
Cache replacement
Metadata allocation
Metadata placement
Metadata translation
Disk layout
Fragmentation management
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
50
PostMark Benchmark (1)



ISP workload (emails, web-based transactions)
Conquest is comparable to ramfs
At least 24% faster than the LRU disk cache
9000
8000
7000
6000
5000
trans / sec
4000
3000
2000
1000
0
40 to 250 MB working set
with 2 GB physical RAM
5000
10000
15000
20000
25000
30000
files
SGI XFS
reiserfs
ext2fs
ramfs
Conquest
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
[Katcher 1997; Sweeney et al., 1996; Card et al., 1999; Namesys 2002]
51
PostMark Benchmark (2)

When both memory and disk components are
exercised, Conquest can be several times faster than
ext2fs, reiserfs, and SGI XFS
10,000 files,
80 MB to 3.5 GB working set
with 2 GB physical RAM
5000
4000
<= RAM > RAM
3000
trans / sec
2000
1000
0
0.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
8.0
9.0
10.0
percentage of large files
SGI XFS
reiserfs
ext2fs
Conquest
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
52
PostMark Benchmark (3)

When working set > RAM, Conquest is 1.4 to 2 times
faster than ext2fs, reiserfs, and SGI XFS
10,000 files,
80 MB to 3.5 GB working set
with 2 GB physical RAM
120
100
80
trans / sec 60
40
20
0
6.0
7.0
8.0
9.0
10.0
percentage of large files
SGI XFS
reiserfs
ext2fs
Conquest
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
53
Sprite LFS Microbenchmarks (1)

Small-file benchmark

Operates on 10,000 1-KB files in three phases
180000
160000
140000
120000
100000
op / sec
80000
60000
40000
20000
0
create
SGI XFS
read
reiserfs
ext2fs
delete
ramfs
Conquest
Motivation – Conquest Alternatives – Conquest Design – Performance Evaluation – Conclusion
54
Sprite LFS Microbenchmarks (2)

Modified large-file microbenchmark: 10 1-MB
files (Conquest in-core files)
700
600
500
MB / sec
400
300
200
100
0
seq write
seq read
SGI XFS
rand write
reiserfs
ext2fs
rand read
ramfs
seq read
Conquest
Motivation – Conquest Alternatives – Conquest Design – Performance Evaluation – Conclusion
55
Sprite LFS Microbenchmarks (3)

Modified large-file microbenchmark: 10 1.01MB files (Conquest on-disk files)
700
600
500
MB / sec
400
300
200
100
0
seq write
seq read
SGI XFS
rand write rand read
reiserfs
ext2fs
ramfs
seq read
Conquest
Motivation – Conquest Alternatives – Conquest Design – Performance Evaluation – Conclusion
56
Sprite LFS Microbenchmarks (4)

Large-file microbenchmark: 40 100-MB files
(Conquest on-disk files)
30
25
20
MB / sec 15
10
5
0
seq write
seq read
SGI XFS
rand write
reiserfs
rand read
ext2fs
seq read
Conquest
Motivation – Conquest Alternatives – Conquest Design – Performance Evaluation – Conclusion
57
History’s Mystery
Puzzling Microbenchmark Numbers…
Geoffrey Kuenning:
“If Conquest is slower than ext2,
I will toss you off of the balcony…”
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
58
With me hanging off a balcony…

Original large-file microbenchmark: 1-MB file
(Conquest in-core file)
700
600
500
MB / sec
400
300
200
100
0
seq write
seq read
SGI XFS
rand write rand read
reiserfs
ext2fs
ramfs
seq read
Conquest
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
59
Odd Microbenchmark Numbers

Why are random reads slower than
sequential reads?
700
600
500
MB / sec
400
300
200
100
0
seq write
seq read
SGI XFS
rand write rand read
reiserfs
ext2fs
ramfs
seq read
Conquest
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
60
Odd Microbenchmark Numbers

Why are RAM-based file systems slower than
disk-based file systems?
700
600
500
MB / sec
400
300
200
100
0
seq write
seq read
SGI XFS
rand write rand read
reiserfs
ext2fs
ramfs
seq read
Conquest
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
61
A Series of Hypotheses

Warm-up effect?



Bad initial states?


Maybe
Why do RAM-based systems warm up slower?
No
Pentium III streaming IO option?

No
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
62
Effects of Cache Footprint Sizes
Large cache footprint
footprint
Small cache footprint
footprint
write a file sequentially
footprint
file end
write a file sequentially
footprint
read the same file sequentially
footprint
read
file
file end
read the same file sequentially
footprint
flush
file end
read
file
flush
file end
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
63
LFS Sprite Microbenchmarks

Modified large-file microbenchmark: 10 1-MB
files (Conquest in-core files) faster random over sequential
accesses due to cache reuse
700
600
500
MB / sec
400
300
200
100
0
seq write
seq read
SGI XFS
rand write
reiserfs
ext2fs
rand read
ramfs
seq read
Conquest
Motivation – Conquest Alternatives – Conquest Design – Performance Evaluation – Conclusion
64
Lessons Learned

Faster than LRU caching, unexpected



Heavyweight disk handling
Severe penalty for accessing memory content
Matching user access patterns to storage
media offers considerable simplification and
better performance


Not an automatic result
Need careful design
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
66
More Lessons Learned

Effects of L2 caching become highly visible in
memory workloads (modern workloads)
 Cannot blindly apply existing disk-based
microbenchmarks to measure memory
performance of file systems
 Need to consider states of L2 cache and
memory behaviors at each stage of
microbenchmarking
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
67
Additional Lessons Learned

Don’t discuss your performance numbers
next to a balcony…unless…
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
68
Related Work (1)

Disk caching


Assumption of scarce memory
Complex mechanisms to maintain consistency


Especially with the presence of metadata
RAM drives and RAM file systems



Not meant to be persistent
Use disk-related mechanisms
Limitations on storage capacity
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
[McKusick et al., 1990; Ganger et al., 2000; Roselli et al., 2000; Seltzer et al.,
2000]
69
Related Work (2)

Disk emulators


RAM storage accessed through SCSI interface
Ad hoc approaches

Manual transferring of files to and from ramfs


Capacity limitation
Background daemon to stage RAM files to a
disk

Semantic and name space problems
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
[Riedel 1998; ZDNet 1999]
70
Going Beyond Conquest (1)

Matching usage patterns with heterogeneous
machines in the distributed domain



Specialized tasks for machines within a cluster
Preferably self-organizing and self-evolving
State-rich computing


Caching of runtime data structures
Similar to /tmp
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
71
Going Beyond Conquest (2)

Separate storage of metadata from data



Association of metadata with data of different
fidelity
Opportunity for hierarchical replication across
devices with different calibers
Benchmarking memory performance of file
systems

Developing new memory benchmarks
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
72
Contributions

Demonstrated the feasibility of disk-memory
hybrid file systems
 Showed performance does not preclude
simplicity
 Pinpointed cache-related problems with
modern benchmarks
 Opened doors to many exciting areas of
research
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
73
Conclusion

Conquest demonstrates how rethinking
changes in underlying assumptions can lead
to significant architectural and performance
improvements

Radical changes in hardware, applications,
and user expectations in the past decade
should lead us to rethink other aspects of OS
as well.
Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion
74
Questions . . .
Conquest: http://lasr.cs.ucla.edu/conquest
Andy Wang: awang@cs.ucla.edu
75
Download