AutoRAID

advertisement
THE HP AUTORAID
HIERARCHICAL STORAGE
SYSTEM
J. Wilkes, R. Golding, C. Staelin
T. Sullivan
HP Laboratories, Palo Alto, CA
INTRODUCTION

must protect data against disk failures: too
frequent and too hard to repair

possible solutions:

for small numbers of disks: mirroring

for larger number of disks: RAID
RAID

Typical RAID Organizations

Level 3: bit or byte level interleaved with
dedicated parity disk

Level 5: block interleaved with parity
blocks stored on all disks
LIMITATIONS OF RAID (I)

Each RAID level performs well for a narrow
range of workloads

Too many parameters to configure: data- and
parity-layout, stripe depth, stripe width, cache
sizes, write-back policies, ...
LIMITATIONS OF RAID (II)

Changing from one layout to another or
adding capacity requires downloading and
reloading the data

Spare disks remain unused until a failure
occurs
A BETTER SOLUTION


A managed storage hierarchy:

mirror active data

store in a RAID 5 less active data
This requires locality of reference:

active subset must be rather stable:
found to be true in several studies
IMPLEMENTATION LEVEL

Storage hierarchy could be implemented

Manually: can use the most knowledge but
cannot adapt quickly

In the file system: offers best balance of
knowledge and implementation freedom
but specific to a particular file system

Through a smart array controller: easiest
to deploy (HP AutoRAID)
MAJOR FEATURES (I)

Mapping of host block addresses to physical
disk locations

Mirroring of write-active data

Adaptation to changes in the amount of data
stored:


Starts RAID 5 when array becomes full
Adaptation to workload changes:

Hot-pluggable disks, fans, power supplies
and controllers
MAJOR FEATURES (II)

On-line storage capacity expansion:
system switches then to mirroring
 Can mix or match disk capacities

Controlled fail-over: can have
dual controllers (primary/standby)

Active hot spares: used for more mirroring

Simple administration and setup: appears
to host as one or more logical units

Log-structured RAID 5 writes
RELATED WORK (I)

Storage Technology Corporation Iceberg:

also uses redirection but based on RAID 6

handles variable size records

emphasis on very high reliability
RELATED WORK (II)

Floating parity scheme from IBM Almaden:
 Relocated parity blocks and uses
distributed sparing

Work on log-structured file systems at U.C.
Berkeley and cleaning policies
RELATED WORK (III)

Whole literature on hierarchical storage
systems

Schemes compressing inactive data

Use of non-volatile memory (NVRAM) for
optimizing writes

Allows reliable delayed writes
OVERVIEW
Parity
Logic
Control
Processor,
RAM and
Control Logic
Control
2x10MB/s bus
Control
Matching RAM
Control
SCSI
Controller
DRAM Read Cache
NVRAM Write Cache
Other RAM
20 MB/s
Host Computer
PHYSICAL DATA LAYOUT




Data space on disks is broken up into large
Physical EXTents (PEXes):

Typical size is 1 MB
PEXes can be combined to form Physical
Extent Groups (PEGs) containing at least
three PEXes on three different disks
PEGs can be assigned to the mirrored
storage class or to the RAID 5 storage class
Segments are the units on contiguous
space on a disk (128 KB in prototype)
LOGICAL DATA LAYOUT

Logical allocation and migration unit is the
Relocation Block (RB)

Size in prototype was 64 KB:


Smaller RB’s require more mapping
information but larger RB’s increase
migration costs after small updates
Each PEG holds a fixed number of RB’s
MAPPING STRUCTURES


Map addresses from virtual volumes to
PEGs, PEXes and physical disk addresses
Optimized for finding fast the physical
address of a RB given its logical address :

Each logical unit has a virtual device
table listing all RB’s in the logical unit
and pointing to their PEG

Each PEG has a PEG Table listing all
RB’s in the PEG and the PEXes used to
store them
NORMAL OPERATIONS (I)

Requests are sent to the controller in SCSI
Command Descriptor Blocks (CDB):


Up to 32 CB’s can be simultaneously active
and 2048 other ones queued
Long requests are broken into 64 KB
segments
NORMAL OPERATIONS (II)


Read requests:

Test first to see if data are not already in
read cache or in non-volatile write cache

Otherwise allocate space in cache and
issue one or more requests to back-end
storage classes
Write requests return as soon as data are
modified in non-volatile write cache:

Cache has a delayed write policy
NORMAL OPERATIONS (III)


Flushing data from cache can involve;

A back-end write to a mirrored storage
class

Promotion from RAID 5 to mirrored
storage before the write
Mirrored reads and writes are
straightforward
NORMAL OPERATIONS (IV)


RAID 5 reads are straightforward
RAID 5 writes can be done:

On a per-RB base: requires two reads
and two writes

In batched writes: more complex but
cheaper
BACKGROUND
OPERATIONS

Triggered when array has been idle for some
time

Include

Compaction of empty RB slots,

Migration between storage classes (using
an approximate LRU algorithm) and

Load balancing between disks
MONITORING

System also includes:


An I/O logging tool and
A management tool for analyzing the
array performance
PERFORMANCE RESULTS (I)

HP AutoRAID configuration with:



16 MB of controller data cache
Twelve 2.0GB Seagate Barracuda disks
(7200rpm)
Compared with:
 Data General RAID array with
64 MB front-end cache
 Eleven individual disk drives implementing
disk striping but without any redundancy
PERFORMANCE RESULTS (II)

Results of OLTP database workload:
 AutoRAID was better than RAID array and
comparable to set of non-redundant
drives
 But whole database was stored in
mirrored storage!

Micro benchmarks:
 AutoRAID is always better than RAID array
but has smaller I/O rates than set of drives
SIMULATION RESULTS (I)


Increasing the disk speed improves the
throughput:

Especially if density remains constant

Transfer rates matter more than rotational
latency
64KB seems to be a good size for the
Relocation Blocks:

Around the size of a disk track
SIMULATION RESULTS (II)

Best heuristics for selecting the mirrored copy
to be read is shortest queue

Allowing write cache overwrites has a HUGE
impact on performance

RB’s demoted to RAID should use existing
holes when the system is not too loaded
SUMMARY (I)

System is very easy to set up:



Dynamic adaptation is a big win but it will
not work for all workloads
Software is what makes AutoRAID, not the
hardware
Being auto adaptive makes AutoRAID
hard to benchmark
SUMMARY (II)

Future work includes:

System tuning especially



Idle period detection
Front-end cache management
algorithms
Developing better techniques for
synthesizing traces
Download