Failing in Place for Low- Serviceability Infrastructure Using High-Parity GPU-Based RAID

```Failing in Place for LowServiceability Infrastructure
Using High-Parity GPU-Based
RAID
Matthew L. Curry, University of Alabama at Birmingham and
Sandia National Laboratories, [email protected]
H. Lee Ward, Sandia National Laboratories, [email protected]
Anthony Skjellum, University of Alabama at Birmingham,
[email protected]
Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United
States Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000.
Introduction
• RAID increases speed and reliability of disk
arrays
• Mainstream technology limits fault tolerance to
two arbitrary failed disks in volume at a time
(RAID 6)
• Always-on and busy systems that do not have
service periods are at increased risk of failure
– Hot spares can decrease rebuild time, but leave
installations vulnerable for days at a time
• High-parity GPU-based RAID can decrease
long-term risk of data loss
2
Motivation: Inaccurate Disk
Failure Statistics
• Mean Time To Data Loss of RAID often
assumes manufacturer statistics are
accurate
– Disks are 2-10x more likely to fail than
manufacturers estimate*
– Estimates assume low replacement latency
*Bianca Schroeder and Garth A. Gibson. “Disk failures in the real
world: What does an MTTF of 1,000,000 hours mean to you?” In
Proceedings of the 5th USENIX Conference on File and Storage
Technologies, Berkeley, CA, USA, 2007. USENIX Association.
3
Probability of Success (No
Error)
Errors
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
5
10
15
20
25
30
35
40
45
50
55
60
65
70
75
80
85
BER = 10^-14
BER = 10^-15
4
BER = 10^-16
90
95
100
Result: Hot Spares Are
Probability of Data Loss Within
Ten Years
0
100
200
300
1.E+00
1.E-01
RAID 5+0 (4 sets)
RAID 6+0 (4 sets)
1.E-02
RAID 1+0 (3-way
replication)
1.E-03
1.E-04
Data Capacity (TB, using 2TB Disks)
MTTR = One Week, MTTF = 100,000 Hours,
years
5
Solution: High-Parity RAID
• Use Reed-Solomon Coding to provide much higher fault tolerance
– Configure array to tolerate failure of any m disks, where m can vary
widely
– Analogue: RAID 5 has m=1, RAID 6 has m=2, RAID 6+0 also has m=2
– Spare disks, instead of remaining idle, participate actively in array,
eliminating window of vulnerability
• This is very computationally expensive
– k+m RAID, where k is number of data blocks per stripe and m is number
of parity blocks per stripe, requires O(m) operations per byte written
– Table lookups, so no vector-based parallelism with x86/x86-64
– Failed disks in array operate in degraded mode, which requires the
6
Use GPUs to Provide Fast RAID
Computations
• Reed-Solomon coding
accesses a look-up table
• NVIDIA CUDA architecture
supplies several banks of
memory, accessible in parallel
• 1.82 look-ups per cycle per SM
on average
• GeForce GTX 285 has 30 SMs,
so can satisfy 55 look-ups per
cycle per device
– Compare to one per core in x86
processors
7
Compute Core
Shared Memory
Bank
Performance: GeForce GTX 285
vs. Intel Extreme Edition 975
5000
Coding Throughput (MB/s)
4500
4000
3500
GPU m=2
3000
GPU m=3
2500
GPU m=4
2000
CPU m=2
1500
CPU m=3
1000
CPU m=4
500
0
2
6 10 14 18 22 26 30
Number of Data Buffers
8
• GPU capable of
providing high
bandwidth at 6x-10x
CPU speeds
• New memory layout
and matrix
generation algorithm
for Reed-Solomon
yields equivalent
A User Space RAID
Architecture and Data Flow
• GPU computing facilities
are inaccessible from
kernel space
• Provide I/O stack
components in user space,
accessible via iSCSI
– Accessible via loopback
interface for directattached storage, as in
this study
• Alternative: Micro-driver
Performance Results
RAID Type
Linux MD, RAID 0
GPU +4
Write
GPU +3
GPU +2
0
100
200
300
400
500
600
Throughput (MB/s)
• 10 GB streaming read/write operations to a 16disk array
• Volumes mounted over loopback interface
– stgt version used is bottleneck, later versions have
higher performance
10
• Prototype for general-purpose RAID
– Arbitrary parity
– High-parity configuration for initial hardware
deployments
• Guard against batch-correlated failures
• Flexibility not available with hardware RAID
– Lower-bandwidth higher-reliability distributed data
stores
11
Future Work
• UAB
–
–
–
–
–
Data intensive “active storage” computation apps
Computation, Compression within the storage stack
NSF-funded AS/RAID testbed (.5 Petabyte)
Multi-level RAID architecture, failover, trade studies
Actually use the storage (e.g., ~200TB of 500TB)
• Sandia
– Active/active failover configurations
– Multi-level RAID architectures
– Encryption within the storage stack
• Commercialization
12
Conclusions
• RAID reliability is increasingly related to BER
• Window of vulnerability during rebuild is
dangerous, needs to be eliminated if system
stays busy 24x7
– Hot spares are inadequate because of elongated