Disk Arrays COEN 180 Large Storage Systems Collection of disks to store large amount of data. Performance advantage: Each drive can satisfy only so many IO per seconds. Data spread across more drives is more accessible. JBOD: Just a Bunch Of Disks Large Storage Systems Principal difficulty: Reliability Data needs to be stored redundantly: Mirroring, Replication Simple Expensive (double, triple, … storage costs) Good performance Erasure correcting codes Complex Save storage Moderate performance Large Storage Systems Mirrored Disks Used by Tandem 1970 – 1997, bought by Compact Nonstop architecture Data Used redundancy (CPU, storage) for fail-over capacity is replicated on both drives Performance: Writes as fast as single disk model Reads: Slightly faster, since we can serve the read from the drive with best expected service time. Disk Performance Modeling Basics Service Time: Time to satisfy a request if system is otherwise idle. Response Time: Time to satisfy a request at a given system load. Response time = service time + waiting time Utilization: Time system is busy Disk Performance Modeling Basics M/M/1 queue single server Assume Poisson arrival, exponential service time Arrival rate Service time S Utilization U = S (Little’s law) Response time R R 20 S=1 15 hence U= 10 5 0.2 0.4 Determine R by: R = S + UR R= S/(1-U) = S/(1- S) 0.6 0.8 Disk Performance Modeling Basics Need to determine service time of disk request. Service time = seek time + latency + transfer time Industrial (but wrong) determination: Seek Why? time = time to travel one third of a disk. Disk Performance Modeling Basics Assume that head position is randomly on any track. Assume that target track is another random track. Given x [0,1], calculate D(x) x. = distance of random point in [0,1] from Disk Performance Modeling Basics Given x [0,1], calculate D(x) = distance of random point in [0,1] from x. 1 D( x) 0.5 y x dy 0.45 0 x 1 ( x y )dy ( y x)dy 0 x 2 (1 x) 2 2 2 1 x2 x 2 x 0.4 0.35 0.2 0.25 0.4 0.6 0.8 1 Disk Performance Modeling Basics Now calculate the average distance from a random point to a random point in [0,1] 1 D D( x)dx 0 x 1 x x x 2 2 x 0 3 1 3 3 2 Disk Performance Modeling Basics Is Average Seek Time = Seek Time for Average Distance? NO: Seek Time is not linearly dependent on average seek time. Seek Time consists acceleration cruising (if seek distance is long braking exact positioning Disk Performance Modeling Basics Is Average Seek Time = Seek Time for Average Distance? Practical measurements suggests Seek time depends on the seek distance roughly as a square-root of distance 4 3 2 1 2 4 6 8 10 Disk Performance Modeling Basics Rules of Thumb Keep 80%. utilization of disks between 50% and Disk Arrays Dealing with reliability RAID Redundant array of inexpensive (independent) disks RAID Levels RAID Level 0: JBOD (striping) RAID Level 1: Mirroring RAID Level 2: Encodes symbols (bytes) with a Hamming code. Stores a bit per symbol on different disk. Not used in practice. Disk Arrays Dealing with reliability RAID Levels RAID Level 3: Encodes symbols (bytes) with the simple parity code. Breaks a file up into n stripes. Calculates parity stripes. Stores all n + 1 stripes on n + 1 disks. Disk Arrays Data Data Data Parity Dealing with Reliability RAID Levels RAID Level 4 Maintains n data drives. Files are stored completely on one drive. Or perhaps in stripes if files become very large. Additional drive storing the byte-wise parity of the disk arrays. Disk Arrays Level 4 RAID Uneven load of parity drive and data drives Disk Arrays Dealing with Reliability RAID Level 5 No dedicated parity disk Data in blocks Blocks in parallel positions on disks form reliability stripe. One block in each reliability stripe is the parity of the others. No performance bottleneck Disk Arrays Dealing with Reliability RAID Level 6 Like RAID Level 5, but every stripe has two parity blocks Lower write performance 2-failure resilience RAID Level 7 Proprietary name for a RAID Level 3 with lots of caching. (Marketing bogus) Disk Arrays Disk Array Operations Reads: Directly from data in RAID Level 3-6 Writes: Large Writes: Writes to all blocks in a single reliability stripe. Calculate parity from data and write it. Small Writes: Need to maintain parity. Option 1: Write data, then read all other blocks in the stripe and recalculate parity. Option 2: Read old data, then overwrite it. Calculate the difference (XOR) between old and new data. Then read old parity, XOR it with the result of the previous operation and overwrite with it the parity block. Disk Arrays Disk Array Operations Reconstruction (RAID Level 4-5): Systematically: Reconstruct only lost data. Read all surviving blocks in the reliability stripe. Calculate its parity. This is the lost data block. Write data block in place of parity. Out of order reconstruction for data that is being read. Disk Arrays Performance Analysis Assume that read and write service times are the same. seek latency (transfer) Write operation involves the read-modify operation. About twice as long as read / write service time seek latency transfer two latencies transfer Disk Arrays Performance Analysis Level 4 RAID Offered read load r Offered write load w n disks Utilization at data disk: r S /(n – 1) + w 2S/(n – 1) Utilization at parity disk: w 2S Equal utilization only if r = 2(n – 2) w Disk Arrays Utilization 1 parity disk 0.8 Performance Analysis Level 4 RAID Offered load . Assume only small writes. Assume read /write ratio of Utilization data disk 0.4 0.2 100 200 300 400 500 Offered Load (IO/sec) S/n Utilization at data disk 0.6 at write disk (1- )2 S Parameters: 4+1 layout 70% reads Service time 10 msec Disk Arrays Performance Analysis RAID Level 5 Offered load Read ratio n disks Read Load S/n Write Load (1- ) 4S/n Every write leads to two read-modify-write ops. Disk Arrays Level 4 RAID vs Level 5 RAID 1 0.8 1 0.8 parity drive 0.6 0.6 0.4 0.4 data drive 0.2 RAID Level 5 Without parity disk (JBOD) 0.2 100 200 300 400 500 100 200 Parameters: 4+1 layout 70% reads Service time 10 msec 300 400 500 Disk Arrays Performance Small writes are expensive. Parity logging (Daniel Stodolsky, Garth Gibson, Mark Holland) Write operation: Read old data, Write new data, Send XOR to a parity log file. Whenever parity log file becomes to big, process it by updating parity information. Disk Arrays Reliability Accurately given by the probability of failure at every moment in time. 1 0.8 0.6 0.4 5 10 15 20 25 30 Disk Arrays Reliability Often given by Mean Time To Data Loss MTTDL Warning: MTTDL numbers can be deceiving. Red line is more reliable during Design Life, but has lower MTTDL Disk Arrays Use Markov Model to model system in various states. States describe system. Assumes constant rates of transitions. Transitions correspond to: component failure component repair Disk Arrays One component system Initial State Failure State (absorbing) MTTDL = MTTF = 1/ Disk Arrays Two component system without repair 2 Initial State: 2 components working 2 1 1 component working, one failed Failure State (absorbing) Disk Arrays Two component system with repair 2 2 Initial State: 2 components working 1 1 component working, one failed Failure State (absorbing) Disk Arrays How to calculate MTTF Start with original Markov model. Remove failure state. Replace transition(s) to failure state with failure transitions to initial state. This models a meta-system where we replace a failed system immediately with a new one. Now calculate the steady-state solution of the Markov model. It typicallyhas become ergodic. Use this to calculate the average rate of a failure transition being taken. This gives the MTTF. Disk Arrays One component system System in initial state all the time. Failure transition taken at rate . Initial State “Loss rate” L = . MTTDL = 1/L = 1/ Disk Arrays Two component system without repair 2 2 Steady-state solution Let x be the probability to be in state 2, y the probability to be in state 1. 1 Then: Inflow into state 2 = Outflow from state 2: Initial State: 2 components working 1 component working, one failed 2x = y Total sum of probabilities is 1: x+y = 1. Disk Arrays Two component system without repair Steady-state solution 2x = y 2 2 x+y = 1. 1 Solution is: x = 1/3, y = 2/3. Initial State: 2 components working 1 component working, one failed Loss rate is L = (2/3). MTTF = 1/L = 1.5 (1/ ). (1.5 times better than before). Disk Arrays Two component system with repair 2x ( ) y , x y 1 2 2 Initial State: 2 components working 1 1 component working, one failed 2 x ,y 3 3 22 L 3 3 3 1 MTTF 2 2 2 22 Disk Arrays RAID Level 4/5 Reliability n n n-1 (n-1) Initial State: n disks n – 1 disks Failure State (absorbing) Disk Arrays RAID Level 6 Reliability n n n-1 Initial State: n disks (n-1) (n-2) n-2 2 n – 1 disks n – 2 disks Failure State (absorbing) Disk Arrays Sparing Create more resilience by adding a hot spare. Failover to hot spare reconstructs and replaces contents of the lost disk on spare disk. Distributed sparing (Menon et al.): Distribute the spare space throughout the disk array.