1 Disk Subsystem Design

advertisement

1 Disk Subsystem Design

1.1 Introduction

This paper is my way of putting a lot of disparate thoughts down on paper so that I can formulate my thoughts. I also expect other system designers have had similar thoughts.

The basis of this paper is when one of my colleagues said. 5 disks in an FC60 will flood the enclosure’s buss. I disagreed & spend a few days trying to show that this was an ultra conservative way of designing disk subsystems. (e.g. FC60 could only have 30 disks, VA7410 would be flooding with 30 disks etc, etc)

1.2 Assumptions

Assume… you know what they say it make an ass out of u & me. That said I’ll put my neck out. I know very little about fibre technology & disks algorithms, what happens when multiple disks try and transmit together & flood the buss. So the following have been used to paper the cracks!.

 I’ve assumed the disks will be transferring data randomly.

 I’ve assumed that fibre is a broadcast medium & there is very little/no coordination between the disks & host (or controller).

RAID0 has been used & mirroring has been (on purpose) ignored so a IO write is the same as a read & all the disks are evenly loaded.

1.3 Formulae

First things first, formulation of the problem & formulae. In general, units are converted to MB & seconds. Thus, various conversions are used in calculations to achieve this. Below are some fundamental symbols & their meanings.

General

BW x

Bandwidth of x, this is rate at which data transferred [MB/s]

TP x

Throughput of x, this is the number of IO per second [IO/s]

Disk

The disks used will all be the same, with the following attributes symbols.

Xfr – Disk burst [max] transfer rate, [MB/s].

B – Block size used, [kB].

St – Service time of disk [ms].

μ – Disk utilisation or design point as a fraction. p – Ratio (probability) of transfer time: total IO time. q – Ratio (probability) of “white space” time: total IO time.

Disk Subsystem (JBOD)

The disk subsystem is made up of a number of the above disks. Usually a disk subsystem is broken further into groups of disks on channels or loops, which are housed in one or more disk enclosures. Data is evenly stripped (say RAID 1+0) across all the disks and enclosures such that it that the activity of one disk is representative of all the disks.

Lxfr – Loop speed, [Gbit/s].

N d

– Number of disks on loop/subsystem.

N f

– Lxfr/Xfr, Minimum number of disks that could cause flooding of loop.

P – Probability of flooding loop

Q – Probability of not flooding loop

1.3.1 Calculating p, q, disk bandwidth & throughput p & q are used to calculate the chances of flooding the loop. p = B / {μ.(St.Xfr.1024/1000 + B)} q = 1-p

{μ.St.Xfr.1024 + B(μ - 1)}/(1000.μ.(St.Xfr.1024/1000 + B))

The throughput & bandwidth formulae.

TP dsk

= μ.1024.Xfr/(St.Xfr.1024/1000 + B) .. in .. [IO/s]

BW dsk

= μ.Xfr.B/(St.Xfr.1024/1000 + B) .. in .. [MB/s]

BW dsk%

= μ.B/(St.Xfr.1024/1000 + B) .. in .. [%]

1.3.2 Calculating loop flood probability, P

Using the basic disk formulae above to evaluate the chances of multiple disk transfers occurring and flooding the buss. This can be done by expanding the binomial expression (p+q)

Nd

. This expansion does not need to be full expanded, as usually the minimum number of disks that could flood the buss is [much] less than the total number of disks, thus it is easier to add together the terms that do not cause flooding

[Q] & thus calculate the probability of flooding [P] from that.

Q = ∑ Q n

for n=0, 1, 2, .. N f

, where Q

0

= q

Nd

& Q n+1

= Q n

.(N d

-n).p/{q.(n+1)}

P = 1-Q

1.3.3 Disk subsystem, bandwidth and throughput

The basis of calculating the flooding probability is to use this (P) to calculate the effective disk subsystem bandwidth & throughputs. Multiple disks on a single buss is generally efficient because disks tend to spend a large amount of time looking for data

(seek times) & very little time transferring data. This results in disks interlacing their data transfers at bandwidths less than the busses. However, there is a chance that more disks than the buss can handle will transferee in sync. This buss flooding will be very inefficient, its effects on the throughput; and hence bandwidth; calculations are based on the following assumption that

Performance is linear & the gross throughput & bandwidth can be the sum of all the constituent disks.

 When a channel/loop has flooded, it will perform IO’s at the flooded level.

Thus if a disk subsystem has a 25% chance of flooding, the IO rate of the whole system will be the IO rate of 75% of all the disks plus 25% of the IO rate of the number of disks that can cause flooding. Thus…

ε ss

= [Q.N

d

+ P.N

f

]/N d

[P(N f

– N d

) + N d

]/N d

.. efficiency

TP ss

= μ.1024.Xfr.p.[P(N f

– N d

) + N d

]/B

BW ss

= μ.Xfr.p.[P(N f

– N d

) + N d

]

BW ss%

= μ.p.[P(N f

– N d

) + N d

]/N f

1.3.4 Comments

Writing the subsystem bandwidth results in terms of p, P, μ, Xfr, N d

& N f

, rather than

B, St, p, q, P, Q, μ, Xfr, N d

and N f

was to try & reduce the number of parameters it is dependant on.

The other thing to mention is that the systems designer really only has a few things that can be altered. The number of disks, disk utilisation (or design point) block size

& type of disk are usually the only variables available. As such, BW ss

has these I the form of p, N d

& N f

.

Max bandwidth

The maximum bandwidth is when d(BW ss)

/dp = 0. This, at first seems to be a highly complex calculation. However, it is possible to derive the following

P = [N d

- (N f

+ 1).(N d

– N f

).P

Nf+1

]/(N d

– N f

) where P

Nf+1

= N d

!.(1-p) Nd-Nf-1 .p

Nf+1 /{ (N d

– N f

–1)!.(N f

+ 1)! }

Looking at the above formula it can be solely expressed in p, N d

& N f

. Another, subjective way of looking at the formula is that when the probability of flooding is a certain proportion of the probability of N f

+1 synchronised disk transfers, the bandwidth will be at a [local] maximum.

Back substitution of this gives

BW ss max

= μ.Xfr.p.(N f

+ 1).(N d

– N f

).P

Nf+1

BW ss% max

= μ.p.(N f

+ 1).(N d

– N f

).P

Nf+1

/N f

Unfortunately, the above two formulae do not yield easily to pen and paper analysis!

1.4 Calculations on JBOD, 15,000 rpm Seagate on 2GB Fibre

1.4.1 Question 1

OK I know from manufactures specs that, say, 4 disks will flood my buss. But

I get FAR lower bandwidth usage per disk. Can I simply use multiple disks up to the bandwidth of the buss?

Calculation

The above, is an optimistic methodology of designing the disk subsystem. Lets look at some numbers.

Xfr = 50 MB/s, L xfr

= 2 Gbit/s, St = 5.6 ms, B = 4 kB, μ=100% so N f

= 238.4/40 = 4.76 and BW dsk

= 0.69 MB/s, so N d

= 346.

Using the formulae for this, P=0.518 and the disk subsystem bandwidth is 116.5

MB/s, or 48.8% of the available buss bandwidth.

Using iterative applications of the calculations, it can be shown that the best bandwidth from the system can be obtained when the number of disks are 266 & the result is just over 128 MB/s (& P=30.5%, BW ss%

=53.8%).

Answer

The direct answer is, No. If you do this firstly the bandwidth is less than the maximum bandwidth potentially available. Secondly, the same performance could be achieved with less (about 40% less) disks. That said…. In practice it may only be possible to string together say 30 or so disks, with this level of disks P [the probability of buss flooding] is virtually zero!

1.4.2 Question 2

If question 1 is optimistic methodology, how about putting on the maximum number of disks that will not cause buss flooding?

Answer

For the same disks in question 1 you will need only 4 disks to achieve this. This will use only get a fraction of the buss bandwidth available. Flooding & efficiency will be

0% & 100% respectively, but the bandwidth will be 0.68MB/s for a 4kB block size

[B=4] and up to 23.7 for 256kB block size, this is 0.3-9% of the bandwidth.

1.4.3 Question 3

What block size should I use to get the best bandwidth on the system?

Calculation

Repeated usage of the formulae give the 5 following graphs. Each graph shows the effect of increasing the block size for a fixed number of disks. There are 5 graphs as each one shows the optimum number of disks for 4,8,16,32 & 64 kB block sizes. The disk utilisation [μ] is set to 100%. These results are summarised in the following table.

Block

[kB]

Num disks,

N d

Bandwidth,

BW ss

[MB/s]

Flooding prob,

P [%]

Efficiency,

ε

ss

[%]

64

32

16

8

4

22

38

70

135

266

141.7

135.1

131.3

129.3

128.2

37.4%

33.3%

31.1%

30.5%

30.5%

70.6%

70.8%

71.0%

70.6%

70.1%

Loci of optimised design points

Xfr=50 MB/s, Lxfr=2Gbit/s, St=5.6ms, u=100%

100%

90%

80%

70%

60%

50%

40%

30%

20%

10%

0%

70

280

240

200

160

120

80

40

0

BWss

P

Ess

Nd

0 10 20 30 40 50 60

Block size

From the graphs below it can be seen that

Larger block sizes; B; are optimum on a smaller total numbers of disks; N d and it is roughly an inverse relationship.

The maximum bandwidth; BW ss

; increases slowly with block size; B.

As the number of disks increases, the maximum peak is sharper & more localised.

The probability of flooding & disk sub system efficiency move in opposite directions. Interestingly at each configurations local peak they are similar (P ~

33%,

ε ss

~ 70%).

Answer

As the above graphs show for a given number of disks there is a block size that will produce the maximum bandwidth. At these points it seems that actual bandwidth, P

&

ε ss

values are fairly invariant! That said, there are two possibilities

Over shooting the peak . Assume there are 70 disks and the user wants to use

32kB as the block size. This would have overshot the peak and the bandwidth would be about 80MB/s [35% of 2Gbit/s]. The way to solve this is to halve the disks & have two sets of 35. This would mean that each would get just under 135MB/s [55% of 2Gbit/s]. If the storage space is crucial (as opposed to performance) then have two sets of 35 disks on two loops.

Under shooting the peak . This is not soluble in the same way as the above, with 70 disks and 8kB the bandwidth is 91 MB/s [43%]. More disks would help or using a bigger block size.

Over shooting the peak . This case is the most interesting as in reality the disks will not be running at 100% continuously, but would ramp up to it. What would be seen in reality is the disks ramp their usage up to a peak then as demand increased it would throttle the requests. Depending on the control algorithms used in the disks & intelligence in the software one of two things will happen.

Continue ramping up disk utilisation & throttle bandwidth

It will sit at 50% disk utilisation & cap the bandwidth.

The second is more preferable as the system will maintain a high bandwidth, BUT it will look like more bandwidth is available!

1.5 Calculations on Intelligent Disk Subsystem on 2GB Fibre

hmmm, will do this. Basically I’ll need to re-think the assumptions as the system is intelligent & can re-order IO to make best use of resources. You can also say bandwidth in equals bandwidth out, but probably the block size will get bigger to improve efficiency.

Download