Uploaded by Martynas Skripkauskas

HPE 3PAR RAID 6

advertisement
HPE 3PAR RAID 6
Securing your data investment
Technical white paper
Technical white paper
Contents
Executive summary................................................................................................................................................................................................................................................................................................................................3
RAID .....................................................................................................................................................................................................................................................................................................................................................................3
How does RAID work?.........................................................................................................................................................................................................................................................................................................................4
RAID 1 .........................................................................................................................................................................................................................................................................................................................................................4
RAID 5 .........................................................................................................................................................................................................................................................................................................................................................4
RAID 6 .........................................................................................................................................................................................................................................................................................................................................................5
Capacity efficiency ............................................................................................................................................................................................................................................................................................................................6
Drive failures and rebuild ..................................................................................................................................................................................................................................................................................................................6
The second failure ............................................................................................................................................................................................................................................................................................................................7
Uncorrectable Read Error ..........................................................................................................................................................................................................................................................................................................7
HPE 3PAR and wide striping..................................................................................................................................................................................................................................................................................................8
Are SSDs different?..........................................................................................................................................................................................................................................................................................................................8
Relative data safety of RAID 1 vs. RAID 5 vs. RAID 6 ......................................................................................................................................................................................................................................9
Performance .................................................................................................................................................................................................................................................................................................................................................9
RAID overhead.....................................................................................................................................................................................................................................................................................................................................9
Performance during rebuild ................................................................................................................................................................................................................................................................................................. 10
Anatomy of a rebuild..................................................................................................................................................................................................................................................................................................................11
Sparing rates ......................................................................................................................................................................................................................................................................................................................................12
RAID conversion ...................................................................................................................................................................................................................................................................................................................................13
Investigation.......................................................................................................................................................................................................................................................................................................................................13
RAID conversion process ....................................................................................................................................................................................................................................................................................................... 16
Summary ......................................................................................................................................................................................................................................................................................................................................................16
HPE recommendation (3.3.1 and later) ................................................................................................................................................................................................................................................................... 17
Glossary ........................................................................................................................................................................................................................................................................................................................................................17
Technical white paper
Page 3
Executive summary
Data continues to grow at alarming rates and the need to store more data has never been greater. This data expansion has led to a continued
growth in disk drive capacities with only modest growth in drive reliability and performance. The gap between capacity and reliability leads to
increased risk to your data, and the time required to recover from a drive failure continues to grow. Compounding drive recovery is the increased
risk of a second failure while still recovering from the initial failure. This second failure may come from a second drive fault, but a more common
source is an Uncorrectable Read Error (URE).
The primary purpose of storage arrays today is to provide robust data services with a high level of performance. HPE 3PAR StoreServ arrays
maintain data protection using RAID technologies. RAID protection comes in three levels (1, 5, and 6) and several set sizes. Each RAID level and
set size offer a different level of performance and data robustness.
If you are concerned about the risks to your data, you need to understand RAID levels, failures, second failures, URE, and the data robustness
options available in HPE 3PAR arrays.
RAID
The idea of RAID started in the late 1980s at the University of California, Berkeley, as a way to address I/O performance. The first RISC
processors were being developed and with it, the expectation of great processor speed increases. Moore’s Law, an observation made in 1965,
was in full focus leading to the belief that processor performance would increase annually. This led to the need for more I/O performance to keep
pace (the industry has been working on the processor-I/O balance ever since).
In the late 1980s, the first official SCSI spec was being defined and the state-of-the-art hard disks could be represented by an IBM drive
(3300 series) the size of a refrigerator with 14-inch platters holding 7.5 GB of data (the literature of the day described capacity in billions of
characters of information that could be stored). The idea behind the initial RAID paper published in 1988 was to address larger disks with
long service times dominated by seek time (time to move the read/write head) and rotational delays (time for the platter to spin). These
concepts are not discussed much anymore, but they are still the main drivers in service time of spinning media drives today.
The definition of RAID changed over time to replace the word “inexpensive” (for the letter I) with the word “independent” as disks then, as they
are now, were expensive. A more important change was the recognition of the concept of RAID to address reliability and performance. Today,
RAID is synonymous with data protection in storage arrays.
Today’s RAID comes in many forms called levels or modes. The levels have been defined and refined over time and now there are three
RAID levels in common use—RAID 1/RAID 10, RAID 5/RAID 50, and RAID 6/RAID 60. RAID 1 is commonly referred to as mirroring where
two copies of data are kept in sync. RAID 5 and RAID 6 use the concept of parity to provide data redundancy using less space than RAID 1.
RAID 5 uses single parity while RAID 6 uses double parity. Adding a zero (0) to the RAID level adds striping. RAID 10, for example, is
striped mirroring.
Technical white paper
Page 4
How does RAID work?
RAID 1
When we consider how RAID works, the focus is on how the architecture provides data protection. RAID 1 is often called disk mirroring and uses
pairs of disks or in the case of HPE 3PAR, pairs of chunklets to store data. Each block written from the host is written to each disk or chunklet in
the mirror pair. At any point in time, each RAID 1 disk or chunklet will hold the same data as its mirrored partner. The space efficiency of RAID 1
is 50% as writing one unit of data from the server will require a unit of space on two separate disks. RAID 1 protects data in the event of one disk
failure. (HPE 3PAR also supports RAID 1 triple mirroring. This will be discussed later.)
Figure 1. RAID 1
RAID 5
RAID 5 uses a parity algorithm to maintain data protection, which increases space efficiency relative to RAID 1. Data written to the storage with
RAID 5 will include a minimum of two data blocks and one parity block. A unit of data from data disk 1 is combined with a unit of data from data
disk 2 to compute parity information, which is written to the parity disk.
The minimum configuration for RAID 5 is three disks of equal size, but implementations of RAID 5 using eight disks (seven data disks and one
parity disk or 7+1) or more are common. This combination of data disks and parity disk defines the RAID set size. A RAID 5 configuration with
three data disks and one parity disk, commonly noted as (3+1) will have a set size of four disks.
The space efficiency of RAID 5 is dependent on the set size. In a 4-disk RAID 5 set, the space efficiency is 3/4 or 75% (three data disks and one
parity disk) while the efficiency of an 8-disk RAID 5 set is 87.5% (seven data disks, one parity disk). RAID 5 with a single parity disk can preserve
data when one disk fails.
Technical white paper
Page 5
Figure 2. RAID 5
RAID 6
RAID 6 builds on the parity concept of RAID 5 but adds a second parity disk. Data written to disk configured in RAID 6 will include a minimum of
two data blocks and two parity blocks. A unit of data from data block one is combined with a unit of data from data block two to compute parity
information, which is then written to both parity blocks.
The minimum configuration for RAID 6 is four disks of equal size, but implementations of RAID 6 using eight (six data disks, two parity disks) or
more are common. The space efficiency of RAID 6 depends on the set size. In a 4-disk RAID 6 set, the space efficiency is 50% (two data blocks
and two parity blocks). Space efficiency of a 16-disk RAID 6 set is 87.5% (14 data blocks and two parity blocks). RAID 6 with double parity disks
can preserve data in the event of two disk failures.
Figure 3. RAID 6
Technical white paper
Page 6
Capacity efficiency
It is often assumed that less space will be available in a RAID 6 configuration compared to a RAID 5 configuration because of the double parity.
This is not always the case and depends on the set size (number of storage units in a RAID set). Table 1 compares space efficiency of several
RAID 5 and RAID 6 set sizes.
Table 1. RAID 5 and RAID 6 set sizes and density
RAID 5 set size
RAID 6 equivalent set size
Space efficiency
3 (2+1)
6 (4+2)
2/3 = 4/6 = 66.6%
4 (3+1)
8 (6+2)
3/4 = 6/8 = 75.0%
5 (4+1)
10 (8+2)
4/5 = 8/10 = 80.0%
6 (5+1)
12 (10+2)
5/6 = 10/12 = 83.3%
7 (6+1)
8 (7+1)
6/7 = N/A = 85.7%
16 (14+2)
9 (8+1)
7/8 = 14/16 = 87.5%
8/9 = N/A = 88.9%
Drive failures and rebuild
Each of the RAID levels discussed provides data protection in the event of a drive failure, but they do not provide the same level of protection.
Understanding the relative data safety requires a closer look at each RAID level and the rebuild process.
Data is protected from a failure in each of the RAID levels by use of mirroring (RAID 1) or parity (RAID 5 and 6). When a drive fails in a RAID set
providing single drive resiliency (for example, RAID 1 mirroring or RAID 5), a window opens where the data in the RAID group is no longer
protected from another failure (a second failure) until the rebuild of the failed drive is completed. The rebuild process uses array resources to
restore the same data protection that existed prior to the drive failure, but there is an additional risk during the rebuild process.
The time required for the rebuild process is influenced by many factors including the amount of data written to the failed drive, the number of
drives in the RAID group, and speed of the drives in the RAID group. The rebuild process must read the remaining good data from the RAID
group, rebuild the failed data if parity is used, and write the recovered data to a new location. A proxy for the minimum rebuild time in traditional
arrays using a spare drive is the time required to write the content to a new drive.
Note
The rebuild time proxy does not apply to an HPE 3PAR StoreServ array, which implements distributed sparing and many-to-many rebuilds.
HPE 3PAR will read and write the same amount of data, but the reads and writes will be spread among many drives, which is faster.
If a 1 TB drive fails, the theoretical minimum rebuild time would be 1 TB divided by the write rate of that drive. If the drive is capable of
100 MB/sec then the minimum theoretical rebuild time would be around three hours (1 TB/100 MB/sec = 10485 seconds ≈ 2.9 hours). This
minimum time is often much longer in practice. The array usually has other priorities such as serving host I/O. The time required to read the
remaining good data in the RAID set may also add to the rebuild time when the RAID set is large such as in a RAID 5 (7+1) configuration.
Rebuild times are often much longer in practice and growing longer as drive sizes increase.
The industry trend has always been toward larger capacity faster drives. It is easier to increase the capacity of a drive (to a point) than to
increase its performance, which is often limited by physics (rotational speed in spinning media) or bus speeds. This leads to a pace of capacity
gains growing faster than gains in drive performance. Take an industry-leading 10K RPM drive as an example. Several years ago, a 300 GB
enterprise drive was capable of around 150 MB/second of sustained throughput. Today’s 1.8 TB enterprise drive is capable of around
175 MB/sec of sustained throughput. This 600% improvement in capacity is matched with only a 12% increase in performance. The theoretical
rebuild time has increased from about 34 minutes for the 300 GB drive to 170 minutes for the 1.8 TB drive or about 500% longer.
The HPE 3PAR StoreServ array provides significant benefit by using spare chunklets and wide striping. HPE 3PAR implements distributed
sparing compared to others who use a spare drive. Wide striping of spare chunklets means an HPE 3PAR StoreServ array rebuild times are faster
because many drives handle the rebuild load (many-to-many rebuild). The industry trend to larger drive sizes means the trend to increased
rebuild times applies to HPE 3PAR StoreServ arrays as well.
Technical white paper
Page 7
The second failure
We have just discussed the importance of rebuild time to data protection and how a second failure during a rebuild can lead to data loss when
using a RAID level providing single drive resilience (RAID 1 or RAID 5). What is the probability of experiencing an unfortunate second failure
during the rebuild time?
The probability of a second failure is dependent on the time required to rebuild the failed drive. When drives were small (for example, 100 GB),
the rebuild time was expressed in minutes limiting the risk. As drives have grown larger, the drive speed also increased for a while keeping the
rebuild times short. As drive capacities continue to follow Kryder’s law (doubling every two years), drive speeds cannot keep up, which leads to
longer rebuild times and increased risk.
The probability of a second drive failure is often calculated using measures such as Mean Time Between Failure (MTBF) and Annualized Failure
Rate (AFR). These values are published by the drive vendor and adjusted by the anticipated rebuild time. Typical AFR values published by
vendors are 1% or less for enterprise drives making the likelihood of a second drive failure during a days-long rebuild relatively small (on the
order of five in a million—0.0005%—or less).
Uncorrectable Read Error
This “second failure” is often thought of as a second drive failure while a rebuild is addressing the first failure. This will indeed cause data loss in a
single parity configuration, however, a more common second failure is URE. URE measures the error rate as a function of reading data on the
drive and it is expressed as the number of bits, which can be read before a failure is likely.
MTBF and AFR metrics reflect failure rates as a function of the time the drive is powered on. Today’s URE standard is 1-bit error in 1014 bits read
for consumer drives (think of your laptop) and 1 bit in 1016 or greater bits read for enterprise drives (all HPE 3PAR drives are enterprise drives).
Drive vendors work hard to increase the reliability of enterprise drives, but just as there are limits in increasing drive performance, increasing
drive reliability also has limitations. Capacity growth, especially of SSDs, is growing faster than increase in drive reliability.
Calculating URE
URE is expressed as a rate per bits read. As drive capacity increases, the data that must be read to reconstruct a failed drive also increases. In a
RAID 5 set of 4 x 100 GB drives where one drive fails, the remaining three drives must be read 100% successfully to complete the rebuild
process. A drive size of 100 GB means the remaining three drives, a total of 300 GB, must be read to rebuild the failed drive. If we increase the
drive size to 2 TB in the same RAID 5 (3+1) configuration, the rebuild process must then read 6 TB. All 6 TB must be read without error to
successfully complete the rebuild.
The probability of reading 6 TB of data to rebuild a 2 TB failed drive represents an increased risk over the drive failure probability using MTBF
and AFR as discussed earlier. The prevalent URE for enterprise drives today is approximately 1-bit error in 1016 bits read, which represents one
error in about 1.1 PB read. Using the complement of an algorithm borrowed from the RAID—High-Performance Reliable Secondary Storage work,
you can see that the probability of success of rebuilding a 2 TB drive in a RAID 5 configuration with a set size of 4 is 99.47%. Taking the
complement (100% – 99.47%) shows a 0.53% chance of failure leading to data loss.
Psuccess = (1 – 1/(URE))(data to read)
Psuccess = (1 – 1/(1016))6 TB = 99.47%
Pfailure = 1 – Psuccess = 100% – 99.47% = 0.53%
Here’s the reference to the research work done for the formula:
RAID: High-performance, Reliable Secondary Storage (technical report CSB 93-778), Chen, P., Lee, E., Patterson, D., Gibson, G., Katz, R,
ACM Digital Library, 1993
portal.acm.org/citation.cfm?id=893811
This work is also referenced in a paper:
Triple-Parity RAID and Beyond, Sun Microsystems, Adam Leventhal, 2009
queue.acm.org/detail.cfm?id=1670144
Technical white paper
Page 8
This is a substantial increase over the more common drive failure view using AFR or MTBF. Recall using the manufactures AFR data to calculate
the likelihood of a second failure during a daylong rebuild is only 0.0005%. If we increase the drive size or the set size the risk of data loss due to
URE increases because more data must be read to successfully complete the rebuild. Replacing the 2 TB drives, in the previous example, with
4 TB drives in the same RAID 5 configuration with a set size of four increases the risk from 0.53% to 1.05%.
HPE 3PAR and URE
If a URE occurs during a rebuild on an HPE 3PAR array and the block cannot be rebuilt, the block will be marked as “suspect.” If the next
operation to this block is a host read, a read error is returned. If the next operation to this block is a host write overwriting the “suspect” block, the
new data will be written and the block now contains good data so it is no longer marked “suspect.”
HPE 3PAR uses a technique known as disk scrubbing to address soft errors and UREs before they become a problem. Disk scrubbing is a
background process, which reads each drive. If a read error is found on a drive outside the rebuild window, it can be corrected to maintain the
health of the array. Disk scrubbing becomes more challenging as drive capacity increases lead to longer times to read the entire array capacity.
HPE 3PAR and wide striping
We have described the risk of data loss being dependent on the amount of data that must be read to complete the rebuild. The amount of data,
then, is a function of the drive size and the number of drives in the RAID set. For example, a RAID set of 8 x 1 TB drives must successfully read
(8-1) x 1 TB or 7 TB of data to successfully rebuild a failed drive.
HPE 3PAR StoreServ arrays use wide striping for all RAID types to spread data across many drives. In a StoreServ array with 64 drives, if a single
volume were large enough it would have portions of the volume on each of the 64 drives. The 64 drives, however, do not define the RAID set. It
is not correct that a drive failure in an HPE 3PAR array with 64 drives requires reading the remaining 63 drives to rebuild the failed drive. The
RAID set size is defined in the common provisioning group (CPG) and it is implemented using chunklets. A RAID 5 CPG with a set size of 8 (7+1)
will require reading (8-1) x drive size to rebuild a failed drive. The set size is a tunable attribute of the CPG.
Are SSDs different?
The industry is in a transition from mostly spinning media to mostly flash media with perhaps an exception for NL (archive) drives. A fair
question is, “does the previous history of spinning media apply to SSDs?” Stated another way, are SSDs different when it comes to failure rates?
Flash technology in SSDs is indeed very different from spinning media. Significant reductions in power, cooling, and increases in density and
performance can be realized moving from spinning media to flash media. But how does this impact resiliency? Are failure rates and UREs
different for SSDs? Is the increased performance enough to change the parameters around rebuilds?
Flash technology is new compared to spinning media, which has been around for more than 50 years, so the field history is limited. There are
two SSD studies involving prominent technology companies that offer some insight, but the answers are not clear.
In 2015, a study by Carnegie Mellon University (CMU) analyzed the reliability of SSDs used at Facebook. This study, which is titled “A Large-Scale
Study of Flash Memory Failures in the Field” was also published in the Proceeding of the 2015 ACM SIGMETRICS International Conference on
Measurement and Modeling of Computer Systems. The CMU study does not compare reliability to spinning media and unfortunately does not
provide any helpful data about failure rates and URE of SSDs. The study does conclude with five “observations” including how SSDs go through
several failure periods (early, usable life, wear out, and more) with different failure rates.
Google™ and the University of Toronto published a second study at the File and Storage Technologies Conference proceedings (FAST 16) in
February of 2016. This study, Flash Reliability in Production: The Expected and the Unexpected, covers six years of production use of SSDs at
Google, so all the drives in the study are at least six years old. The study concludes on the measure of replacement rates—SSDs are an
improvement over spinning media. Another conclusion in the study using the measure of URE in SSDs is that SSDs suffer a greater rate of UREs
than spinning media.
The small amount of available data so far suggests failure rates of SSDs are lower than spinning media drives. If this holds up over time, we can
conclude SSDs have reduced risk of data loss because there are fewer rebuilds to be performed. This same small amount of data also suggests
the risk of encountering a URE during the rebuild is still present with SSDs and, in fact, may present an elevated risk with SSDs compared to
spinning media.
What can we do with this data? One option is to consider alternatives to the single failure RAID modes provided by RAID 1 and RAID 5. RAID 6
and RAID 1 with a set size of 3 (triple mirroring) provide resiliency during two overlapping failures. The remainder of this discussion considers
the tradeoffs between RAID modes providing single failure resilience (RAID 1 and RAID 5) and double failure resilience (RAID 1 triple mirror
and RAID 6).
Technical white paper
Page 9
Relative data safety of RAID 1 vs. RAID 5 vs. RAID 6
The tradeoff between usable capacity and RAID set size is well understood. A RAID 5 set size of 4 (3+1) provides 75% usable capacity, which
is the same as a RAID 6 set size of 8 (6+2). Meanwhile, the relative data safety of different RAID levels and set sizes are not well understood.
The following table offers a view of data safety for a few common RAID and set size configurations. This data is provided for comparison only.
The assumptions for the calculations, especially the 2-drive resilience configurations are very detailed and beyond the scope of this discussion.
Table 2. Data robustness and efficiency of RAID modes compared to RAID 5 (8+1)
Set size
Relative data loss robustness
Capacity efficiency
RAID 1
1+1
8
50%
RAID 1
1+1+1
166,000
33%
RAID 5
3+1
3
75%
RAID 5
8+1
1
89%
RAID 6
4+2
25,000
67%
RAID 6
6+2
12,000
75%
Table 2 illustrates the relative robustness and capacity efficiency for several RAID levels and set sizes. The data loss robustness displays data
comparison relative to RAID 5 with a set size of 9 (8+1). For example, RAID 1 with a set size of 2 (1+1) is eight times better than RAID 5 with
a set size of 9 (8+1).
The data also identifies for single drive resiliency RAID modes (RAID 1 1+1 and RAID 5), relative data robustness improves as the set size is
reduced. The largest set size in the table, RAID 5 (8+1) with a set size of 9 has the lowest relative data robustness (larger means better data
robustness and protection). The same is true of double drive resiliency RAID modes (RAID 1 1+1+1 and RAID 6). The smaller the set size, the
greater the relative data robustness.
Performance
RAID overhead
The main goal of RAID arrays is to protect user data. The risks to data safety for several RAID levels have been discussed and now we turn
attention to performance. Each RAID mode and geometry has different levels of data safety and performance.
RAID 1
As previously discussed, RAID 1 mirrors data. When configured with a set size of 2 (1+1), there are two copies of the data. Other set sizes are
also possible and will increase the number of copies. A set size of 3 (1+1+1) will create three copies. This is sometimes called triple mirroring and
is used by the HPE 3PAR InForm OS to store some critical data.
Reads to a RAID 1 volume require one I/O to the backend if the data is not in cache. Writes to a RAID 1 volume require the same number of I/Os
as the set size. A set size of 2 (1+1) will require two backend writes for each host write. These backend writes will increase the workload on the
array, but usually, do not lead to an increase in host latency since the host write is cached.
RAID 5
RAID 5 uses parity information to protect data. Each set size will include n data blocks and 1 parity block where “n” is the set size minus 1. A set
size of 4, for example, will include three data blocks and one parity block. Parity for RAID 5 (and RAID 6) is created using XOR calculations. In
HPE 3PAR, this function is provided in the ASIC (hardware) so it is very fast.
Reads to a RAID 5 volume will require one I/O to the backend if the data is not in cache. Random writes to a RAID 5 volume will require
four backend I/Os. The process begins by reading the old data block and parity block (two back-end reads) and calculating the new parity
information by XORing the new data, the old data, and the old parity. The new data and new parity are then written to the backend creating
two backend writes. There are some optimizations that can help some workloads, but in general, each host write to a RAID 5 volume will
result in four backend I/Os. Sequential write workloads to RAID 5 can result in full stripe writes where one I/O occurs on the backend for
every block in the RAID 5 set.
Technical white paper
Page 10
RAID 6
RAID 6 uses parity information to protect data. Each set size will include n data blocks and two parity blocks, where “n” is the set size minus two.
A set size of 8, for example, will include six data blocks and two parity blocks.
Reads to a RAID 6 volume will require one I/O to the backend if the data is not in cache. Random writes to a RAID 6 volume will require
six backend I/Os. The process begins by reading the old data and both parity blocks (three backend reads). Parity is calculated next creating
two new parity blocks by XORing the new data, the old data, and one of the parity blocks to compute the first parity and repeating the
process with the second old parity block to create the second new parity block. The new data and both new parity blocks are then written to
the backend creating three backend writes. There are some optimizations that can help some workloads, but in general, each host write to a
RAID 6 volume will result in six backend I/Os. Sequential write workloads to RAID 6 can result in full stripe writes where one I/O occurs on the
backend for every block in the RAID 6 set.
Performance during rebuild
When a drive fails, the array has a lot of work to do to restore the previous level of protection. Normal read and write operations to the data
may be impacted. During a disk rebuild, array performance may be impacted by the additional work required to restore the configured level
of data protection.
The impact of a drive failure depends on many factors including array configuration, workload, the amount of data written to the failed drive, and
more. The most significant of these factors is the workload level prior to the failure. A drive failure will cause the array to perform more work to
rebuild the data. This additional workload added to the array will have minimal impact on a lightly loaded array but may have a measurable
impact on a heavily loaded array.
The following two figures illustrate the impact of a drive failure on host performance (figure 4) and backend performance (figure 5) for one
set of factors. This set of factors includes an HPE 3PAR 8440 4-node array with 64 SSDs. The volumes were configured in RAID 1 and were
75% full of written data. The workload was random with 60% reads, 40% writes using 8 KB block size. In this case, the array was moderately
loaded (~250K IOPS) when a single drive connected to node 3 was failed.
Figure 4 shows the workload before, during, and after the drive failure. The host IOPS continue through the rebuild with a very slight impact (less
than 1%). Host service times were less than 300 microseconds before the rebuild and increase about 4% during the rebuild. The impact of the
drive failure to the host is minimal.
Host IOPS and latency
300000
1
0.9
250000
0.7
Before
Rebuild
After
150000
0.6
0.5
0.4
100000
0.3
0.2
50000
0.1
0
0
Time
Host IOPS
Figure 4. Host IOPS and service times before, during, and after a rebuild
Host Latency
latency
Rebuild
Latency (ms)
IOPS
200000
0.8
Technical white paper
Page 11
Figure 5 shows the activity of the backend drives through the same drive failure. The drive IOPS increase about 10% and service time increase
from an average of 120 to 140 microseconds. The increased backend activity reflects the work required to rebuild the failed drive using spare
chunklets. The increased workload is shared by all SSDs in the array. In this case, the work of restoring redundancy by rebuilding the failed drive
has minimal impact on the host.
Back-end drive IOPS and latency
1
450000
0.9
IOPS
0.7
0.6
350000
0.5
0.4
300000
0.3
Latency (ms)
0.8
400000
0.2
250000
0.1
0
200000
0
Time
IOPS
Service time
Rebuild
Figure 5. Backend IOPS and service times before, during, and after a rebuild
There are other potential impacts from rebuilding a failed drive. When space is constrained, it is possible that some data may be moved between
controllers, which could result in changing the workload balance between nodes on the HPE 3PAR array. In one example, a balanced system
handling 20,000 IOPS by each of two controllers before a drive failure became imbalanced after the rebuild. Following the rebuild, one controller
was handling 17,500 IOPS (44%) and the other controller was handling 22,500 IOPS (56%). This imbalance was caused by space constraints
causing some chunklets to be relocated to a different controller and not the controller owning the failed drive.
HPE 3PAR storage arrays are highly available and have many features to protect user data from failures. However, when a failure does occur it is
possible to introduce a change in performance. It will not be observed in all cases, but following recovery from a failure you should not be
surprised if performance changes. There is no guarantee that performance will return to the same level it was before a failure following recovery
from the failure.
Anatomy of a rebuild
The failure of a component in an array causes additional tasks to be performed. It is important to understand array behavior during a failure.
When a drive fails, the HPE 3PAR InForm OS will identify the source of the failure and recover the drive if possible. It is much easier to recover a
drive with a transient error than introduce the risk of a rebuild. It may take two minutes or more before the rebuild process begins.
The process rebuilds chunklets by reading the remaining good data, reconstructing the missing data, and writing the newly constructed data to a
new chunklet.
New data is written to new chunklets throughout the system. The rebuild process chooses a target spare chunklet using several criteria. These
criteria are prioritized to maintain the same level of performance and availability as the source chunklet, if possible. The first choice is a spare
chunklet on the same node as the failed drive. When spare chunklets on the same node are not available, free chunklets with the same
characteristics are considered. During the sparing process, if the number of free chunklets exceeds a threshold in the HPE 3PAR StoreServ OS,
consideration will be given to spare chunklets on another node. This helps keep the array balanced.
When space is constrained, the rebuild process may need to choose a target chunklet that does not preserve array balance or chunklet
availability (for example, high availability cage [HA cage]). In addition to rebuilding a chunklet on another node or node pair, a chunklet on a
Technical white paper
Page 12
different tier of storage may be required. A nearline (NL) chunklet may be rebuilt to NL, fast class (FC or SAS), or SSDs. FC chunklets may be
rebuilt on FC drives or SSDs. SSD chunklets will only be rebuilt to other SSDs. The conditions leading to spare rebuilds on different tiers of
storage are rare and only occur when space is constrained.
During the rebuild process, a host workload is usually running. In many cases, the host workload will access data from the failed drive requiring
special handling of the I/O. When the I/O is a read, the array will reconstruct the missing data just for that read and return it to the host.
When the I/O is a write, the data will initially be stored in cache just like any other write I/O. When the data is flushed from cache it is written to a
log chunklet. This special chunklet will hold data until the original failed chunklet is relocated. Once the failed chunklet is relocated, the data in the
log chunklet will be moved to the new location.
Sparing rates
We have seen how the risk of data loss is dependent on many factors including the rebuild window. The rebuild window is the time that begins
with a drive failure event and ends when data protection is restored following the rebuild. The length of this window on an HPE 3PAR StoreServ
array depends on the amount of written data on the drive, the array model, and configuration. It also depends on how busy the array is in
handling host I/O and data services.
The HPE 3PAR sparing algorithms were designed to have minimal impact on host workloads, but an array that is very busy when a failure occurs
may see an impact in the rebuild process. The rebuild process will take longer on a heavily loaded array. The rebuild time will also take longer for
higher capacity drives, although only written data is reconstructed. Slower drives, such as NL drives, have slower rebuild rates because of higher
drive latencies.
Table 3 lists examples of sparing rates under single drive failures. The HPE 3PAR rebuild process only rebuilds written data so the rebuild time is
not a function of drive size. The workload column in table 3 describes workload characteristics such as the read percentage and host block size.
The workload intensity is described in terms of the average physical drive (PD) service time (SVT). For example, a workload intensity description
of 130 microsecond PD SVT means the host workload is adjusted until the average service time of the drives is 130 milliseconds.
Table 3. Examples of raw capacity sparing rates under single drive failure
Configuration
RAID
7450 2N
3.2.1 MU2
RAID 5
16 x 1.92 TB SSD
RAID 1
(1+1)
8440 4N
3.2.2 MU4
64 x 400 GB SSD
RAID 5
(5+1)
RAID 6
(6+2)
Workload
Rebuild rate
Idle
170 GB/hr
70% read, 8 KB + 32 KB
5 ms PD SVT
140 GB/hr
70% read, 8 KB + 32 KB
20 ms PD SVT
100 GB/hr
Idle
2000 GB/hr
60% read, 8 KB
130 microsecond PD SVT
2000 GB/hr
Idle
1850 GB/hr
60% read, 8 KB
130 microsecond PD SVT
1400 GB/hr
Idle
700 GB/hr
60% read, 8 KB
130 microsecond PD SVT
350 GB/hr
You can see from the examples in table 3 that the rebuild process is faster for RAID 1 and slower for RAID 6. Rebuild rates are also higher for the
most recent version of HPE 3PAR InForm OS running on later generation hardware. Use these examples as guidelines. Actual rebuild rates are
dependent on many factors.
Technical white paper
Page 13
RAID conversion
There are many tradeoffs in selecting a RAID mode including resiliency, performance, and rebuild risks. These variables change over time leading
to the possibility of changing RAID modes. HPE 3PAR provides the ability to change RAID modes online using Dynamic Optimization or DO.
Investigation
Converting from a RAID mode with single drive protection (RAID 1 or RAID 5) to a RAID mode providing double drive protection (RAID 6 or
RAID 1 triple mirroring) requires analysis of performance and capacity before conversion. Adding double drive resilience has a performance
impact for some workloads and understanding this impact before starting a conversion will prevent surprises. Available capacity must also be
evaluated, as some set sizes will require additional space.
Sizing RAID 1 (1+1+1) conversion
RAID 1 (1+1+1) triple mirroring provides excellent performance, fast rebuilds, and protection for up to two drive failures. RAID 1 performance
following a conversion from RAID 5 will be the same or better than RAID 5 performance. Read performance will be about the same while write
performance will be improved. RAID 5 requires two disk reads, parity generation, and two disk writes for each host write. RAID 1 triple mirror
simply requires three disk writes, therefore, overall performance will depend on the workload, but it is expected to be the same or an
improvement over RAID 5.
The primary concern when considering a conversion from RAID 5 to RAID 1 triple mirroring is capacity. The capacity efficiency of RAID 5 varies
with the set size (refer to table 2). A set size of 8 (7+1), for example, will be 87% efficient, which means 87% of the usable space will hold user
data and the remaining 13% will hold parity information. RAID 1 triple mirroring will only be 33% efficient, which means 33% of the space will hold
user data and 67% will provide double drive protection in the form of copies of the user data.
Consider an example with 1 TB of user space written in RAID 5 (7+1). The space on disk (assuming no compaction) will be approximately
1.18 TB (1 TB/87%). 1 TB of user space written in RAID 1 triple mirror will be 3 TB (1 TB/33%) or an increase of about 1.82 TB. In this case, the
cost of double drive protection and RAID 1 triple mirror performance is 1.82 TB of the additional capacity.
Sizing RAID 6 conversion
RAID 6 offers a balance of performance and capacity efficiency in providing double drive failure protection. Users converting from RAID 5 to
RAID 6 should expect lower performance accompanied by the same or lower storage efficiency. The topic of HA cage may also be compromised.
RAID 6 performance
RAID 6 requires additional backend resources to maintain the additional parity information. Reads to RAID 6 will perform the same as RAID 5,
but writes are different. Host random writes to RAID 5 require two backend reads and two backend writes. Host random writes to RAID 6 require
50% more backend IOPS and this additional resource requirement must be considered before starting a conversion to RAID 6.
A good measure of array congestion from write operations is delayed acknowledgments (delack). When the array is very busy with write
requests, HPE 3PAR may delay acknowledgment (delack) of a host request in an effort to moderate the host workload. Delacks have the
effect of slowing the host workload and may allow the backlog of dirty cache pages waiting to be flushed to the disks to complete and normal
performance to resume. Delacks can be monitored with a system reporter command (for example, srstatcmp –btsecs -1h –hires). If 5% or
more of the samples have non-zero delack counters, this is an indication the array might be too busy to consider RAID 6 conversion. The
delack counters are cumulative, requiring subtracting the value reported in the current interval from the value reported in the prior interval
to determine the delacks that occurred. In this case, further performance analysis is required.
Characterization of the host workload comes next. Gather workload data including IOPS, throughput, block size, read/write ratio, and service
times. These values will help understand the workload and estimate the additional resource required by a conversion to RAID 6. This data is
available with a system reporter command (for example, srstatport –btsecs -7d –port_type host –hires).
Technical white paper
Page 14
An example of some data provided by System Reporter is presented in figure 6.
Array1 cli% srstatport -btsecs -7d -hires -port_type host
-----------IO/s------------- ---------------KBytes/s----------------- ---Svct ms------- --IOSz KBytes--Time
Secs
Rd
Wr
Tot
Rd
Wr
Tot Rd Wr Tot Rd Wr Tot
2016-11-03 03:05:00 PDT 1478167500 19839.6 16747.7 36587.3 2304312.2 373525.5 2677837.7 4.66 0.70 2.85 116.1 22.3 73.2
2016-11-03 03:10:00 PDT 1478167800 18104.9 14237.5 32342.4 2292495.9 367744.7 2660240.6 4.62 0.75 2.92 126.6 25.8 82.3
2016-11-03 03:15:00 PDT 1478168100 19863.9 15599.5 35463.4 2245195.5 363948.1 2609143.6 4.09 0.67 2.58 113.0 23.3 73.6
2016-11-03 03:20:00 PDT 1478168400 19329.8 16173.6 35503.3 2128402.0 383208.2 2511610.2 3.73 0.66 2.33 110.1 23.7 70.7
2016-11-03 03:25:00 PDT 1478168700 19538.3 13609.2 33147.6 2149153.0 346517.5 2495670.5 3.47 0.70 2.33 110.0 25.5 75.3
2016-11-03 03:30:00 PDT 1478169000 19569.6 14185.0 33754.6 2084315.7 444565.2 2528880.9 3.61 0.78 2.42 106.5 31.3 74.9
2016-11-03 03:35:00 PDT 1478169300 19140.3 13678.4 32818.7 2131287.6 414473.2 2545760.8 3.65 0.77 2.45 111.4 30.3 77.6
2016-11-03 03:40:00 PDT 1478169600 17879.8 12746.8 30626.5 2026038.9 369926.2 2395965.1 3.66 0.70 2.43 113.3 29.0 78.2
2016-11-03 03:45:00 PDT 1478169900 20696.9 15152.5 35849.4 2248155.4 360746.8 2608902.1 3.60 0.65 2.35 108.6 23.8 72.8
2016-11-03 03:50:00 PDT 1478170200 16556.7 11309.2 27865.9 2021479.5 302920.5 2324400.0 3.81 0.71 2.55 122.1 26.8 83.4
2016-11-03 03:55:00 PDT 1478170500 15521.8 12401.0 27922.8 2078035.5 380780.6 2458816.1 4.17 0.73 2.64 133.9 30.7 88.1
2016-11-03 04:00:00 PDT 1478170800 15693.6 12174.3 27867.8 2031430.3 347725.5 2379155.8 4.22 0.71 2.69 129.4 28.6 85.4
2016-11-03 04:05:00 PDT 1478171100 16064.3 11527.2 27591.4 2163348.7 281132.0 2444480.7 4.17 0.65 2.70 134.7 24.4 88.6
Figure 6. Example System Reporter data
Figure 7 shows what this data looks like when graphing IOPS.
Figure 7. Example host performance data
The example in figure 7 shows a host workload requesting up to 58K IOPS. The read-to-write ratio can be calculated from the full data started in
figure 6. Divide the average read operations by the average write operations to calculate the number of reads per write. In the example in figures
6 and 7, the average read rate is 16,175 and the average write rate is 13,880. Dividing the read rate by the write rate, we get 1.16 reads per
write or a read: write ratio of 1.16:1.
This measurement data can be compared to sizing estimates of the array to estimate the impact of a conversion to RAID 6. IOPS estimates of
this array are calculated for the current configuration (RAID 5) and the target configuration (RAID 6). These estimates are added to the graph in
figure 7 to create the graph in figure 8.
Note
Consult your local HPE representative for help in estimating IOPS for your array.
Technical white paper
Page 15
Figure 8. Example host performance with RAID 5 and RAID 6 estimates
The line marked RAID 5 in figure 8 represents the sizer estimate of the maximum IOPS for this array with this workload. The RAID 6 line in
figure 8 represents the sizer estimate of the reduced maximum IOPS this array can achieve with this workload if the configuration were changed
to use RAID 6.
When the data is presented this way, you can easily see how much performance is being used in the RAID 5 configuration and an estimate of the
performance following a conversion to RAID 6. In this case, you can see a few data points that cross the RAID 6 estimated performance line. This
does not mean RAID 6 is not an option, but during times when I/O requests exceed the capability of the array, host service times will increase. A
read spike that causes the total IOPS to exceed a threshold may result in a greater impact to the workload than a write spike since writes will
always be cached.
When comparing measurement data and sizing estimates like that found in figure 8, it is often helpful to dig deeper into the measurement data.
A closer examination of the workload running during the times where the total IOPS exceeds the RAID 6 estimated performance line will help
understand what applications may see increased service times. The calculation in the measurement data of maximums and 95th and 98th
percentiles can shed some light on the relationship between the measurement data and the performance estimate.
Special consideration should also be given to the measurement window and the differences between the sizer assumptions and the array
configuration. A key question when analyzing measurement data is, “what was measured?” The measurement data may not be helpful if it is not
representative of the workload that is expected following a RAID 6 conversion. All performance estimators make some assumptions about
configurations such as the number of host connections, backend data paths, number of disks, and more. Make sure these assumptions agree with
the target configuration or make the necessary adjustments in the performance estimate.
RAID 6 HA cage
High availability cage (HA cage) is a common provisioning group (CPG) setting that specifies data is written to physical drives in a manner
that provides for an entire cage failure without data unavailability or loss. This option is allowed in all RAID modes but it requires sufficient
drive cages to support the configured set size. The set size defines the number of drives that will be written as an availability group or RAID
set. RAID 1 (1+1), for example, will mirror data on two drives to provide single drive fault tolerance. Adding HA cage to this configuration
requires at least two cages behind each node pair. Similarly, RAID 5 (7+1) uses data written to seven drives plus one more drive for parity
making a total of eight. Adding HA cage to this configuration requires at least eight drive cages behind each node pair.
The HA setting for the current environment must be considered before converting to RAID 6. If HA cage is being used before the conversion and
the desire is to maintain HA cage, there may be space efficiency (set size) tradeoffs to consider. RAID 5 (7+1) using HA cage requires eight drive
cages behind each node pair. Conversion to RAID 6 must match the RAID 5 set size of 8 to maintain HA cage. In this case, choosing a RAID 6 set
size of 8 (6+2) will enable HA cage to be maintained, but the capacity efficiency will be reduced from 87% (RAID 5 7+1) to 66% (RAID 6 6+2).
Technical white paper
Page 16
Alternatively, additional cages may be added to each node pair if possible. If additional capacity is not available, more drives may be required
or the choice of HA cage and RAID 6 set size can be reconsidered. The same capacity efficiency can be maintained using a RAID 6 set size of
16 (14+2), but with this set size HA cage cannot be maintained with only eight cages.
RAID 6 capacity
RAID 6 provides double drive failure protection using two parity drives. This additional parity sometimes requires additional storage capacity, but
not always. The key is the set size of the source RAID 5 CPG and the target RAID 6 CPG. If a RAID 6 set size is double the RAID 5 set size, the
same capacity efficiency can be maintained. A RAID 5 set size of 4 (3+1) has a capacity efficiency of 75%. Choosing a RAID 6 set size of
8 (6+2) will maintain the same 75% capacity efficiency.
If doubling the RAID 5 set size is not possible, capacity efficiency will suffer. If sufficient free space is not available, additional drives may be
needed. A RAID 5 volume using a set size of 8 (7+1) has a capacity efficiency of 87%. Converting this volume to RAID 6 and maintaining the
same set size of 8 (6+2) will reduce capacity efficiency to 75%. If 12% of additional capacity is not available, the conversion will not be successful.
RAID conversion process
Careful consideration may lead to the decision to change RAID modes or set sizes. HPE 3PAR Dynamic Optimization allows changing of RAID
levels and set sizes non-disruptively. Changing the RAID level or set size of a volume uses the tunevv CLI command.
Converting an existing volume to have different RAID or set size properties begins with the CPG. These properties are defined in the CPG
definition. The first step is to find an existing CPG with the desired properties or create a new CPG with these properties. The following command
creates a new CPG specifying RAID 6 and a set size (ssz) of 8 (6+2).
createcpg –ssz 8 –ha mag –t r6 –p –devtype SSD MY_CPG
After the CPG is identified, the tunevv command may be run to change the attributes of the VV. The tunevv command will run as a background
process and it can be run non-disruptively. The host will be unaware of the tune process as the VV changes from one set of parameters to
another. The following is an example tunevv command.
tunevv usr_cpg MY_CPG MY_Vol
The tune process attempts to minimize resource utilization to prevent any impact to host I/Os. tunevv will limit outstanding I/Os to anyone
drive to approximately 8% of that drive’s maximum IOPS. This leaves 92% of the drive’s IOPS capability to serve host IOPS and other data
services. Because of this, it can take a long time to convert a large volume. Currently, there is no way to change the speed of a tune process
even on an idle array.
It is still possible, even with this consideration for host performance that a tune command may at times impact host service times. If you
determine a tune operation is having an undesirable impact on other array operations, the tune may be safely stopped (canceltask) and restarted
at a later time.
While tunevv is running it will periodically free up available space. The tune process may also overlap with some other data services that move
data such as Adaptive Optimization. Because of this overlap, some long-running tune operations may report errors on some regions. This is
normal and the skipped regions can be addressed by running the tune operation a second time.
Summary
RAID provides data protection from single and double drive failures. When a drive fails, a process within the storage array begins to restore the
array to the original level of protection. Restoring protection often results in a drive rebuild process, which leads to the risk of a second failure
during the rebuild. This second failure can come from a second drive failure or from an Uncorrectable Read Error (URE).
Drive capacities are growing faster than the drive manufacturers can grow reliability and performance. This creates a trend of longer rebuild
times and, therefore, more exposure to the risk of a second failure. All this makes it desirable to consider a RAID level and set size providing more
data robustness. When a change to a different RAID level or set size is desired, HPE 3PAR can perform the conversion while continuing host I/O
using Dynamic Optimization.
The trend of larger capacity drives leading to longer rebuild times and the associated risks is leading HPE 3PAR to recommend and default
to a more robust RAID configuration. The current HPE 3PAR recommendation and default for different drive types and capacities are listed
in table 4.
Technical white paper
HPE recommendation (3.3.1 and later)
Table 4. HPE 3PAR recommended and default RAID levels
Type
Recommendation
Default
RAID 5
NL
RAID 6
RAID 6
Supported:
setsys AllowR5OnNLDrives yes
FC/SAS
RAID 6
RAID 6
Supported:
setsys AllowR5OnFCDrives yes
SSD
RAID 6
RAID 6
Supported
Glossary
• RAID: Redundant Array of Independent Disks; previously, Redundant Array of Inexpensive Disks
• RAID set: A set of blocks that together provide data protection; RAID 5 (3+1) uses four blocks (3 data, 1 parity) on different drives to provide
data protection
• RAID set size: The number of blocks in a RAID set used to provide data protection; RAID 5 with a set size of 4 will have three data blocks and
one parity block (3+1); RAID 6 with a set size of 8 will have six data blocks and two parity blocks (6+2)
• RAID conversion: Change the RAID level of a storage object
• Parity: Data stored in the RAID set that can be used for regenerating the data in the RAID set; parity data is calculated using the XOR operator
• Kryder’s law: Mark Kryder, CTO at Seagate Corp, in 2005 observed the rate of disk drive density increases were exceeding the more famous
Moore’s Law; the result is a “law” attributed to Kryder generally regarded as drive density doubling every two years
• Uncorrectable Read Error (URE): Sometimes called Unrecoverable Read Error; reliability metric reported by drive vendors; the rate is reported
as the number of bits that can be read before an uncorrectable read error may be expected; common values today (2016) for URE in
enterprise drives are 1-bit error for every 1016 bits read
• Annual failure rate (AFR): Reliability metric reported by drive vendors; the rate is the estimated probability that a drive will fail during one full
year of use
• Mean time between failures (MTBF): Reliability metric reported by drive vendors; the rate is expressed as the number of hours before a failure is likely
• Capacity efficiency: The ratio between total raw storage space and user-written space for a given RAID mode; RAID 6 (4+2), for example, requires
six blocks (4+2) of raw storage to store four blocks of user-written data; this example results in a capacity efficiency of 67% (4/6)
• Chunklet: A block of contiguous storage space on a drive; chunklets on HPE 3PAR StoreServ 8000 Storage and HPE 3PAR StoreServ
20000 Storage arrays are 1 GB
• Data chunklet: A block of contiguous storage space holding user data
• Parity chunklet: A block of contiguous storage space holding parity data
Sign up for updates
© Copyright 2017 Hewlett Packard Enterprise Development LP. The information contained herein is subject to change without notice.
The only warranties for Hewlett Packard Enterprise products and services are set forth in the express warranty statements accompanying
such products and services. Nothing herein should be construed as constituting an additional warranty. Hewlett Packard Enterprise shall
not be liable for technical or editorial errors or omissions contained herein.
Google is a registered trademark of Google Inc. All other third-party trademark(s) is/are property of their respective owner(s).
a00000244ENW, January 2017
Download