HPE 3PAR RAID 6 Securing your data investment Technical white paper Technical white paper Contents Executive summary................................................................................................................................................................................................................................................................................................................................3 RAID .....................................................................................................................................................................................................................................................................................................................................................................3 How does RAID work?.........................................................................................................................................................................................................................................................................................................................4 RAID 1 .........................................................................................................................................................................................................................................................................................................................................................4 RAID 5 .........................................................................................................................................................................................................................................................................................................................................................4 RAID 6 .........................................................................................................................................................................................................................................................................................................................................................5 Capacity efficiency ............................................................................................................................................................................................................................................................................................................................6 Drive failures and rebuild ..................................................................................................................................................................................................................................................................................................................6 The second failure ............................................................................................................................................................................................................................................................................................................................7 Uncorrectable Read Error ..........................................................................................................................................................................................................................................................................................................7 HPE 3PAR and wide striping..................................................................................................................................................................................................................................................................................................8 Are SSDs different?..........................................................................................................................................................................................................................................................................................................................8 Relative data safety of RAID 1 vs. RAID 5 vs. RAID 6 ......................................................................................................................................................................................................................................9 Performance .................................................................................................................................................................................................................................................................................................................................................9 RAID overhead.....................................................................................................................................................................................................................................................................................................................................9 Performance during rebuild ................................................................................................................................................................................................................................................................................................. 10 Anatomy of a rebuild..................................................................................................................................................................................................................................................................................................................11 Sparing rates ......................................................................................................................................................................................................................................................................................................................................12 RAID conversion ...................................................................................................................................................................................................................................................................................................................................13 Investigation.......................................................................................................................................................................................................................................................................................................................................13 RAID conversion process ....................................................................................................................................................................................................................................................................................................... 16 Summary ......................................................................................................................................................................................................................................................................................................................................................16 HPE recommendation (3.3.1 and later) ................................................................................................................................................................................................................................................................... 17 Glossary ........................................................................................................................................................................................................................................................................................................................................................17 Technical white paper Page 3 Executive summary Data continues to grow at alarming rates and the need to store more data has never been greater. This data expansion has led to a continued growth in disk drive capacities with only modest growth in drive reliability and performance. The gap between capacity and reliability leads to increased risk to your data, and the time required to recover from a drive failure continues to grow. Compounding drive recovery is the increased risk of a second failure while still recovering from the initial failure. This second failure may come from a second drive fault, but a more common source is an Uncorrectable Read Error (URE). The primary purpose of storage arrays today is to provide robust data services with a high level of performance. HPE 3PAR StoreServ arrays maintain data protection using RAID technologies. RAID protection comes in three levels (1, 5, and 6) and several set sizes. Each RAID level and set size offer a different level of performance and data robustness. If you are concerned about the risks to your data, you need to understand RAID levels, failures, second failures, URE, and the data robustness options available in HPE 3PAR arrays. RAID The idea of RAID started in the late 1980s at the University of California, Berkeley, as a way to address I/O performance. The first RISC processors were being developed and with it, the expectation of great processor speed increases. Moore’s Law, an observation made in 1965, was in full focus leading to the belief that processor performance would increase annually. This led to the need for more I/O performance to keep pace (the industry has been working on the processor-I/O balance ever since). In the late 1980s, the first official SCSI spec was being defined and the state-of-the-art hard disks could be represented by an IBM drive (3300 series) the size of a refrigerator with 14-inch platters holding 7.5 GB of data (the literature of the day described capacity in billions of characters of information that could be stored). The idea behind the initial RAID paper published in 1988 was to address larger disks with long service times dominated by seek time (time to move the read/write head) and rotational delays (time for the platter to spin). These concepts are not discussed much anymore, but they are still the main drivers in service time of spinning media drives today. The definition of RAID changed over time to replace the word “inexpensive” (for the letter I) with the word “independent” as disks then, as they are now, were expensive. A more important change was the recognition of the concept of RAID to address reliability and performance. Today, RAID is synonymous with data protection in storage arrays. Today’s RAID comes in many forms called levels or modes. The levels have been defined and refined over time and now there are three RAID levels in common use—RAID 1/RAID 10, RAID 5/RAID 50, and RAID 6/RAID 60. RAID 1 is commonly referred to as mirroring where two copies of data are kept in sync. RAID 5 and RAID 6 use the concept of parity to provide data redundancy using less space than RAID 1. RAID 5 uses single parity while RAID 6 uses double parity. Adding a zero (0) to the RAID level adds striping. RAID 10, for example, is striped mirroring. Technical white paper Page 4 How does RAID work? RAID 1 When we consider how RAID works, the focus is on how the architecture provides data protection. RAID 1 is often called disk mirroring and uses pairs of disks or in the case of HPE 3PAR, pairs of chunklets to store data. Each block written from the host is written to each disk or chunklet in the mirror pair. At any point in time, each RAID 1 disk or chunklet will hold the same data as its mirrored partner. The space efficiency of RAID 1 is 50% as writing one unit of data from the server will require a unit of space on two separate disks. RAID 1 protects data in the event of one disk failure. (HPE 3PAR also supports RAID 1 triple mirroring. This will be discussed later.) Figure 1. RAID 1 RAID 5 RAID 5 uses a parity algorithm to maintain data protection, which increases space efficiency relative to RAID 1. Data written to the storage with RAID 5 will include a minimum of two data blocks and one parity block. A unit of data from data disk 1 is combined with a unit of data from data disk 2 to compute parity information, which is written to the parity disk. The minimum configuration for RAID 5 is three disks of equal size, but implementations of RAID 5 using eight disks (seven data disks and one parity disk or 7+1) or more are common. This combination of data disks and parity disk defines the RAID set size. A RAID 5 configuration with three data disks and one parity disk, commonly noted as (3+1) will have a set size of four disks. The space efficiency of RAID 5 is dependent on the set size. In a 4-disk RAID 5 set, the space efficiency is 3/4 or 75% (three data disks and one parity disk) while the efficiency of an 8-disk RAID 5 set is 87.5% (seven data disks, one parity disk). RAID 5 with a single parity disk can preserve data when one disk fails. Technical white paper Page 5 Figure 2. RAID 5 RAID 6 RAID 6 builds on the parity concept of RAID 5 but adds a second parity disk. Data written to disk configured in RAID 6 will include a minimum of two data blocks and two parity blocks. A unit of data from data block one is combined with a unit of data from data block two to compute parity information, which is then written to both parity blocks. The minimum configuration for RAID 6 is four disks of equal size, but implementations of RAID 6 using eight (six data disks, two parity disks) or more are common. The space efficiency of RAID 6 depends on the set size. In a 4-disk RAID 6 set, the space efficiency is 50% (two data blocks and two parity blocks). Space efficiency of a 16-disk RAID 6 set is 87.5% (14 data blocks and two parity blocks). RAID 6 with double parity disks can preserve data in the event of two disk failures. Figure 3. RAID 6 Technical white paper Page 6 Capacity efficiency It is often assumed that less space will be available in a RAID 6 configuration compared to a RAID 5 configuration because of the double parity. This is not always the case and depends on the set size (number of storage units in a RAID set). Table 1 compares space efficiency of several RAID 5 and RAID 6 set sizes. Table 1. RAID 5 and RAID 6 set sizes and density RAID 5 set size RAID 6 equivalent set size Space efficiency 3 (2+1) 6 (4+2) 2/3 = 4/6 = 66.6% 4 (3+1) 8 (6+2) 3/4 = 6/8 = 75.0% 5 (4+1) 10 (8+2) 4/5 = 8/10 = 80.0% 6 (5+1) 12 (10+2) 5/6 = 10/12 = 83.3% 7 (6+1) 8 (7+1) 6/7 = N/A = 85.7% 16 (14+2) 9 (8+1) 7/8 = 14/16 = 87.5% 8/9 = N/A = 88.9% Drive failures and rebuild Each of the RAID levels discussed provides data protection in the event of a drive failure, but they do not provide the same level of protection. Understanding the relative data safety requires a closer look at each RAID level and the rebuild process. Data is protected from a failure in each of the RAID levels by use of mirroring (RAID 1) or parity (RAID 5 and 6). When a drive fails in a RAID set providing single drive resiliency (for example, RAID 1 mirroring or RAID 5), a window opens where the data in the RAID group is no longer protected from another failure (a second failure) until the rebuild of the failed drive is completed. The rebuild process uses array resources to restore the same data protection that existed prior to the drive failure, but there is an additional risk during the rebuild process. The time required for the rebuild process is influenced by many factors including the amount of data written to the failed drive, the number of drives in the RAID group, and speed of the drives in the RAID group. The rebuild process must read the remaining good data from the RAID group, rebuild the failed data if parity is used, and write the recovered data to a new location. A proxy for the minimum rebuild time in traditional arrays using a spare drive is the time required to write the content to a new drive. Note The rebuild time proxy does not apply to an HPE 3PAR StoreServ array, which implements distributed sparing and many-to-many rebuilds. HPE 3PAR will read and write the same amount of data, but the reads and writes will be spread among many drives, which is faster. If a 1 TB drive fails, the theoretical minimum rebuild time would be 1 TB divided by the write rate of that drive. If the drive is capable of 100 MB/sec then the minimum theoretical rebuild time would be around three hours (1 TB/100 MB/sec = 10485 seconds ≈ 2.9 hours). This minimum time is often much longer in practice. The array usually has other priorities such as serving host I/O. The time required to read the remaining good data in the RAID set may also add to the rebuild time when the RAID set is large such as in a RAID 5 (7+1) configuration. Rebuild times are often much longer in practice and growing longer as drive sizes increase. The industry trend has always been toward larger capacity faster drives. It is easier to increase the capacity of a drive (to a point) than to increase its performance, which is often limited by physics (rotational speed in spinning media) or bus speeds. This leads to a pace of capacity gains growing faster than gains in drive performance. Take an industry-leading 10K RPM drive as an example. Several years ago, a 300 GB enterprise drive was capable of around 150 MB/second of sustained throughput. Today’s 1.8 TB enterprise drive is capable of around 175 MB/sec of sustained throughput. This 600% improvement in capacity is matched with only a 12% increase in performance. The theoretical rebuild time has increased from about 34 minutes for the 300 GB drive to 170 minutes for the 1.8 TB drive or about 500% longer. The HPE 3PAR StoreServ array provides significant benefit by using spare chunklets and wide striping. HPE 3PAR implements distributed sparing compared to others who use a spare drive. Wide striping of spare chunklets means an HPE 3PAR StoreServ array rebuild times are faster because many drives handle the rebuild load (many-to-many rebuild). The industry trend to larger drive sizes means the trend to increased rebuild times applies to HPE 3PAR StoreServ arrays as well. Technical white paper Page 7 The second failure We have just discussed the importance of rebuild time to data protection and how a second failure during a rebuild can lead to data loss when using a RAID level providing single drive resilience (RAID 1 or RAID 5). What is the probability of experiencing an unfortunate second failure during the rebuild time? The probability of a second failure is dependent on the time required to rebuild the failed drive. When drives were small (for example, 100 GB), the rebuild time was expressed in minutes limiting the risk. As drives have grown larger, the drive speed also increased for a while keeping the rebuild times short. As drive capacities continue to follow Kryder’s law (doubling every two years), drive speeds cannot keep up, which leads to longer rebuild times and increased risk. The probability of a second drive failure is often calculated using measures such as Mean Time Between Failure (MTBF) and Annualized Failure Rate (AFR). These values are published by the drive vendor and adjusted by the anticipated rebuild time. Typical AFR values published by vendors are 1% or less for enterprise drives making the likelihood of a second drive failure during a days-long rebuild relatively small (on the order of five in a million—0.0005%—or less). Uncorrectable Read Error This “second failure” is often thought of as a second drive failure while a rebuild is addressing the first failure. This will indeed cause data loss in a single parity configuration, however, a more common second failure is URE. URE measures the error rate as a function of reading data on the drive and it is expressed as the number of bits, which can be read before a failure is likely. MTBF and AFR metrics reflect failure rates as a function of the time the drive is powered on. Today’s URE standard is 1-bit error in 1014 bits read for consumer drives (think of your laptop) and 1 bit in 1016 or greater bits read for enterprise drives (all HPE 3PAR drives are enterprise drives). Drive vendors work hard to increase the reliability of enterprise drives, but just as there are limits in increasing drive performance, increasing drive reliability also has limitations. Capacity growth, especially of SSDs, is growing faster than increase in drive reliability. Calculating URE URE is expressed as a rate per bits read. As drive capacity increases, the data that must be read to reconstruct a failed drive also increases. In a RAID 5 set of 4 x 100 GB drives where one drive fails, the remaining three drives must be read 100% successfully to complete the rebuild process. A drive size of 100 GB means the remaining three drives, a total of 300 GB, must be read to rebuild the failed drive. If we increase the drive size to 2 TB in the same RAID 5 (3+1) configuration, the rebuild process must then read 6 TB. All 6 TB must be read without error to successfully complete the rebuild. The probability of reading 6 TB of data to rebuild a 2 TB failed drive represents an increased risk over the drive failure probability using MTBF and AFR as discussed earlier. The prevalent URE for enterprise drives today is approximately 1-bit error in 1016 bits read, which represents one error in about 1.1 PB read. Using the complement of an algorithm borrowed from the RAID—High-Performance Reliable Secondary Storage work, you can see that the probability of success of rebuilding a 2 TB drive in a RAID 5 configuration with a set size of 4 is 99.47%. Taking the complement (100% – 99.47%) shows a 0.53% chance of failure leading to data loss. Psuccess = (1 – 1/(URE))(data to read) Psuccess = (1 – 1/(1016))6 TB = 99.47% Pfailure = 1 – Psuccess = 100% – 99.47% = 0.53% Here’s the reference to the research work done for the formula: RAID: High-performance, Reliable Secondary Storage (technical report CSB 93-778), Chen, P., Lee, E., Patterson, D., Gibson, G., Katz, R, ACM Digital Library, 1993 portal.acm.org/citation.cfm?id=893811 This work is also referenced in a paper: Triple-Parity RAID and Beyond, Sun Microsystems, Adam Leventhal, 2009 queue.acm.org/detail.cfm?id=1670144 Technical white paper Page 8 This is a substantial increase over the more common drive failure view using AFR or MTBF. Recall using the manufactures AFR data to calculate the likelihood of a second failure during a daylong rebuild is only 0.0005%. If we increase the drive size or the set size the risk of data loss due to URE increases because more data must be read to successfully complete the rebuild. Replacing the 2 TB drives, in the previous example, with 4 TB drives in the same RAID 5 configuration with a set size of four increases the risk from 0.53% to 1.05%. HPE 3PAR and URE If a URE occurs during a rebuild on an HPE 3PAR array and the block cannot be rebuilt, the block will be marked as “suspect.” If the next operation to this block is a host read, a read error is returned. If the next operation to this block is a host write overwriting the “suspect” block, the new data will be written and the block now contains good data so it is no longer marked “suspect.” HPE 3PAR uses a technique known as disk scrubbing to address soft errors and UREs before they become a problem. Disk scrubbing is a background process, which reads each drive. If a read error is found on a drive outside the rebuild window, it can be corrected to maintain the health of the array. Disk scrubbing becomes more challenging as drive capacity increases lead to longer times to read the entire array capacity. HPE 3PAR and wide striping We have described the risk of data loss being dependent on the amount of data that must be read to complete the rebuild. The amount of data, then, is a function of the drive size and the number of drives in the RAID set. For example, a RAID set of 8 x 1 TB drives must successfully read (8-1) x 1 TB or 7 TB of data to successfully rebuild a failed drive. HPE 3PAR StoreServ arrays use wide striping for all RAID types to spread data across many drives. In a StoreServ array with 64 drives, if a single volume were large enough it would have portions of the volume on each of the 64 drives. The 64 drives, however, do not define the RAID set. It is not correct that a drive failure in an HPE 3PAR array with 64 drives requires reading the remaining 63 drives to rebuild the failed drive. The RAID set size is defined in the common provisioning group (CPG) and it is implemented using chunklets. A RAID 5 CPG with a set size of 8 (7+1) will require reading (8-1) x drive size to rebuild a failed drive. The set size is a tunable attribute of the CPG. Are SSDs different? The industry is in a transition from mostly spinning media to mostly flash media with perhaps an exception for NL (archive) drives. A fair question is, “does the previous history of spinning media apply to SSDs?” Stated another way, are SSDs different when it comes to failure rates? Flash technology in SSDs is indeed very different from spinning media. Significant reductions in power, cooling, and increases in density and performance can be realized moving from spinning media to flash media. But how does this impact resiliency? Are failure rates and UREs different for SSDs? Is the increased performance enough to change the parameters around rebuilds? Flash technology is new compared to spinning media, which has been around for more than 50 years, so the field history is limited. There are two SSD studies involving prominent technology companies that offer some insight, but the answers are not clear. In 2015, a study by Carnegie Mellon University (CMU) analyzed the reliability of SSDs used at Facebook. This study, which is titled “A Large-Scale Study of Flash Memory Failures in the Field” was also published in the Proceeding of the 2015 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems. The CMU study does not compare reliability to spinning media and unfortunately does not provide any helpful data about failure rates and URE of SSDs. The study does conclude with five “observations” including how SSDs go through several failure periods (early, usable life, wear out, and more) with different failure rates. Google™ and the University of Toronto published a second study at the File and Storage Technologies Conference proceedings (FAST 16) in February of 2016. This study, Flash Reliability in Production: The Expected and the Unexpected, covers six years of production use of SSDs at Google, so all the drives in the study are at least six years old. The study concludes on the measure of replacement rates—SSDs are an improvement over spinning media. Another conclusion in the study using the measure of URE in SSDs is that SSDs suffer a greater rate of UREs than spinning media. The small amount of available data so far suggests failure rates of SSDs are lower than spinning media drives. If this holds up over time, we can conclude SSDs have reduced risk of data loss because there are fewer rebuilds to be performed. This same small amount of data also suggests the risk of encountering a URE during the rebuild is still present with SSDs and, in fact, may present an elevated risk with SSDs compared to spinning media. What can we do with this data? One option is to consider alternatives to the single failure RAID modes provided by RAID 1 and RAID 5. RAID 6 and RAID 1 with a set size of 3 (triple mirroring) provide resiliency during two overlapping failures. The remainder of this discussion considers the tradeoffs between RAID modes providing single failure resilience (RAID 1 and RAID 5) and double failure resilience (RAID 1 triple mirror and RAID 6). Technical white paper Page 9 Relative data safety of RAID 1 vs. RAID 5 vs. RAID 6 The tradeoff between usable capacity and RAID set size is well understood. A RAID 5 set size of 4 (3+1) provides 75% usable capacity, which is the same as a RAID 6 set size of 8 (6+2). Meanwhile, the relative data safety of different RAID levels and set sizes are not well understood. The following table offers a view of data safety for a few common RAID and set size configurations. This data is provided for comparison only. The assumptions for the calculations, especially the 2-drive resilience configurations are very detailed and beyond the scope of this discussion. Table 2. Data robustness and efficiency of RAID modes compared to RAID 5 (8+1) Set size Relative data loss robustness Capacity efficiency RAID 1 1+1 8 50% RAID 1 1+1+1 166,000 33% RAID 5 3+1 3 75% RAID 5 8+1 1 89% RAID 6 4+2 25,000 67% RAID 6 6+2 12,000 75% Table 2 illustrates the relative robustness and capacity efficiency for several RAID levels and set sizes. The data loss robustness displays data comparison relative to RAID 5 with a set size of 9 (8+1). For example, RAID 1 with a set size of 2 (1+1) is eight times better than RAID 5 with a set size of 9 (8+1). The data also identifies for single drive resiliency RAID modes (RAID 1 1+1 and RAID 5), relative data robustness improves as the set size is reduced. The largest set size in the table, RAID 5 (8+1) with a set size of 9 has the lowest relative data robustness (larger means better data robustness and protection). The same is true of double drive resiliency RAID modes (RAID 1 1+1+1 and RAID 6). The smaller the set size, the greater the relative data robustness. Performance RAID overhead The main goal of RAID arrays is to protect user data. The risks to data safety for several RAID levels have been discussed and now we turn attention to performance. Each RAID mode and geometry has different levels of data safety and performance. RAID 1 As previously discussed, RAID 1 mirrors data. When configured with a set size of 2 (1+1), there are two copies of the data. Other set sizes are also possible and will increase the number of copies. A set size of 3 (1+1+1) will create three copies. This is sometimes called triple mirroring and is used by the HPE 3PAR InForm OS to store some critical data. Reads to a RAID 1 volume require one I/O to the backend if the data is not in cache. Writes to a RAID 1 volume require the same number of I/Os as the set size. A set size of 2 (1+1) will require two backend writes for each host write. These backend writes will increase the workload on the array, but usually, do not lead to an increase in host latency since the host write is cached. RAID 5 RAID 5 uses parity information to protect data. Each set size will include n data blocks and 1 parity block where “n” is the set size minus 1. A set size of 4, for example, will include three data blocks and one parity block. Parity for RAID 5 (and RAID 6) is created using XOR calculations. In HPE 3PAR, this function is provided in the ASIC (hardware) so it is very fast. Reads to a RAID 5 volume will require one I/O to the backend if the data is not in cache. Random writes to a RAID 5 volume will require four backend I/Os. The process begins by reading the old data block and parity block (two back-end reads) and calculating the new parity information by XORing the new data, the old data, and the old parity. The new data and new parity are then written to the backend creating two backend writes. There are some optimizations that can help some workloads, but in general, each host write to a RAID 5 volume will result in four backend I/Os. Sequential write workloads to RAID 5 can result in full stripe writes where one I/O occurs on the backend for every block in the RAID 5 set. Technical white paper Page 10 RAID 6 RAID 6 uses parity information to protect data. Each set size will include n data blocks and two parity blocks, where “n” is the set size minus two. A set size of 8, for example, will include six data blocks and two parity blocks. Reads to a RAID 6 volume will require one I/O to the backend if the data is not in cache. Random writes to a RAID 6 volume will require six backend I/Os. The process begins by reading the old data and both parity blocks (three backend reads). Parity is calculated next creating two new parity blocks by XORing the new data, the old data, and one of the parity blocks to compute the first parity and repeating the process with the second old parity block to create the second new parity block. The new data and both new parity blocks are then written to the backend creating three backend writes. There are some optimizations that can help some workloads, but in general, each host write to a RAID 6 volume will result in six backend I/Os. Sequential write workloads to RAID 6 can result in full stripe writes where one I/O occurs on the backend for every block in the RAID 6 set. Performance during rebuild When a drive fails, the array has a lot of work to do to restore the previous level of protection. Normal read and write operations to the data may be impacted. During a disk rebuild, array performance may be impacted by the additional work required to restore the configured level of data protection. The impact of a drive failure depends on many factors including array configuration, workload, the amount of data written to the failed drive, and more. The most significant of these factors is the workload level prior to the failure. A drive failure will cause the array to perform more work to rebuild the data. This additional workload added to the array will have minimal impact on a lightly loaded array but may have a measurable impact on a heavily loaded array. The following two figures illustrate the impact of a drive failure on host performance (figure 4) and backend performance (figure 5) for one set of factors. This set of factors includes an HPE 3PAR 8440 4-node array with 64 SSDs. The volumes were configured in RAID 1 and were 75% full of written data. The workload was random with 60% reads, 40% writes using 8 KB block size. In this case, the array was moderately loaded (~250K IOPS) when a single drive connected to node 3 was failed. Figure 4 shows the workload before, during, and after the drive failure. The host IOPS continue through the rebuild with a very slight impact (less than 1%). Host service times were less than 300 microseconds before the rebuild and increase about 4% during the rebuild. The impact of the drive failure to the host is minimal. Host IOPS and latency 300000 1 0.9 250000 0.7 Before Rebuild After 150000 0.6 0.5 0.4 100000 0.3 0.2 50000 0.1 0 0 Time Host IOPS Figure 4. Host IOPS and service times before, during, and after a rebuild Host Latency latency Rebuild Latency (ms) IOPS 200000 0.8 Technical white paper Page 11 Figure 5 shows the activity of the backend drives through the same drive failure. The drive IOPS increase about 10% and service time increase from an average of 120 to 140 microseconds. The increased backend activity reflects the work required to rebuild the failed drive using spare chunklets. The increased workload is shared by all SSDs in the array. In this case, the work of restoring redundancy by rebuilding the failed drive has minimal impact on the host. Back-end drive IOPS and latency 1 450000 0.9 IOPS 0.7 0.6 350000 0.5 0.4 300000 0.3 Latency (ms) 0.8 400000 0.2 250000 0.1 0 200000 0 Time IOPS Service time Rebuild Figure 5. Backend IOPS and service times before, during, and after a rebuild There are other potential impacts from rebuilding a failed drive. When space is constrained, it is possible that some data may be moved between controllers, which could result in changing the workload balance between nodes on the HPE 3PAR array. In one example, a balanced system handling 20,000 IOPS by each of two controllers before a drive failure became imbalanced after the rebuild. Following the rebuild, one controller was handling 17,500 IOPS (44%) and the other controller was handling 22,500 IOPS (56%). This imbalance was caused by space constraints causing some chunklets to be relocated to a different controller and not the controller owning the failed drive. HPE 3PAR storage arrays are highly available and have many features to protect user data from failures. However, when a failure does occur it is possible to introduce a change in performance. It will not be observed in all cases, but following recovery from a failure you should not be surprised if performance changes. There is no guarantee that performance will return to the same level it was before a failure following recovery from the failure. Anatomy of a rebuild The failure of a component in an array causes additional tasks to be performed. It is important to understand array behavior during a failure. When a drive fails, the HPE 3PAR InForm OS will identify the source of the failure and recover the drive if possible. It is much easier to recover a drive with a transient error than introduce the risk of a rebuild. It may take two minutes or more before the rebuild process begins. The process rebuilds chunklets by reading the remaining good data, reconstructing the missing data, and writing the newly constructed data to a new chunklet. New data is written to new chunklets throughout the system. The rebuild process chooses a target spare chunklet using several criteria. These criteria are prioritized to maintain the same level of performance and availability as the source chunklet, if possible. The first choice is a spare chunklet on the same node as the failed drive. When spare chunklets on the same node are not available, free chunklets with the same characteristics are considered. During the sparing process, if the number of free chunklets exceeds a threshold in the HPE 3PAR StoreServ OS, consideration will be given to spare chunklets on another node. This helps keep the array balanced. When space is constrained, the rebuild process may need to choose a target chunklet that does not preserve array balance or chunklet availability (for example, high availability cage [HA cage]). In addition to rebuilding a chunklet on another node or node pair, a chunklet on a Technical white paper Page 12 different tier of storage may be required. A nearline (NL) chunklet may be rebuilt to NL, fast class (FC or SAS), or SSDs. FC chunklets may be rebuilt on FC drives or SSDs. SSD chunklets will only be rebuilt to other SSDs. The conditions leading to spare rebuilds on different tiers of storage are rare and only occur when space is constrained. During the rebuild process, a host workload is usually running. In many cases, the host workload will access data from the failed drive requiring special handling of the I/O. When the I/O is a read, the array will reconstruct the missing data just for that read and return it to the host. When the I/O is a write, the data will initially be stored in cache just like any other write I/O. When the data is flushed from cache it is written to a log chunklet. This special chunklet will hold data until the original failed chunklet is relocated. Once the failed chunklet is relocated, the data in the log chunklet will be moved to the new location. Sparing rates We have seen how the risk of data loss is dependent on many factors including the rebuild window. The rebuild window is the time that begins with a drive failure event and ends when data protection is restored following the rebuild. The length of this window on an HPE 3PAR StoreServ array depends on the amount of written data on the drive, the array model, and configuration. It also depends on how busy the array is in handling host I/O and data services. The HPE 3PAR sparing algorithms were designed to have minimal impact on host workloads, but an array that is very busy when a failure occurs may see an impact in the rebuild process. The rebuild process will take longer on a heavily loaded array. The rebuild time will also take longer for higher capacity drives, although only written data is reconstructed. Slower drives, such as NL drives, have slower rebuild rates because of higher drive latencies. Table 3 lists examples of sparing rates under single drive failures. The HPE 3PAR rebuild process only rebuilds written data so the rebuild time is not a function of drive size. The workload column in table 3 describes workload characteristics such as the read percentage and host block size. The workload intensity is described in terms of the average physical drive (PD) service time (SVT). For example, a workload intensity description of 130 microsecond PD SVT means the host workload is adjusted until the average service time of the drives is 130 milliseconds. Table 3. Examples of raw capacity sparing rates under single drive failure Configuration RAID 7450 2N 3.2.1 MU2 RAID 5 16 x 1.92 TB SSD RAID 1 (1+1) 8440 4N 3.2.2 MU4 64 x 400 GB SSD RAID 5 (5+1) RAID 6 (6+2) Workload Rebuild rate Idle 170 GB/hr 70% read, 8 KB + 32 KB 5 ms PD SVT 140 GB/hr 70% read, 8 KB + 32 KB 20 ms PD SVT 100 GB/hr Idle 2000 GB/hr 60% read, 8 KB 130 microsecond PD SVT 2000 GB/hr Idle 1850 GB/hr 60% read, 8 KB 130 microsecond PD SVT 1400 GB/hr Idle 700 GB/hr 60% read, 8 KB 130 microsecond PD SVT 350 GB/hr You can see from the examples in table 3 that the rebuild process is faster for RAID 1 and slower for RAID 6. Rebuild rates are also higher for the most recent version of HPE 3PAR InForm OS running on later generation hardware. Use these examples as guidelines. Actual rebuild rates are dependent on many factors. Technical white paper Page 13 RAID conversion There are many tradeoffs in selecting a RAID mode including resiliency, performance, and rebuild risks. These variables change over time leading to the possibility of changing RAID modes. HPE 3PAR provides the ability to change RAID modes online using Dynamic Optimization or DO. Investigation Converting from a RAID mode with single drive protection (RAID 1 or RAID 5) to a RAID mode providing double drive protection (RAID 6 or RAID 1 triple mirroring) requires analysis of performance and capacity before conversion. Adding double drive resilience has a performance impact for some workloads and understanding this impact before starting a conversion will prevent surprises. Available capacity must also be evaluated, as some set sizes will require additional space. Sizing RAID 1 (1+1+1) conversion RAID 1 (1+1+1) triple mirroring provides excellent performance, fast rebuilds, and protection for up to two drive failures. RAID 1 performance following a conversion from RAID 5 will be the same or better than RAID 5 performance. Read performance will be about the same while write performance will be improved. RAID 5 requires two disk reads, parity generation, and two disk writes for each host write. RAID 1 triple mirror simply requires three disk writes, therefore, overall performance will depend on the workload, but it is expected to be the same or an improvement over RAID 5. The primary concern when considering a conversion from RAID 5 to RAID 1 triple mirroring is capacity. The capacity efficiency of RAID 5 varies with the set size (refer to table 2). A set size of 8 (7+1), for example, will be 87% efficient, which means 87% of the usable space will hold user data and the remaining 13% will hold parity information. RAID 1 triple mirroring will only be 33% efficient, which means 33% of the space will hold user data and 67% will provide double drive protection in the form of copies of the user data. Consider an example with 1 TB of user space written in RAID 5 (7+1). The space on disk (assuming no compaction) will be approximately 1.18 TB (1 TB/87%). 1 TB of user space written in RAID 1 triple mirror will be 3 TB (1 TB/33%) or an increase of about 1.82 TB. In this case, the cost of double drive protection and RAID 1 triple mirror performance is 1.82 TB of the additional capacity. Sizing RAID 6 conversion RAID 6 offers a balance of performance and capacity efficiency in providing double drive failure protection. Users converting from RAID 5 to RAID 6 should expect lower performance accompanied by the same or lower storage efficiency. The topic of HA cage may also be compromised. RAID 6 performance RAID 6 requires additional backend resources to maintain the additional parity information. Reads to RAID 6 will perform the same as RAID 5, but writes are different. Host random writes to RAID 5 require two backend reads and two backend writes. Host random writes to RAID 6 require 50% more backend IOPS and this additional resource requirement must be considered before starting a conversion to RAID 6. A good measure of array congestion from write operations is delayed acknowledgments (delack). When the array is very busy with write requests, HPE 3PAR may delay acknowledgment (delack) of a host request in an effort to moderate the host workload. Delacks have the effect of slowing the host workload and may allow the backlog of dirty cache pages waiting to be flushed to the disks to complete and normal performance to resume. Delacks can be monitored with a system reporter command (for example, srstatcmp –btsecs -1h –hires). If 5% or more of the samples have non-zero delack counters, this is an indication the array might be too busy to consider RAID 6 conversion. The delack counters are cumulative, requiring subtracting the value reported in the current interval from the value reported in the prior interval to determine the delacks that occurred. In this case, further performance analysis is required. Characterization of the host workload comes next. Gather workload data including IOPS, throughput, block size, read/write ratio, and service times. These values will help understand the workload and estimate the additional resource required by a conversion to RAID 6. This data is available with a system reporter command (for example, srstatport –btsecs -7d –port_type host –hires). Technical white paper Page 14 An example of some data provided by System Reporter is presented in figure 6. Array1 cli% srstatport -btsecs -7d -hires -port_type host -----------IO/s------------- ---------------KBytes/s----------------- ---Svct ms------- --IOSz KBytes--Time Secs Rd Wr Tot Rd Wr Tot Rd Wr Tot Rd Wr Tot 2016-11-03 03:05:00 PDT 1478167500 19839.6 16747.7 36587.3 2304312.2 373525.5 2677837.7 4.66 0.70 2.85 116.1 22.3 73.2 2016-11-03 03:10:00 PDT 1478167800 18104.9 14237.5 32342.4 2292495.9 367744.7 2660240.6 4.62 0.75 2.92 126.6 25.8 82.3 2016-11-03 03:15:00 PDT 1478168100 19863.9 15599.5 35463.4 2245195.5 363948.1 2609143.6 4.09 0.67 2.58 113.0 23.3 73.6 2016-11-03 03:20:00 PDT 1478168400 19329.8 16173.6 35503.3 2128402.0 383208.2 2511610.2 3.73 0.66 2.33 110.1 23.7 70.7 2016-11-03 03:25:00 PDT 1478168700 19538.3 13609.2 33147.6 2149153.0 346517.5 2495670.5 3.47 0.70 2.33 110.0 25.5 75.3 2016-11-03 03:30:00 PDT 1478169000 19569.6 14185.0 33754.6 2084315.7 444565.2 2528880.9 3.61 0.78 2.42 106.5 31.3 74.9 2016-11-03 03:35:00 PDT 1478169300 19140.3 13678.4 32818.7 2131287.6 414473.2 2545760.8 3.65 0.77 2.45 111.4 30.3 77.6 2016-11-03 03:40:00 PDT 1478169600 17879.8 12746.8 30626.5 2026038.9 369926.2 2395965.1 3.66 0.70 2.43 113.3 29.0 78.2 2016-11-03 03:45:00 PDT 1478169900 20696.9 15152.5 35849.4 2248155.4 360746.8 2608902.1 3.60 0.65 2.35 108.6 23.8 72.8 2016-11-03 03:50:00 PDT 1478170200 16556.7 11309.2 27865.9 2021479.5 302920.5 2324400.0 3.81 0.71 2.55 122.1 26.8 83.4 2016-11-03 03:55:00 PDT 1478170500 15521.8 12401.0 27922.8 2078035.5 380780.6 2458816.1 4.17 0.73 2.64 133.9 30.7 88.1 2016-11-03 04:00:00 PDT 1478170800 15693.6 12174.3 27867.8 2031430.3 347725.5 2379155.8 4.22 0.71 2.69 129.4 28.6 85.4 2016-11-03 04:05:00 PDT 1478171100 16064.3 11527.2 27591.4 2163348.7 281132.0 2444480.7 4.17 0.65 2.70 134.7 24.4 88.6 Figure 6. Example System Reporter data Figure 7 shows what this data looks like when graphing IOPS. Figure 7. Example host performance data The example in figure 7 shows a host workload requesting up to 58K IOPS. The read-to-write ratio can be calculated from the full data started in figure 6. Divide the average read operations by the average write operations to calculate the number of reads per write. In the example in figures 6 and 7, the average read rate is 16,175 and the average write rate is 13,880. Dividing the read rate by the write rate, we get 1.16 reads per write or a read: write ratio of 1.16:1. This measurement data can be compared to sizing estimates of the array to estimate the impact of a conversion to RAID 6. IOPS estimates of this array are calculated for the current configuration (RAID 5) and the target configuration (RAID 6). These estimates are added to the graph in figure 7 to create the graph in figure 8. Note Consult your local HPE representative for help in estimating IOPS for your array. Technical white paper Page 15 Figure 8. Example host performance with RAID 5 and RAID 6 estimates The line marked RAID 5 in figure 8 represents the sizer estimate of the maximum IOPS for this array with this workload. The RAID 6 line in figure 8 represents the sizer estimate of the reduced maximum IOPS this array can achieve with this workload if the configuration were changed to use RAID 6. When the data is presented this way, you can easily see how much performance is being used in the RAID 5 configuration and an estimate of the performance following a conversion to RAID 6. In this case, you can see a few data points that cross the RAID 6 estimated performance line. This does not mean RAID 6 is not an option, but during times when I/O requests exceed the capability of the array, host service times will increase. A read spike that causes the total IOPS to exceed a threshold may result in a greater impact to the workload than a write spike since writes will always be cached. When comparing measurement data and sizing estimates like that found in figure 8, it is often helpful to dig deeper into the measurement data. A closer examination of the workload running during the times where the total IOPS exceeds the RAID 6 estimated performance line will help understand what applications may see increased service times. The calculation in the measurement data of maximums and 95th and 98th percentiles can shed some light on the relationship between the measurement data and the performance estimate. Special consideration should also be given to the measurement window and the differences between the sizer assumptions and the array configuration. A key question when analyzing measurement data is, “what was measured?” The measurement data may not be helpful if it is not representative of the workload that is expected following a RAID 6 conversion. All performance estimators make some assumptions about configurations such as the number of host connections, backend data paths, number of disks, and more. Make sure these assumptions agree with the target configuration or make the necessary adjustments in the performance estimate. RAID 6 HA cage High availability cage (HA cage) is a common provisioning group (CPG) setting that specifies data is written to physical drives in a manner that provides for an entire cage failure without data unavailability or loss. This option is allowed in all RAID modes but it requires sufficient drive cages to support the configured set size. The set size defines the number of drives that will be written as an availability group or RAID set. RAID 1 (1+1), for example, will mirror data on two drives to provide single drive fault tolerance. Adding HA cage to this configuration requires at least two cages behind each node pair. Similarly, RAID 5 (7+1) uses data written to seven drives plus one more drive for parity making a total of eight. Adding HA cage to this configuration requires at least eight drive cages behind each node pair. The HA setting for the current environment must be considered before converting to RAID 6. If HA cage is being used before the conversion and the desire is to maintain HA cage, there may be space efficiency (set size) tradeoffs to consider. RAID 5 (7+1) using HA cage requires eight drive cages behind each node pair. Conversion to RAID 6 must match the RAID 5 set size of 8 to maintain HA cage. In this case, choosing a RAID 6 set size of 8 (6+2) will enable HA cage to be maintained, but the capacity efficiency will be reduced from 87% (RAID 5 7+1) to 66% (RAID 6 6+2). Technical white paper Page 16 Alternatively, additional cages may be added to each node pair if possible. If additional capacity is not available, more drives may be required or the choice of HA cage and RAID 6 set size can be reconsidered. The same capacity efficiency can be maintained using a RAID 6 set size of 16 (14+2), but with this set size HA cage cannot be maintained with only eight cages. RAID 6 capacity RAID 6 provides double drive failure protection using two parity drives. This additional parity sometimes requires additional storage capacity, but not always. The key is the set size of the source RAID 5 CPG and the target RAID 6 CPG. If a RAID 6 set size is double the RAID 5 set size, the same capacity efficiency can be maintained. A RAID 5 set size of 4 (3+1) has a capacity efficiency of 75%. Choosing a RAID 6 set size of 8 (6+2) will maintain the same 75% capacity efficiency. If doubling the RAID 5 set size is not possible, capacity efficiency will suffer. If sufficient free space is not available, additional drives may be needed. A RAID 5 volume using a set size of 8 (7+1) has a capacity efficiency of 87%. Converting this volume to RAID 6 and maintaining the same set size of 8 (6+2) will reduce capacity efficiency to 75%. If 12% of additional capacity is not available, the conversion will not be successful. RAID conversion process Careful consideration may lead to the decision to change RAID modes or set sizes. HPE 3PAR Dynamic Optimization allows changing of RAID levels and set sizes non-disruptively. Changing the RAID level or set size of a volume uses the tunevv CLI command. Converting an existing volume to have different RAID or set size properties begins with the CPG. These properties are defined in the CPG definition. The first step is to find an existing CPG with the desired properties or create a new CPG with these properties. The following command creates a new CPG specifying RAID 6 and a set size (ssz) of 8 (6+2). createcpg –ssz 8 –ha mag –t r6 –p –devtype SSD MY_CPG After the CPG is identified, the tunevv command may be run to change the attributes of the VV. The tunevv command will run as a background process and it can be run non-disruptively. The host will be unaware of the tune process as the VV changes from one set of parameters to another. The following is an example tunevv command. tunevv usr_cpg MY_CPG MY_Vol The tune process attempts to minimize resource utilization to prevent any impact to host I/Os. tunevv will limit outstanding I/Os to anyone drive to approximately 8% of that drive’s maximum IOPS. This leaves 92% of the drive’s IOPS capability to serve host IOPS and other data services. Because of this, it can take a long time to convert a large volume. Currently, there is no way to change the speed of a tune process even on an idle array. It is still possible, even with this consideration for host performance that a tune command may at times impact host service times. If you determine a tune operation is having an undesirable impact on other array operations, the tune may be safely stopped (canceltask) and restarted at a later time. While tunevv is running it will periodically free up available space. The tune process may also overlap with some other data services that move data such as Adaptive Optimization. Because of this overlap, some long-running tune operations may report errors on some regions. This is normal and the skipped regions can be addressed by running the tune operation a second time. Summary RAID provides data protection from single and double drive failures. When a drive fails, a process within the storage array begins to restore the array to the original level of protection. Restoring protection often results in a drive rebuild process, which leads to the risk of a second failure during the rebuild. This second failure can come from a second drive failure or from an Uncorrectable Read Error (URE). Drive capacities are growing faster than the drive manufacturers can grow reliability and performance. This creates a trend of longer rebuild times and, therefore, more exposure to the risk of a second failure. All this makes it desirable to consider a RAID level and set size providing more data robustness. When a change to a different RAID level or set size is desired, HPE 3PAR can perform the conversion while continuing host I/O using Dynamic Optimization. The trend of larger capacity drives leading to longer rebuild times and the associated risks is leading HPE 3PAR to recommend and default to a more robust RAID configuration. The current HPE 3PAR recommendation and default for different drive types and capacities are listed in table 4. Technical white paper HPE recommendation (3.3.1 and later) Table 4. HPE 3PAR recommended and default RAID levels Type Recommendation Default RAID 5 NL RAID 6 RAID 6 Supported: setsys AllowR5OnNLDrives yes FC/SAS RAID 6 RAID 6 Supported: setsys AllowR5OnFCDrives yes SSD RAID 6 RAID 6 Supported Glossary • RAID: Redundant Array of Independent Disks; previously, Redundant Array of Inexpensive Disks • RAID set: A set of blocks that together provide data protection; RAID 5 (3+1) uses four blocks (3 data, 1 parity) on different drives to provide data protection • RAID set size: The number of blocks in a RAID set used to provide data protection; RAID 5 with a set size of 4 will have three data blocks and one parity block (3+1); RAID 6 with a set size of 8 will have six data blocks and two parity blocks (6+2) • RAID conversion: Change the RAID level of a storage object • Parity: Data stored in the RAID set that can be used for regenerating the data in the RAID set; parity data is calculated using the XOR operator • Kryder’s law: Mark Kryder, CTO at Seagate Corp, in 2005 observed the rate of disk drive density increases were exceeding the more famous Moore’s Law; the result is a “law” attributed to Kryder generally regarded as drive density doubling every two years • Uncorrectable Read Error (URE): Sometimes called Unrecoverable Read Error; reliability metric reported by drive vendors; the rate is reported as the number of bits that can be read before an uncorrectable read error may be expected; common values today (2016) for URE in enterprise drives are 1-bit error for every 1016 bits read • Annual failure rate (AFR): Reliability metric reported by drive vendors; the rate is the estimated probability that a drive will fail during one full year of use • Mean time between failures (MTBF): Reliability metric reported by drive vendors; the rate is expressed as the number of hours before a failure is likely • Capacity efficiency: The ratio between total raw storage space and user-written space for a given RAID mode; RAID 6 (4+2), for example, requires six blocks (4+2) of raw storage to store four blocks of user-written data; this example results in a capacity efficiency of 67% (4/6) • Chunklet: A block of contiguous storage space on a drive; chunklets on HPE 3PAR StoreServ 8000 Storage and HPE 3PAR StoreServ 20000 Storage arrays are 1 GB • Data chunklet: A block of contiguous storage space holding user data • Parity chunklet: A block of contiguous storage space holding parity data Sign up for updates © Copyright 2017 Hewlett Packard Enterprise Development LP. The information contained herein is subject to change without notice. The only warranties for Hewlett Packard Enterprise products and services are set forth in the express warranty statements accompanying such products and services. Nothing herein should be construed as constituting an additional warranty. Hewlett Packard Enterprise shall not be liable for technical or editorial errors or omissions contained herein. Google is a registered trademark of Google Inc. All other third-party trademark(s) is/are property of their respective owner(s). a00000244ENW, January 2017