Achieving Five-9's Availability: The Dot Hill Systems Commitment to

Dot Hill Systems White Paper Achieving Five-9’s Availability: The Dot Hill Systems Commitment to Reliability Dot Hill Systems | Introduction 1 Dot Hill Systems White Paper INTRODUCTION Businesses make great efforts to maximize the availability of their on-line data, as well as to minimize or even eliminate the loss of any valuable data. RAID (Redundant Array of Independent Disks) hardware configurations supplemented with intelligent software features all help to achieve these goals. These measures are more effective and less expensive to maintain when they are built into a reliable Storage Area Network (SAN) or a virtualized storage array foundation. RAID availability is measured by percentage of uptime. When storage area network (SAN) data are unavailable, so are the applications that must access this data. When system outages occur, most online work stops. This is why any downtime of mission-critical applications is expensive. While figures vary by the type and size of the organization, some industries, such as energy and telecommunications, report losses of revenue from $33,000 to $47,000 (USD) for every minute of downtime, according to a study conducted by Meta Group (subsequently acquired by Gartner, a leading information technology research and advisory company). Measures of RAID Availability:    99.99% or Four-9’s translates to about 53 minutes of downtime per year. 99.999% or Five-9’s of availability translates to just over 5 minutes of downtime annually. 99.9999% or Six-9’s of availability translates to just over 31 seconds of downtime annually. These are statistical averages, but you can see that the additional 48 minutes of average annual average uptime/productivity (at e.g. $30K-$47K/Min) by moving from Four-9’s to Five-9’s availability is important to businesses (with projected $1.44 - 2.26M in productivity savings) that rely on a SAN infrastructure for such mission-critical applications. Dot Hill Systems understands the importance of high availability for its customers. Throughout its 30+ year history, Dot Hill has built a reputation for exceptional quality in both hardware and software. This commitment has resulted in industry-leading high availability, which is embodied in the very name of Dot Hill’s product family: AssuredSAN™. This paper’s remaining content is divided into two sections followed by a brief conclusion. The section on Designing for High Availability describes the measures Dot Hill takes during the engineering design phase to ensure high reliability and availability. The measures taken to validate system reliability are then covered in the section on Proof of Success with Rigorous Analysis and Field Data. Dot Hill Systems | Introduction 2 Dot Hill Systems White Paper DESIGNING FOR HIGH AVAILABILITY High availability is achieved through a combination of three design elements: 1. Fault Avoidance - High reliability (measured by the Mean Time Between Failures or MTBF) of the system and its several subsystems. 2. Fault Tolerance - Redundant subsystems to eliminate as many single points of failure as possible. 3. Serviceability - Rapid diagnosis and repair of any failure (measured by Mean Time to Repair or MTTR) by using internal diagnostics code and adequately positioned Field Replaceable Units (FRUs) for all critical subsystems. The following equation for availability demonstrates the vital role of serviceability in the system’s design. Maximum availability can be achieved only by minimizing the time it takes to diagnose and perform a repair, which is reduced significantly by using FRUs. MTBF Availability = ——————— MTBF + MTTR To achieve maximum availability, Designs must consider fault avoidance, fault tolerance, and serviceability. This also improves its manufacturability. The specific aspects of each of these areas are described in turn here. DESIGN FOR RELIABILITY, AVAILABILITY, & SERVICEABILITY (RAS) Designing hardware for high Reliability, Availability, and Serviceability involves both the system and its several subsystems. To achieve high availability at the system level, Dot Hill integrates reliability into the design process in several ways. The first and most obvious is the use of storage device (disk drive) redundancy within RAID configurations (RAID 1, 3, 5, 6, 10 and 60). Next, we use dual power supplies containing fans for the entire system in our standard 2U12 or 2U24 chassis RAIDs and JBODs. (JBOD stands for Just a Bunch Of Disks, or a group of disks not connected through an intelligent controller.) In our high density RAID and JBOD systems (2U48 and 4U56 chassis RAIDs), there are separate dual Fan Control Modules (FCM) as field replaceable units (FRUs) to prevent over-heating (and avoid life acceleration component failures) and also allow rapid field replacements - if ever needed. Even higher availability is achieved by using redundant RAID and JBOD controllers. By eliminating as many single points of failure in these critical subsystems as possible, the system itself continues to operate normally during a failure of any single redundant FRU. While such FRU failures do factor into the subsystem’s MTBF (its FRU rated reliability), it does not diminish the availability of the overall system itself since an outage will only occur when there is an exposed single point of failure. Dot Hill’s AssuredSAN architecture features full redundancy by design for every subsystem requiring a significant number of active components. The mechanical chassis/midplane itself cannot be redundant, Dot Hill Systems | Designing for High Availability 3 Dot Hill Systems White Paper of course, since there is a single midplane that performs the simple function of connecting the redundant controllers to the redundant disk drives. The midplane has minimal active components, however, and Dot Hill selects these components for the highest possible reliability. The result is an extraordinarily high MTBF for the chassis and its midplane, and therefore, virtually no impact is projected on overall system availability. To enhance system serviceability for the shortest possible Mean Time To Repair (MTTR), Dot Hill utilizes the two complementary design techniques. The first is the use of a modular chassis with Field Replaceable Units (FRUs). The ability to swap out a confirmed failed FRU subsystem quickly and easily minimizes the time it takes to repair an installed system and restore it to full operating performance. By utilizing such a modular FRU design, which provides convenient access to all subsystems, Dot Hill’s AssuredSAN products can be maintained seamlessly with minimal to no disruption in service during most repairs. Standard RAID chassis FRUs are illustrated in 1) below. Two other views 2), 3) of high density RAID chassis with their drawer assemblies are shown next. 1) Standard 2U12 RAID/Chassis Assembly: 2) Ultra48 high density 2U48 RAID Chassis/Drawer Assembly: Dot Hill Systems | Design for Reliability, Availability, & Serviceability (RAS) 4 Dot Hill Systems White Paper 3) Ultra56 high density 4U56 Chassis/Drawer Assembly: Dot Hill’s standard chassis mechanical design enables the power supply with internal fan, JBOD I/O module and RAID controller, and disk drives all to be serviced quickly as hot-swappable Field Replaceable Units (FRUs). Being able to replace redundant FRUs while the system is fully operational further enhances its availability. Note how the power supply and I/O module are accessible from the rear of the chassis, while the disk drives are accessed from the front. Also note that redundant FRUs are not shown here to identify the individual subsystems more clearly. In addition, the higher part count HD chassis design is still able to maintain high availability by separating the larger wattage power supply from the fans themselves by having redundant FRUs for both the PSUs and Fan Control Modules or FCMs. The second serviceability technique is immediate notification of any failure by messaging to our customers. As expected, the longer it takes to detect a failure, the longer it will take to repair it. Time is of the essence for another reason, however: the failure of a redundant subsystem creates, in effect, a temporary single point of failure that increases the risk of a system-level outage. For this reason, the firmware in all Dot Hill systems is designed to quickly detect, isolate and confirm any failure, initiate a failover to a redundant subsystem, and provide immediate notification. The actual “messaging” of the notification can also be configured to match operational procedures to ensure that on-duty (or on-call) staff is properly and quickly notified via “phone home” features. Dot Hill Systems | Design for Reliability, Availability, & Serviceability (RAS) 5 Dot Hill Systems White Paper DESIGN FOR RELIABILITY (DFR) At the FRU or subsystem level, Dot Hill utilizes five separate designs for reliability (DFR) techniques to increase the reliability of the overall system by addressing each subsystem and FRU, while at the same time also maximizing the inclusion of leading-edge SAN features. The first DFR technique is to use only high quality parts. Higher quality parts cost more, of course, but their superior performance and longer service lives normally contribute to a lower total cost of ownership in the long-run. Despite the higher per-part cost, minimizing the part count, while concurrently enhancing feature functionality, helps to improve the overall price/performance of a highly-reliable design. For these reasons, Dot Hill utilizes only the highest quality parts available from reputable suppliers. The second DFR guideline is reducing the electronic component part count. Because any individual part can fail, the fewer there are, the higher the inherent reliability of the subsystem. Dot Hill engineers must support new features and still minimize the electronic parts required on all printed circuit boards and other subsystem FRUs. The third DFR technique involves the de-rating of selected electronic parts. Operating any part or component at or near its rated capacities inevitably shortens its useful life. For critical parts, Dot Hill selects only those that will be able to operate well below their maximum allowable specifications for voltage, power and/or current. This can substantially increase their useful life, and therefore, the MTBF of the subsystem. The fourth DFR technique has been touched on earlier but, deserves more detail. RAID reliability is also enhanced by using redundant FRU design techniques. Redundant FRU failover options are highly valued for mission critical subsystems such as RAID Controllers, JBOD Input/Output Modules (IOMs), Fan Control Modules, Power Supply Units (PSUs), and Drives. The fifth DFR technique is unique to Dot Hill Systems: designing for software reliability. In modern designs, software reliability is just as important as hardware reliability, and in some ways even more important. The reason is: software bugs (including those in firmware) that cause downtime normally take significantly longer to resolve than the more obvious hardware failures. Bugs are often dependent upon system state (the set of circumstances leading up to the failure), making them difficult to reproduce and isolate quickly, and any patch or update must be fully tested before it can be released. Both add considerably to the MTTR for software failures, thereby adversely impacting on system serviceability and availability. To maximize software reliability, Dot Hill monitors its improvement via Software Reliability Growth and Maturity Test techniques. Two primary metrics are utilized to assess software readiness. The first is the Mean Time to Discovery (MTTD) of bugs to assess the maturity of all new product RAID systems software and firmware during development. The second is the Reliability Growth Factor (RGF). The RGF tracks the growth of the MTTD metric over time to indicate when adequate growth has been attained as an input to our product release decisions. It is important to note that the MTTD and RGF metrics are not Dot Hill Systems | Design for Reliability, Availability, & Serviceability (RAS) 6 Dot Hill Systems White Paper industry metrics, but are instead software maturity metrics created by Dot Hill as part of the company’s commitment to RAID systems quality and reliability. After extensive testing, all RAID software designs must have a sufficiently high and stable MTTD and RGF before being released. In addition, the new product system design is released to manufacturing and enters the production phase only after passing three comprehensive tests. The Engineering Verification Test (EVT) and the Design Verification Test (DVT) ensure that the system and/or subsystem(s) fully satisfy all design specifications, including those for high reliability of both the hardware and software. These tests also confirm that marginal variations in parts from component suppliers will not compromise system reliability over the product’s useful life of a minimum of 10 years of design life. The Reliability Demonstration Test (RDT) is a separate and rigorous evaluation of the final production hardware that verifies its calculated reliability, availability and serviceability (covered below). Where some vendors use only a few samples in a fairly short demonstration test, Dot Hill’s RDT uses 24 fully-configured HD chassis to 30 fully-configured Standard chassis RAID + JBOD systems in an 8 – 12 week extended reliability test performed at our CM’s location with final hardware and feature complete code that must end with a full MTBF demonstration at 80% confidence to pass. DESIGN FOR MANUFACTURABILITY (DFM) To ensure high Reliability, Availability, and Serviceability, Dot Hill’s Quality Assurance (QA) Group establishes a comprehensive set of assurance controls in parallel with the engineering team’s design efforts. These controls ensure that the design facilitates manufacturing best practices for maintaining high quality and reliability with high yields and minimal defects that support rapid fault isolation and repair. Dot Hill’s DFM process also evaluates all components and suppliers for quality assurance using a strict set of qualifications to ensure compliance with minimal failure rates. All hardware is then manufactured under similar controls for the production process itself, which must comply with all Dot Hill component and assembly specifications. These specifications are all outlined in detail within the Supplier Quality Management Handbook (SQMH). Additional best practices during manufacturing include burn-in of components, subsystems, and fully configured RAID systems to minimize early infant mortality escape failures. Ongoing Reliability Testing (ORT) is performed on regular mfg. samples. For the ORT process, random samples are regularly selected for a 4-week extended test to ensure that the manufacturing process continues to yield the desired levels of quality and reliability by not compromising the RAID design’s inherent capabilities. PROOF OF SUCCESS WITH RIGOROUS ANALYSIS AND FIELD DATA The reliability of any system can be assessed using a “bottom-up” analysis of its numerous electronic component parts. While these analyses can be remarkably accurate, especially when conducted in accordance with proven methodologies, it is also prudent to validate these calculated assessments with actual data from production units in use at customer sites. Dot Hill does both. Dot Hill Systems | Design for Manufacturability (DFM) 7 Dot Hill Systems White Paper RELIABILITY, AVAILABILITY & SERVICEABILITY (RAS) ANALYSIS Dot Hill’s RAS methodology models the design’s Reliability, Availability and Serviceability at both the system and subsystem levels, and also provides estimates of RAID data protection levels. Componentlevel Telcordia MTBF predictions performed at 25°C and 40°C ambient environments are included for all FRUs as a foundation for the analysis, and all analyses are performed in accordance with Telcordia’s SRNWT-000332 Reliability Prediction Procedure for Electronic Equipment. This Belcore/Telcordia prediction methodology assumes a series model for the hardware MTBF based on any random hardware failure occurrence (including those in redundant subsystems), regardless of whether or not it causes a system-level outage. The predicted FRU level failure rate data is based on a combination of supplier life test data, Dot Hill field returns data, and Telcordia’s component industry data. Per the RAS analysis methodology, all FRUs are assembled via Reliability Block Diagram (RBD) methods (shown in the figure below), and Monte Carlo simulations are applied to account for random FRU failures, different data protection levels (Vdisks/LUNs for RAID5 N+1 and RAID6 N+2) and different MTTRs; for example: four hours, twelve hours, twenty-six hours for service restore times for any FRU or software failure. Recovery Time Objectives of 26 vs. 27 hours for MTTRs were determined to be the breakpoint for this configuration when comparing RAID5 vs. RAID6. The analysis also assumes an average 24 second failover time to the redundant RAID or JBOD IOM per Dot Hill internal test data under expected IO conditions. Dot Hill Systems | Reliability, Availability & Serviceability (RAS) Analysis 8 Dot Hill Systems White Paper Depicted in this reliability block diagram (RBD) above is the AssuredSAN 6000 Series system model. This includes a 4U56 chassis RAID+JBOD configuration Here the RAID chassis contains dual power supplies, dual fan control modules, and dual SAS RAID controllers. While the JBOD contains dual power supplies, dual fan control modules, and dual JBOD IOMs. Each RAID and JBOD chassis each contains 56 x 4TB SAS disk drives (RAID5 and RAID6), resulting in two different RAS analysis configurations for comparison. These statistically analyzed availability rates are based on breakpoint/worse case Recovery Time Objectives (RTO) for RAID servicing times. A summary of the RAS analysis results are shown in the following table: RAID5 Dual Controllers RAID6 Dual Controllers 27 Hour Recovery Time Objective (RTO) 99.9989% 99.999% 26 Hour Recovery Time Objective (RTO) 99.999% 99.9991% Dot Hill Systems | Reliability, Availability & Serviceability (RAS) Analysis 9 Dot Hill Systems White Paper Based on two selected Recovery Time Objectives (RTO) for RAID servicing this translates into the following average expected minutes of downtime per year, MTBO, and AOER%: RAID5 Dual Controllers RAID6 Dual Controllers 27 Hours RTO Down Time Per Year (Minutes/Seconds) 5.47 5.15 26 Hours RTO Down Time Per Year (Minutes/Seconds) 5.15 4.44 RAID5 Dual Controllers RAID6 Dual Controllers 27 Hours RTO Mean Time Between Outage (Hours) 2,422,215 2,609,305 26 Hours RTO Mean Time Between Outage (Hours) 2,457,736 2,916,754 RAID5 Dual Controllers RAID6 Dual Controllers 27 Hours RTO Annual Outage Event Rate 0.3616% 0.3357% 26 Hours RTO Annual Outage Event Rate 0.3564% 0.3003% PROOF OF SUCCESS WITH ACTUAL FIELD DATA Dot Hill works primarily with major vendors in the storage industry as an original equipment manufacturer (OEM). Under this business model it is the major vendors, and not Dot Hill, who perform the field maintenance of the installed base of SAN systems. These vendors all utilize the results of Dot Hill’s RAS Analysis to help maintain an inventory of spare FRUs, and all also track actual failures in the field. While this data are regularly shared with Dot Hill for field quality monitoring purposes, most vendors consider the data to be confidential. One vendor that takes particular pride in providing high availability for its customers, however, has agreed to permit Dot Hill to disclose aggregate results from its “Uptime Meter” used to track availability continuously and in real-time. The vendor utilizes Dot Hill’s AssuredSAN 4000 platform rebranded as a private label product. The Uptime Meter has confirmed Dot Hill’s claim of five-9’s availability over time from actual performance across its entire installed base of systems. Dot Hill Systems | 10 Dot Hill Systems White Paper CONCLUSION As demonstrated by the many measures outlined in this white paper, Dot Hill is committed to delivering the highest possible reliability and availability across our entire product line. Dot Hill also takes great pride in the fact that this commitment to quality and reliability has enabled us to deliver “carrier-class” five-9’s of availability in products with “enterprise-class” price/performance. No other company delivers such high availability at such a competitive price. But don’t just take our word for it. Compare for yourself Dot Hill’s high availability with any other product in its class. Evaluate the vendor’s commitment to quality and reliability by its design, manufacturing and testing processes. Look at the bottom line results. Talk to colleagues about their experiences. We are confident that if you do, you too will come to realize what many others already know: Dot Hill Systems delivers industryleading high reliability and availability. To learn more about how your organization can benefit from high-availability SAN solutions from Dot Hill, visit us on the Web at www.dothill.com or call us at 303-845-3200. © 2015 Copyright Dot Hill Systems Corporation. All rights reserved. Dot Hill Systems Corp., Dot Hill, the Dot Hill logo, and AssuredSAN are trademarks or registered trademarks of Dot Hill Systems. All other trademarks are the property of their respective companies in the United States and/or other countries. 7.10.15 Dot Hill Systems | Conclusion 11

Achieving Five-9's Availability: The Dot Hill Systems Commitment to

Related documents

Products

Support

Achieving Five-9's Availability: The Dot Hill Systems Commitment to

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib