Achieving Five-9's Availability: The Dot Hill Systems Commitment to

Dot Hill Systems White Paper
Achieving Five-9’s Availability:
The Dot Hill Systems
Commitment to Reliability
Dot Hill Systems | Introduction
1
Dot Hill Systems White Paper
INTRODUCTION
Businesses make great efforts to maximize the availability of their on-line data, as well as to minimize or
even eliminate the loss of any valuable data. RAID (Redundant Array of Independent Disks) hardware
configurations supplemented with intelligent software features all help to achieve these goals.
These measures are more effective and less expensive to maintain when they are built into a reliable
Storage Area Network (SAN) or a virtualized storage array foundation.
RAID availability is measured by percentage of uptime. When storage area network (SAN) data are
unavailable, so are the applications that must access this data. When system outages occur, most online work stops.
This is why any downtime of mission-critical applications is expensive. While figures vary by the type and
size of the organization, some industries, such as energy and telecommunications, report losses of
revenue from $33,000 to $47,000 (USD) for every minute of downtime, according to a study conducted
by Meta Group (subsequently acquired by Gartner, a leading information technology research and
advisory company).
Measures of RAID Availability:



99.99% or Four-9’s translates to about 53 minutes of downtime per year.
99.999% or Five-9’s of availability translates to just over 5 minutes of downtime annually.
99.9999% or Six-9’s of availability translates to just over 31 seconds of downtime annually.
These are statistical averages, but you can see that the additional 48 minutes of average annual average
uptime/productivity (at e.g. $30K-$47K/Min) by moving from Four-9’s to Five-9’s availability is important
to businesses (with projected $1.44 - 2.26M in productivity savings) that rely on a SAN infrastructure for
such mission-critical applications.
Dot Hill Systems understands the importance of high availability for its customers. Throughout its 30+
year history, Dot Hill has built a reputation for exceptional quality in both hardware and software. This
commitment has resulted in industry-leading high availability, which is embodied in the very name of
Dot Hill’s product family: AssuredSAN™.
This paper’s remaining content is divided into two sections followed by a brief conclusion. The section
on Designing for High Availability describes the measures Dot Hill takes during the engineering design
phase to ensure high reliability and availability. The measures taken to validate system reliability are
then covered in the section on Proof of Success with Rigorous Analysis and Field Data.
Dot Hill Systems | Introduction
2
Dot Hill Systems White Paper
DESIGNING FOR HIGH AVAILABILITY
High availability is achieved through a combination of three design elements:
1. Fault Avoidance - High reliability (measured by the Mean Time Between Failures or MTBF) of the
system and its several subsystems.
2. Fault Tolerance - Redundant subsystems to eliminate as many single points of failure as possible.
3. Serviceability - Rapid diagnosis and repair of any failure (measured by Mean Time to Repair or
MTTR) by using internal diagnostics code and adequately positioned Field Replaceable Units
(FRUs) for all critical subsystems.
The following equation for availability demonstrates the vital role of serviceability in the system’s
design. Maximum availability can be achieved only by minimizing the time it takes to diagnose and
perform a repair, which is reduced significantly by using FRUs.
MTBF
Availability = ———————
MTBF + MTTR
To achieve maximum availability, Designs must consider fault avoidance, fault tolerance, and
serviceability. This also improves its manufacturability. The specific aspects of each of these areas are
described in turn here.
DESIGN FOR RELIABILITY, AVAILABILITY, & SERVICEABILITY (RAS)
Designing hardware for high Reliability, Availability, and Serviceability involves both the system and its
several subsystems. To achieve high availability at the system level, Dot Hill integrates reliability into the
design process in several ways. The first and most obvious is the use of storage device (disk drive)
redundancy within RAID configurations (RAID 1, 3, 5, 6, 10 and 60). Next, we use dual power supplies
containing fans for the entire system in our standard 2U12 or 2U24 chassis RAIDs and JBODs. (JBOD
stands for Just a Bunch Of Disks, or a group of disks not connected through an intelligent controller.) In
our high density RAID and JBOD systems (2U48 and 4U56 chassis RAIDs), there are separate dual Fan
Control Modules (FCM) as field replaceable units (FRUs) to prevent over-heating (and avoid life
acceleration component failures) and also allow rapid field replacements - if ever needed. Even higher
availability is achieved by using redundant RAID and JBOD controllers. By eliminating as many single
points of failure in these critical subsystems as possible, the system itself continues to operate normally
during a failure of any single redundant FRU. While such FRU failures do factor into the subsystem’s
MTBF (its FRU rated reliability), it does not diminish the availability of the overall system itself since an
outage will only occur when there is an exposed single point of failure.
Dot Hill’s AssuredSAN architecture features full redundancy by design for every subsystem requiring a
significant number of active components. The mechanical chassis/midplane itself cannot be redundant,
Dot Hill Systems | Designing for High Availability
3
Dot Hill Systems White Paper
of course, since there is a single midplane that performs the simple function of connecting the
redundant controllers to the redundant disk drives. The midplane has minimal active components,
however, and Dot Hill selects these components for the highest possible reliability. The result is an
extraordinarily high MTBF for the chassis and its midplane, and therefore, virtually no impact is
projected on overall system availability.
To enhance system serviceability for the shortest possible Mean Time To Repair (MTTR), Dot Hill utilizes
the two complementary design techniques. The first is the use of a modular chassis with Field
Replaceable Units (FRUs). The ability to swap out a confirmed failed FRU subsystem quickly and easily
minimizes the time it takes to repair an installed system and restore it to full operating performance.
By utilizing such a modular FRU design, which provides convenient access to all subsystems, Dot Hill’s
AssuredSAN products can be maintained seamlessly with minimal to no disruption in service during
most repairs. Standard RAID chassis FRUs are illustrated in 1) below. Two other views 2), 3) of high
density RAID chassis with their drawer assemblies are shown next.
1) Standard 2U12 RAID/Chassis Assembly:
2) Ultra48 high density 2U48 RAID Chassis/Drawer Assembly:
Dot Hill Systems | Design for Reliability, Availability, & Serviceability (RAS)
4
Dot Hill Systems White Paper
3) Ultra56 high density 4U56 Chassis/Drawer Assembly:
Dot Hill’s standard chassis mechanical design enables the power supply with internal fan, JBOD
I/O module and RAID controller, and disk drives all to be serviced quickly as hot-swappable Field
Replaceable Units (FRUs). Being able to replace redundant FRUs while the system is fully
operational further enhances its availability. Note how the power supply and I/O module are
accessible from the rear of the chassis, while the disk drives are accessed from the front. Also
note that redundant FRUs are not shown here to identify the individual subsystems more clearly.
In addition, the higher part count HD chassis design is still able to maintain high availability by
separating the larger wattage power supply from the fans themselves by having redundant FRUs
for both the PSUs and Fan Control Modules or FCMs.
The second serviceability technique is immediate notification of any failure by messaging to our
customers. As expected, the longer it takes to detect a failure, the longer it will take to repair it. Time is
of the essence for another reason, however: the failure of a redundant subsystem creates, in effect, a
temporary single point of failure that increases the risk of a system-level outage. For this reason, the
firmware in all Dot Hill systems is designed to quickly detect, isolate and confirm any failure, initiate a
failover to a redundant subsystem, and provide immediate notification. The actual “messaging” of the
notification can also be configured to match operational procedures to ensure that on-duty (or on-call)
staff is properly and quickly notified via “phone home” features.
Dot Hill Systems | Design for Reliability, Availability, & Serviceability (RAS)
5
Dot Hill Systems White Paper
DESIGN FOR RELIABILITY (DFR)
At the FRU or subsystem level, Dot Hill utilizes five separate designs for reliability (DFR) techniques to
increase the reliability of the overall system by addressing each subsystem and FRU, while at the same
time also maximizing the inclusion of leading-edge SAN features.
The first DFR technique is to use only high quality parts. Higher quality parts cost more, of course, but
their superior performance and longer service lives normally contribute to a lower total cost of
ownership in the long-run. Despite the higher per-part cost, minimizing the part count, while
concurrently enhancing feature functionality, helps to improve the overall price/performance of a
highly-reliable design. For these reasons, Dot Hill utilizes only the highest quality parts available from
reputable suppliers.
The second DFR guideline is reducing the electronic component part count. Because any individual part
can fail, the fewer there are, the higher the inherent reliability of the subsystem. Dot Hill engineers must
support new features and still minimize the electronic parts required on all printed circuit boards and
other subsystem FRUs.
The third DFR technique involves the de-rating of selected electronic parts. Operating any part or
component at or near its rated capacities inevitably shortens its useful life. For critical parts, Dot Hill
selects only those that will be able to operate well below their maximum allowable specifications for
voltage, power and/or current. This can substantially increase their useful life, and therefore, the MTBF
of the subsystem.
The fourth DFR technique has been touched on earlier but, deserves more detail. RAID reliability is also
enhanced by using redundant FRU design techniques. Redundant FRU failover options are highly valued
for mission critical subsystems such as RAID Controllers, JBOD Input/Output Modules (IOMs), Fan
Control Modules, Power Supply Units (PSUs), and Drives.
The fifth DFR technique is unique to Dot Hill Systems: designing for software reliability. In modern
designs, software reliability is just as important as hardware reliability, and in some ways even more
important. The reason is: software bugs (including those in firmware) that cause downtime normally
take significantly longer to resolve than the more obvious hardware failures. Bugs are often dependent
upon system state (the set of circumstances leading up to the failure), making them difficult to
reproduce and isolate quickly, and any patch or update must be fully tested before it can be released.
Both add considerably to the MTTR for software failures, thereby adversely impacting on system
serviceability and availability.
To maximize software reliability, Dot Hill monitors its improvement via Software Reliability Growth and
Maturity Test techniques. Two primary metrics are utilized to assess software readiness. The first is the
Mean Time to Discovery (MTTD) of bugs to assess the maturity of all new product RAID systems
software and firmware during development. The second is the Reliability Growth Factor (RGF). The RGF
tracks the growth of the MTTD metric over time to indicate when adequate growth has been attained as
an input to our product release decisions. It is important to note that the MTTD and RGF metrics are not
Dot Hill Systems | Design for Reliability, Availability, & Serviceability (RAS)
6
Dot Hill Systems White Paper
industry metrics, but are instead software maturity metrics created by Dot Hill as part of the company’s
commitment to RAID systems quality and reliability. After extensive testing, all RAID software designs
must have a sufficiently high and stable MTTD and RGF before being released.
In addition, the new product system design is released to manufacturing and enters the production
phase only after passing three comprehensive tests. The Engineering Verification Test (EVT) and the
Design Verification Test (DVT) ensure that the system and/or subsystem(s) fully satisfy all design
specifications, including those for high reliability of both the hardware and software. These tests also
confirm that marginal variations in parts from component suppliers will not compromise system
reliability over the product’s useful life of a minimum of 10 years of design life. The Reliability
Demonstration Test (RDT) is a separate and rigorous evaluation of the final production hardware that
verifies its calculated reliability, availability and serviceability (covered below). Where some vendors use
only a few samples in a fairly short demonstration test, Dot Hill’s RDT uses 24 fully-configured HD chassis
to 30 fully-configured Standard chassis RAID + JBOD systems in an 8 – 12 week extended reliability test
performed at our CM’s location with final hardware and feature complete code that must end with a full
MTBF demonstration at 80% confidence to pass.
DESIGN FOR MANUFACTURABILITY (DFM)
To ensure high Reliability, Availability, and Serviceability, Dot Hill’s Quality Assurance (QA) Group
establishes a comprehensive set of assurance controls in parallel with the engineering team’s design
efforts. These controls ensure that the design facilitates manufacturing best practices for maintaining
high quality and reliability with high yields and minimal defects that support rapid fault isolation and
repair. Dot Hill’s DFM process also evaluates all components and suppliers for quality assurance using a
strict set of qualifications to ensure compliance with minimal failure rates.
All hardware is then manufactured under similar controls for the production process itself, which must
comply with all Dot Hill component and assembly specifications. These specifications are all outlined in
detail within the Supplier Quality Management Handbook (SQMH). Additional best practices during
manufacturing include burn-in of components, subsystems, and fully configured RAID systems to
minimize early infant mortality escape failures. Ongoing Reliability Testing (ORT) is performed on regular
mfg. samples. For the ORT process, random samples are regularly selected for a 4-week extended test to
ensure that the manufacturing process continues to yield the desired levels of quality and reliability by
not compromising the RAID design’s inherent capabilities.
PROOF OF SUCCESS WITH RIGOROUS ANALYSIS AND FIELD DATA
The reliability of any system can be assessed using a “bottom-up” analysis of its numerous electronic
component parts. While these analyses can be remarkably accurate, especially when conducted in
accordance with proven methodologies, it is also prudent to validate these calculated assessments with
actual data from production units in use at customer sites. Dot Hill does both.
Dot Hill Systems | Design for Manufacturability (DFM)
7
Dot Hill Systems White Paper
RELIABILITY, AVAILABILITY & SERVICEABILITY (RAS) ANALYSIS
Dot Hill’s RAS methodology models the design’s Reliability, Availability and Serviceability at both the
system and subsystem levels, and also provides estimates of RAID data protection levels. Componentlevel Telcordia MTBF predictions performed at 25°C and 40°C ambient environments are included for all
FRUs as a foundation for the analysis, and all analyses are performed in accordance with Telcordia’s SRNWT-000332 Reliability Prediction Procedure for Electronic Equipment. This Belcore/Telcordia
prediction methodology assumes a series model for the hardware MTBF based on any random hardware
failure occurrence (including those in redundant subsystems), regardless of whether or not it causes a
system-level outage. The predicted FRU level failure rate data is based on a combination of supplier life
test data, Dot Hill field returns data, and Telcordia’s component industry data.
Per the RAS analysis methodology, all FRUs are assembled via Reliability Block Diagram (RBD) methods
(shown in the figure below), and Monte Carlo simulations are applied to account for random FRU
failures, different data protection levels (Vdisks/LUNs for RAID5 N+1 and RAID6 N+2) and different
MTTRs; for example: four hours, twelve hours, twenty-six hours for service restore times for any FRU or
software failure. Recovery Time Objectives of 26 vs. 27 hours for MTTRs were determined to be the
breakpoint for this configuration when comparing RAID5 vs. RAID6. The analysis also assumes an
average 24 second failover time to the redundant RAID or JBOD IOM per Dot Hill internal test data under
expected IO conditions.
Dot Hill Systems | Reliability, Availability & Serviceability (RAS) Analysis
8
Dot Hill Systems White Paper
Depicted in this reliability block diagram (RBD) above is the AssuredSAN 6000 Series system
model. This includes a 4U56 chassis RAID+JBOD configuration Here the RAID chassis contains
dual power supplies, dual fan control modules, and dual SAS RAID controllers. While the JBOD
contains dual power supplies, dual fan control modules, and dual JBOD IOMs. Each RAID and
JBOD chassis each contains 56 x 4TB SAS disk drives (RAID5 and RAID6), resulting in two different
RAS analysis configurations for comparison. These statistically analyzed availability rates are
based on breakpoint/worse case Recovery Time Objectives (RTO) for RAID servicing times.
A summary of the RAS analysis results are shown in the following table:
RAID5 Dual
Controllers
RAID6 Dual
Controllers
27 Hour Recovery Time Objective (RTO)
99.9989%
99.999%
26 Hour Recovery Time Objective (RTO)
99.999%
99.9991%
Dot Hill Systems | Reliability, Availability & Serviceability (RAS) Analysis
9
Dot Hill Systems White Paper
Based on two selected Recovery Time Objectives (RTO) for RAID servicing this translates into the
following average expected minutes of downtime per year, MTBO, and AOER%:
RAID5 Dual
Controllers
RAID6 Dual
Controllers
27 Hours RTO Down Time Per Year
(Minutes/Seconds)
5.47
5.15
26 Hours RTO Down Time Per Year
(Minutes/Seconds)
5.15
4.44
RAID5 Dual
Controllers
RAID6 Dual
Controllers
27 Hours RTO Mean Time Between
Outage (Hours)
2,422,215
2,609,305
26 Hours RTO Mean Time Between
Outage (Hours)
2,457,736
2,916,754
RAID5 Dual
Controllers
RAID6 Dual
Controllers
27 Hours RTO Annual Outage Event Rate
0.3616%
0.3357%
26 Hours RTO Annual Outage Event Rate
0.3564%
0.3003%
PROOF OF SUCCESS WITH ACTUAL FIELD DATA
Dot Hill works primarily with major vendors in the storage industry as an original equipment
manufacturer (OEM). Under this business model it is the major vendors, and not Dot Hill, who perform
the field maintenance of the installed base of SAN systems. These vendors all utilize the results of Dot
Hill’s RAS Analysis to help maintain an inventory of spare FRUs, and all also track actual failures in the
field. While this data are regularly shared with Dot Hill for field quality monitoring purposes, most
vendors consider the data to be confidential.
One vendor that takes particular pride in providing high availability for its customers, however, has
agreed to permit Dot Hill to disclose aggregate results from its “Uptime Meter” used to track availability
continuously and in real-time. The vendor utilizes Dot Hill’s AssuredSAN 4000 platform rebranded as a
private label product. The Uptime Meter has confirmed Dot Hill’s claim of five-9’s availability over time
from actual performance across its entire installed base of systems.
Dot Hill Systems |
10
Dot Hill Systems White Paper
CONCLUSION
As demonstrated by the many measures outlined in this white paper, Dot Hill is committed to delivering
the highest possible reliability and availability across our entire product line. Dot Hill also takes great
pride in the fact that this commitment to quality and reliability has enabled us to deliver “carrier-class”
five-9’s of availability in products with “enterprise-class” price/performance.
No other company delivers such high availability at such a competitive price. But don’t just take our
word for it. Compare for yourself Dot Hill’s high availability with any other product in its class. Evaluate
the vendor’s commitment to quality and reliability by its design, manufacturing and testing processes.
Look at the bottom line results. Talk to colleagues about their experiences. We are confident that if you
do, you too will come to realize what many others already know: Dot Hill Systems delivers industryleading high reliability and availability.
To learn more about how your organization can benefit from high-availability SAN solutions from Dot
Hill, visit us on the Web at www.dothill.com or call us at 303-845-3200.
© 2015 Copyright Dot Hill Systems Corporation. All rights reserved. Dot Hill Systems Corp., Dot Hill, the Dot Hill
logo, and AssuredSAN are trademarks or registered trademarks of Dot Hill Systems. All other trademarks are the
property of their respective companies in the United States and/or other countries. 7.10.15
Dot Hill Systems | Conclusion
11