Data Center Investment vs. System Reliability in Power Distribution Systems Montri Wiboonrat Department of Computer Engineering, Faculty of Engineering, King Mongkut's Institute of Technology Ladkrabang, Thailand (montri.wi@kmitl.ac.th) annual downtime is $USD 260,000. In cloud computing service provider or internet service provider simply cannot afford to be offline that implies the ecosystem reliability and risk management are crucial [7]. Given the huge growth in data storage requirements with cloud access, IoT and big data, eliminating downtime is essential. Even a second outage can have a significant impact on an online business. To understand of downtime incidences, data center operators must be identifying and addressing to root causes of the outage. The risk of many incidences leads from root causes of downtime, which can be eliminating by monitoring and observing the best practices in system design and system redundancy from the outset of data center project. The scope of this research is concentrated on system reliability of power distribution system (PDS) in data center. The research question is based on the optimal system reliability output and system investment which, derived from 3 system topologies of data center PDS and 3 system investment of redundant component (Tier 2/Class F2), concurrent maintenance (Tier 3/Class F3), and fault tolerance (Tier 4/Class F4) respectively. This research purpose optimal method analysis (OMA) for balancing between data center investment versus system reliability in power distribution system. The concept of OMA is to collect currently available data into systems analysis. The OMA contains: 1. Methods of estimating the intrinsic reliability of power distribution systems of data center 2. Analyzes mean time between failures (MTBF) and failure rate (λ) of PDS of data center in level of components and systems 3. Suggests for considering probability of PDS failure in each Tier classification as costs of downtime 4. Optimizes Tier investment to eliminate PDS of data center as protecting costs 5. Guidelines to help designers and operators monitor system reliability through simulation and field appraisal programs Availability is “a measure of the degree to what an item is in an operable state when called upon to perform”. Availability is a measure of the percent (%) of time the equipment is in an operable state while reliability is a measure of how long the item performs its intended function [14]. Abstract – Power failures are the major causes of data center downtime. Reliability analysis of power distribution systems (PDS) of data center evokes representing a system as collection of subsystems to characterize the reliability of each part of the whole system. Therefore, understanding of reliability at the component level is fundamental to develop accurate estimate of system reliability and prevent system failures. Moreover, realization in the component level help to estimate overall investment of each systems topology. To improve system reliability, it can perform through redundant topology, however redundant components and systems is increasing investment and complexity of data center system. Balancing between cost of downtime and cost of reliability can be differed from business to business. This research purpose optimal method analysis (OMA) to quantify the proper investment against the cost of data center downtime. Keywords - Data center, power distribution systems, system reliability, optimal method analysis I. INTRODUCTION Financial impact of unplanned downtime was annihilated revenue and profitability that tied directly to system reliability of data centers. Data center downtime can have tremendous reputational implications as consequence of businesses that depend on platform operations. Technological investment has created interdependencies across critical infrastructure subject to improve system reliability. The design of reliable power distribution systems is important because of the high cost associate with power outages. According to a study from Ponemon Institute [9] shown in 2011 the average costs of data center downtime was approximately $USD 5,600 per minute. The research from Uptime [11] reports data center downtimes often happening and increasing. Since January 2016, Uptime Institute found power failures accounted for 36 percent of the highest cause of data center downtime. For enterprises with online business models that depend on the data center connectivity to support IT and networking services to clients such as e-commerce companies, internet data centers (IDC), or credit card companies, downtime can be extraordinarily costly, with the highest cost of an incidence more than $USD 11,000 per minute [10]. According to Ponemen Institute [8] research report, the data center downtime cost has been estimated around $7,793 USD per minute. The recent US-research firm Information Technology Intelligence Consulting (ITIC), in 2018, ITIC calculate that the average cost of this hour of 146 978-1-7281-0457-7/2019/$31.00 c 2019 IEEE redundancy in case of a unforeseen power outage. Difference from data center to data centers, not all data center’s redundancy power systems are created equal. Some offer N+1, 2N, and 2(N +1) redundancy topology as shown in Fig. 1. II. BACKGROUD A. Costs of Downtime Loss of revenue and sued by customers are the first things that need to concern, and they are critical. The results can be devastating, and that in itself ought to be sufficient to get businesses thinking seriously about their disaster recovery plan. However, there are many other significant areas, which can be affected by a data center downtime. In the categories of direct costs, these are the actual expenses incurred when a business experiences downtime, and can include: 1) The cost of getting everything back up and running. Depending on the type of business, and how much of it relies on IT, the cost of restarting systems and processes can be significant. 2) Project delays – often projects are linked. If a data center outage means one project is delayed, it can affect others that are tied to it. 3) Equipment costs – especially if there have been damage to the infrastructure. 4) Third party costs – if contractors or consultants are needed to resolve the outage. In the categories of indirect costs, these are costs that cannot be worked-out on a calculator, but still have an impact on a business’s bottom line: 1) Loss of business opportunities – a business cannot be working to attract new customers when everything’s offline. 2) Productivity loss – in the same way, not much get done when there is an outage. 3) Damage to reputation – if a business cannot fulfil customer requirements, it makes them lost reputation and trust. Fig. 1. Redundant topology N, N+1, 2N, and 2(N+1). The N setup is configured to have just the number components it needs to function; meaning that whenever one component fails the entire system fails. The N+1 configuration indicates that there is one extra component on-site regardless of the size of N. This results in one spare component that is to be used whenever a component fails as displayed in Fig. 1. The 2N configuration the entire system is duplicated resulting in very high levels of redundancy while requiring high investments and operating costs. The most redundant setup in use is the 2(N+1) configuration as it holds one spare component for every component even though the entire system is completely duplicated. To meet the standard requirements, such as Uptime [2], BICSI[3], TIA 942[4], and ISO 22237[5], for data center operations of IT equipment and to provide the appropriate level redundancy, extensive power distribution equipment is needed. As the equipment is costly and depreciates technically and economically rather rapidly capital expenditure is substantial. B. Cost of System Reliability System reliability is the probability the systems will function without failure over a given period of time when used under specified operating conditions. Probability axioms can be applied to dependent or independent reliability interoperability components. The system reliability of an event that is dependent on several independent events. The probability that a system failure is due to a certain cause. Given the reliability of components, mean time between failures (MTBF), failure rate (λ), and failure rate x average downtime per failure (hours per failure) or called restorability (λr) this can predict the reliability of the system. Moreover, understand the concept of topology of power distribution systems, this can predict the reliability of the system as reference in IEEE 493 [1]. A redundancy is engineering technique for duplication of critical components or systems with the intention of increasing reliability system. Normally uses in the case of backup or fail-safe. For financial businesses and state corporations have their data centers set up at either Tier 3 or Tier 4 because they offer a sufficient amount of III. SYSTEM ANALYSIS The PDS level of a whole data center, reliability system of electrical components has be combined with reliability of power distribution topology. By standards, Uptime/BICSI-002/TIA-942/ISO22237, can be explained in topology diagram (N, (N+1), 2N, 2(N+1)) as a sequence of series and parallel component’s connectivity as an ecosystem in PDS, as illustrated in Fig 2. The analysis of PDS is starting from incoming utility (1), short length of 147 cable (2) till cable termination (14) by applying failure rate, forced hours downtime and availability power distribution to calculation system availability of each topology diagram (Topology N and Topology 2N), as presented in Fig 3, and Fig 4, respectively. topology will be 227.6074008 x 11,000 = $USD 2,503,681.4088 per year. Fig. 2. Power distribution system as parallel systems of components. Fig. 3. Power distribution system for 480V with Topology N [1]. From the research analysis revealed that, MTBF is directly impact to failure rate (λ) and repair rate (μ or r) of system availability. An integral equation for the instantaneous time–dependent availability (A(t)) is usually performed on super computer as modern system reliability prediction. In case when the component repair and failure of cumulative distribution function (CDF) are given by e-μt and e-λt, respectively, then A(t) will be: TABLE I FAILURE RATE, FORCED HOURS DOWNTIME AND AVAILIBILITY POWER DISTRIBUTION OF FIG. 3. [1] (1) ; (2) (3) From equation (3) uses to apply for calculating Ai comparison between single path PDS of 480V with N (N=1) and dual path PDS of 480V with 2N (N=1); t = 525,600 minutes. The result shown availability Ai system comparison between N of PDC (Fig. 3) and 2N of PDC (Fig. 4) topology output level of system availability as 0.999511730 (Table I) vice versa, it will have downtime 0.00048827 or 256.634712 minutes per year and 0.999944773 (Table II) will have downtime 0.000055227 or 29.0273112 minutes per year. The analysis of costs of downtime differs between N and 2N based on $USD 11,000 per minute [10] will be (256.634712 - 29.0273112 = 227.6074008 minutes) at point of use. Therefore, the different costs of PDS 480V downtime between N and 2N A. Topology vs. System Reliability PDS of data center can be defining as a sense of reliability block diagram (RBD) [12]. A power distribution typically are series and parallel systems, used redundantly as one way to increase reliability because redundant parts 148 TABLE III ANNUAL PLANNED MAINTENACE [1] are available to keep the system fully operations [13]. Table III reveals the annual planned maintenance hours of each operational level that deploys Tier 2 permitted with 50-99 hours downtime, topology shown in Fig. 5; Tier 3 with 0-49 hours downtime, topology shown in Fig. 6; and Tier 4 with none downtime, topology shown in Fig. 7. TABLE II FAILURE RATE, FORCED HOURS DOWNTIME AND AVAILIBILITY POWER DISTRIBUTION OF FIG. 4. [1] Fig. 5. Tier 2: redundant capacity data center topology. UPS N Fig. 6. Teir 3: concurrent maintenance data center topology. This topology defines Tier [2] or availability Class [3] standards and level of permitted downtime. The availability Class model can be used to guide design and oerational decision for each critical services. The OMA can be evaluated in a mission crical data center, there are 4 factors that can be quantified: Fig. 4. Power distribution system for 480V with Topology 2N [1]. 149 Tier 1: $11,500/kW of redundant UPS capacity for IT Tier 2: $12,500/kW of redundant UPS capacity for IT Tier 3: $23,000/kW of redundant UPS capacity for IT Tier 4: $25,000/kW of redundant UPS capacity for IT 2) The “data center room or white space” buildout in all cases is $500 per sqm. of white space and this cost must be added to the “total kW” shown above. 3) The “facility room or gray space” buildout in all cases is $700 per sqm. of gray space. From case study of data center white space 3,000 sqm., needs buildout cost of white space at $500 per sqm., and gray space at $700 per sqm. This case study will has 1,000 racks (a rack per 3 sqm.) and each rack 5 kW then total of power requirement will be 5,000 kW. The total investment of each Tier data center shows in Table IV. • Operational requirements • Impact of downtime • Operational availability (Tier/Class) • Level of investment TABLE IV DATA CENTER INVESTMETN OF TIER CLASSIFICATION Topology Tier 1 White Buildout Space White S pace (sqm.) $500/sqm. 3,000 1,500,000 Buildout Gray Space $700/sqm. 2,100,000 Total kW (No. Rack at Tier $ per kW 5kW) 5,000 57,500,000 Total Investment $US 61,100,000 Tier 2 3,000 1,500,000 3,150,000 5,000 62,500,000 67,150,000 Tier 3 3,000 1,500,000 4,200,000 5,000 115,000,000 120,700,000 Tier 4 3,000 1,500,000 6,300,000 5,000 125,000,000 132,800,000 Fig. 7. Tier 4: fault tolerance data center topology. Data from Table V presents MTTR of each topology, which can be critical point of system recovery. According to Ponemen Institute [8] research report, the data center downtime cost has been estimated around $7,793 USD per minute and the duration of downtime per 5 year and MTTR from Table V. MTTR means downtime per each incurred that can be calculated in value of money such as: Tier 2: [(4.48 x 60) x $7,793] = $2,094,758.4 US Tier 3: [(1.64 x 60) x $7,793] = $766,831.2 US Tier 4: [(1.74 x 60) x $7,793] = $813,589.2 US B. Topology vs. Investment The OMA provides the minimum required level of reliability and redundany without over construction associated with facility’s overheard and investment of data center topology while deliberates the revenues lost per hour against with operational requirements of each Tier/Class opeational availability. The OMA model derived from cost of downtime and cost of reliability incurred that are proportional to the failure rate (λ) and repair rate (μ). The total effect on variable investment (I), if the value of downtime is a constant on a per minutely basis, then I will be expressed by equation: (4) where I is the investment of data center topology λ is the failure per year or failure rate xi is extra expenses incurred per failure dp is cost of downtime incurred per failure (minute) xp is variable expenses saved per minute μ is the repair or replacement time after a failure s is the plan recovery time after a failure In Tier 2: redundant capacity data centers require a ratio of 1 to 1.5 for facility supporting area or called gray space and data center area or called white space. In Tier 3: concurrent maintenance data center can consume 2-3 times more gray spacef than the white space. A Tier 4: fault tolerance data center can require up to 3-4 times more gray spacef than the white space. Below are the Uptime Institute’s cost estimates for building a data center broken down by 3 components [6]: 1) The kilowatt (kW) Cost Component by desired level of functionality: TABLE V MTBF, MTTR, AI, AND PROBABILITY OF FAILURE OF POWER DISTRIBUTION SYSTEM OF FIG. 5. 6. 7. From equation (3) and Table V the availability of each Topology can be calculated in a year (minutes) as: Tier 2 uptime: [(0.9999340 x 60) x 8,760]=525,565.31 downtime: 34.69 Tier 3 uptime: [(0.9999913 x 60) x 8,760]=525,595.43 150 downtime: 4.57272 Tier 4 uptime: [(0.9999914 x 60) x 8,760]=525,595.48 downtime: 4.52016 According to the concept of OMA, how to balancing between data center investment versus system reliability? The results show the intrinsic reliability of power distribution systems of data center between topology 2N and topology 2(N+1) seen to be not differed but topology N+1 and 2N much more differed. Analysis systems reliability 1: from equation (3), result of Ai from topology 2N is the best, as illustrated in Table V. Analysis systems reliability 2: from real practical maintenance, MTTR will be the most critical part on systems reliability, therefore, this case must be consider the shortage MTTR. Topology 2N is the best, as seen Table V. However, investment in topology is as an insurance costs or protecting costs. Analysis investment 1: from Table IV, investment for topology 2(N+1) and 2N is double of topology N+1. Analysis investment 2: downtime cost of topology N+1 is 3 threefold of topology 2N and 2(N+1). From equation (4) as concept of OMA, topology 2N seem to be the best solution of investment and systems reliability of power distribution system of data center. [3] BICSI-002, Data Center Design and Implementation Best Practices, BICSI 002-2014, December 9, 2014. [4] TIA-942-B, Telecommunications Infrastructure Standards for Data Centers, Jul 2017. [5] ISO 22237, Information Technology—Data Centre Facilities and Infrastructures, 2018. [6] 365 Data Centers, “Data Center Colocation Build vs. Buy,” 364 Data Centers, 2016. [7] CTEC, “Data Center: Jobs and Opportunities in Communities Nationwide” U.S Chamber of Commerce Technology Engagement Center, 2017. [8] Ponemon Institute, “Cost of Data Center Outage,” Ponemon Institute, January 2016. [9] Ponemon Institute, “Cost to Support Computer Capacity,” Ponemon Institute, August 2016. [10] Vertiv, “Understanding the Cost of Data Center Downtime: An Analysis of the Financial Impact of Infrastructure Vulnerability,” VertiveCo.com, 2016. [11] Uptime, “Uptime Institute Data Shows Outages are Common, Costly, and Preventable,” Uptime Institute Research, July 2018. [12] M. Wiboonrat, “Data Center Design of Optimal Reliable Systems,” IEEE Int. Conf. on Quality and Reliability, 2011, pp. 350-354. [13] D.J Klinger, Y. Nakada, and M.A. Menendez, AT&T Reliability Manual, Van Nosstrand-Reinhold, 1990. [14] Department of Defense, Definitions of Terms for Reliability and Maintainability, MIL-STD-721C, U.S. Government Printing Office, Washington, D.C., 1981. V. CONCLUSION The concept of optimal method analysis (OMA) uses to examine the best option of data center investment and system reliability of power distribution systems (PDS). This research use case is Data Center Tier Classification (Tier 2/Class F2: topology N+1, Tier 3/Class F3: topology 2N, Tier 4/Class F4: topology 2(N+1)) that describes topology of PDS of data center each Tier. The analysis of systems reliability has been conducted to quantify the failure rate (λ), repair rate (μ), and systems availability (Ai) of each Tier while assesses and compares data center investment in each Tier as well. The research result reveals proper investment against the cost of downtime of topology 2N or Tier 3/Class F3 is the best option to consider subject to the best systems reliability of less mean time to repair (MTTR), the highest systems availability (Ai), and the lowest of downtime costs per year. However, the investment may be not the lowest but when compares with topology 2(N+1) or Tier 4/Class F4, topology 2N is still better in term of investment and systems reliability that received. REFERENCES [1] IEEE Std. 493, Recommended practice for the design of reliability industrial and commercial power systems, Revision of IEEE Std. 493-1997, Gold Book, February 7, 2007. [2] Uptime Institute, Data Center Site Infrastructure Tier Standard: Topology. Uptime Institute Professional Services, LLC. Uptime Institute, 2014. 151