Uploaded by Franklin Uriel Parás Hernandez

Data Center Investment vs. System Reliability in Power Distribution System

advertisement
Data Center Investment vs. System Reliability in Power Distribution Systems
Montri Wiboonrat
Department of Computer Engineering, Faculty of Engineering,
King Mongkut's Institute of Technology Ladkrabang, Thailand
(montri.wi@kmitl.ac.th)
annual downtime is $USD 260,000. In cloud computing
service provider or internet service provider simply cannot
afford to be offline that implies the ecosystem reliability
and risk management are crucial [7]. Given the huge
growth in data storage requirements with cloud access, IoT
and big data, eliminating downtime is essential. Even a
second outage can have a significant impact on an online
business.
To understand of downtime incidences, data center
operators must be identifying and addressing to root causes
of the outage. The risk of many incidences leads from root
causes of downtime, which can be eliminating by
monitoring and observing the best practices in system
design and system redundancy from the outset of data
center project.
The scope of this research is concentrated on system
reliability of power distribution system (PDS) in data
center. The research question is based on the optimal
system reliability output and system investment which,
derived from 3 system topologies of data center PDS and 3
system investment of redundant component (Tier 2/Class
F2), concurrent maintenance (Tier 3/Class F3), and fault
tolerance (Tier 4/Class F4) respectively. This research
purpose optimal method analysis (OMA) for balancing
between data center investment versus system reliability in
power distribution system. The concept of OMA is to
collect currently available data into systems analysis. The
OMA contains:
1. Methods of estimating the intrinsic reliability of
power distribution systems of data center
2. Analyzes mean time between failures (MTBF) and
failure rate (λ) of PDS of data center in level of
components and systems
3. Suggests for considering probability of PDS failure
in each Tier classification as costs of downtime
4. Optimizes Tier investment to eliminate PDS of data
center as protecting costs
5. Guidelines to help designers and operators monitor
system reliability through simulation and field
appraisal programs
Availability is “a measure of the degree to what an item
is in an operable state when called upon to perform”.
Availability is a measure of the percent (%) of time the
equipment is in an operable state while reliability is a
measure of how long the item performs its intended
function [14].
Abstract – Power failures are the major causes of data
center downtime. Reliability analysis of power distribution
systems (PDS) of data center evokes representing a system as
collection of subsystems to characterize the reliability of each
part of the whole system. Therefore, understanding of
reliability at the component level is fundamental to develop
accurate estimate of system reliability and prevent system
failures. Moreover, realization in the component level help to
estimate overall investment of each systems topology. To
improve system reliability, it can perform through redundant
topology, however redundant components and systems is
increasing investment and complexity of data center system.
Balancing between cost of downtime and cost of reliability can
be differed from business to business. This research purpose
optimal method analysis (OMA) to quantify the proper
investment against the cost of data center downtime.
Keywords - Data center, power distribution systems,
system reliability, optimal method analysis
I. INTRODUCTION
Financial impact of unplanned downtime was
annihilated revenue and profitability that tied directly to
system reliability of data centers. Data center downtime
can have tremendous reputational implications as
consequence of businesses that depend on platform
operations. Technological investment has created
interdependencies across critical infrastructure subject to
improve system reliability. The design of reliable power
distribution systems is important because of the high cost
associate with power outages. According to a study from
Ponemon Institute [9] shown in 2011 the average costs of
data center downtime was approximately $USD 5,600 per
minute. The research from Uptime [11] reports data center
downtimes often happening and increasing. Since January
2016, Uptime Institute found power failures accounted for
36 percent of the highest cause of data center downtime.
For enterprises with online business models that depend on
the data center connectivity to support IT and networking
services to clients such as e-commerce companies, internet
data centers (IDC), or credit card companies, downtime can
be extraordinarily costly, with the highest cost of an
incidence more than $USD 11,000 per minute [10].
According to Ponemen Institute [8] research report, the
data center downtime cost has been estimated around
$7,793 USD per minute. The recent US-research firm
Information Technology Intelligence Consulting (ITIC), in
2018, ITIC calculate that the average cost of this hour of
146
978-1-7281-0457-7/2019/$31.00 c 2019 IEEE
redundancy in case of a unforeseen power outage.
Difference from data center to data centers, not all data
center’s redundancy power systems are created equal.
Some offer N+1, 2N, and 2(N +1) redundancy topology as
shown in Fig. 1.
II. BACKGROUD
A. Costs of Downtime
Loss of revenue and sued by customers are the first
things that need to concern, and they are critical. The
results can be devastating, and that in itself ought to be
sufficient to get businesses thinking seriously about their
disaster recovery plan. However, there are many other
significant areas, which can be affected by a data center
downtime.
In the categories of direct costs, these are the actual
expenses incurred when a business experiences downtime,
and can include:
1) The cost of getting everything back up and
running. Depending on the type of business, and how much
of it relies on IT, the cost of restarting systems and
processes can be significant.
2) Project delays – often projects are linked. If a data
center outage means one project is delayed, it can affect
others that are tied to it.
3) Equipment costs – especially if there have been
damage to the infrastructure.
4) Third party costs – if contractors or consultants are
needed to resolve the outage.
In the categories of indirect costs, these are costs that
cannot be worked-out on a calculator, but still have an
impact on a business’s bottom line:
1) Loss of business opportunities – a business cannot
be working to attract new customers when everything’s
offline.
2) Productivity loss – in the same way, not much get
done when there is an outage.
3) Damage to reputation – if a business cannot fulfil
customer requirements, it makes them lost reputation and
trust.
Fig. 1. Redundant topology N, N+1, 2N, and 2(N+1).
The N setup is configured to have just the number
components it needs to function; meaning that whenever
one component fails the entire system fails.
The N+1 configuration indicates that there is one extra
component on-site regardless of the size of N. This results
in one spare component that is to be used whenever a
component fails as displayed in Fig. 1.
The 2N configuration the entire system is duplicated
resulting in very high levels of redundancy while requiring
high investments and operating costs.
The most redundant setup in use is the 2(N+1)
configuration as it holds one spare component for every
component even though the entire system is completely
duplicated.
To meet the standard requirements, such as Uptime
[2], BICSI[3], TIA 942[4], and ISO 22237[5], for data
center operations of IT equipment and to provide the
appropriate level redundancy, extensive power distribution
equipment is needed. As the equipment is costly and
depreciates technically and economically rather rapidly
capital expenditure is substantial.
B. Cost of System Reliability
System reliability is the probability the systems will
function without failure over a given period of time when
used under specified operating conditions. Probability
axioms can be applied to dependent or independent
reliability interoperability components. The system
reliability of an event that is dependent on several
independent events. The probability that a system failure is
due to a certain cause. Given the reliability of components,
mean time between failures (MTBF), failure rate (λ), and
failure rate x average downtime per failure (hours per
failure) or called restorability (λr) this can predict the
reliability of the system. Moreover, understand the concept
of topology of power distribution systems, this can predict
the reliability of the system as reference in IEEE 493 [1].
A redundancy is engineering technique for duplication
of critical components or systems with the intention of
increasing reliability system. Normally uses in the case of
backup or fail-safe. For financial businesses and state
corporations have their data centers set up at either Tier 3
or Tier 4 because they offer a sufficient amount of
III. SYSTEM ANALYSIS
The PDS level of a whole data center, reliability
system of electrical components has be combined with
reliability of power distribution topology. By standards,
Uptime/BICSI-002/TIA-942/ISO22237, can be explained
in topology diagram (N, (N+1), 2N, 2(N+1)) as a sequence
of series and parallel component’s connectivity as an
ecosystem in PDS, as illustrated in Fig 2. The analysis of
PDS is starting from incoming utility (1), short length of
147
cable (2) till cable termination (14) by applying failure rate,
forced hours downtime and availability power distribution
to calculation system availability of each topology diagram
(Topology N and Topology 2N), as presented in Fig 3, and
Fig 4, respectively.
topology will be 227.6074008 x 11,000 = $USD
2,503,681.4088 per year.
Fig. 2. Power distribution system as parallel systems of components.
Fig. 3. Power distribution system for 480V with Topology N [1].
From the research analysis revealed that, MTBF is
directly impact to failure rate (λ) and repair rate (μ or r) of
system availability. An integral equation for the
instantaneous time–dependent availability (A(t)) is usually
performed on super computer as modern system reliability
prediction. In case when the component repair and failure
of cumulative distribution function (CDF) are given by e-μt
and e-λt, respectively, then A(t) will be:
TABLE I
FAILURE RATE, FORCED HOURS DOWNTIME AND AVAILIBILITY
POWER DISTRIBUTION OF FIG. 3. [1]
(1)
;
(2)
(3)
From equation (3) uses to apply for calculating Ai
comparison between single path PDS of 480V with N
(N=1) and dual path PDS of 480V with 2N (N=1); t =
525,600 minutes. The result shown availability Ai system
comparison between N of PDC (Fig. 3) and 2N of PDC
(Fig. 4) topology output level of system availability as
0.999511730 (Table I) vice versa, it will have downtime
0.00048827 or 256.634712 minutes per year and
0.999944773 (Table II) will have downtime 0.000055227
or 29.0273112 minutes per year. The analysis of costs of
downtime differs between N and 2N based on $USD
11,000 per minute [10] will be (256.634712 - 29.0273112
= 227.6074008 minutes) at point of use. Therefore, the
different costs of PDS 480V downtime between N and 2N
A. Topology vs. System Reliability
PDS of data center can be defining as a sense of
reliability block diagram (RBD) [12]. A power distribution
typically are series and parallel systems, used redundantly
as one way to increase reliability because redundant parts
148
TABLE III
ANNUAL PLANNED MAINTENACE [1]
are available to keep the system fully operations [13].
Table III reveals the annual planned maintenance hours of
each operational level that deploys Tier 2 permitted with
50-99 hours downtime, topology shown in Fig. 5; Tier 3
with 0-49 hours downtime, topology shown in Fig. 6; and
Tier 4 with none downtime, topology shown in Fig. 7.
TABLE II
FAILURE RATE, FORCED HOURS DOWNTIME AND AVAILIBILITY
POWER DISTRIBUTION OF FIG. 4. [1]
Fig. 5. Tier 2: redundant capacity data center topology.
UPS N
Fig. 6. Teir 3: concurrent maintenance data center topology.
This topology defines Tier [2] or availability Class [3]
standards and level of permitted downtime. The
availability Class model can be used to guide design and
oerational decision for each critical services. The OMA can
be evaluated in a mission crical data center, there are 4
factors that can be quantified:
Fig. 4. Power distribution system for 480V with Topology 2N [1].
149
Tier 1: $11,500/kW of redundant UPS capacity for IT
Tier 2: $12,500/kW of redundant UPS capacity for IT
Tier 3: $23,000/kW of redundant UPS capacity for IT
Tier 4: $25,000/kW of redundant UPS capacity for IT
2) The “data center room or white space” buildout in
all cases is $500 per sqm. of white space and this cost must
be added to the “total kW” shown above.
3) The “facility room or gray space” buildout in all
cases is $700 per sqm. of gray space.
From case study of data center white space 3,000 sqm.,
needs buildout cost of white space at $500 per sqm., and
gray space at $700 per sqm. This case study will has 1,000
racks (a rack per 3 sqm.) and each rack 5 kW then total of
power requirement will be 5,000 kW. The total investment
of each Tier data center shows in Table IV.
• Operational requirements
• Impact of downtime
• Operational availability (Tier/Class)
• Level of investment
TABLE IV
DATA CENTER INVESTMETN OF TIER CLASSIFICATION
Topology
Tier 1
White
Buildout
Space
White S pace
(sqm.)
$500/sqm.
3,000
1,500,000
Buildout
Gray Space
$700/sqm.
2,100,000
Total kW
(No. Rack at Tier $ per kW
5kW)
5,000
57,500,000
Total
Investment
$US
61,100,000
Tier 2
3,000
1,500,000
3,150,000
5,000
62,500,000
67,150,000
Tier 3
3,000
1,500,000
4,200,000
5,000
115,000,000
120,700,000
Tier 4
3,000
1,500,000
6,300,000
5,000
125,000,000
132,800,000
Fig. 7. Tier 4: fault tolerance data center topology.
Data from Table V presents MTTR of each topology,
which can be critical point of system recovery. According
to Ponemen Institute [8] research report, the data center
downtime cost has been estimated around $7,793 USD per
minute and the duration of downtime per 5 year and MTTR
from Table V. MTTR means downtime per each incurred
that can be calculated in value of money such as:
Tier 2: [(4.48 x 60) x $7,793] = $2,094,758.4 US
Tier 3: [(1.64 x 60) x $7,793] = $766,831.2 US
Tier 4: [(1.74 x 60) x $7,793] = $813,589.2 US
B. Topology vs. Investment
The OMA provides the minimum required level of
reliability and redundany without over construction
associated with facility’s overheard and investment of data
center topology while deliberates the revenues lost per hour
against with operational requirements of each Tier/Class
opeational availability. The OMA model derived from cost
of downtime and cost of reliability incurred that are
proportional to the failure rate (λ) and repair rate (μ). The
total effect on variable investment (I), if the value of
downtime is a constant on a per minutely basis, then I will
be expressed by equation:
(4)
where
I is the investment of data center topology
λ is the failure per year or failure rate
xi is extra expenses incurred per failure
dp is cost of downtime incurred per failure (minute)
xp is variable expenses saved per minute
μ is the repair or replacement time after a failure
s is the plan recovery time after a failure
In Tier 2: redundant capacity data centers require a ratio
of 1 to 1.5 for facility supporting area or called gray space
and data center area or called white space. In Tier 3:
concurrent maintenance data center can consume 2-3 times
more gray spacef than the white space. A Tier 4: fault
tolerance data center can require up to 3-4 times more gray
spacef than the white space. Below are the Uptime
Institute’s cost estimates for building a data center broken
down by 3 components [6]:
1) The kilowatt (kW) Cost Component by desired
level of functionality:
TABLE V
MTBF, MTTR, AI, AND PROBABILITY OF FAILURE OF POWER
DISTRIBUTION SYSTEM OF FIG. 5. 6. 7.
From equation (3) and Table V the availability of each
Topology can be calculated in a year (minutes) as:
Tier 2 uptime: [(0.9999340 x 60) x 8,760]=525,565.31
downtime: 34.69
Tier 3 uptime: [(0.9999913 x 60) x 8,760]=525,595.43
150
downtime: 4.57272
Tier 4 uptime: [(0.9999914 x 60) x 8,760]=525,595.48
downtime: 4.52016
According to the concept of OMA, how to balancing
between data center investment versus system reliability?
The results show the intrinsic reliability of power
distribution systems of data center between topology 2N
and topology 2(N+1) seen to be not differed but topology
N+1 and 2N much more differed.
Analysis systems reliability 1: from equation (3), result
of Ai from topology 2N is the best, as illustrated in Table
V.
Analysis systems reliability 2: from real practical
maintenance, MTTR will be the most critical part on
systems reliability, therefore, this case must be consider the
shortage MTTR. Topology 2N is the best, as seen Table V.
However, investment in topology is as an insurance
costs or protecting costs.
Analysis investment 1: from Table IV, investment for
topology 2(N+1) and 2N is double of topology N+1.
Analysis investment 2: downtime cost of topology N+1
is 3 threefold of topology 2N and 2(N+1).
From equation (4) as concept of OMA, topology 2N
seem to be the best solution of investment and systems
reliability of power distribution system of data center.
[3] BICSI-002, Data Center Design and Implementation Best
Practices, BICSI 002-2014, December 9, 2014.
[4] TIA-942-B, Telecommunications Infrastructure Standards
for Data Centers, Jul 2017.
[5] ISO 22237, Information Technology—Data Centre Facilities
and Infrastructures, 2018.
[6] 365 Data Centers, “Data Center Colocation Build vs. Buy,”
364 Data Centers, 2016.
[7] CTEC, “Data Center: Jobs and Opportunities in
Communities Nationwide” U.S Chamber of Commerce
Technology Engagement Center, 2017.
[8] Ponemon Institute, “Cost of Data Center Outage,” Ponemon
Institute, January 2016.
[9] Ponemon Institute, “Cost to Support Computer Capacity,”
Ponemon Institute, August 2016.
[10] Vertiv, “Understanding the Cost of Data Center Downtime:
An Analysis of the Financial Impact of Infrastructure
Vulnerability,” VertiveCo.com, 2016.
[11] Uptime, “Uptime Institute Data Shows Outages are
Common, Costly, and Preventable,” Uptime Institute
Research, July 2018.
[12] M. Wiboonrat, “Data Center Design of Optimal Reliable
Systems,” IEEE Int. Conf. on Quality and Reliability, 2011,
pp. 350-354.
[13] D.J Klinger, Y. Nakada, and M.A. Menendez, AT&T
Reliability Manual, Van Nosstrand-Reinhold, 1990.
[14] Department of Defense, Definitions of Terms for Reliability
and Maintainability, MIL-STD-721C, U.S. Government
Printing Office, Washington, D.C., 1981.
V. CONCLUSION
The concept of optimal method analysis (OMA) uses
to examine the best option of data center investment and
system reliability of power distribution systems (PDS).
This research use case is Data Center Tier Classification
(Tier 2/Class F2: topology N+1, Tier 3/Class F3: topology
2N, Tier 4/Class F4: topology 2(N+1)) that describes
topology of PDS of data center each Tier. The analysis of
systems reliability has been conducted to quantify the
failure rate (λ), repair rate (μ), and systems availability (Ai)
of each Tier while assesses and compares data center
investment in each Tier as well. The research result reveals
proper investment against the cost of downtime of topology
2N or Tier 3/Class F3 is the best option to consider subject
to the best systems reliability of less mean time to repair
(MTTR), the highest systems availability (Ai), and the
lowest of downtime costs per year. However, the
investment may be not the lowest but when compares with
topology 2(N+1) or Tier 4/Class F4, topology 2N is still
better in term of investment and systems reliability that
received.
REFERENCES
[1] IEEE Std. 493, Recommended practice for the design of
reliability industrial and commercial power systems,
Revision of IEEE Std. 493-1997, Gold Book, February 7,
2007.
[2] Uptime Institute, Data Center Site Infrastructure Tier
Standard: Topology. Uptime Institute Professional Services,
LLC. Uptime Institute, 2014.
151
Download