focus on your server operating system

advertisement
WANT TO SLASH DOWNTIME?
FOCUS ON YOUR SERVER OPERATING SYSTEM
Ask any IT executive their biggest day-to-day fear, and they’ll likely tell you
something about reducing or even eliminating downtime. Not only does downtime
have a significant impact on an organization’s operating efficiency, brand
reputation and regulatory requirements, but its economic costs can also be huge.
CONTENTS:
• Why the Operating
System Matters
in Reducing
Downtime
Planned service
interruptions
Unplanned service
interruptions
• Considering Your
Options: SUSE
Linux Enterprise
Data from research organization Ponemon Institute estimates that a single
minute of downtime costs about $5,000 per minute.1 Of course, for uptimedependent applications like e-commerce and other transaction-intensive
requirements, the cost can be even greater. Amazon.com in 2013 experienced
a downtime event of nearly an hour that, based on Amazon’s revenues at the
time, may have cost the company as much as $4.9 million in deferred revenue.
But when IT organizations and their business stakeholders plan strategies
to significantly pare downtime, it makes sense to talk about two key issues.
The first is understanding the differences and causes of planned downtime—
the process of doing scheduled and anticipated maintenance that is known
in advance—and those affecting unplanned downtime. While unplanned
downtime has the biggest potential for catastrophic impact, and is far more
likely to garner press coverage and affect IT leaders’ job security, dealing with
planned downtime is every bit as important a concern because it happens far
more frequently than unplanned downtime.
The second key issue is to consider the causes of downtime and the best
solutions in improving system availability. While achieving the ideal goal of zero
downtime may be a physical impossibility for organizations, getting as close
to that goal as possible is a very high priority for IT departments charged with
keeping systems up and running and ensuring the availability of critical data,
applications and services.
In each case, however, it’s becoming increasingly important for IT leaders
to pay more attention to the importance of server operating systems as an
1
“Calculating the Cost of Data Center Outages,” Ponemon Institute, May 2011
essential step to reducing downtime. Naturally, the operating system needs to be highly secure, deliver high
performance, scale easily as workloads increase and ensure high availability. At the same time, operating
system choices must align with the new realities of IT infrastructure and architecture, such as virtualization,
cloud computing, support for mobility and real-time problem identification and remediation.
When an unplanned outage occurs, IT organizations often run through a checklist of actions to identify,
isolate and fix the problem, starting with simple causes like a loss of power and escalating up to more
complex issues. However, in considering ways to reduce both unplanned and planned service interruptions,
many organizations often overlook the role of the operating system as a way to improve availability. A
recent study conducted by operating system provider SUSE with registered visitors to SearchDataCenter.
com indicated that technology failure was the No. 1 cause of unplanned downtime, and that upgrading or
changing operating system functionality was one of the most significant actions their organizations plan to
take to reduce downtime.
Why the Operating System Matters in Reducing Downtime
One of the most important industry trends in IT infrastructure—virtualization—is putting huge pressure on
organizations’ host servers. The proliferation of virtual machines (VMs) for a wide variety of workloads,
from email and enterprise resource planning to Web servers and analytics, means that VM workloads are
consolidating on a host system. Without properly safeguarding the host’s operating system through a variety
of tools and technical advancements, those workloads will be at heightened risk.
Planned service interruptions
For instance, reducing the time frames for performing routine and scheduled maintenance, upgrades and
refreshes during planned downtimes can go a long way toward improving uptime. Two-thirds of respondents
in the SUSE survey said their organizations conduct planned downtime sessions at least once a quarter.
Reducing either the frequency of those planned interruptions or their duration can go a long way toward
reducing downtime.
Specifically, it’s important to have operating systems that provide native tools that can address some of
the most common sources of planned downtime, such as doing security patches without having to actually
shut down the system. IT organizations should consider upgrading or even changing their operating system
in order to keep the server updated with the most recent kernel patches without affecting mission-critical
workloads or waiting for the next planned service window.
Additionally, some operating systems offer easy-to-deploy tools that address the common flaw of human
error through snapshots that enable one-click system resets to any well-known state that’s stored. IT
organizations plan to implement a number of other steps to trim planned downtime windows, according to
the research survey:
2
© TechTarget 2014
Steps to
Cut Planned
Downtime
60%
51%
50%
40%
40%
36%
30%
21%
20%
10%
8%
0%
t
nd
es
ing
le
ing
s a cks
.t b ices
ch ools
yc
h
t
t
c
c
a
o
g
a
p t
ct
at
ife
sh llb
h m pra
Sl
ep
ter
ap ro
c
t
O
t
v
n
i
e
S
L
B
Pa
er
ng
o
L
These and other operating system features provide important benefits to organizations looking to reduce
downtime stemming from planned interruptions:
Snapshots and rollbacks reduce manual error by taking system snapshots, including kernel files, and
allowing administrators to roll back to any good state that’s stored.
Live kernel patching enables the deployment of critical security patches before the next service window—
without rebooting.
Automated patching reduces human errors by automating the often time-consuming and manual-intensive
patch management process.
Patch pre-loading allows patches to be pre-loaded to the server prior to application, which reducing
downtime time frames.
Extending the operating system lifecycle allows organizations to remain on their current OS versions longer
in order to reduce the risks of migrations and the costs of a systems upgrade.
3
Unplanned service interruptions
Eliminating unplanned outages is certainly far more difficult than dealing with planned outages, but the
potential impact of unplanned outages demands that IT organizations go to great length to reduce their
incidence and mitigate their impact.
Having a modernized, feature-rich operating system optimized for reducing unplanned downtime can go
a long way toward helping organizations accomplish their goal of achieving that elusive “five-nines” of
availability (99.999% uptime, or fewer than five minutes of downtime per year).
Steps to Cut
Unplanned
Downtime
60%
51%
50%
40%
30%
20%
35%
32%
20%
22%
10%
0%
e
ks
rag cy
e
ac
n
v a
b
l
e
L nd
ol
u
t/r
d
o
e
h
r
s
ap
Sn
a
gr
Up
w
ne re
o
t wa
atehard
r
ig
OS
de
M
e
radort
g
Up upp
s
e/
vic
r
se
There are a number of important capabilities IT organizations should be looking for in their operating system
in order to mitigate the frequency and impact of unplanned downtime:
Failover clustering helps organizations meet terms of their service-level agreements by establishing server
clusters among physical nodes or VM.
Geo-clustering is becoming an increasingly important OS function because of the growing incidence of
organizations’ data centers in disparate locations. It bridges clusters across any distance in order to ensure
business continuity.
4
© TechTarget 2014
IPv4 and IPv6 load balancing automatically distributes stateless server workloads to achieve
high availability.
Reliability, availability and serviceability (RAS) enables tight coupling of server operating systems built
on different hardware platforms in order to improve workload uptime.
Cluster test drive simulates and validates cluster setup before a failover actually occurs.
Cluster rolling update ensures cluster uptime by sequentially updating nodes.
Considering Your Options: SUSE Linux Enterprise
For more than 20 years, SUSE has offered enterprises robust, reliable and scalable operating system
architectures that help reduce downtime. Many of the industry’s top server hardware brands—including
the very popular IBM System z server—support Linux as their primary operating system for mission-critical
workloads, and SUSE Linux Enterprise Server offers IT departments mainframe- and Unix-class reliability and
performance on a wide range of hardware infrastructure, including x86-based server clusters.
SUSE has made it a priority to help organizations find new ways to reduce both planned and unplanned
downtime through a combination of new OS-based tools and enhanced technology and functionality to
address the root causes of downtime and limit the duration of service interruptions. For instance, SUSE
Linux Enterprise High Availability Extension allows organizations to create clusters of physical servers or
VMs in order to improve flexibility, while also offering tools for easy cluster configuration, management and
failure scenario simulation. Finally, SUSE’s geo-clustering capability enables service failover at any distance
among the organization’s different data centers or even at cloud service provider locations.
SUSE also offers SUSE Linux Enterprise Live Patching, which delivers kernel updates without rebooting. The
technology protects key workloads against the latest threats, without requiring the organization to take
down servers and incur downtime.
Conclusion
Reducing or even eliminating downtime is at or near the very top of every IT executive’s must-do list.
The financial, competitive, legal, brand and operational implications of downtime—both planned and
unplanned—are huge. Those potential issues mandate that IT departments exhaust every possible means
to reduce interruptions.
The downtime challenge should be addressed on many levels, but one area in which organizations can
and should do more is in modernizing or even changing their operating systems for maximum resiliency
and flexibility.
5
By addressing downtime challenges at the operating system level, organizations can make great strides to
reduce the duration and frequency of planned outages through capabilities such as live kernel patching,
snapshots/rollbacks, automated patching, patch pre-loading and extending the effective lifecycle of the
operating system. Additionally, today’s modern operating systems versions offer important functionality
that head off sources of potential unplanned downtime and remediate those outages more quickly through
capabilities such as geoclustering, failover clustering, load balancing, RAS, cluster test drive and cluster
rolling update.
SUSE Linux Enterprise Server has become an important part of IT organizations’ effort to reduce downtime,
whether their key workloads run on mainframes or x86-based commodity servers. SUSE Linux Enterprise
Server offers a broad array of tools and technology-based functionality specifically designed to help IT
organizations make faster progress toward that elusive goal of zero downtime.
For more information about SUSE operating system tools and technologies, go to
www.suse.com/zerodowntime.
29%
SUSE and the SUSE logo are registered trademarks of SUSE LLC in the United States and other countries.
6
© TechTarget 2014
Download