WANT TO SLASH DOWNTIME? FOCUS ON YOUR SERVER OPERATING SYSTEM Ask any IT executive their biggest day-to-day fear, and they’ll likely tell you something about reducing or even eliminating downtime. Not only does downtime have a significant impact on an organization’s operating efficiency, brand reputation and regulatory requirements, but its economic costs can also be huge. CONTENTS: • Why the Operating System Matters in Reducing Downtime Planned service interruptions Unplanned service interruptions • Considering Your Options: SUSE Linux Enterprise Data from research organization Ponemon Institute estimates that a single minute of downtime costs about $5,000 per minute.1 Of course, for uptimedependent applications like e-commerce and other transaction-intensive requirements, the cost can be even greater. Amazon.com in 2013 experienced a downtime event of nearly an hour that, based on Amazon’s revenues at the time, may have cost the company as much as $4.9 million in deferred revenue. But when IT organizations and their business stakeholders plan strategies to significantly pare downtime, it makes sense to talk about two key issues. The first is understanding the differences and causes of planned downtime— the process of doing scheduled and anticipated maintenance that is known in advance—and those affecting unplanned downtime. While unplanned downtime has the biggest potential for catastrophic impact, and is far more likely to garner press coverage and affect IT leaders’ job security, dealing with planned downtime is every bit as important a concern because it happens far more frequently than unplanned downtime. The second key issue is to consider the causes of downtime and the best solutions in improving system availability. While achieving the ideal goal of zero downtime may be a physical impossibility for organizations, getting as close to that goal as possible is a very high priority for IT departments charged with keeping systems up and running and ensuring the availability of critical data, applications and services. In each case, however, it’s becoming increasingly important for IT leaders to pay more attention to the importance of server operating systems as an 1 “Calculating the Cost of Data Center Outages,” Ponemon Institute, May 2011 essential step to reducing downtime. Naturally, the operating system needs to be highly secure, deliver high performance, scale easily as workloads increase and ensure high availability. At the same time, operating system choices must align with the new realities of IT infrastructure and architecture, such as virtualization, cloud computing, support for mobility and real-time problem identification and remediation. When an unplanned outage occurs, IT organizations often run through a checklist of actions to identify, isolate and fix the problem, starting with simple causes like a loss of power and escalating up to more complex issues. However, in considering ways to reduce both unplanned and planned service interruptions, many organizations often overlook the role of the operating system as a way to improve availability. A recent study conducted by operating system provider SUSE with registered visitors to SearchDataCenter. com indicated that technology failure was the No. 1 cause of unplanned downtime, and that upgrading or changing operating system functionality was one of the most significant actions their organizations plan to take to reduce downtime. Why the Operating System Matters in Reducing Downtime One of the most important industry trends in IT infrastructure—virtualization—is putting huge pressure on organizations’ host servers. The proliferation of virtual machines (VMs) for a wide variety of workloads, from email and enterprise resource planning to Web servers and analytics, means that VM workloads are consolidating on a host system. Without properly safeguarding the host’s operating system through a variety of tools and technical advancements, those workloads will be at heightened risk. Planned service interruptions For instance, reducing the time frames for performing routine and scheduled maintenance, upgrades and refreshes during planned downtimes can go a long way toward improving uptime. Two-thirds of respondents in the SUSE survey said their organizations conduct planned downtime sessions at least once a quarter. Reducing either the frequency of those planned interruptions or their duration can go a long way toward reducing downtime. Specifically, it’s important to have operating systems that provide native tools that can address some of the most common sources of planned downtime, such as doing security patches without having to actually shut down the system. IT organizations should consider upgrading or even changing their operating system in order to keep the server updated with the most recent kernel patches without affecting mission-critical workloads or waiting for the next planned service window. Additionally, some operating systems offer easy-to-deploy tools that address the common flaw of human error through snapshots that enable one-click system resets to any well-known state that’s stored. IT organizations plan to implement a number of other steps to trim planned downtime windows, according to the research survey: 2 © TechTarget 2014 Steps to Cut Planned Downtime 60% 51% 50% 40% 40% 36% 30% 21% 20% 10% 8% 0% t nd es ing le ing s a cks .t b ices ch ools yc h t t c c a o g a p t ct at ife sh llb h m pra Sl ep ter ap ro c t O t v n i e S L B Pa er ng o L These and other operating system features provide important benefits to organizations looking to reduce downtime stemming from planned interruptions: Snapshots and rollbacks reduce manual error by taking system snapshots, including kernel files, and allowing administrators to roll back to any good state that’s stored. Live kernel patching enables the deployment of critical security patches before the next service window— without rebooting. Automated patching reduces human errors by automating the often time-consuming and manual-intensive patch management process. Patch pre-loading allows patches to be pre-loaded to the server prior to application, which reducing downtime time frames. Extending the operating system lifecycle allows organizations to remain on their current OS versions longer in order to reduce the risks of migrations and the costs of a systems upgrade. 3 Unplanned service interruptions Eliminating unplanned outages is certainly far more difficult than dealing with planned outages, but the potential impact of unplanned outages demands that IT organizations go to great length to reduce their incidence and mitigate their impact. Having a modernized, feature-rich operating system optimized for reducing unplanned downtime can go a long way toward helping organizations accomplish their goal of achieving that elusive “five-nines” of availability (99.999% uptime, or fewer than five minutes of downtime per year). Steps to Cut Unplanned Downtime 60% 51% 50% 40% 30% 20% 35% 32% 20% 22% 10% 0% e ks rag cy e ac n v a b l e L nd ol u t/r d o e h r s ap Sn a gr Up w ne re o t wa atehard r ig OS de M e radort g Up upp s e/ vic r se There are a number of important capabilities IT organizations should be looking for in their operating system in order to mitigate the frequency and impact of unplanned downtime: Failover clustering helps organizations meet terms of their service-level agreements by establishing server clusters among physical nodes or VM. Geo-clustering is becoming an increasingly important OS function because of the growing incidence of organizations’ data centers in disparate locations. It bridges clusters across any distance in order to ensure business continuity. 4 © TechTarget 2014 IPv4 and IPv6 load balancing automatically distributes stateless server workloads to achieve high availability. Reliability, availability and serviceability (RAS) enables tight coupling of server operating systems built on different hardware platforms in order to improve workload uptime. Cluster test drive simulates and validates cluster setup before a failover actually occurs. Cluster rolling update ensures cluster uptime by sequentially updating nodes. Considering Your Options: SUSE Linux Enterprise For more than 20 years, SUSE has offered enterprises robust, reliable and scalable operating system architectures that help reduce downtime. Many of the industry’s top server hardware brands—including the very popular IBM System z server—support Linux as their primary operating system for mission-critical workloads, and SUSE Linux Enterprise Server offers IT departments mainframe- and Unix-class reliability and performance on a wide range of hardware infrastructure, including x86-based server clusters. SUSE has made it a priority to help organizations find new ways to reduce both planned and unplanned downtime through a combination of new OS-based tools and enhanced technology and functionality to address the root causes of downtime and limit the duration of service interruptions. For instance, SUSE Linux Enterprise High Availability Extension allows organizations to create clusters of physical servers or VMs in order to improve flexibility, while also offering tools for easy cluster configuration, management and failure scenario simulation. Finally, SUSE’s geo-clustering capability enables service failover at any distance among the organization’s different data centers or even at cloud service provider locations. SUSE also offers SUSE Linux Enterprise Live Patching, which delivers kernel updates without rebooting. The technology protects key workloads against the latest threats, without requiring the organization to take down servers and incur downtime. Conclusion Reducing or even eliminating downtime is at or near the very top of every IT executive’s must-do list. The financial, competitive, legal, brand and operational implications of downtime—both planned and unplanned—are huge. Those potential issues mandate that IT departments exhaust every possible means to reduce interruptions. The downtime challenge should be addressed on many levels, but one area in which organizations can and should do more is in modernizing or even changing their operating systems for maximum resiliency and flexibility. 5 By addressing downtime challenges at the operating system level, organizations can make great strides to reduce the duration and frequency of planned outages through capabilities such as live kernel patching, snapshots/rollbacks, automated patching, patch pre-loading and extending the effective lifecycle of the operating system. Additionally, today’s modern operating systems versions offer important functionality that head off sources of potential unplanned downtime and remediate those outages more quickly through capabilities such as geoclustering, failover clustering, load balancing, RAS, cluster test drive and cluster rolling update. SUSE Linux Enterprise Server has become an important part of IT organizations’ effort to reduce downtime, whether their key workloads run on mainframes or x86-based commodity servers. SUSE Linux Enterprise Server offers a broad array of tools and technology-based functionality specifically designed to help IT organizations make faster progress toward that elusive goal of zero downtime. For more information about SUSE operating system tools and technologies, go to www.suse.com/zerodowntime. 29% SUSE and the SUSE logo are registered trademarks of SUSE LLC in the United States and other countries. 6 © TechTarget 2014