Red books IT Resiliency Paper

advertisement

Red

books

Paper

Steve Finnes

Bob Gintowt

Mike Snyder

Nick Harris

IT Resiliency

The “business” side of the corporate organization likely has a basic level of understanding and appreciation regarding the criticality of the IT organization and mission. However, no one, including the IT organization, may realize exactly how robust (or fragile) the underlying IT infrastructure is in terms of resiliency (that is, the ability to not fail or to quickly recover in the event of a failure). This infrastructure includes, among other things, hardware, applications, and human resources.

Often, the realization that there are resiliency exposures comes at a time of crisis when a core business application and the associated processes are at a standstill. An outage is most likely caused not by a disaster, but by some unexpected, non-disaster situation such as the failure of a critical component somewhere in the IT infrastructure. A disaster-recovery solution can be defined as one that necessitates a remote recovery resource. Most businesses have some level of disaster recovery; the most common being nightly tape backups and off-site archiving. Having an acceptable disaster recovery solution does not necessarily translate into an acceptable solution for typical, non-disaster outages. In fact, in the event of an actual disaster, customers may be more tolerant of recovery delays and problems. The real challenge is how to maintain 24x7 operations during a non-disaster outage.

This IBM® Redpaper discusses the concept and structure of an IT Resilient environment.

© Copyright IBM Corp. 2005. All rights reserved.

ibm.com/redbooks 1

IT resiliency profile

To begin a process of understanding the business continuity robustness of an IT infrastructure, consider the concept of an IT resiliency profile. The idea is to establish a metric: a measure of potential IT resiliency performance across a set of relevant variables that can affect actual IT capability to either withstand or recover from an outage event.

As an example, consider service level agreements (SLAs) as applied to a particular mission-critical application. A resiliency assessment would determine how valid this SLA is in the context of a particular type of outage or a set of outage types. A Midwestern beverage distribution company had an SLA on the core business application environment of no more than two hours of down time per week. This particular company experienced a complete disruption of the normal business process when a “short” in a power cable caused the server hardware to fail, bringing down the mission-critical application. The outage caused a production down time of 36 hours. Although there was an aggressive SLA in place, the true impact of this event from a recoverability perspective had not been understood; therefore, the

SLA was not achievable.

IT resiliency elements

There are six major elements (not independent) for establishing the assessment points used to develop an IT resiliency profile. These assessments focus on the availability of critical business applications, because it is primarily the application and associated business processes that are visible to customers (both internal and external). In the end, the business solutions are affected by these critical applications.

򐂰 Availability coverage: Produce a complete and current list of the applications that are critical to the business from a continuity perspective.

Mission-critical applications are identified in the context of the current business and regulatory environment or, perhaps, as a result of interdependency with other processes.

As the business has evolved, so has its exploitation of the IT infrastructure, and there might be a need to reestablish an understanding of which applications are mission critical.

򐂰 Interdependencies: Document the complete topology of the IT capabilities (software and hardware) that support applications for your core and business-critical solutions.

A business process projects onto an application environment, and this application environment in turn has various interdependencies. The mission-critical application environment is interconnected with other applications and with an underlying physical infrastructure. It is also interconnected with various human processes. One application might be a hub connecting multiple other application environments. Perhaps one physical

IT location is dependent on a set of data and procedures from another remote IT organization. During a recovery process, an issue might be as basic as having the right staff available at a remote location to coordinate recovery activities. The notion of mission critical goes beyond the view of a single application and extends into a matrix of interconnected processes, people, and applications.

򐂰 Single points of failure: Identify all potential single points of failure along with documented action plans for handling failure conditions.

Many, if not all, resiliency issues can be reduced to single points-of-failure constructs. For this assessment, the focus of which is on IT, single points of failure must be identified and connected to the availability of the mission-critical applications. Using the example of a hub application that interacts with several other critical applications renders it a potential single point of failure. Another more basic example might be the way that the disk

2 IT Resiliency

subsystem is protected. The data on a particular mission-critical application server might be protected by RAID-5 (parity protection) as opposed to RAID-1 (mirroring).

Other examples might include a backup power generator, redundant Ethernet connections, and so on. Developing an inventory of the single points of failure can be onerous, but it is one of the most essential elements in the development of the overall resiliency profile.

򐂰 Ownership and process: Document the lines of responsibility for IT infrastructure availability and all associated processes.

Many IT organizations are operating with principles that were established in the past decade. A classic profile is that of a company that has grown from $400,000,000 US to

$2,000,000,000 US without its core IT procedures, staffing, and skills evolving commensurately. This situation becomes evident as the IT organization appears to move from one crisis to another, seeming to never attain stability. IT organizational structures might not reflect the demands of the current business environment and are not in a position to deal successfully with resiliency issues. In the time of a crisis, chaos ensues.

The basic assumption must be that failures will happen. The challenge is: What course of action and what procedures will activate when a failure does happen?

򐂰 Skills and staffing: Describe the required staffing and skills needed to fully support your SLA.

An extension of the ownership and process discussion is a discussion about skills and staffing. Staffing and skills are critical. There is an ideal staffing and skill structure that can enable a smoothly functioning IT environment. It is not possible to execute to a defined process structure if the staffing or skills, or both, are not adequate. The performance of an understaffed organization becomes reactionary, and the amount of reaction is a function of organizational bandwidth and skill. No one likes to face the fact that they are organizationally limited and therefore exposed in an IT resiliency sense. These limitations can lead to a lack of proactive planning and an IT organization that is unable to respond effectively in moments of crisis.

򐂰 Achievability: Determine the probability that your IT infrastructure (processes, deployed configurations, staff, and so on) can achieve the stated SLA in terms of availability.

Given the current IT resiliency profile, what is the current IT infrastructure resiliency capability (application design, processes, deployed configurations, staff, and so on)?

Many organizations do not realize that the current SLAs are not achievable until some type of unplanned event occurs. Reliable hardware and highly available data do not necessarily translate into highly available applications. The fundamental question is: What is the achievability of the current mission-critical application availability objectives? This variable is a “yes” or “no,” because later a comprehensive resiliency baseline will be developed.

A plan of action

To build an IT resiliency assessment plan and strategy, a framework is needed that matches the needs of the business with the company’s IT capabilities.

There are five essential steps that will lead to a resilient IT strategy and plan: a business impact analysis, setting objectives, establishing a baseline, performing a gap analysis, and finally, establishing the IT resiliency requirements.

1. Business impact analysis

The first step to be taken is to create a business impact analysis report. This is, abstracted from IT, an assessment of the impact to the business due to the unavailability of certain mission-critical application environments. The analysis helps one determine a level of prioritization and a sense of urgency based on business outcomes.

IT Resiliency 3

The process used to develop this analysis has three basic steps: a. Define and calculate the total cost of an outage.

b. Provide requirements for application availability, including time to recover (recovery time objective, or RTO) and the acceptable age of the recovered data (recovery point objective, or RPO).

c. Prioritize application availability and security objectives.

A business impact analysis of the resiliency of mission-critical applications will result in an assessment of risk. It should also provide an ROI calculation that justifies the deployment of a particular resiliency solution approach. It is sometimes possible to justify a solution investment solely on the minimization of planned outage scenarios while coincidentally providing unplanned outage resiliency. Viewing outages in the context of normal business processes provides an understanding of impact but, in a larger sense, it reveals how robust or fragile business processes are based on IT resiliency. Failure scenarios applied to the application environments should reveal potential impacts to the business, the generation of revenue, customer satisfaction, and outage/recovery cost considerations

(for example, fines) relevant to the given business environment.

2. IT resiliency objectives

Based on the output from the business impact analysis, establish specific objectives for the resiliency of the IT infrastructure for each of the mission-critical applications.

Objectives should be stated in terms of the expected business consequences. Revised objectives for realistic SLAs should be created. For each application environment, establish an RPO and an RTO and establish how much of an outage is tolerable for tape backups and other maintenance procedures. At this stage, it is important to discuss disaster-recovery objectives and high-availability objectives. A disaster-recovery solution generally involves a resilient or backup resource at a remote location. A high-availability solution is characterized by the capability to support application resiliency, autonomics

(clustering), and the ability to do regular and sustained role swapping or switching of the primary and backup resource. In a true high-availability topology, the backup environment becomes the primary and remains so for an extended period of time. Regardless, the characteristic of the solution topology will be defined by the IT resiliency objectives.

3. Establish a baseline

Given the outputs from the business impact analysis and the IT resiliency objectives, we next establish the current IT resiliency baseline. The resiliency baseline is an independent assessment of current capabilities using the deployed technology, processes, and practices. This exercise establishes the present resiliency capabilities and expresses them using the six IT resiliency variables discussed earlier.

4. IT resiliency gap analysis

The next step is to determine how the baseline stacks up against the IT resiliency objectives and to identify where gaps exist in the current resiliency capabilities. The baseline analysis has, for example, identified single points of failure and documented an understanding of the types of risks that the business is facing based on the current IT resiliency profile. Examples include lack of mission-critical application availability during planned and unplanned outage events, specific single points of failure, documented recovery procedures and processes, documented change control process, organizational mission statements, and identified owners of key plans and processes related to IT continuity. The gap analysis helps to determine what is required to achieve the realistic

SLA for mission-critical applications in the context of a worst-case risk scenario. At this point, we should know which of the business-level IT objectives are likely to be met and not met. It is now the job of IT to put this new set of objectives into IT requirements.

4 IT Resiliency

5. Establish IT resiliency requirements

It is the job of the IT professional to develop the set of specific requirements and actions that translate the business-level IT objectives into IT resiliency requirements. We now become very specific for each application environment.

For example, for application xyz, the impact to availability to end users can be no more than 20 minutes per year for performing planned outage events such as tape backups or maintenance. For unplanned outage events, the recovery point objective (RPO) for the data must be the last committed transaction, and the recovery time objective (RTO) must be less than half an hour. The application environment will be deployed so that it will completely switch to a backup resource once a week and will remain in this state until the next switching exercise a week later. This is done to ensure complete technical and organizational capability to execute such procedures in case of an actual outage event.

These requirements might be tabulated for clarity.

Table 1 Sample IT resiliency requirements

Application Planned outage Unplanned outage RTO/RPO Disaster recovery RTO/RPO

XYZ 20 mins/year 30 mins/last transaction 24 hours/last batch of work

ABC 1 week/year 2 hours/last transaction 24 hours/start of last shift

Now, plans can be formulated to bring the current environment in line with the stated objectives.

Summary and conclusions

As businesses have become increasingly dependent on the availability of their critical business applications, the risk of not having these applications available 24x365 increases the need for a realistic understanding of the IT infrastructure from a resiliency perspective.

The era of highly available data as the primary focus is long gone. It is now the era of the highly available applications and their associated infrastructure (including data). Many IT organizations are not fully cognizant of the strengths and weaknesses of their current solution deployment environment in the context of a resiliency discussion.

The deployment of a resiliency strategy must be an integral part of the overall corporate strategy with full executive support and sponsorship. Resiliency solutions that are deployed must be exercised and evaluated continuously for validity as the business environment changes. All critical points in the resiliency profile that are addressed with a new strategy and plan must become part of an ongoing process, a culture of resiliency-aware people and processes.

Additional information

This section lists some sources of information that might be interesting as the subject of IT resiliency is investigated further.

Sources from IBM

IBM Client Technology Center (CTC) for Lab Services is a services group that assists clients worldwide in developing and deploying server and storage solutions that meet their current business requirements and strategic on demand initiatives by providing leading-edge consulting and support services, proof of concepts, benchmarking, and collateral, while enabling our key channel partners to do the same.

IT Resiliency 5

Find out more about the CTC at: http://www.ibm.com/servers/eserver/services/about.html

CTC for Lab Services teams provide services in two areas: for new products or technologies, or both, that are emerging from IBM product development labs so as to help enable early client success and satisfaction, and for niche and mature/end-of-life markets that exist when there are important products, technologies, or both being used by a limited number of clients that are deemed important and need services for support. CTC for Lab Services teams are singularly focused on the client’s adoption of storage software solutions, have access to deep development technical skills to deliver the services, and team with, complement, and enable other service-provider channels, such as IBM Global Services and Business Partners (BPs), sharing intellectual capital, transferring skills, mentoring, and “filling in the gaps” with needed services offerings when the client’s needs demand them.

The CTC for Lab Services group is a worldwide organization that includes these teams:

򐂰 iSeries™ CTC Lab Services, located in the U.S., EMEA, and AP

򐂰 pSeries® UNIX Software Services, located in the U.S., EMEA, and AP

򐂰 xSeries® Lab Services, located in the U.S.

򐂰 zSeries® CTC Lab Services, located in the U.S.

򐂰 Storage CTC Lab Services, located in the U.S., EMEA, and AP

IBM Global Services

Are you prepared to ask, or answer, the question:

How long could we stay in business if our critical business processes went down?

As your technology environment changes to accommodate the on demand marketplace, you need to assess your risks and quantify the financial—and intangible—impacts to your business. How much loss is acceptable, for how long, and how much will it cost? IBM can help put priority on your recovery strategies and recommend cost-effective safeguards to mitigate risk and prevent loss. Our experts can help you identify vulnerabilities and develop an effective, corporate-wide continuity program that maintains, or quickly restores, your business operations: http://www.ibm.com/services/us/index.wss/home

IBM Business Continuity and Recovery Services can help you achieve continuous availability of, and access to, your enterprise information and critical business processes. Our consultants will work with you to determine the risks, vulnerabilities, and financial impact that a disruption can have on your business and plan to effectively manage emergency situations.

We can also operate and manage your continuity program for you, so you can focus on your core business.

http://www.ibm.com/services/us/index.wss/of/bcrs/a1000387

IBM Redbooks™

IBM Redbooks are developed and published by the IBM International Technical Support

Organization (ITSO). The ITSO develops and delivers skills, technical know-how, and materials to technical professionals of IBM, Business Partners, and customers, and to the marketplace generally.

http://www.redbooks.ibm.com

ITSO teams with IBM Divisions and Business Partners in the process of developing IBM

Redbooks, Redpapers, Technotes, Training, and other materials. The ITSO is part of the IBM

Global Technical Support organization within IBM Global Sales and Distribution.

6 IT Resiliency

ITSO value-add information products address product, platform, and solution perspectives.

They explore integration, implementation, and operation of realistic customer scenarios.

Here is a list of current IBM Redbooks about high availability and IBM Eserver ® systems.

General IBM Redbooks about high availability:

򐂰 Achieving the Highest Levels of Parallel Sysplex Availability in a DB2 Environment,

REDP-3960

򐂰 Architecting High Availability e-business on IBM Eserver zSeries, SG24-6850

򐂰 Case Study for Content Manager OnDemand Backup, Recovery, and High Availability #1:

Global Voice and Data Communications Company, TIPS0517

򐂰 Cluster Systems Management Cookbook for pSeries, SG24-6859

򐂰 Content Manager OnDemand Backup, Recovery, and High Availability Case Study #2:

International Financial Services Company, TIPS0518

򐂰 Content Manager Backup/Recovery and High Availability: Strategies, Options, and

Procedures, SG24-7063

򐂰 High Availability and Scalability with Domino Clustering and Partitioning on AIX,

SG24-5163

򐂰 High Availability Without Clustering, SG24-6216

򐂰 Highly Available WebSphere Business Integration Solutions, SG24-6328

򐂰 Linux on IBM zSeries and S/390: High Availability for z/VM and Linux, REDP-0220

򐂰 Useful Disaster Recovery Definitions and Concepts, TIPS0047

IBM

Eserver

iSeries Redbooks:

򐂰 Clustering and IASPs for Higher Availability on the IBM Eserver iSeries Server,

SG24-5194

򐂰 High Availability on the AS/400 System: A System Manager’s Guide, REDP-0111

White paper: i5/OS™ High Availability Clusters: Data Resilience Solutions

It is a given that business continuity is an extremely broad topic. The focus of this paper is the technologies that provide improved data resilience for end users. The paper is organized such that the reader could study the content cover-to-cover (recommended) or use specific sections for reference as needed.

http://www.ibm.com/servers/eserver/iseries/ha/pdf/DataResilienceTechnologies.pdf

Sources outside IBM

The following sources are found outside IBM. They cover information that might be interesting to CEO/CIO-level executives.

򐂰 CIO.com

This is a useful site with interesting articles, found by searching on business resiliency, resilience, and availability: http://www.cio.com/

򐂰 ContinuityCentral.com

This site offers extensive information about the general business continuity topic: http://www.continuitycentral.com/

IT Resiliency 7

8 IT Resiliency

Notices

This information was developed for products and services offered in the U.S.A.

IBM may not offer the products, services, or features discussed in this document in other countries. Consult your local IBM representative for information on the products and services currently available in your area.

Any reference to an IBM product, program, or service is not intended to state or imply that only that IBM product, program, or service may be used. Any functionally equivalent product, program, or service that does not infringe any IBM intellectual property right may be used instead. However, it is the user's responsibility to evaluate and verify the operation of any non-IBM product, program, or service.

IBM may have patents or pending patent applications covering subject matter described in this document. The furnishing of this document does not give you any license to these patents. You can send license inquiries, in writing, to:

IBM Director of Licensing, IBM Corporation, North Castle Drive Armonk, NY 10504-1785 U.S.A.

The following paragraph does not apply to the United Kingdom or any other country where such

provisions are inconsistent with local law: INTERNATIONAL BUSINESS MACHINES CORPORATION

PROVIDES THIS PUBLICATION "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR

IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF NON-INFRINGEMENT,

MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Some states do not allow disclaimer of express or implied warranties in certain transactions, therefore, this statement may not apply to you.

This information could include technical inaccuracies or typographical errors. Changes are periodically made to the information herein; these changes will be incorporated in new editions of the publication. IBM may make improvements and/or changes in the product(s) and/or the program(s) described in this publication at any time without notice.

Any references in this information to non-IBM Web sites are provided for convenience only and do not in any manner serve as an endorsement of those Web sites. The materials at those Web sites are not part of the materials for this IBM product and use of those Web sites is at your own risk.

IBM may use or distribute any of the information you supply in any way it believes appropriate without incurring any obligation to you.

Information concerning non-IBM products was obtained from the suppliers of those products, their published announcements or other publicly available sources. IBM has not tested those products and cannot confirm the accuracy of performance, compatibility or any other claims related to non-IBM products. Questions on the capabilities of non-IBM products should be addressed to the suppliers of those products.

This information contains examples of data and reports used in daily business operations. To illustrate them as completely as possible, the examples include the names of individuals, companies, brands, and products.

All of these names are fictitious and any similarity to the names and addresses used by an actual business enterprise is entirely coincidental.

COPYRIGHT LICENSE:

This information contains sample application programs in source language, which illustrates programming techniques on various operating platforms. You may copy, modify, and distribute these sample programs in any form without payment to IBM, for the purposes of developing, using, marketing or distributing application programs conforming to the application programming interface for the operating platform for which the sample programs are written. These examples have not been thoroughly tested under all conditions. IBM, therefore, cannot guarantee or imply reliability, serviceability, or function of these programs. You may copy, modify, and distribute these sample programs in any form without payment to IBM for the purposes of developing, using, marketing, or distributing application programs conforming to IBM's application programming interfaces.

9 © Copyright IBM Corp. 2005. All rights reserved.

This document created or updated on June 1, 2005.

Send us your comments in one of the following ways:

򐂰 Use the online Contact us review redbook form found at:

ibm.com/redbooks

򐂰 Send your comments in an email to: redbook@us.ibm.com

򐂰 Mail your comments to:

IBM Corporation, International Technical Support Organization

Dept. JLU Building 107-2

3605 Highway 52N

Rochester, Minnesota 55901-7829 U.S.A.

®

Trademarks

The following terms are trademarks of the International Business Machines Corporation in the United States, other countries, or both:

Eserver ® i5/OS™

IBM® iSeries™ pSeries®

Redbooks™

Redbooks (logo) xSeries® zSeries®

UNIX is a registered trademark of The Open Group in the United States and other countries.

Other company, product, and service names may be trademarks or service marks of others.

10 IT Resiliency

Download