Front cover Data ata taa Resilience esilience silience ilience lience ience ence nce cee Solutions olutions lutions utions tions ions ons nss for IBM i5/OS High Availability Clusters Understand the scope of business continuity problems and solutions Learn about data resilience technologies, their features and limitations Determine the right technologies for your availability needs Steve Finnes Bob Gintowt Mike Snyder ibm.com/redbooks Redpaper International Technical Support Organization Data Resilience Solutions for IBM i5/OS High Availability Clusters February 2005 Note: Before using this information and the product it supports, read the information in “Notices” on page v. First Edition (February 2005) This edition applies to Version 5 Release 3 Modification 0 of IBM i5/OS. This document created or updated on February 7, 2005. © Copyright International Business Machines Corporation 2005. All rights reserved. Note to U.S. Government Users Restricted Rights -- Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp. Contents Notices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .v Trademarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii The team that wrote this Redpaper . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii Become a published author . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii Comments welcome. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii Chapter 1. What is business continuity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Chapter 2. Major business continuity problem sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.1 Problem categories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.2 Selection criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Chapter 3. Overview of business continuity technologies . . . . . . . . . . . . . . . . . . . . . . . 7 3.1 Backup window reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 3.2 Workload balancing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 3.3 Application resilience . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 3.4 Data resilience . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 Chapter 4. Applicability of a solution to a problem set . . . . . . . . . . . . . . . . . . . . . . . . . 13 Chapter 5. Detailed attributes of data resilience technologies. . . . . . . . . . . . . . . . . . . 15 5.1 Logical replication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 5.2 Switchable device . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 5.3 Cross-site mirroring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 5.4 IBM TotalStorage Enterprise Storage Server PPRC used with the iSeries Copy Services for ESS toolkit. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 Chapter 6. Comparison characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 Chapter 7. General guidelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 Appendix A. Decision factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 Primary decision factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 Supporting decision factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 Appendix B. Cautions, caveats, and other tips . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Basic single system availability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Backup window reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Multisystem HA solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Planned maintenance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Recovery for disaster outages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 36 36 36 37 37 Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 © Copyright IBM Corp. 2005. All rights reserved. iii iv Data Resilience Solutions for IBM i5/OS High Availability Clusters Notices This information was developed for products and services offered in the U.S.A. IBM may not offer the products, services, or features discussed in this document in other countries. Consult your local IBM representative for information on the products and services currently available in your area. Any reference to an IBM product, program, or service is not intended to state or imply that only that IBM product, program, or service may be used. Any functionally equivalent product, program, or service that does not infringe any IBM intellectual property right may be used instead. However, it is the user's responsibility to evaluate and verify the operation of any non-IBM product, program, or service. IBM may have patents or pending patent applications covering subject matter described in this document. The furnishing of this document does not give you any license to these patents. You can send license inquiries, in writing, to: IBM Director of Licensing, IBM Corporation, North Castle Drive Armonk, NY 10504-1785 U.S.A. The following paragraph does not apply to the United Kingdom or any other country where such provisions are inconsistent with local law: INTERNATIONAL BUSINESS MACHINES CORPORATION PROVIDES THIS PUBLICATION "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF NON-INFRINGEMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Some states do not allow disclaimer of express or implied warranties in certain transactions, therefore, this statement may not apply to you. This information could include technical inaccuracies or typographical errors. Changes are periodically made to the information herein; these changes will be incorporated in new editions of the publication. IBM may make improvements and/or changes in the product(s) and/or the program(s) described in this publication at any time without notice. Any references in this information to non-IBM Web sites are provided for convenience only and do not in any manner serve as an endorsement of those Web sites. The materials at those Web sites are not part of the materials for this IBM product and use of those Web sites is at your own risk. IBM may use or distribute any of the information you supply in any way it believes appropriate without incurring any obligation to you. Information concerning non-IBM products was obtained from the suppliers of those products, their published announcements or other publicly available sources. IBM has not tested those products and cannot confirm the accuracy of performance, compatibility or any other claims related to non-IBM products. Questions on the capabilities of non-IBM products should be addressed to the suppliers of those products. This information contains examples of data and reports used in daily business operations. To illustrate them as completely as possible, the examples include the names of individuals, companies, brands, and products. All of these names are fictitious and any similarity to the names and addresses used by an actual business enterprise is entirely coincidental. COPYRIGHT LICENSE: This information contains sample application programs in source language, which illustrates programming techniques on various operating platforms. You may copy, modify, and distribute these sample programs in any form without payment to IBM, for the purposes of developing, using, marketing or distributing application programs conforming to the application programming interface for the operating platform for which the sample programs are written. These examples have not been thoroughly tested under all conditions. IBM, therefore, cannot guarantee or imply reliability, serviceability, or function of these programs. You may copy, modify, and distribute these sample programs in any form without payment to IBM for the purposes of developing, using, marketing, or distributing application programs conforming to IBM's application programming interfaces. © Copyright IBM Corp. 2005. All rights reserved. v Trademarks The following terms are trademarks of the International Business Machines Corporation in the United States, other countries, or both: Eserver® Eserver® ibm.com® iSeries™ i5/OS™ Enterprise Storage Server® FlashCopy® IBM® Redbooks (logo) Redbooks™ TotalStorage® ™ The following terms are trademarks of other companies: Microsoft, Windows, Windows NT, and the Windows logo are trademarks of Microsoft Corporation in the United States, other countries, or both. Linux is a trademark of Linus Torvalds in the United States, other countries, or both. Other company, product, and service names may be trademarks or service marks of others. vi Data Resilience Solutions for IBM i5/OS High Availability Clusters Preface Choosing the correct set of data resilience technologies in the context of your overall business continuity strategy can be complex and difficult. It is a given that business continuity is an extremely broad topic. This IBM® Redpaper provides some insight into this broad topic and then describes the technologies that support improved data resilience for end users. It explains the capabilities, advantages, and limitations of the various technologies. It provides information and techniques that you can use to select the best set of technologies available on IBM Eserver i5, to use in conjunction with the IBM i5/OS™ high availability clusters, to satisfy your specific business continuity goals. This IBM Redpaper is organized so that you can study the content from cover-to-cover (recommended) or use specific sections for reference as needed. It begins with a discussion about the business continuity requirements. This information helps you to determine and prioritize your business requirements in the context of the specific problem sets of interest to you. Next, this IBM Redpaper presents an overview of the technologies that are related to business continuity. This helps you to understand the technology categories and choices within each category that are available to address the problem sets. Then the paper explains how the technologies apply to the various business continuity requirements and how they compare with one another. A detailed analysis is included to help position the various technologies against your specific business requirements. Finally, conclusions are drawn about the technologies, that map solutions to the characteristics of end-user environments. Although this paper does not describe the value proposition of high availability or contrast technologies for other aspects of business continuity, you can find a starter set of references for this type of material on the IBM Eserver iSeries™ High Availability Web site at: http://www-1.ibm.com/servers/eserver/iseries/ha/ The team that wrote this Redpaper This Redpaper was produced by a team of specialists from IBM Rochester, Minnesota. Steve Finnes Bob Gintowt Mike Snyder IBM Rochester Thanks to the following people for their contributions to this project: Lou Antoniolli Sue Baker Selwyn Dickey Janice Dunstan Eric Hess Mike McDermott Jeff Palm Stu Preacher Jim Ranweiler © Copyright IBM Corp. 2005. All rights reserved. vii Larry Youngren IBM Rochester Become a published author Join us for a two- to six-week residency program! Help write an IBM Redbook or Redpaper dealing with specific products or solutions, while getting hands-on experience with leading-edge technologies. You'll team with IBM technical professionals, Business Partners and/or customers. Your efforts will help increase product acceptance and customer satisfaction. As a bonus, you'll develop a network of contacts in IBM development labs, and increase your productivity and marketability. Find out more about the residency program, browse the residency index, and apply online at: ibm.com/redbooks/residencies.html Comments welcome Your comments are important to us! We want our papers to be as helpful as possible. Send us your comments about this Redpaper or other Redbooks™ in one of the following ways: Use the online Contact us review redbook form found at: ibm.com/redbooks Send your comments in an email to: redbook@us.ibm.com Mail your comments to: IBM Corporation, International Technical Support Organization Dept. HYJ Mail Station P099 2455 South Road Poughkeepsie, NY 12601-5400 viii Data Resilience Solutions for IBM i5/OS High Availability Clusters 1 Chapter 1. What is business continuity Clients are continually faced with the complex task of determining which solution or technologies to deploy that address the various business requirements which must be supported by IT. In the case of business continuity requirements, the task is equally daunting. Detailed business continuity requirements must be developed and documented, the solution types identified, and the solution choices evaluated. This is a challenging task due in part to the complexity of the problem. It is also partly due to the confusion associated with conflicting requirements, mis-stated objectives, unrealistic expectations, and incomplete information. Business continuity is the capability of a business to withstand outages and to operate important services normally and without interruption in accordance with predefined service-level agreements. To achieve a given desired level of business continuity, a collection of services, software, hardware, and procedures must be selected, described in a documented plan, implemented, and practiced regularly. The business continuity solution must address the data, the operational environment, the applications, the application hosting environment, and the end-user interface. All must be available to deliver a good, complete business continuity solution. Business continuity includes disaster recovery (DR) and high availability (HA). For example, one aspect of a business continuity plan may be the set of resources, plans, services, and procedures used to recover important applications and to resume normal operations for these applications at a remote site. This is done in the event of a disaster that causes a complete outage at the production site. This Disaster Recovery Plan includes a stated disaster recovery goal (for example, resume operations within eight hours) and addresses acceptable levels of degradation. Another major aspect of business continuity goals for many customers is high availability. This can be defined as the ability to withstand all outages (planned, unplanned, and disasters) and to provide continuous processing for all important applications. The ultimate goal is for the outage time to be less that .001% of total service time. The differences between high availability and disaster recovery typically include more demanding recovery time objectives (seconds to minutes) and more demanding recovery point objectives (zero end user disruption). HA solutions are characterized by the fully automated failover to a backup system to approach the objective of a nondisruptive continuation of productive end user activity. HA solutions must have the ability to provide an immediate recovery point. At the same time, they must provide a recovery time capability that is significantly better than the © Copyright IBM Corp. 2005. All rights reserved. 1 recovery time that you would experience in a non-HA solution topology. For example, recovering a single system from tape can take many hours. The HA solution should be able to do this recovery in minutes. Frequently, the scope of disaster recovery involves the entire system (operating system, the hosted data, and the associated data). High availability solutions are more granular and can be targeted to individual critical resources within a system, for example a specific application instance. At the heart of the iSeries high availability solution is cluster technology. A cluster is a collection of interconnected complete systems used as a single, unified computing resource. The cluster provides a coordinated, distributed process across the systems to deliver the solution. This results in higher levels of availability, some horizontal growth and simpler administration across the enterprise. You should expect cluster architectures and implementations to deliver the complete solution to availability. In the complete solution, you must address the operational environment, the application hosting environment, application resilience, and the end-user interfaces in addition to providing data resilience mechanisms. IBM i5/OS cluster technology focuses on all aspects of the complete solution. The integrated cluster resource services enable you to define a cluster of systems and the set of resources that should be protected against outages. Cluster resource services detect outage conditions and coordinate automatic movement of critical resources to a backup system. 2 Data Resilience Solutions for IBM i5/OS High Availability Clusters 2 Chapter 2. Major business continuity problem sets The starting point for the selection process is to fully identify the set of availability problems that you are attempting to address. For business continuity, these problems can be collected into five major categories. Each problem is placed into one of these categories based on a set of criteria, which is also explained in this chapter. © Copyright IBM Corp. 2005. All rights reserved. 3 2.1 Problem categories The problem sets are not mutually independent. Detailed analysis shows that elements of each problem set overlap as illustrated in Figure 2-1. Each problem category is explained in the following sections. HA for planned outages Backup window reduction Disaster recovery HA for unplanned outages Workload balancing Figure 2-1 Overlapping problem sets Backup window reduction The requirement for this category is to reduce or eliminate the amount of non-production time required to do regular backups, such as nightly backups. Typically, the production window (a metaphor for time) has grown. As a result, the backup window has also increased to accommodate the increase in size and number of end-user objects that must be saved. Although a daily outage for performing backups may be available, the length of the outage exceeds the available window. Availability for planned outage events The requirement for this category is to ensure that production can continue, even though an outage is required for planned maintenance. Planned maintenance outages include hardware service, hardware upgrade, system software service, system software upgrades, certain system administrative tasks, application software service, and application software upgrades. Planned outages may also be the result of environmental factors such as building power upgrades and IT center relocation. Business operations demand that the applications and associated data must continue to be available during the time of any planned maintenance. Recovery from disaster related outage events The requirement for this category is to ensure that an extended outage, due to some disaster, does not adversely affect business. The business is recoverable, and the corporate environment may be resumed with a minimal loss of data. A solution is required to protect against local system outages as well as site disasters, such as fire, flood, and tornado. Typically applications and associated data are relocated to a backup system at a different physical location than the normal production system. However, 4 Data Resilience Solutions for IBM i5/OS High Availability Clusters the business operations may tolerate IT services being unavailable for some amount of time, while disaster recovery is performed, typically in the 12 to 48 hour range but sometimes less. In addition, disaster recovery may involve some amount of manual processing since it is assumed to be rarely needed. High availability for unplanned outage events The requirement for this category is to ensure that critical system resources, such as applications, are continuously available during both unplanned outages and planned outages. Unplanned outages may include system hardware failures, disk and disk subsystem failures, power failures, operating system failures, application failures, and user errors. Applications and associated data must be available 24 x 7. Business operations cannot tolerate extended outages while a disaster recovery site is brought online or while a backup system is provisioned. Business operations dictate that failover processing to a backup server must be fully automated. Normal production is expected to continue with acceptable degradation. Workload balancing Multiple workloads (from multiple applications, multiple instances of the same application, or both) are hosted in a multiple server environment. Each application instance has a pre-determined service-level objective such as response time goals. If one or more of the application instances do not achieve their stated objective due to system overload, they can be relocated to another server which has available capacity. 2.2 Selection criteria Keep in mind that generalizations, such as those illustrated in Figure 2-1, can distort the answer. This discussion is a guide to potentially rule out certain solution types. In practice, consider all of the major and supporting decision factors such as: Up time requirements Recovery time objective Recovery point objective Resilience requirements Concurrent access requirements Geographic dispersion requirements Tolerance for end user disruption Outage type coverage Cost Service and support You can find details about these and other decision factors in Chapter 6, “Comparison characteristics” on page 23, and in Appendix A, “Decision factors” on page 29. Chapter 2. Major business continuity problem sets 5 6 Data Resilience Solutions for IBM i5/OS High Availability Clusters 3 Chapter 3. Overview of business continuity technologies Several technology choices are available to address the problem sets discussed in Chapter 2, “Major business continuity problem sets” on page 3. Because the specific requirements for each problem set are articulated across various environments, a single technology is not capable of addressing all needs of all customers. This chapter presents four categories of technology solutions: Backup window reduction Workload balancing Application resilience Data resilience Backup window reduction and workload balancing tie primarily to the backup window reduction and workload balancing problem sets described in Chapter 2, “Major business continuity problem sets” on page 3. Application and data resilience predominantly apply to the high availability (HA) and disaster recovery (DR) problem sets. Both are needed to achieve high availability. In the broad context of business continuity, there are other categories that are not addressed in this paper, such as error correcting hardware, communications error recovery, and enablers for basic application error handling and recovery. As stated earlier, the focus of this paper is on multiple system data resilience topologies. However, to place the total solution into context, you must also consider the other aspects of business continuity. This chapter introduces some basic techniques for backup window reduction, workload balancing, the various types of application resiliency, and data resilience. Each is covered briefly to remind you that, to achieve a total solution, you cannot address only data resilience. Additional details regarding these other technologies, as well as the advantages and disadvantages of the various solutions, are not provided in this paper. © Copyright IBM Corp. 2005. All rights reserved. 7 3.1 Backup window reduction The obvious techniques of reducing or eliminating the backup window involve either decreasing the time to perform the backup or decreasing the amount of data backed up. Classic examples of these techniques are: Improved tape technologies Faster and denser tape technologies can reduce the total backup time. Parallel saves Using multiple tape devices concurrently can reduce backup time by eliminating or reducing serial processing on a single device. Saving to non-removable media Saving to media that is faster than removable media, for example directly to direct access storage device (DASD), can reduce the backup window. Data can be migrated to removable media at a later time. Data archiving Data that is not needed for normal production can be archived and taken offline. It is brought online only when needed, perhaps for month-end or quarter-end processing. The daily backup window is reduced since the archived data is not included. Saving only changed objects Daily backups exclude objects that have not changed during the course of the day. The backup window can be dramatically reduced if the percentage of unchanged objects is relatively high. Other save window reduction techniques leverage a second copy of the data (real or virtual). These techniques include: Saving from a second system Data resilience technologies, such as logical replication, that make available a second copy of the data can be used to shift the save window from the primary copy to the secondary copy. This technique can eliminate the backup window on the primary system. Therefore, it does not affect production since the backup processing is done on a second system. Save while active In a single system environment, the data is backed up using save processing while applications may be in production. To ensure the integrity and usability of the data, a checkpoint is achieved that ensures a point-in-time consistency. The object images at the checkpoint are saved, while allowing change operations to continue on the object itself. The saved objects are consistent with respect to one another so that you can restore the application environment to a known state. Save while active may also be deployed on a redundant copy achieved through logical replication. Employing such a technique can enable the save window to be eliminated effectively. IBM TotalStorage® Enterprise Server (ESS) FlashCopy® used in conjunction with the iSeries Copy Services for ESS toolkit This technology uses the ESS function of FlashCopy on an independent auxiliary storage pool (IASP) basis. A point-in-time snapshot of the IASP is taken on a single ESS server. The copy of the IASP is done within the ESS (in one set of logical disk units), and the host is not aware of the copy. The toolkit enables bringing the copy on to the backup system for purposes of doing saves or other offline processing. The toolkit also manages bringing the 8 Data Resilience Solutions for IBM i5/OS High Availability Clusters second system back into the cluster in a nondisruptive fashion. The Copy Services toolkit supports multiple IASPs from the same or multiple production system being attached at the same time. 3.2 Workload balancing The most common technologies for workload balancing involve moving work to available resources. Contrast this with common performance management techniques that involve moving resources to work that does not achieve performance goals. Example workload balancing technologies (each with its own HA implications) are: Front end routers These routers handle all incoming requests and then use an algorithm to distribute work more evenly across available servers. Algorithms may be as simple as a round robin distribution or complex based on actual measured performance. Multiple application servers An end user distributes work via some predefined configuration or policy across multiple application servers. Typically the association from requester to server is relative static, but the requesters are distributed as evenly as possible across multiple servers. Distributed, multi-part application These applications work in response to end-user requests that actually flow across multiple servers. The way in which the work is distributed is transparent to the end user. Each part of the application performs a predefined task and then passes the work on to the next server in sequence. The most common example of this type of workload balancing is a three-tiered application with a back-end database server. Controlled application switchover Work is initially distributed in some predetermined fashion across multiple servers. A server may host multiple applications, multiple instances of the same application, or both. If a given server becomes overloaded while other servers are running with excess capacity, the operations staff moves applications or instances of applications with associated data from the overloaded server to the underutilized server. Workload movement can be manual or automated based on a predetermined policy. 3.3 Application resilience Application resilience can be classified according to the following categories: No application recovery After an outage, end users must manually restart their applications. Based on the state of the data, they determine where to restart processing within the application. Automatic application restart and manual repositioning within applications Applications that were active at the time of the outage are automatically restarted. However, the user must still determine where to resume within the application, based on the state of the data. Automatic application restart and semi-automatic recovery In addition to the applications automatically restarting, the end users are returned to some predetermined “restart” point within the application. The restart point may be, for example, a primary menu within the application. This is normally consistent with the state of the Chapter 3. Overview of business continuity technologies 9 resilient application data, but the user may have to advance within the application to actually match the state of the data. Automatic application restart and automatic recovery to last transaction boundary The user is repositioned within the application to the processing point that is consistent with the last committed transaction. The application data and the application restart point match exactly. Full application resilience with automatic restart and transparent failover In addition to being repositioned to the last committed transaction, the end user continues to see exactly the same window with the same data as when the outage occurred. There is no data loss, signon is not required, and there is no perception of loss of server resources. The user perceives only a delay in response time. Important: You can combine any of these application resilience mechanisms with the data resilience mechanisms described in the following section to provide a complete solution. 3.4 Data resilience You can use a number of technologies to address the data resilience requirements described in 2.1, “Problem categories” on page 4. This paper describes the four key multisystem data resilience technologies. Since this is the focus of this paper, they are only introduced here. You can find more details in Chapter 5, “Detailed attributes of data resilience technologies” on page 15. Logical replication A second copy of data is generated that is logically identical to the first. The replication is done on an object basis (file, member, data area, program, and so on) near real time. Where possible, the replication is done at the lowest unit of change for the object, for example at the record level for a data base file. Otherwise, the replication is done on the entire object when a change is detected by the replication software. Logical replication is normally accomplished using a business partner software product and is associated with a data cluster resource group (CRG). This technology is a data- and object-based replication solution. Note: See the following HA Web site for a list of HA Business Partners: http://www-1.ibm.com/servers/eserver/iseries/ha/ Switchable device A single copy of the data is maintained in an IASP. However, the data in an IASP can be moved to a backup system in the event of an outage (scheduled or unscheduled) on the system that is currently hosting the data. The data is then available for production processing on the new primary system (previously the backup system). The switchable device is controlled by a device CRG. This technology involves no replication and no additional copies of the data. Cross-site mirroring (XSM) A second copy of the data in the IASP is generated that is logically identical to the first. Using the operating system geographic mirroring function, XSM mirrors data on disks at sites that can be separated by a significant distance. This technology extends the functionality of a device CRG beyond the limits of physical device connectivity. As data is written to the production copy of an IASP, geographic mirroring replicates the change to a second copy of the IASP through another system. The detailed comparison in Chapter 5, 10 Data Resilience Solutions for IBM i5/OS High Availability Clusters “Detailed attributes of data resilience technologies” on page 15, through Chapter 7, “General guidelines” on page 27, assumes an XSM environment where each copy of the IASP is also protected by normal switched device technology. XSM with geographic mirroring is an operating system storage management-based replication solution. ESS PPRC used in conjunction with the iSeries Copy Services for ESS toolkit A second copy of the data is generated that is physically identical to the first. The ESS peer-to-peer remote copy (PPRC) function is combined with the IASP as the basic storage unit. These functions are augmented with an iSeries Technology Center (iTC) toolkit, which provides a level of automation and operational protection. The toolkit enables bringing the IASP copy online to the backup system. It also manages bringing the second system back into the cluster in a nondisruptive fashion. In addition, the toolkit provides function to reverse the PPRC direction on switchover or failover. PPRC is a real-time remote copy technique that synchronously mirrors a primary set of volumes (that are updated by applications) onto a secondary set of volumes. Typically the secondary volumes are on a different ESS located at a remote location (the recovery site) some distance away from the application site. Mirroring is done at a logical volume level. This technology is a storage server-based replication solution. Chapter 3. Overview of business continuity technologies 11 12 Data Resilience Solutions for IBM i5/OS High Availability Clusters 4 Chapter 4. Applicability of a solution to a problem set Using the business continuity problem decomposition and the categorization of data resilience technologies, you can determine possible matches of technologies to specific needs. Table 4-1 provides a starting point for this selection process. It is only one of several tools that you should use, and therefore not an exclusive means, to select the correct set of technologies for your requirements. See Chapter 5, “Detailed attributes of data resilience technologies” on page 15, for additional tools. Use Table 4-1 to help you eliminate the technologies that do not fit. After this initial analysis, perform a detailed analysis of the complete requirement sets against the specific characteristic of each technology choice. The data resilience technologies listed in Table 4-1 include those that are primarily targeted at improved data resilience. The table does not include other special technologies that target primarily save window reduction, workload balancing, or application resilience. The shaded cells in Table 4-1 indicate that the technology for that column most likely is not applicable to the problem set for that row. Table 4-1 Possible technology mapping Business continuity requirement Data resilience technologies Logical replication Backup window reduction* Switched disk XSM ESS toolkit for PPRC N/A N/A N/A Planned maintenance Recovery for disaster outage N/A HA for unplanned outage N/A Workload balancing N/A * Although the technologies marked N/A in this row when used alone cannot provide backup window reduction, they can be augmented by other technologies to achieve these results. For example, you can combine IBM TotalStorage Enterprise Storage Server® (ESS) FlashCopy with the peer-to-peer remote copy (PPRC) function and then use the additional data copy on another system to reduce backup time. © Copyright IBM Corp. 2005. All rights reserved. 13 14 Data Resilience Solutions for IBM i5/OS High Availability Clusters 5 Chapter 5. Detailed attributes of data resilience technologies While the previous chapters introduce several availability technologies, this chapter explores each of the four key data resilience technologies in greater detail. It examines the characteristics, advantages, limitations, and other considerations for each technology. © Copyright IBM Corp. 2005. All rights reserved. 15 5.1 Logical replication Logical replication is the most seasoned and widely deployed multisystem data resiliency topology for high availability (HA) in the iSeries space. It is typically deployed via an HA Business Partner (HABP) solution package. Replication is executed (via software methods) on objects. Changes to the objects (for example file, member, data area, or program) are replicated to a backup copy. The replication is near real time (simultaneous). Typically, if the object, such as a file, is journaled, replication is handled at a record level. For such objects as user spaces that aren’t journaled, replication is handled typically at the object level. In this case, the entire object is replicated after each set of changes to the object is complete. Most logical replication solutions allow for additional features beyond object replication. For example, you can achieve additional auditing capabilities, observe the replication status in real time, automatically add newly created objects to those being replicated, and replicate only a subset of objects in a given library or directory. To build an efficient and reliable multisystem HA solution using logical replication, synchronous remote journaling as a transport mechanism is preferable. With remote journaling, IBM i5/OS continuously moves the newly arriving data in the journal receiver to the backup server journal receiver. At this point, a software solution is employed to “replay” these journal updates, placing them into the object on the backup server. After this environment is established, there are two separate yet identical objects, one on the primary server and one on the backup server. With this solution in place, you can rapidly activate your production environment on the backup server via a role-swap operation. Figure 5-1 illustrates the basic mechanics in a logical replication environment. Figure 5-1 Logical replication A key advantage of this solution category is that the backup database file is “live”. That is, it can be accessed in real time for backup operations or for other read-only application types such as building reports. In addition, that normally means that minimal recovery is needed when switching over to the backup copy. The challenge with this solution category is the complexity that can be involved with setting up and maintaining the environment. One of the fundamental challenges lies in not strictly policing undisciplined modification of the live copies of objects residing on the backup server. Failure to properly enforce such a discipline can lead to instances in which end users and 16 Data Resilience Solutions for IBM i5/OS High Availability Clusters programmers make changes against the live copy so that it no longer matches the production copy. Should this happen, the primary and the backup versions of your files are no longer identical. Significant advances by iSeries HABPs, in the form of tools designed to simplify the management aspects and perform periodic data validation, can help detect such behavior. Another challenge associated with this approach is that objects that are not journaled must go through a check point, be saved, and then sent separately to the backup server. Therefore, the granularity of the real-time nature of the process may be limited to the granularity of the largest object being replicated for a given operation. For example, a program updates a record residing within a journaled file. As part of the same operation, it also updates an object, such as a user space, that isn’t journaled. The backup copy becomes completely consistent when the user space is entirely replicated to the backup system. Practically speaking, if the primary system fails, and the user space object is not yet fully replicated, a manual recovery process is required to reconcile the state of the non-journaled user space to match the last valid operation whose data was completely replicated. Another possible challenge associated with this approach lies in the latency of the replication process. This refers to the amount of lag time between the time at which changes are made on the source system and the time at which those changes become available on the backup system. Synchronous remote journal can mitigate this to a large extent. Regardless of the transmission mechanism used, you must adequately project your transmission volume and size your communication lines and speeds properly to help ensure that your environment can manage replication volumes when they reach their peak. In a high volume environment, replay backlog and latency may be an issue on the target side even if your transmission facilities are properly sized. 5.2 Switchable device The iSeries implementation of switchable device, independent auxiliary storage pools (IASPs), supports both directory objects (such as the integrated file system (IFS)) and library objects (such as database files). It is provided as part of i5/OS Option 41, High Availability Switchable Resources. The IT community often refers to this topology as switched disks. A key point about this solution is that it’s inherently a logical solution, as opposed to a purely mechanical-switching solution. The architecture is deployed within the operating system as a special class of auxiliary storage pool (ASP) that is independent of a particular host system. The practical outcome of this architecture is that switching an IASP from one system to another involves less processing time than a full initial program load (IPL). Figure 5-2 illustrates the concept of a switchable device. The benefit of using IASPs lies in their operational simplicity. The single copy of data is always current, meaning there is no other copy with which to synchronize. No in-flight data, such as data that is transmitted asynchronously, can be lost. And, there is minimal performance overhead. Role swapping or switching is relatively straight forward, although you may have to account for the time required to vary on the IASP. Another key benefit of using IASPs is zero-transmission latency. The major effort associated with this solution involves setting up the direct access storage device (DASD) configuration, the data, and application structure. Making an IASP switchable is relatively simple. The IASP is placed into a switchable tower that is attached to two servers (or partitions) via a high-speed link (HSL) loop. Chapter 5. Detailed attributes of data resilience technologies 17 Figure 5-2 Switchable device Limitations are also associated with the IASP solution. First, there’s only one logical copy of the data in the IASP. This can be a single point of failure, although the data may be protected using RAID 5 or mirroring. The data cannot be concurrently accessed from both hosts for such things as read access or more importantly for backup to tape operations. Certain object types, such as configuration objects, cannot be stored in an IASP. You need another mechanism, such as periodic save/restore or logical replication, to ensure that these objects are appropriately maintained. Another limitation involves hardware associated restrictions. Examples include distance limits in the HSL loop technology and outages associated with certain hardware upgrades. The IASP cannot be brought online to a down level system. Other considerations include database restrictions on cross-IASP relationships such as JOINs and referential integrity rules. Therefore, up-front database design and analysis are essential. 5.3 Cross-site mirroring This solution type involves the mirroring of IASP data via i5/OS storage management to a second and perhaps remote server over a communications fabric. Cross-site mirroring (XSM) is included in Option 41 of i5/OS Version 5 Release 3. It enables the switching or automatic failover to a mirrored copy of the IASP (see Figure 5-3) in addition to locally switching the IASP between systems. It addresses the single point of failure issue of the basic switchable device structure. It also provides a means to develop a remote mirrored copy of your IASP data via a function called geographic mirroring. 18 Data Resilience Solutions for IBM i5/OS High Availability Clusters Primary (source system) Production data Backup Mirror Figure 5-3 Cross-site mirroring The benefits of this solution are essentially the same as the basic switchable device solution with the added advantage of providing disaster recovery to a second copy at increased distance. The biggest benefit continues to be operational simplicity. All of the data placed in the production copy of the IASP, including the journal receivers, is mirrored to a second IASP on a second, perhaps remote, system. The switching operations are essentially the same as that of the switchable device solution with the added benefit that you can also switch to the mirror copy of the IASP, making this a straightforward HA solution to deploy and operate. As in the switchable device solution, objects not in the IASP must be handled via some other mechanism and the IASP cannot be brought online to a down-level system. XSM also provides real-time replication support for hosted integrated environments such as Microsoft® Windows® and Linux®. This is not generally possible through journaling-based logical replication. A potential limitation of an XSM solution is performance impacts in certain workload environments. As with any solution, when synchronous communications are used, you must consider distance, bandwidth, and latency limitations associated with transmission times. When running input/output (I/O)-intensive batch jobs, some performance degradation on the primary system is possible. Also, be aware of the increased central processing unit (CPU) overhead required to support XSM. In addition, another limitation of an XSM solution is that concurrent operations cannot access the mirror copy of the IASP. For example, if you want to back up to tape from the geographically mirrored copy, you must quiesce operations on the source system and detach the mirrored copy. Then you must vary on the detached copy of the IASP, perform the backup procedure, and then re-attach the IASP to the original production host. This mandates full data resynchronization between the production and mirrored copies. Depending on how long it takes to synchronize the primary and backup IASP copies, it may be impractical to detach and then reattach the mirrored copy for a back up-to-tape operation in certain environments. Your system is running exposed while doing the backups and when synchronization is occurring. Synchronization is also required for any persistent transmission interruption, such as the loss of all communication paths between the source and target systems for an extended period of time. To minimize the potential for this situation, we recommend that you use redundant transmission links. You should also use XSM in at least a Chapter 5. Detailed attributes of data resilience technologies 19 three system configurations where the production copy of the IASP can be switched to another system at the same site that can maintain geographic mirroring. You can configure an XSM environment where both of the production and mirrored copies of the IASP are non-switchable between servers at each site. Although this applies to some end-user environments, it is not the configuration assumed in this paper. When running with such a configuration, your system is still exposed to some single points of failure. 5.4 IBM TotalStorage Enterprise Storage Server PPRC used with the iSeries Copy Services for ESS toolkit This solution type involves the replication of data at the storage controller level to a second storage server using IBM TotalStorage Enterprise Storage Server (ESS) copy services. An IASP is the basic unit of storage for the ESS peer-to-peer remote copy (PPRC) function. PPRC generates a second copy of the IASP on another ESS. The toolkit comes as part of the iSeries Copy Services for ESS services offering. It provides a set of functions to combine the PPRC, IASP, and i5/OS cluster services for coordinated switchover and failover processing through a cluster resource group (Figure 5-4). iSeries Cluster and Device Domain Figure 5-4 ESS PRC Toolkit This solution provides the benefit of the remote copy function and coordinated switching operations, which gives you good data resiliency capability if the replication is done synchronously. The toolkit enables you to attach the second copy to a backup server without an IPL. No load source recovery is involved in the operations. You also have the ability to combine this solution with other ESS-based copy services functions, such as FlashCopy, for additional benefits such as save window reduction. This solution also has limitations. The switchover processing, while mostly automated, requires some manual intervention to coordinate actions between i5/OS and the ESS. When you are done using the IASP on the backup system and switch back to the original primary via a scheduled switchover with PPRC, then no IPL is necessary. However, if the switchover is unscheduled, then an extra IPL on the failed system is required before it can accept the IASP again. Because of the required manual processing, this solution isn’t defined as an HA solution but principally as a disaster-recovery solution. It can be used for certain kinds of planned outages. 20 Data Resilience Solutions for IBM i5/OS High Availability Clusters If the two ESS are connected synchronously, you must also be aware of the distance limitations associated with transmission times as with any solution when synchronous communications are used. This approach requires tools and services to deploy. Prior to ESS Release 2.4.0, asynchronous PPRC was never recommended as part of this solution. Previous implementations of asynchronous PPRC did not preserve write order. Therefore, a consistent view of the data as well as internal object consistency could not be guaranteed. However, the recently announced IBM TotalStorage Global Mirror for ESS provides an asynchronous PPRC solution that preserves the order of writes and provides significantly longer distances. A variation of this solution is to use ESS PPRC without IASPs and the toolkit. This is not considered in this paper because such a solution involves a long recovery time due to load source recovery processing and long IPL recovery steps. In addition, such a solution does not protect you from simultaneously updating both copies of the data (a feature of the toolkit), thereby eliminating identical copies of the data. Chapter 5. Detailed attributes of data resilience technologies 21 22 Data Resilience Solutions for IBM i5/OS High Availability Clusters 6 Chapter 6. Comparison characteristics With this high level overview of solution applicability to a major problem set, it is important to explore the detailed characteristics and attributes of the solutions based on several important characteristics. For this chapter, we selected some key characteristics to consider. However, you may have other characteristics that are equally or more important to your environment. To compare the various availability techniques that use some form of data resiliency, we use the following characteristics in the technology comparison shown in Table 6-1: Primary use: This indicates whether the solution is primarily oriented for users with a high availability (HA) requirement or for users who only have a disaster recovery requirement. Disaster recovery (DR) involves recovering all important applications and resuming normal operations for these applications at a remote site in the event of a disaster that causes a complete outage at the production site. High availability enables your environment to withstand all outages (planned, unplanned, and disasters) to provide continuous processing for all important applications for a very high percentage of time. Characteristic of replication mechanism: This provides a brief description of the major characteristics of the solution that generates a copy of the data onto auxiliary storage. Recovery time: This refers to the length of time that it takes to recover from an outage (scheduled, unscheduled, or disaster) and resume normal operations for an application or a set of applications. Recovery point: This indicates the point in time where recovery processing returns control to end users. The recovery point is from the perspective of both data and application processing. Ordering of changes: This indicates how changes are ordered on the backup system and how the order relates to the original sequence of changes on the primary system. Ordering of changes must be preserved to ensure data integrity and the internal consistency of objects. The ordering mechanism can also affect recovery time. For example, while there may not be strict ordering as the system expects it, power-loss integrity can sometimes be preserved by initial program load (IPL) processing to ensure object consistency. However, if interdependent changes are not done in the correct order, IPL recovery may in fact introduce data integrity issues. Concurrent access: This is the level of access allowed to a secondary copy of data. This row also addresses the real-time currency of the replicate copy, which describes the © Copyright IBM Corp. 2005. All rights reserved. 23 nature of the lag time (or latency) between the secondary copy and the primary copy. A value of 100% current would mean that, from a user’s perspective, the copy and primary are updated simultaneously (0 latency). Geographic dispersion: This refers to distance limitations imposed between systems and data copies. Number of backup systems: The value for this row indicates the maximum number of backup systems that can be involved with the replication and failover process. Number of data copies allowed: The value in this row indicates the maximum number of secondary data copies that can be processed by this solution. Cost factors: This row specifies whether there are any specific requirements for the direct access storage device (DASD) used for the primary or secondary copies of the data. This row also describes any additional cost considerations relative to the data storage. End-user disruption: This row indicates what the user will experience during normal production and system recovery. For save window reduction, this includes any processing from the time at which the save was initiated until the copy is captured and the user can resume normal processing. Outage coverage: This may apply to a planned, unplanned, disaster, or save window reduction. Cluster controlled resource: This row specifies whether the resource that is providing the resilience falls under the control of cluster services (for single point of management, automated failover, coordinated switchover, and so on). Risks: This row identifies potential risks or exposures that are involved with the solution. When examining Table 6-1, consider the following notes: Since Table 6-1 only addresses data resilience techniques, not all of the decision factors mentioned in Appendix A, “Decision factors” on page 29, are included in the comparison. In some cases, the distance limits are stated as “virtually unlimited”. While this is technically true, the actual distance limits are gated by response time degradation tolerances, throughput impacts, characteristics of the communications fabrics, and other factors. For example, for the IBM TotalStorage Enterprise Storage Server (ESS), we recommend that you never use synchronous PPRC across distances greater than 100 KM. Independent auxiliary storage pool (IASPs) vary on processing covers the time needed to bring the device online to the system and any recovery processing needed on the IASP contents. 24 Data Resilience Solutions for IBM i5/OS High Availability Clusters Table 6-1 Technology comparison Logical replication Switchable IASPs XSM with geographic mirroring ESS PPRC with IASP and iTC Toolkit Primary use HA (including DR) HA (no DR) HA (including DR) DR Characteristic of replication mechanism Recovery time considerations Object based replication Changes at record or object level based on data and audit journal Logical copy of object-level changes for selected objects Apply lag plus replication switchover overhead Journal settings No IPL required Minutes No replication One copy of data that is switchable between systems IASP vary on SMAPP or journal settings No IPL required Minutes Page-level replication as controlled by operating system based on storage management writes Logical copy since physical DASD configurations can differ IASP vary on SMAPP or Journal settings No IPL required Minutes Recovery point considerations Ordering of changes Concurrent access Geo dispersion Transaction boundary with commitment control Mixed, audit and data journal Data or objects sent to target are recovered Any changes not transmitted are lost (zero data loss with synch remote journal) Based on journal receiver content and HABP ability to synchronize changes from data and audit journals Typically read only, possibly shared data Always some lag time in data currency Remote Journal helps Virtually unlimited Transaction boundary with commitment control Last data written to IASP Objects not in IASP Ordering preserved Transaction boundary with commitment control Last data written to IASP Objects not in IASP Ordering at system level Ordering preserved across ASP group No concurrent access since no copy of data Limited (250 M) No, requires resynchronization Second copy current Virtually unlimited Sector level replication of all pages written to disk Physical copy of an IASP based on disk I/O (cache based) Quiesce time Manual steps plus vary on SMAPP/Journal settings IPL sometimes required before use backup again Tens of minutes Quiesce point for breaking PPRC Transaction boundary with commitment control Last data written to disk (some automation and protection from mistakes) Ordering at controller level Preserved at LUN set level for sync PPRC No order for asynch until 2.4.0 No concurrent access Copy current with synch PPRC; incoherent with asynch PPRC Virtually unlimited Chapter 6. Comparison characteristics 25 Logical replication Switchable IASPs XSM with geographic mirroring ESS PPRC with IASP and iTC Toolkit Number of backup systems 1<= n <127 (or BP max) n=1 (with switchable towers) 1<= n <=3 (2 or 3 with switchable towers) 1<=n<=2 (2 with cascading PPRC) Number of data copies 127 (or BP max) None 1 2 Cost factors End user disruption Any DASD configuration HABP software Bandwidth Duplicate disks Replication overhead Can automatically restart application Switchable tower (or IOP) i5/OS Option 41 Can automatically restart application Any (flexible) DASD configuration i5/OS Option 41 Bandwidth Duplicate disks Geographic mirroring overhead Can automatically restart application Ext DASD (2 x Shark) Bandwidth i5/OS Option 41 PPRC Toolkit Duplicate disks PPRC and toolkit overhead Semi automatic application restart Outage coverage Planned, unplanned, disaster, save window Planned, unplanned Planned, unplanned, disaster Some planned outages, disaster Cluster control Yes Yes Yes Yes – of switchable devices Risks 26 Loss of in flight data Mismatch of data levels for various objects Monitoring logical object replication environment Disk subsystem is single point of failure, therefore no protection against catastrophic disk failure Asynch case: Loss of copy for some double failure situations; OK if can quiesce and vary-off mirror copy Resynch may yield lengthy unprotected condition (especially with only two systems) Data Resilience Solutions for IBM i5/OS High Availability Clusters IPL on backup systems in some situations Somewhat complex Never use asynch PPRC, unless using the new Global Mirror option 7 Chapter 7. General guidelines Choosing a multisystem data resilience mechanism can be a complex and confusing decision. Table 6-1 on page 25 helps you to compare the resilience technologies in detail. This section provides general guidelines about when a particular mechanism may be best suited for a given environment. The technologies are not mutually exclusive. The solution that best fits a set of customer requirements may be achieved by deploying a combination of available technologies. Consider logical replication when: You need two or more copies of the data. You want concurrent access to second data copy. You need backup window reduction. You need to selectively replicate objects within a library or directory. You IT staff can monitor the state of the replication environment. Geographic dispersion between copies is needed, especially if they need distances greater than what can be achieved by hardware solutions. You already have deployed a solution using logical object replication. You need a solution that has no special hardware configuration requirements. Failover and switchover times should not exceed tens of minutes. Transaction level integrity is important for all journaled objects. Consider switchable independent auxiliary storage pools (IASPs) when: Only one copy of the data with hardware protection satisfies your requirement and you have considered or addressed avoiding unplanned outages due to disk subsystem failures. You need a simple, low cost and low maintenance solution. Disaster recovery (DR) is not needed. Coverage for planned and certain types of unplanned is all that is required. The source and target system are at the same site. © Copyright IBM Corp. 2005. All rights reserved. 27 You want consistent failover and switchover times within minutes and that do not depend on transaction volumes. Transaction-level integrity is important for all objects. You need immediate availability of all object changes with no loss of in flight data. Objects not within an IASP either do not need to be replicated or are handled via some other mechanism. You need the highest throughput environment. Your environment calls for multiple, independent databases that can be moved between systems. Consider cross-site mirroring when: You want a system-generated second copy of the data (at an IASP level). You need two copies of data, but do not need concurrent access to a second copy. A relatively low cost and low maintenance solution is desired, but you also need disaster recovery. Geographic dispersion between copies is needed, but your distance requirement does not adversely impact your acceptable production performance goals. You want consistent failover and switchover times within minutes and that do not depend on transaction volumes. Transaction-level integrity is important for all objects. You need immediate availability of all object changes with no loss of in flight data. Objects not within an IASP either do not need to be replicated or are handled via some other mechanism. The second copy that is not available during resynchronization fits within your service level objectives. Consider IBM TotalStorage Enterprise Storage Server (ESS) peer-to-peer remote copy (PPRC) with IASP and Toolkit when: You desire a storage-based solution for DR, especially if multiple platforms are involved. You do not need complete high availability (HA), but seek to cover DR and some planned outages for critical application data. Recovery times of one hour or more are acceptable. (Actual recovery times can be less.) You want two copies of data, but do not need concurrent access to a second copy. Geographic dispersion between copies is needed, but your distance requirement does not adversely impact your acceptable production performance goals. Alternatively, consider PPRC Global Mirror (asynchronous PPRC). Transaction-level integrity is important for all objects. You need availability of all object changes with no loss of in-flight data. Use a combination solution when no single solution meets all of your business continuity requirements. 28 Data Resilience Solutions for IBM i5/OS High Availability Clusters A Appendix A. Decision factors A user employs the decision factors in some form of decision tree to determine which data resilience or replication mechanism best suits the user’s business continuity needs. The group categorized as primary decision factors are those requirements that are most likely to be common across a wide user audience. They are also most likely to carry more weight in the decision process. An underlying assumption for all of these is that the mechanism does not compromise the integrity of the data. The group categorized as supporting decision factors are normally, but not always, secondary requirements in determining an availability solution. © Copyright IBM Corp. 2005. All rights reserved. 29 Primary decision factors The primary decision factors are based on the criteria. Up-time requirements Up-time requirements refers to the total amount of time that the system is available for end-use applications. The value is stated as a percent of total scheduled working hours. Typically the cost per outage hour is used as a determining factor in up-time requirements. The values with a corresponding downtime for a 24x365 shop are: <90% 90 to 95% 95 to 99% 99.1 to 99.9% 99.99% 99.999% (downtime of 876 or more hours (36 days)/year) (downtime of 438 to 876 hours/year) (downtime of 88 to 438 hours (3.6 days)/year) (downtime of 8.8 to 88 hours/year) (downtime of about 50 minutes/year) (downtime of about 5 minutes/year) Recovery time objective The recovery time objective (RTO) indicates the length of time it takes to recover from an outage (scheduled, unscheduled, or disaster) and to resume normal operations for an application or a set of applications. The RTO may be different for scheduled and unscheduled outages. The values are: More than 4 days is acceptable 1 to 4 days <24 hour <4 hours <1 hour Approaching zero (near immediate) Recovery point objective The recovery point objective (RPO) indicates the point in time where recovery processing returns the end users, from both the data and application perspective. The values are: Last save (weekly, daily, ...) Start of last shift (8 hrs) Last major break (4 hrs) Last batch of work (1 hour to tens of minutes) Last transaction (seconds to minutes) In-flight changes may be lost (power loss consistency) Now (near immediate) Resilience requirements The resilience requirements are the set of information and system capabilities required to be made resilient. These entities remain available (for example via failover) even when the system currently hosting them experiences an outage. 30 Data Resilience Solutions for IBM i5/OS High Availability Clusters The values are: Nothing needs to be made resilient Application data Application and system data Application programs plus item #3 Preserve application state plus item #4 Preserve the application environment plus item #5 Preserve all communications, hardware devices, user clients plus item #6 Concurrent access requirements Concurrent access requirements indicates the level of access that is required to secondary copies of the data for other work activity offloaded from primary copies, such as save and batch reports. You must consider at least the frequency, duration, and access method. The values are: None Seldom and during period of non-production Infrequent but during normal production for short (seconds to minutes) durations Infrequent but during normal production for long durations Frequently during production for short durations Frequently during production for long durations Nearly all the time (near continuous) The values for the access method are: None Read only Read with limited update Simultaneous concurrent update Geographic dispersion requirements Geographic dispersion requirements indicate the proximity that is required for secondary copies of data, system services, and application environment with respect to the location of the production version. The values are: Same room 100 meters or less (same building) 100 to 200 meters (across the street) Less than 2 km (across the campus) 2 km to 20 km (same city) 20 km to 40 km (same region) >40 km but <=100 km 100s of km Appendix A. Decision factors 31 Tolerance for end user disruption (slowdowns, failover time, manual restart procedures, and so on) This factor indicates the impacts which are acceptable to application end users and are caused by the availability solution and the corresponding recovery processing. The values for solution impact are: Not an issue (The availability of primary importance, performance can be affected as long as availability solution delivered.) Some performance degradation is acceptable Slight degradation in performance No perceived performance impact The values for recovery processing impact are: Each user must manually restart applications and determine where to reposition Automatic application restart but end users must determine where to reposition Automatic restart and users automatically are repositioned to the last main menu (application book marking) Automatic restart and automatic reposition to last transaction boundary (commit control) Automatic restart and automatic reposition to the last screen with all non-committed data shown Outage type coverage (planned, unplanned, disaster) This factor indicates the types of outages for which recovery is to be provided. There are implied levels of granularity for each of these. The user should itemize all specific outages requiring protection. The values are: None Site disasters only (building outage, regional disasters, and so on) Scheduled outages only (software maintenance, release upgrades, and so on) Unscheduled outages only (hardware device failure, system hardware failure, power failure, software/application failures, human errors, and so on) All scheduled and unscheduled outages Cost Cost requirements indicate the allowed impact for solution cost on adoption and deployment. Solution cost is the total cost of ownership which includes the initial cost to procure and deploy the solution, the ongoing costs to use the solution, and any cost/performance impacts. Cost is typically predicated on a thorough business impact analysis. The values are: Cost is not a factor. Cost has slight bearing on decision. Based on outage analysis, the solution cost must be contained within some budget. Cost is a significant factor in the decision. Unwillingly or unable to spend anything on availability solution Service and support A high availability (HA) solution consists of both function and service. Due to the varying customer environments in the market, along with the varying degrees of completeness of solution types as discussed in this paper, service is a primary consideration factor to be used 32 Data Resilience Solutions for IBM i5/OS High Availability Clusters in the selection of a solution. There are two types of services as explained in the following sections. The solution provider must be able to provide customer satisfaction data done in one of the standard methodologies such as net satisfaction index (NSI). This person must also document a full suite of both partner and customer education resources on the topics of deployment and utilization of the products being marketed in a given region. Project management, planning, training, and testing This is the deployment aspect of an HA service offering. It must be done by specialists who are certified by the solution provider company. Also the solution provider company must be able to provide the information about the certification or training process. A project plan is the basis for an availability solution deployment. The solution provider must demonstrate a complete project plan with deliverables, time tables, dependencies, and owners as well as the metrics that define completion of the deployment of the project. The number of certified specialists is also a critical factor. A typical HA solution project can take around 30 person days spread over three months. Therefore, it is critical that the solution provider has sufficient resource available to completely implement a given HA project plan in the time budgeted for the project. Technical support This service is provided by certified support specialists. The solution provider must have local language support for customer support that must be available 24X7 and must be available at a regional level. Due to the sensitive nature of a high availability solution, it is important that the solution provider demonstrate sufficient staffing resource to cover critical situations simultaneously along with a documented procedure for escalation and closure. Supporting decision factors The following decision factors sometimes serve as the basis for additional requirements. You can use them to help determine the correct data resilience solution. Keep in mind that, for some situations, one or more of these factors may actually be primary decision factors. Determine the correct set that fits your specific environment. Downtime window availability (nights, weekends, holiday weekends, and so on, or never) Switch frequency: How often you plan on exercising outage protection for planned maintenance, such as weekly or monthly Save processing objectives, procedures, and so on Restore processing objectives, procedures, and so on Transaction boundary requirements Workload balancing objectives (movement of application or data to achieve desired workload objectives) Usability of solution and associated skill requirements This includes susceptibility for introducing user errors or failing because of complex manual procedures. It also includes the degree of automation required to achieve the desired solution for this and other processing. The amount of technical training and depth of knowledge of the solutions is also included in this item. Plus this factor includes the complexity of deployment. That is whether the solution be easily installed, configured, and implemented. Appendix A. Decision factors 33 Complementary services and support available The number of copies of the data that are needed (1, 2, more than 2) Reasons for having additional copies of data (concurrent save, offloading non-production work to backup system, business intelligence, and so on) Configuration flexibility: Interconnection requirements, number systems involved, backup order, or centralized enterprise solution Objects covered or supported Performance factors (CPU overhead, response time, throughput, batch processing time, and so on) Switch frequency (how often you will do a planned switchover) Integration of the solution into existing or planned environments 34 Data Resilience Solutions for IBM i5/OS High Availability Clusters B Appendix B. Cautions, caveats, and other tips The tips, cautions, and caveats presented in this appendix may be helpful to determine your business continuity needs. Where appropriate, enlist the services of a qualified high availability (HA) specialist to guide you through all aspects of establishing and implementing an HA solution. In general, when you change your IT environment, you may need to re-evaluate and possibly adjust your business continuity solutions. Adding or changing hardware or applications may affect what you have currently deployed. © Copyright IBM Corp. 2005. All rights reserved. 35 Basic single system availability For basic single system availability, consider the following points: Any good business continuity implementation is grounded in basic single system availability characteristics. Use such techniques as journaling, commitment control, and predictive error analysis to provide a strong foundation for any of the problem categories described in Chapter 2, “Major business continuity problem sets” on page 3. Ensure that you employ the appropriate hardware-based protection mechanisms to avoid outages caused by single points of failure. For example, determine whether you should use a direct access storage device (DASD) protection mechanism, such as mirroring or RAID, or dual power sources. Journal objects for single system integrity and recovery. Journaling is supported for database as well as integrated file system (IFS) objects, data areas, and data queues. If you are using logical replication, journaling enables real-time replication of new and changed data at a record level. You need to achieve the appropriate balance between increased journaling overhead and IPL recovery times. Therefore, do not journal objects where data integrity and recovery are not an issue (such as for temporary files) or when you do not need real-time, record-level replication. Backup window reduction For backup window reduction, note the following tips: If using FlashCopy, ensure that all of your applications are at a quiesce point (for example, by ending the applications) before you use FlashCopy. This is the safest way to ensure the consistency of the saved data. If you are using logical object replication, you can perform backups on the backup system instead of the primary system. However, you must still ensure that the save is from a known recovery point on the target system. You can achieve a quiesce point either by quiescing your applications or, more typically, by suspending replication until all changes are applied on the backup system. A save can be initiated at this point on the backup system. Then replication can be restarted to ensure that the changes are sent to the backup system. However, the changes can be held up and not applied until the save is completed. Complete the save as quickly as possible and return to normal replication so that: – Your replication processing doesn’t fall too far behind. – You do not run an exposed system (if also doing HA). Multisystem HA solution For a multisystem HA solution, consider that: The best HA solutions are grounded on clustering technology. This enables better automation, single point of control, system detection of failures, coordinated failover, and coordinated switchover. Any good HA solution is exercised regularly. If you are not doing regular switching from you primary server to your backup server, then you may not have the desired level of availability in practice. 36 Data Resilience Solutions for IBM i5/OS High Availability Clusters Planned maintenance When performing a planned maintenance, keep in mind the following points: In general, regardless of the data resilience mechanism chosen, it is always best to quiesce application end users prior to a planned outage. This ensures the highest predictability of data consistency. If you are using the IBM TotalStorage Enterprise Storage Server (ESS) toolkit with peer-to-peer remote copy (PPRC) for planned maintenance, additional considerations apply. You can switch over to the secondary copy of the independent auxiliary storage pool (IASP) on the second ESS box, perform planned maintenance, and then switch back through the use of the toolkit services. The switch processing is somewhat longer and more complicated than the other mechanisms indicated. Determine if multiple copies of data and multiple targets for application hosting are needed to provide resilience in planned outage scenarios. If exposure to an outage of the production system during the time that the backup system is offline is assessed as too risky, then use multiple targets for the application and data. Alternately, you may use a combination of solutions to avoid a loss of protection. Recovery for disaster outages Always exercise your disaster recovery plan. It is not enough simply to set up an environment for disaster recover. The plan must be exercised to ensure that the automated and manual processing involved yields the desired results and that operations staff are familiar with the plan. Appendix B. Cautions, caveats, and other tips 37 38 Data Resilience Solutions for IBM i5/OS High Availability Clusters Glossary application resilience The application itself can continue to provide end-user services even if the system that originally hosted the application fails. asynchronous remote journaling The journal entry is replicated to the target system after control is returned to the application that deposits the journal entry on the source system. business continuity The capability of a business to withstand outages and operate important services normally and without interruption in accordance with a predefined service-level agreement. cluster A collection of complete systems that work together to provide a single, unified computing capability. An iSeries cluster is made up of only iSeries servers. cluster node A system that is a member of a cluster. cluster resource group (CRG) A collection of related cluster resources that defines actions to be taken during a switchover or failover operation of the access point of resilient resources. The group describes a recovery domain and supplies the name of the CRG exit program that manages the movement of an access point. cross-site mirroring (XSM) A function of i5/OS High Available Switchable Resources, Option 41, that provides geographic mirroring and the services to switch over or automatically cause failover to a secondary copy, potentially at another location, in the event of an outage at the primary location. data resilience The data remains accessible to the application even if the system that originally hosted the data fails. disaster recovery The set of resources, plans, services, and procedures used to recover mission critical applications and to resume normal operations for these applications at a remote site. ESS Enterprise Storage Server; see IBM TotalStorage Enterprise Storage Server. failover A cluster event that causes cluster-critical resources (for example, data and applications) to switch over from the primary server to a backup system due to the failure of the primary server or of the resource. FlashCopy A hardware-based copy function that provides a point-in-time volume copy within a single ESS. © Copyright IBM Corp. 2005. All rights reserved. geographic mirroring A subfunction of XSM that generates a mirror image of an independent disk pool on a system, which is (optionally) geographically distant from the originating site for availability or protection purposes. HABP See High Availability Business Partner. high availability The ability to withstand all outages (planned, unplanned, and disasters) and to provide continuous processing for all mission critical applications. High Availability Business Partner (HABP) Provides a set of software, services, and solutions that enable high availability for data and applications. high-speed link (HSL) loop The system-to-tower connectivity technology that is required to implement switchable independent disk pools that reside on an expansion unit (tower). The servers and towers in a cluster using resilient devices on an external tower must be on an HSL loop connecting with HSL cables. IASP See independent auxiliary storage pool. IBM TotalStorage Enterprise Storage Server (ESS) Consists of a storage server and attached disk storage devices. The storage server provides integrated caching and RAID support for the attached disk devices. The disk devices are attached via a Serial Storage Architecture (SSA) interface. The ESS can be configured in a variety of ways to provide scalability in capacity and performance. independent auxiliary storage pool (IASP) Also known as independent disk pool. One or more storage units that are defined from the disk units or disk-unit subsystems that makes up addressable disk storage. An independent disk pool contains objects, the directories that contain the objects, and other object attributes such as authorization ownership attributes. An independent disk pool can be made available (varied on) and made unavailable (varied off) without restarting the system. An independent disk pool can be either switchable among multiple systems in a clustering environment or privately connected to a single system. journal A system object that identifies the objects being journaled, the current journal receiver, and all the journal receivers on the system for the journal. It is the process of recording, in a journal, the changes made to objects, such as physical file members or access paths, or the depositing of journal entries by system or user functions. 39 logical replication The process of generating a second copy of data that is logically identical to the first. The replication is done on an object basis (file, member, data area, program, and so on) near real time. outage (planned, unplanned, disaster) An event that causes disruptive loss of IT resources. The outage can be planned, unplanned, or the result of a disaster. Planned outages include scheduled maintenance of systems and software. Unplanned outages include unrecoverable failures of hardware or software components as well as environmental disruptions such as intermittent power loss. Disasters typically result in loss of an entire site due to natural events (such as a flood or hurricane) or human caused events (for example, sabotage). peer-to-peer remote copy (PPRC) A hardware-based remote copy option that provides a synchronous volume copy across storage subsystems for disaster recovery, device migration, and workload migration. PPRC See peer-to-peer remote copy. recovery point objective (RPO) The point in time where recovery processing returns the end users (from both data and application perspective). recovery time objective (RTO) The length of time it takes to recover from an outage and resume normal operations for an application or a set of applications. remote journal Remote journal management allows you to establish journals and journal receivers on a remote system or to establish journal and receivers on independent disk pools that are associated with specific journals and journal receivers on a local system. The remote journaling function can replicate journal entries from the local system to the journals and journal receivers that are located on the remote system or independent disk pools after they are established. Delivery of the journal entries can be either synchronous or asynchronous. See synchronous remote journaling and asynchronous remote journaling. switchable device The physical resource containing the resource, such as independent disk pools that can be switched between systems in a cluster. The device can be an expansion unit that contains disk units in a multiple system environment. It could also be an input/output processor (IOP) that contains disk units in a logically partitioned (LPAR) environment. switchover A cluster event that causes a cluster critical resource (for example, data or application) to switch over from the primary server to a backup system due to the manual intervention from the cluster management interface. The resource becomes available for processing on the backup system. synchronous remote journaling The journal entry is replicated to the target system concurrently with the entry being written to the local receiver on the source system. system In the context of this paper, a system is the set of hardware and software that delivers an operational environment for one or more applications. Each system has exactly one operating system. A system may therefore be a standalone server with a single operating system or in may be an LPAR in an LPAR environment. system-managed access-path protection (SMAPP) An i5/OS function that allows a user to specify a goal for the maximum amount of time that the system should use to recover access paths after an abnormal system end. The system automatically protects access paths so that they can be recovered within the time specified. XSM See cross-site mirroring. resilience An object that has the ability to recover readily, as from failure. The object is capable of returning to an original condition. See also application resilience and data resilience. RPO See recovery point objective. RTO See recovery time objective. SAN See storage area network. SMAPP See system-managed access-path protection. storage area network (SAN) A managed, high-speed network that enables any-to-any interconnection of heterogeneous servers and storage systems. 40 Data Resilience Solutions for IBM i5/OS High Availability Clusters Back cover Data Resilience Solutions utions ® Redpaper for IBM i5/OS High Availability Clusters Understand the scope of business continuity problems and solutions Learn about data resilience technologies, their features and limitations Determine the right technologies for your availability needs Choosing the correct set of data resilience technologies in the context of your overall business continuity strategy can be complex and difficult. It is a given that business continuity is an extremely broad topic. This IBM Redpaper provides insight into this broad topic and describes the technologies that provide improved data resilience for end users. It explains the capabilities, advantages, and limitations of the various technologies. Plus it provides information and techniques that you can use to select the best set of technologies available on IBM Eserver i5, to use in conjunction with the IBM i5/OS high availability clusters, to satisfy your specific business continuity goals. This IBM Redpaper is organized so that you can study the content from cover-to-cover (recommended) or use specific sections for reference as needed. It begins with a discussion about the business continuity requirements. This information helps you to determine and prioritize your business requirements in the context of the specific problem sets of interest to you. Next this IBM Redpaper presents an overview of the technologies that are related to business continuity. This helps you to understand the technology categories and choices within each category that are available to address the problem sets. Then, the paper explains how the technologies apply to the business continuity requirements and compare with one another. A detailed analysis helps position the various technologies against your business requirements. Finally, some conclusions are drawn about the technologies, that map solutions to the characteristics of end-user environments. INTERNATIONAL TECHNICAL SUPPORT ORGANIZATION BUILDING TECHNICAL INFORMATION BASED ON PRACTICAL EXPERIENCE IBM Redbooks are developed by the IBM International Technical Support Organization. Experts from IBM, Customers and Partners from around the world create timely technical information based on realistic scenarios. Specific recommendations are provided to help you implement IT solutions more effectively in your environment. For more information: ibm.com/redbooks