Business Continuity Management for SAP System Landscapes Best Practice for Solution Management Version Date: May 2008 The newest version of this Best Practice can always be obtained through the SAP Solution Manager Table of contents 1 Introduction 1.1 Goal of Document 4 1.2 What is Business Continuity Management? 4 1.2.1 What are the failure scenarios covered by Business Continuity? 5 1.2.2 System Recovery and Business Recovery – The two major steps of recovery 6 1.2.3 What impact can a major business disruption or disaster have? 6 1.2.4 What should business continuity plans protect? 6 1.2.5 The Course of Disaster Recovery 7 1.2.6 The Business Continuity Plan 9 1.3 2 3 4 Stages of the Business Continuity Lifecycle Stage 1: Initiation 9 12 2.1 Scoping Study 13 2.2 Develop Project Plan 13 2.2.1 Project Organization and Control Structure 13 2.2.2 Responsibilities 15 Stage 2: Requirement Analysis and Strategy Definition 3.1 Documentation of System Landscape and Business Processes 16 19 3.1.1 Determine Core Business Processes 19 3.1.2 Documentation of Business Processes 19 3.1.3 Description of System Landscape and Interfaces 20 3.1.4 Description of Data Flow between system components 21 3.1.5 Aggregated Data Flow Between System Components 22 3.2 Business Impact Analysis 23 3.2.1 General Threats and Vulnerabilities 24 3.2.2 Costs of Outage and Process Prioritization 24 -1- 3.2.3 Component Failure Impact Analysis (CFIA) 25 3.2.4 CFIA Matrix 27 Risk Assessment 28 3.3 3.3.1 Support requirements for Business Processes 28 3.3.2 Support Requirements for Components 29 3.4 The Business Continuity Strategy 31 3.5 Risk Mitigation Measures 32 3.5.1 Elimination of Single Points of Failure 33 3.5.2 Change Management 34 3.6 4 Determine Recovery Options 34 3.6.1 Basic recovery categories 34 3.6.2 Impact of Technical Recovery and Logical Recovery on Recovery Time 35 3.6.3 Recovery Options per System Component 37 3.6.4 Recovery Options for Business Objects 38 3.6.5 Recovery Options per Process 40 3.7 Define Monitoring Objectives 41 3.8 Identify Resources for Recovery Mechanisms 41 3.9 Agree on Recommendations 42 Stage 3: Implementation and Testing 43 4.1 Establish Organization 43 4.2 Develop Implementation Plans 44 4.2.1 Recovery Plans 44 4.3 Crisis Management 45 4.4 Implement and Document Risk Reduction Measures 45 4.5 Implement and Document Standby Arrangements 45 4.6 Develop Recovery Plans 46 4.6.1 Example: Extraction of a Recovery Plan for the Incomplete Recovery of an SAP CRM System 46 4.7 6 50 4.7.1 General Procedures for Different Contingencies 50 4.7.2 Detailed Procedures for Specific Tasks 51 4.8 5 Develop Recovery Procedures Recovery Testing 52 4.8.1 Create Test Plan 52 4.8.2 Initial Testing 52 Stage 4: Operational Management 53 5.1 Create Awareness 53 5.2 Establish Education, Trainings and Exercises 53 5.3 Establish a Continuous Review and Change Control Process 53 5.4 Establish Regular Testing 54 5.5 Establish Monitoring and Resolution of Findings 54 Conclusion 55 -2- 7 Appendix 56 7.1 Template for scoping presentation 56 7.2 Contents of a Business Continuity Plan 56 -3- 1 Introduction 1.1 Goal of Document This Best Practice provides the SAP view on establishing a business continuity concept for an SAP environment, in the style of the ITIL (the IT Infrastructure Library, see http://www.itil.co.uk) approach for “IT Service Continuity Management”. It outlines a general procedure on how to set up a business continuity concept for an SAP system landscape and SAP business processes, including the identification of different risks and failure situations, which impact continuity of operations or data consistency. This document introduces a methodology to be used in a Business Continuity Management project, with the different project phases ranging from the analysis of the business requirements to the identification of adequate risk mitigation measures and recovery plans, including necessary documentation and operating procedures. The continuity requirements are determined working top-down from the requirements of the core business processes, to the requirements of the underlying systems and technical components. Following this methodology, customers will gain a deep insight into their SAP environment supporting core business functions. Even though customers might already have technical solutions in place to safeguard the operation of the technical system landscape, a Business Continuity Management project will yield a sound definition of the requirements and possibly come up with gaps that were not yet addressed. The approach described in this document does not stop at a technical level but also addresses possible risks on the application layer, like continuity of business processes, or consistency of data objects. Resulting from a Business Continuity Management project, a customer will have documented possible workarounds to sustain core functionality and will have created recovery plans for technical failures and application-related logical errors. The concept described in this document is intended to be followed in the early stages of a project, but, if not available, a continuity concept can also be established later on, during productive operations. Being familiar with the ITIL documentation and its general approach to IT continuity management is not a prerequisite for working with this document, but might help in valuing the different phases we will depict for realizing a business continuity project for an SAP environment. The SAP Continuity Management Service can support customers with their concepts to safeguard the continuity of business operations. This service analyzes a customer’s continuity concept, mainly focusing on technical options to protect the involved systems and application data. It helps to identify gaps and discusses options to optimize protection against business disruptions due to technical or application failures. The service can also assist a customer in the planning phase for a business continuity concept or with reviewing different milestones during a business continuity project. For more information on the SAP Continuity Management Service see http://service.sap.com/continuity. 1.2 What is Business Continuity Management? “Business Continuity Management (BCM) is concerned with managing risks to ensure that at all times an organization can continue operating to, at least, a pre-determined minimum level. The BCM process involves reducing the risk to an acceptable level and planning for the recovery of business processes should a risk materialize and a disruption to the business occur.” (Source: ITIL Service Delivery, chapter 7.1.6) The main focus of this document will be the continuity of IT services supporting the business (in ITIL terminology called ITSCM – IT Service Continuity Management), while other services that the business depends on may also be incorporated in a BCM concept. In the remainder of this document, we will not distinguish further between the terms BCM and ITSCM. A business disruption can be caused by an application failure, a system component failure or the loss of the entire premise where the business operates. Regarding the severity of disruption, we have to distinguish between incidents and major business disruptions. According to ITIL, an incident is “any event which is not part of the standard operation of a service and which causes, or may cause, an -4- interruption to, or a reduction in, the quality of that service”. While minor incidents can be handled by the Service Desk (ITIL “Incident Management”), a major business disruption or disaster is an incident that needs to be reported to the business continuity crisis team, because it could seriously impact the availability of one or more business processes. A major business disruption that stops a critical business process from operating may require invoking a business continuity plan. A business continuity plan (BC plan, also called disaster plan or recovery plan) elaborates on how to operate critical business processes on a pre-determined minimum acceptable level by using an alternative process and on how to recover the affected business process or the affected components back to normal operation. The decision whether an incident will be escalated to a disaster and whether a business continuity plan will be activated is up to the business continuity crisis team. This decision will be taken depending on the time and impact of an outage and may differ depending on the business process being affected. The main goal of BCM is to establish procedures in an organization that allow handling such major business disruptions by describing alternative procedures and possible recovery methods as well as implementing risk reduction measures and recovery technologies. However, BCM does not end with the creation of disaster recovery plans. There is a whole lifecycle to BCM, which needs to be established with a business continuity project. Having established business continuity procedures, BCM needs to ensure that the continuity plans will become part of change control management. The plans need to be updated whenever changes are applied to business processes, be it changes to IT or changes in business operation. BCM must also introduce education and awareness for business continuity throughout the organization and has to establish regular testing to ensure operability of the described procedures. Each stage of the business continuity lifecycle, which is further outlined in section 1.3, will be discussed in a separate chapter of this document. 1.2.1 What are the failure scenarios covered by Business Continuity? In general, BCM needs to address any type of scenario that prevents a company from operating its critical business processes. There are three main categories of failure scenarios: Technical failure or disaster: This can range from crashes of individual hardware components to building fires or flooding of an entire computer center. Technical failures affect all business processes that are using the affected component(s). Logical failure: Faulty software or incorrect use of software may corrupt data and provoke data inconsistencies that cause a disruption to business processes. If, for instance, a malicious program deployed in an ERP system corrupts master and transactional data, executing an order-to-cash process may become impossible. Some misuse of inventory management software may also result in a production down, as necessary goods were not reordered on time at the factory. A logical failure may also be the result of resolving a technical failure: Point-in-time recovery or data loss in one system of a federated system landscape will result in data inconsistencies between the systems that need to be addressed before resuming regular operations. As these examples demonstrate, logical errors or data inconsistencies can have two dimensions: - inconsistent data within one system (for example, order data was accidentally deleted form the database) - inconsistent data between two systems of a system landscape (for example, orders are not consistent between an ERP and a CRM system) Logistical or operational failure (not in scope of this document): Apart from IT processes, business operation depends on many operational or logistical aspects. Required staff need to be available and facilities need to be accessible. Emergency plans for logistical aspects need to make sure that equipment, meeting places and workspaces can be made available in case of a disaster. Since this document focuses on IT Service Continuity Management, logistical or operational aspects falling into the third category will not be covered in the remainder of this document. -5- 1.2.2 System Recovery and Business Recovery – The two major steps of recovery Usually, if a business process is unavailable, the availability of all involved systems needs to be checked first. If a system is unavailable, system recovery or technical recovery has to reestablish technical availability of the system as a first step. This can be done for example by exchanging some defect hardware component, by activating a standby system or by restoring a database from a backup. In most cases, resolving the technical error will immediately return the systems and processes back to regular operation. However, if for some reason, a method of resolving a technical error resulted in data loss for a system component (for example when performing a point-in-time (incomplete) database recovery or when activating an asynchronous standby solution), system recovery would leave a state that required further analysis of data consistency between systems of a system landscape. Business recovery or logical recovery is always required in case of logical errors or data inconsistencies appearing either inside one system or between systems of a system landscape. With inconsistent or outdated data, wrong business decisions might be made or inconsistencies in the system may lead to unacceptable situations like for instance, an ERP system sending invoices without the materials having been delivered to customers. As described above, logical errors or inconsistencies can be the remains of a technical recovery procedure but can also be a disaster cause of its own. In the latter case, usually only a subset of business processes is affected because all technical components are available. If a business process is unavailable due to a logical error (data corruption), the logical error needs to be repaired. A BC plan should describe different ways to address data corruptions, for example by extracting the correct data from some specially provided analysis system. Repairing logical errors usually requires in-depth application knowledge. Sometimes, it is considered to solve logical errors by a technical measure – by recovering the affected system to the point before that error was introduced in the system due to some user error or faulty program (database restore followed by a point-in-time recovery). This procedure can indeed remove the logical error from the system but, as we have seen above, due to the data loss, it introduces a new kind of logical error affecting data consistency between the systems of a federated landscape. Resolution of such inconsistencies again requires business recovery, now between multiple systems. Note: The main challenge of this document will be to distinguish both types of errors, technical and logical errors, since a business continuity plan needs to address both levels: system recovery and business recovery. 1.2.3 What impact can a major business disruption or disaster have? A major disruption of a business process causes the process to be unavailable for a certain amount of time. To minimize the downtime, BC plans are developed. Various technical solutions are available to reduce the impact of a technical failure according to the required service level and budget, for example relocating systems to another facility due to a fire or flooding. In case of a logical error, such technical measures are mostly useless. The duration of the required recovery steps can be quite unpredictable. Preparing for this scenario, by having detailed documentation of business processes at hand and by providing a general approach to addressing such situations, will help to reduce the time for error resolution and reverting to normal operation of the business process. 1.2.4 What should business continuity plans protect? The BC plan should protect a company against the unavailability of its core business processes for an unacceptable period of time due to the loss of key resources of the company. Key resources can be -6- personnel, computer system components and software, but also power supply, or other technical facilities such as parts of the premises of the company itself. The business continuity project has to evaluate which business processes need protection against a business disruption. Depending on the importance of a process, it has to establish methods to recover the process in case of a contingency. Critical core processes requiring immediate recovery can be protected by high availability solutions for their critical system components and by alternative implementations of the respective business process, to ensure operation after a disruption on a minimum level acceptable to business. Less critical processes that can be unavailable for several hours or days without a major impact on business can be sufficiently covered by recovery plans that use fewer resources than the recovery plans for processes with immediate severe impact. Since the list of possible disasters or disturbances is unbelievably long, business continuity plans should not be based on special scenarios like a fire in building X. They are created on the assumption that some key resources are lost or unavailable, yielding useful plans that apply to several scenarios and not only to a single scenario. Instead of preparing for very specific error scenarios, it is more important to clearly understand and document all vital business functions in order to keep the business running regardless of the special peculiarity of a disaster. 1.2.5 The Course of Disaster Recovery Using an example situation, this section describes the different phases that are passed through in the course of a disaster. Phase 1: “Incident Management” A business disruption is detected by end users who trigger an incident at the supporting organization. The incident is analyzed and rated whether it can be resolved within a certain time span. In this example, three independent end users report that the CRM system is unavailable. The application management organization checks through the monitoring cockpit of the CRM system that the application servers are running but the database seems to be unavailable. Say this yields the core business process sales order management unavailable for 500 users. The application management sends an email to the users that the CRM system is currently unavailable to stop end users from calling for help concerning the CRM system. Phase 2: “Crisis team decides on invocation of business continuity plans” Now application management will try to identify the problem of the database server. As this incident is classified as a major business disruption, the crisis team for the sales order process is informed. A deadline of 30 minutes is set, after which the business continuity plan is invoked if the business process, respectively the CRM system, is still unavailable. Phase 3: “Invocation of the BC plan” Application Management was unable to restore the database server to normal operation within 30 minutes. However, the cause for the problem was identified. A malicious network driver that was installed recently corrupted all data in the database. Since the initial deadline of 30 minutes for error resolution was exceeded, the situation is escalated and the business continuity plan is activated. Phase 4: “Alternative process implementation” The now operative business recovery team instructs end users to use the ERP system instead to enter sales orders. As not all functionality is available in ERP which is usually available in CRM, the sales order volume entered is at 50% compared to the normal volume when using CRM. Phase 5: “Recovery team executes recovery plans” The recovery team identifies that the database server as a key resource is completely unavailable. No partial recovery is possible. In this case, the recovery plan advocates a point-in-time recovery of the database as a system recovery step. The inconsistencies produced by this incomplete recovery of the database need to be dealt with in a subsequent business recovery step. After making the database available on system level, the inconsistencies between CRM and ERP database must be repaired. This step is executed by the application experts that are part of the recovery team. Since the affected objects and necessary activities are documented in the business continuity plan, the team can immediately start with this extensive work. -7- Phase 6: “Steps prior to normal operation” When consistency between the systems is reestablished to a sufficient degree, the recovery team runs consistency check reports to verify that the CRM system is ready to revert to normal operation. Functional checks are run to ensure correct operation of the business processes. Phase 7: “End users start using normal business processes” Now that recovery has completed and the tests were successful, the end users are instructed to revert to normal operation using the CRM system. Data that was created using the alternative process needs to be fed back into the CRM system. Phase 8: “Lessons-learned” As a follow-up to the BC plan invocation, the situation leading to the error and the course of the overall recovery procedure is analyzed to identify possible deficiencies in error protection and recovery handling. The lessons-learned from this case are incorporated into the BC plan to improve the business continuity concept further. The following figure provides an overview of the different steps of a disaster recovery procedure to be established by a business continuity plan: Figure 1: Steps of a disaster recovery procedure -8- 1.2.6 The Business Continuity Plan Basically, a business continuity plan must provide answers to the following ‘management’ questions: 1. Which risks am I facing? 2. Which precautions can be taken? 3. How do I proceed in case of a contingency? 4. Will the plan work? These questions will by answered by the following elements of business continuity planning: 1. Which risks am I facing? Risk and Impact Analysis 2. Which precautions can be taken? Risk Mitigation and Recovery Options 3. How do I proceed in case of a contingency? Recovery Procedures & Priorities 4. Will the plan work? Continuous Testing & Change Management 1.3 Stages of the Business Continuity Lifecycle In order to establish a business continuity plan, it is necessary to run a business continuity project (BC project). According to the ITIL standard, this project can be split into four main stages as outlined in the following chart. Figure 2: Stages of a Business Continuity Project Source: Office of Government Commerce (OGC) www.itil.co.uk -9- Each of the following chapters of this document describes one stage of a BC project. At the beginning of each chapter, a short summary table enumerates the personnel needed in the respective project stage and lists the main deliverables of this stage. In a project plan for a BC project, each of these stages resolves into a number of phases and activities. The general course of a BC project is shown in Table 1. We will be following this structure throughout this document and describe each of those different phases in more detail. Table 1: General course of a business continuity project Task Name Section Stage 1 – Initiation 2 Scoping Study Define and set strategy Develop Project Plan Define project phases Define the project organization Define project control structure Identify initial costs 2.1 Stage 2 – Requirements Analysis and Strategy Definition 3 2.2 2.3 2.3 Requirement and Impact Analysis Documentation of System Landscape, Business Processes and Data Exchange Identify critical core business processes to include in BC plan Document core business processes Document system landscape Document interfaces Document data flow for business processes Business Impact Analysis Identify general threats and vulnerabilities Identify costs of outage and prioritize business processes Conduct component failure impact analysis (CFIA) Collect existing workarounds Create CFIA matrix Risk Assessment Determine required service levels (support requirements) for business processes Deviate required service levels for components Business Continuity Strategy Risk Mitigation Measures Elimination of critical single points of failure Thorough change management Determine Recovery Options Guiding principles Technical solutions / standby arrangements for system components Procedure / correction tools for business objects Possible new workarounds for business processes Define Monitoring Objectives Project management Identify required people and resources Agree on recommendations - 10 - 3.1 3.1.1 3.1.2 3.1.3 3.1.3 3.1.4/5 3.2 3.2.1 3.2.2 3.2.3 3.2.3 3.2.4 3.3 3.3.1 3.3.2 3.4 3.5 3.5.1 3.5.2 3.6 3.6.1/2 3.6.3 3.6.4 3.6.5 3.7 3.8 3.9 Duration Stage 3 – Implementation 4 Establish Organization (DR team) Develop detailed implementation plans Crisis management Activation procedure for BC plan -- Damage assessment and decision making Roles and responsibilities for BC Implement and document risk mitigation measures Implement and document stand-by arrangements Technical measures and corresponding procedures Business process workarounds including prerequisites and procedures Create recovery plan(s) Document recovery options and chosen solutions Document solutions and procedures Develop individual procedures for systems, processes and business objects Create master plan summarizing detailed procedures Recovery Testing Create test plans Perform initial testing 4.1 4.2 4.3 Stage 4 - Operational Management 5 Create awareness Establish education, training and exercises Establish a continuous review and change control process Ongoing risk evaluation and risk assessment Establish regular testing Establish Monitoring and Resolution of Findings Error prevention and error detection Clearing of inconsistencies 5.1 5.2 5.3 - 11 - 4.4 4.5 4.6/7 4.8 4.8.1 4.8.2 5.4 5.5 2 Stage 1: Initiation Roles Senior IT management BCM Project Manager Business Process Champions Recovery Expert from Application Management/ Business Process Operations (Ability to translate business recovery requirements (application) into technical requirements and specifications) Output Scoping study BC project plan Initial costs Project organization and control structure To install a successful business continuity concept within an organization, it is essential to establish awareness and commitment from senior management. The concept has to be fully endorsed to obtain the acceptance and commitment of management and staff. Business Continuity depends on the commitment at all levels in the organization and on a definition of their responsibilities. Management (such as the business process champion or the program management office) needs to continually monitor and prioritize business continuity activities against operational activities. The overall aim is a stage in which management considers business continuity in relation to, and even prior to, making key business decisions. This allows a balanced assessment of the risks to be considered in the decision making process. An awareness of the need for business continuity planning may be generated from: The range of risks for the organization The potential business impact that could result from the realization of the risks The probability of each of the risks Personal responsibilities and liabilities External pressures The best and most effective way to raise senior management awareness is to highlight potential risks and business impact facing an organization in terms of business failure to meet key performance indicators or corporate objectives. As with most IT issues, business continuity crosses organizational boundaries and consumes management time and financial resources. Sponsorship at the highest level and integration into the IT structure is paramount to the success of a business continuity project. Without this level of sponsorship, risks to business continuity include: Misalignment with the business and IT strategies, thereby failing to address the true values and business risks as perceived by senior management Lack of momentum, profile or resources Lack of extensive co-operation and input required from management at all levels For business continuity to be successful within an organization, a suitable organizational structure needs to be implemented. The roles should be integrated into the existing suite of IT management responsibilities, like the responsibilities and roles defined by SAP’s E2E Solution Operations standard. The optimum management structure will: Allow responsibilities for ongoing business continuity to be clearly defined and allocated Integrate into existing organizational structures, hierarchies and responsibilities Allocate responsibilities to functions or individuals that have the necessary presence, credibility, skills, knowledge and expertise within the IT organization - 12 - Ensure that the organizational structure that manages business continuity during day-to-day operation closely resembles the structure that will execute the recovery mechanisms in case of a disaster Ensure the business continuity strategy and requirements are integrated with the business and IT strategies Tasks of the initiation stage of a business continuity project: Conducting a scoping study Establishing a business continuity project plan, project structure and procedures Identifying critical business processes Establishing the business continuity project team and business continuity responsibilities 2.1 Scoping Study The scoping study is the initial task of the initiation stage in order to bring the risks and impacts to a management attention. The scoping report is used to raise awareness of the need for business continuity, to identify the business benefits, to generate management commitment and to act as the starting point for more detailed project plans (stage 1) and business impact plans (stage 2). A scoping study should describe the impact some perceivable disaster cases would have on one (or more) of the most important business processes. It should provide an idea as to how a disaster plan could mitigate this impact and compare the resulting costs with and without a recovery plan. A template for the scoping study is outlined in appendix 7.1. 2.2 Develop Project Plan After the initial awareness is raised and the permanent commitment of senior management is established, a project plan for the business continuity concept is created, including project structure and procedures. Table 1, which depicts the general course of a business continuity project, can also be used as a template for a business continuity project plan. This template needs to be completed by filling in the estimated duration of the different project steps. Consecutively, the initial costs of the project need to be determined. To get a more exact estimation of the duration of individual steps in the project plan as well as business and IT areas to be involved in the project, it is helpful to already have an idea of the core business processes that shall be covered in the BC plan (also see section 3.1.1.1 which finally sets the scope of business processes to be included in the BC project). 2.2.1 Project Organization and Control Structure After the BC project is approved, the project team is staffed and introduced. Initial briefing sessions for the project team, stakeholders and the business areas, are always worthwhile to raise awareness, prompt support and manage expectations. Ideally, these sessions develop into a campaign with defined methods of communication. Regular feedback to participants demonstrates the progress achieved as a result of their actions and contributions. Table 2 presents the methods of communication which need to be established in the BC project. - 13 - Table 2: Methods of communication Method of Communication Goal Regular briefings to staff on the emergency procedures and guidelines Readiness of staff for business continuity action plans Regular desktop walkthroughs of the recovery plans Familiarity of staff with business continuity plans Regular articles in newsletters, notice boards, corporate intranet to maintain the profile of Business Continuity Awareness of business continuity plans Inclusion of an overview of the organization’s business, business processes and Business Continuity mechanisms in the staff induction process Knowledge of staff regarding business continuity plans Regular progress reports to the Board and regular agenda items on other management and IT committees Update of Board members regarding business continuity plans During the project and after its completion, the typical organizational structure for large organizations that supports both ongoing management and invocation of business continuity procedures is outlined in the following figure. Figure 3: Org chart of the business continuity project / Source: Office of Government Commerce (OGC) www.itil.co.uk Management sponsorship at board level is often executed by a person whose responsibilities encompass most of the organization (for example IT). Day-to-day responsibility for business continuity often is assigned to a senior manager, who advises the board on a business continuity strategy and ensures that these are in line with business and IT strategies. The business continuity management together with its team of business area manager (business process champions) and the IT service continuity managers (application management) supervises change control, testing, auditing, awareness, and training. Steering committees at senior management level co-ordinate business continuity activities across the organization and support the business continuity manager. The steering committee should meet regularly to confirm the business continuity strategy is still valid and discuss changes that could affect the strategy as well as to review programs and procedures. - 14 - Management, such as application management within the IT organization, is typically given ownership of the deliverables that relate to their area of expertise or responsibility. Ownership not only involves responsibility for ensuring deliverables are met, but also for ensuring that they remain up to date and fit for purpose as application management also owns the change control management standard. Invocation of continuity mechanisms and recovery options is usually undertaken by one or several business continuity teams, focused on specific areas of the IT organization (for example, external communications, local area networks, servers, and so on). During periods of operational stability, the service continuity teams play a vital role in the implementation, testing, maintenance and support of these continuity or recovery procedures and plans. IT may establish a working group that, typically, fills key roles in the IT recovery process and fills operational management roles to deal with the continuity and availability management issues. 2.2.2 Responsibilities Table 3 outlines the typical responsibilities for business continuity during times of normal operations, as well as crisis operations. These layers of responsibility also correlate with the typical management structure for business continuity (see org. chart in previous section). Table 3: Responsibilities Level Task Board Initiate and Sponsor Business Continuity. Set Strategy and Framework Allocate Management Resources Handle external communication Senior Management Direct Business Continuity Define Scope Create and Maintain Awareness Co-ordinate Cross-Organizational Responsibilities and Procedures Management (Business process champion/Program Management Office) Analyze Business Continuity Supervisor and Staff Develop and operate Business Continuity (Application Management/Business process champion) Define Requirements and Deliverables Manage Contracts Supervise Projects and Operations Develop Requirements Negotiate Contracts Operate Procedures Responsibilities should be clearly defined, communicated to management, and documented in appropriate role and job descriptions. To ensure continual management of business continuity at an operational level, an incorporation of specific deliverables into individual staff objectives and responsibilities is recommended. Following a disruption to the normal operating environment, management responsibilities change in line with command, control and operational roles and responsibilities. These include responsibilities for taking action to continuity plan invocation. - 15 - 3 Stage 2: Requirement Analysis and Strategy Definition Roles BCM Project Manager Recovery Expert from Application Management/ Business Process Operations (Ability to translate business recovery requirements (application) into technical requirements and specifications) Business Process Champions Key Users Output Scope of BC plan Documentation of processes, system landscape, interfaces and data exchange Risk analysis Impact analysis / CFIA matrix Service level requirements for processes and components Recommended risk mitigation measures Recommended recovery options Monitoring objectives Resource requirements Agreed BC strategy This stage provides the foundation to determine how well an organization will be able to handle a business process interruption or disaster. As a result, risks to the business operation and their impact will be well understood, which forms the basis for planning countermeasures as well as emergency and recovery procedures. Stage 2 consists of two main tasks: Requirement and impact analysis, which identifies threats to continuity of services and business processes, assesses the severity of these risks and defines the requirements as service levels Design of the business continuity strategy, which identifies possibilities to reduce the above risks and determines the options to support a recovery The following figure visualizes the main phases and the main tasks to be included in stage 2: - 16 - Figure 4: The main phases and tasks of Stage 2 - Requirements analysis and strategy definition Top-down Analysis / Bottom-up Recovery The intent of BCM is to secure availability of business processes. Business processes require different applications and systems. These systems require a technical infrastructure going down to the electrical power supply. Therefore, to analyze the requirements for business continuity, a top-down approach is used to determine which applications, systems, system components, hardware, infrastructure components and services are necessary to keep the critical business processes running. On the other hand, if a critical incident or disaster occurs, recovery of the process functionality starts bottom-up; beginning with the recovery of technical infrastructure and system components. Similarly, protection against failures (risk mitigation) starts with eliminating critical points of failure on infrastructure and hardware level up to system components. Figure 5: BCM dependencies - 17 - Requirement and Impact Analysis Phase (Sections 3.1 to 3.3) During the impact analysis phase, threats to the continuity of business operations are collected and evaluated. In a first step, the continuity project needs to identify the important core business processes that are vital for the company’s business and that shall be included in the BC plan. When the core business processes are named, these processes and the system landscape supporting them need to be documented, including the interfaces and data objects being used by these processes. This documentation is not only required as a further input for the BC project, but will also provide helpful information during a potential later execution of a BC plan. Therefore, this documentation will constitute an important part of the BC plan. The next step will identify risks and their impact on business. An estimation of the costs caused by an outage will allow a prioritization of business processes. The Component Failure Impact Analysis will then have a closer look at the components that are needed to operate the business processes and will determine their criticality. Based on this information, risk assessment can be conducted and the required availability demands (service levels) can be determined for business processes and for the components of the system landscape. Workarounds that may be available to substitute a failing business process play an important role in assessing acceptable outage times. The determined service levels will provide the input for the next phase, the design of a continuity strategy. When this phase is finished, detailed documentation and recommendations for the business continuity approach will be available. The following aspects will be defined and documented for each critical business process: The staffing, skills, and services necessary to enable critical business processes to continue operating at an acceptable service level for a limited time The time within which the critical business processes should be recovered to fully operational level Different contingency cases, that is, which systems are most likely to fail and which business objects are highly critical, for example, because their failure would affect a large number of business processes Design of the Business Continuity Strategy Phase (Sections 3.4 to 3.6) Business continuity can be achieved by risk mitigation that tries to avoid a failure or by providing recovery options that allow a timely recovery from a failure. Based on the criticality of business processes or system components, this phase will identify adequate measures to reduce the risk of business interruption due to the occurrence of possible failures as identified above. Risk mitigation will always have to find a balance between the costs for the measures to be taken and the potential damage they can prevent from. The measures being taken and the procedures to activate them will be part of the BC plan. For risks or failure scenarios that will not (due to the costs being too high) or cannot (since no protection is available) be covered by risk mitigation measures, possible recovery options and recovery strategies should be identified and then worked out later in the BC plan. Recovery options will be distinguished into: Measures to recover from technical failures, Procedures and tools to recover from logical failures or data inconsistencies Workarounds that can reduce the criticality of a business process disruption by providing substitute operations with usually reduced volume and functionality. As a final step of stage 2, recommendations for risk reduction measures and recovery options need to be agreed with all involved parties, especially from a management and cost perspective - before the implementation stage and the creation of the BC plan can be started. - 18 - The next sections will describe these different phases in more detail and provide examples for their output using some hypothetical business processes and a corresponding system landscape. 3.1 Documentation of System Landscape and Business Processes It is vital to have a good understanding and documentation of the core business processes and the system landscape. If, for example, a core business process is not operable due to corrupted business data, resolution of this problem will be considerably faster if the process, the data objects it relies on and the interfaces it uses are clearly documented. Reversely, if for example one system became unavailable, this documentation would easily show which business processes would be affected. Although an overview of the system landscape is often considered the fist step of the analysis, we start with a description of the business processes since this will provide the input for documenting the important system components of a system landscape – without forgetting components being used by the identified core processes and also without including components that are not used by these core processes. 3.1.1 Determine Core Business Processes Together with the application management, identify critical business processes that need to be covered in a BC plan. All processes should be prioritized. You may want to start with only a single core process as “proof-of-concept” and extend the concept later on or you may start with a list of critical processes right from the beginning. This list should be kept as small as possible to restrict the amount of work and the extent of the BC plan. The process list is approved by the steering committee and defines the scope of the BC plan. Example: The following gives an example list of business processes that we will be using during the course of this document: Opportunity/Sales order process Customer service process Marketing process Reporting process 3.1.2 Documentation of Business Processes A good format to document a business process is the swimlane representation (see example below). This quickly allows the identification which business objects are touched by the process and which business objects are exchanged across system boundaries. The description serves as a basis to analyze the effects of an incident or disaster case as well as of the scenario ‘incomplete recovery’. While for logical errors occurring inside a single system, all process steps accessing data objects may be of interest, for data inconsistencies that are, for example, caused by an incomplete recovery of one system, only process steps that cause data exchange between systems are relevant (since only data exchanged between systems may become inconsistent). The latter information will mainly flow into the documentation of the data flow described in sections 3.1.4 and 3.1.5. As a result, this step should identify the stages in the process flow that are critical for data consistency within and between the systems, and should pinpoint potential problem areas in the case of data loss or inconsistencies. In this regard, the most important information that has to be collected with the help of the business process descriptions, are process steps during which data of business objects are saved as well as communication steps during which a data exchange is triggered between the systems in the system landscape. - 19 - Example: Opportunity and sales order processing Sales orders are maintained in SAP CRM and are uploaded to SAP ERP. In the sales order process, a call center agent usually starts with an opportunity and calls the respective customer. The agent discusses a potential order with the customer in detail, based on the customer fact sheet. To discuss the order, the agent also accesses the product catalog and the product configuration. If the customer decides to order, the opportunity is copied into a sales order. The sales order is replicated to SAP ERP, where it is automatically processed. The business process should be documented in an easily recognizable form for later references as in Figure 6. SAP recommends using the SAP Solution Manager to document your processes at a central point. Figure 6: Example Sales Order Process 3.1.3 Description of System Landscape and Interfaces Looking at the documentation of the core business processes created in the previous step, all systems components being used by these processes can be identified. With this information, an overview of the system landscape and the connections (interfaces) between the systems can be created. Example: Figure 7 shows an example production landscape of the SAP Business Suite implementation of SAP CRM using ERP, CRM and BI. In addition to the SAP applications, a typical infrastructure often includes several other SAP and non-SAP systems. The system landscape consists of several systems that exchange data with each other via different interfaces. Therefore, it is important to also document the type of interfaces. In an environment such as this, it is important that the business data that moves between the systems is consistent and up to date. SAP recommends using the SAP Solution Manager to document the system landscape and interfaces for the supported business processes of the scenario in question. - 20 - Figure 7: Example System Landscape and Interfaces 3.1.4 Description of Data Flow between system components To enable the development of disaster recovery procedures that follow an error and a potential partial loss of business data, it is necessary to identify possible sources of data inconsistency. A prerequisite for this is a description of the business object data flow for each core business process. This should also include the flow of master data that supports the process between the system components. The leading system for each business object should be identified, because objects are sometimes maintained in more than one system. In these cases, a leading system cannot be defined. It is worth noting that the maintenance in two systems may cause inconsistencies if no mechanism like cross system locking is used and thus, could later disrupt business processes. Having outlined the business object data flow for each core business process and the objects that might be affected by inconsistencies, a BC plan can work out and describe actions to address the resolution of possible inconsistencies. Example: Business object flow between CRM and ERP for Opportunity/Sales order process The business objects involved in the opportunity and sales order processing process are the business partners, products, product catalog, price conditions, business partner hierarchies and the sales orders themselves. The following graph shows the data flow of these objects between the SAP CRM and the SAP ERP system. - 21 - Figure 8: Sales Order Process Data Flow Master data flow: Business partners are created and maintained in CRM and ERP and replicated between both Business partner hierarchies, products and pricing conditions are created and maintained in the ERP system and are replicated to CRM The product catalog is maintained only in the CRM system Transactional data flow: Opportunities are maintained only in CRM Sales orders are created and maintained in CRM and are replicated to ERP 3.1.5 Aggregated Data Flow Between System Components When the business object data flow for all business processes is aggregated for the entire system landscape, you obtain an overview of the overall data flow that is very useful to quickly recognize for example, the impact of an incomplete recovery of one system component to a previous state like one hour in the past. Example: The following figure shows the aggregated business object data flow chart for the entire system landscape resulting from section 3.1.4. The data flow to and from SAP BW is incomplete since this was not part of the example above. If now for instance a severe error of the database forced the IT department to perform a point-in-time recovery of the ERP system, the aggregated business object data flow chart would show for which business objects this would cause data inconsistencies between the ERP system and the rest of the system components in the landscape (because the other systems had not been set back in time). Looking at the relationship to CRM, data consistency for business partners, business partner - 22 - hierarchies, pricing conditions, products and sales orders would need to be checked and reestablished. Figure 9: Aggregated Data Flow 3.2 Business Impact Analysis Your business processes work as long as all business objects are valid and consistent and all supporting system components (software and hardware) are available. In the business impact analysis, you identify the risks that endanger business continuity and the impact this will have to your core business processes. In a first step, we look at threats that endanger business operations as a whole, for example, physical disasters. The next step will analyze the impact that a complete service disruption would have on your organization, quantifying and qualifying losses this may cause. This analysis will also lead to a prioritization of the business processes. However, in most cases, failures do not affect the complete business operation. Failures of single system components may only leave some business processes or even only parts of business processes inoperable. Workarounds may be available to replace such failing parts, so a business process can be continued, even with limited functionality and often with limited transaction volume. The same applies to logical failures like data inconsistencies which may only partly affect business processes. In a subsequent step, the Component Failure Impact Analysis (CFIA) will analyze the business processes and their reliance on technical components and data objects in detail. - 23 - 3.2.1 General Threats and Vulnerabilities To prevent a comprehensive disruption of business operations through a failure of IT services, you need to identify the general threats that your company or the IT campus is exposed to. Just to name a few examples, these threats can: Be of a regional or local nature (rivers with a higher risk of flooding, earthquakes, nearby airport increasing the risk of plane crashes) Impose a higher risk of being subject to malicious attacks, for example due to a special company profile Lie in accidents resulting from the company’s business or nearby production facilities (for example, explosions or fires) Be a failure of service providers (for example, if IT is outsourced) For a reasonable assessment of the risks, the likelihood of occurrence should be rated for each applicable threat. Risk mitigation in a later stage of the BC project will have to identify appropriate countermeasures and options to reduce these risks or the impact they would have. An additional method to identify risks is the Service Outage Analysis, which analyzes incidents which led to service disruptions in the past and how these were handled (successfully or less successfully). This provides an insight into threats that may still be imminent. 3.2.2 Costs of Outage and Process Prioritization The result of an interruption of the business processes has to be measured in terms of quantifiable and qualifiable losses. The business impact can be qualified in lost income, additional cost and damaged reputation. The impact should also be distinguished depending on the duration of the unavailability of a process because costs and impact will escalate over time. While some processes like a reporting process may have nearly no impact if they are inoperative for several days, other processes like order entry and order processing may have an immediate impact if they are inoperable. So business impact analysis has to identify: The potential loss that may be caused to the organization because of an interruption of critical business processes The form that the loss may take: lower income, higher costs, damaged reputation, immediate and long-term loss of market share, loss of goodwill, loss of competitive advantage, and so on The degree the loss is likely to escalate if not addressed in a timely manner Since it is often difficult to assess losses in the amount of money lost per day or week, it may be easier to assess the impact on a scale from 1 (very low impact) to a 10 (crucial impact that might jeopardize your company as a whole). Make a list of your critical core business processes and evaluate the impact if a process is not operative for a specific period of time. During this phase, you should also collect and document any existing service level agreements, like Recovery Time Objectives (RTO), for these processes. The RTO should be reviewed with the business owners at a later stage, once all surrounding facts are understood (if for example a good workaround is available for normal process operation, the RTO need not be very small, as we will see in section 3.3.1). Using the information obtained in this step, the business processes can be prioritized according to their criticality and costs of outage. The criticality of a process will be an important criterion for the preventive measures to protect against contingencies that will be planned at a later phase. - 24 - Example: The following table provides a criticality rating for our example processes: Table 4: Criticality of processes Process Impact after 4 hours Impact after 1 day Impact Impact after 2 days after 1 week Recovery Time Objective Priority Marketing process 2 5 6 8 Sales order process 8 10 10 10 Reporting process 1 1 3 6 Less critical Customer Service Process 4 8 10 10 Highly critical Critical 2 hours Highly critical After evaluating your core business processes as outlined in the above example, you can distinguish between processes that create an instant negative impact, like the sales order process above, and processes that only yield a negative impact after a considerable amount of time, like the reporting process. 3.2.3 Component Failure Impact Analysis (CFIA) In addition to the a protection against wide-spread failure scenarios as regarded above, continuity of business processes depends on the availability of all involved hardware and software components of the underlying IT system landscape and IT business applications. Therefore, the next analysis step has to pinpoint the most critical hardware and software components supporting the identified business processes. These components will be a focus of the business continuity concept to be established. In order to get a general picture of the criticality of all components of the system landscape, the criticality needs to be collected process by process. For each process, describe the criticality of a failure of each component involved in the process, rating the impact and describing possible workarounds that are already in place or that are perceivable. Not all components may be equally critical for a process, because a process may still be operable without a component. For example, the ATP check during order entry might be left out if the SCM system is unavailable or, as a workaround, the ATP check could be done in the ERP system. If a workaround is available, the description should also tell what percentage of original processing volume can be covered using this workaround, after what time this workaround is usually activated and for what period of time this workaround would be sufficient. The information for this phase must be provided by the different business areas owning the business process. For collecting this information, each involved business area can be provided with a template they have to fill in, see example below. The template lists all components that were collected during the previous analysis steps (section 3.1). Besides system components, this can include infrastructure components and external systems that a process relies on. To cover the aspect of data consistency, we also include the data objects in a separate section of this table. Example: The following table provides an example of useful information that has to be provided by the business process owners for all processes in scope of the BC project. - 25 - The task of each component should be described briefly and the impact of a component failure should be considered. The impact would be less critical if a workaround is available, which should also be noted in the table. The criticality depends on the overall importance of the business process, the impact a failure has and the possibility of replacing normal procedures by a workaround. The likelihood of failure and possible countermeasures can only be filled in by the business process owners if they have special experiences for this process; in general these will be completed on component level in a later phase by the BC project team. Table 5: CFIA for the sales order process Process: Sales order process Components Task of Component Priority: Highly critical Impact of Failure, Workaround Criticality red = highly critical, yellow = critical, green = non-critical Likelihood of Failure Countermeasures Rare, Occasional, Frequent Application Systems CRM Highly critical RTO is 2h Workaround: Orders can be entered directly in ERP; Order entry volume will be 20% of normal order volume; Opportunity processing is not possible, will be critical after 4 hours , highly critical after 1 day Highly critical Rare Telephone Highly critical Rare Power Highly critical LAN Highly critical WAN Highly critical ERP Order entry in CRM can continue Order processing is not possible, will be critical after 4h Infrastructure Business objects Business partner Products Provide information for customer contact, provide information for delivery object must exist but some errors in customer data can be corrected Critical Order entry only Highly critical - 26 - Frequent HA solution implemented possible with correct product information Pricing conditions Order could be created without final prices, corrected prices could be provided later on Critical Sales order consistency of old sales orders is irrelevant for creating a new one Non-critical 3.2.4 CFIA Matrix The CFIA matrix provides a summary view of the criticality of all processes and components. The information collected in the previous step is consolidated into this matrix. The example below details a possible structure of a CFIA matrix. The CFIA matrix allows you to easily identify which processes will be affected and to what extent they will be affected by the failure of a component or by an issue with data consistency of a data object. Based on the resulting criticality of a component, appropriate recovery options and protection measures can be determined for this component. Example: Our example lists all processes as columns and all components as rows of the CFIA matrix. We ordered the business processes with decreasing priority from left to right, according to the criticality that was determined in section 3.2.2. Again, we include the main data objects in the matrix because they are important for data consistency considerations and logical recovery on object level. Each field of the matrix depicts the criticality of a component for a specific business process. The criticality is noted according to the following schema: Green: non-critical component (because it is used only by non-critical processes or because an efficient workaround is in place) Yellow: Critical component (strong business impact) Red: Highly critical component (process is interrupted) A field is left blank if a failure of the component or an inconsistency of the object does not impact the process in any way. You could note the availability of a workaround, for example, by adding a ‘W’ to the field of the matrix. This would mean that the original criticality was reduced due to the availability of this workaround. The criticality of the components being used by a business process cannot be higher than the criticality of the process itself. Therefore, the maximum criticality assigned in a column can be that of the business process. The overall criticality of a component is determined by the maximum criticality given for this component over all columns. Looking at the rows of the matrix, you can identify the criticality of components and objects that are frequently used in business processes. The matrix also shows which processes are most vulnerable to component failures or object inconsistencies by considering the amount of entries in the respective column. - 27 - Table 6: CFIA matrix Processes Overall criticality of component Sales order process Service process Marketing process Reporting process Criticality of process Highly critical Highly critical Critical Less critical # of components # of data objects 6 4 5 2 4 2 4 4 R Y G Components # of process es Application Systems CRM R 4 R ERP R 2 R Telephone R 3 R Y Y Power R 4 R R Y G LAN R 4 R R Y G WAN R 2 R Y Business partner R 4 R R Y G Products R 4 R R Y G Pricing conditions Y 2 Y G Sales order G 2 G G G Infrastructure Business objects 3.3 Risk Assessment In this phase, required service levels are determined for business processes and underlying components. The key figure for business continuity is the required recovery time, given by the Recovery Time Objective (RTO). We first determine the requirements on business process level. The input comes from the previous phases of the analysis. However, if a business process can be replaced by some alternative process for some period of time, the required RTO may need to be adapted accordingly. From process level, we then come down to the requirements on component level. 3.3.1 Support requirements for Business Processes This step identifies how long a critical process can be unavailable, from a business point of view, without severe negative impact. Since business might be able to operate a business process in an alternative way, such as a paper-based approach, this needs to be included in the considerations. These alternatives can increase the time available for recovery of the systems supporting the disrupted business process in contrast to the time identified in section 3.2.2. Business has to specify the resulting time requirement, after which the business process has to be fully recovered. The time a workaround is sufficient until the original process is recovered completely, is called RTO for full recovery. Since a workaround may not be available immediately after a failure or should not be activated immediately due to some side-effects, the RTO for minimum recovery defines how long a process may be completely unavailable until the workaround must become operational. - 28 - In this step, roughly outline such alternative business processes. The details will be elaborated on in the implementation stage (see chapter 4). For each alternative, describe the staffing, skills, services and procedures that are needed to operate the workaround. If no workarounds are available, business only has to provide the RTO after which operation of the business process has to be fully recovered. Example: As an alternative process for the opportunity/sales order process, the ERP system might be used to enter sales orders in the business process, without using the CRM system if it is unavailable due to a system failure. The details of the implementation of the alternative process have to be elaborated in stage 3 (see section 4.5). In this example, the availability of an alternative processing for the sales order process relaxes the support requirement from 2 hours (as previously given in Table 4) to a full day. Since this workaround only substitutes a failure of the CRM system, we need to distinguish between components. The RTO may thus differ for a specific process depending on the component that is unavailable. Table 7: Support requirements for business processes Process Criticality (from 3.2.2) Workaroun d exists for failure of Requirements for workaround RTO for minimum recovery RTO for full recovery Sales Order Process Highly critical CRM Staff needs access to ERP; Training in ERP UI necessary 2 hours 1 day Sales Order Process Highly critical n/a n/a 4 hours Customer Service Process Highly critical n/a n/a 6 hours Marketing Process Critical n/a n/a 2 days Reporting Process Less critical n/a n/a 1 week 3.3.2 Support Requirements for Components After the determination of required service levels and maximum acceptance times of reduced service levels for the critical business processes in case of emergency, the required service levels for the underlying components can be derived. This will be the input for the next steps, the definition of recovery options and risk reduction measures. The criticality of components is determined by the CFIA matrix. The information that is still missing before going into the design phase for the BC strategy is the likelihood of a failure of a component and the times that can be allowed for a recovery of the component (the RTO for each component). The goal must be to provide information for the upcoming decision whether recovery mechanisms are sufficient for a component or whether preventive (risk reduction) measures are required due to the criticality of the supported business processes. The RTO for a component is given by the minimum ‘RTO for full recovery’ of all business processes using this component. Please note that based on the availability of workarounds for some processes, the ‘RTO for full recovery’ might have been relaxed in the previous step (section 3.3.1). In addition to the RTO of the underlying business processes, the importance of a component is also determined by the number of processes relying on it. A component that is used by a high number of business processes usually has a higher criticality than a component that is used by a single process. - 29 - This aspect can require an adaptation of the values determined solely on the RTO of the involved business processes. Besides RTO, the Recovery Point Objective (RPO) constitutes an important criterion for recovery of a component. RPO defines the acceptable data loss during recovery of a component, for example, when restoring from a backup or when switching to a standby site. In order to ensure data consistency in a federated system landscape, an RPO of 0 is required. If the RPO is more than 0, technical recovery of a component will leave the need to analyze and resolve remaining data inconsistencies before business operations can continue. This means that the recovery time can be considerably longer (also see 3.6.2). The RPO must be determined by the business process owners, taking into account the impact of possible data loss. Since again, we include business objects in the list of components (see table below), we need to define what we understand as “RTO and RPO of a business object”. If for example business objects of type Products became corrupted due to some software bug, the RTO for Products would define how long it may take until objects of type Products are available again for use by the core business processes. In addition, the RPO of a business object would define the amount of tolerable data loss in case of a contingency. The minimum RPO of all business objects maintained by a component should equal the overall RPO of that component. Note: When determining RTO and RPO in this section, it is important to note that the costs to achieve them are not yet considered. So, when discussing the (technical) solutions to achieve these goals in sections 3.4 and 3.5, the results obtained in this phase may need to be revised later on. If for example the costs of the solutions become prohibitive, the business may decide to review the service level agreements based on a cost/benefit analysis. Example: The following table lists the criticality and RTO/RPO for our example components. These result from the RTO requirements of the business processes. The RPO of an application component can be determined by assessing the RPO required for all business objects maintained by this component (this information is contained for example in the documentation created in section 3.1.4) For the sales order process, the required RTO for CRM relaxes to 1 day due to the workaround described in 3.3.1. Since the workaround relies on the availability of the ERP system, the RTO for ERP remains at 2 hours. Since Products are required for the most important Sales Order Process, their RTO is determined by the RTO of this business process (2 hours). This notion would underline the importance of the business object Product and show that some advance considerations should be made into possibilities of re-establishing consistency of Products if corrupted for example by an incomplete recovery. As can be seen from the data flow charts in section 3.1.4, Products are held in ERP and in CRM. So, in case of corruptions or inconsistencies in one system, it might be possible to recover Products from the other still operative system (although due to pending (queued) objects that were not yet transferred between the systems, some objects might not be recovered completely this way). Therefore, in a business continuity strategy it is important to ensure that the replication of Products and other objects between systems is working properly to have the current data available in one of the systems for recovery in case of a contingency in the other system. The columns ‘Current RTO’ and ‘Current RPO’ can be used to contrast the required RTO/RPO with the currently committed RTO/RPO from IT. If the latter one is higher than the requirements, this indicates that new solutions will be needed to achieve the requirements or that the requirements identified so far need to be revised. - 30 - Table 8: Support requirements for components Component Criticality (from 3.2.4) Likelihood of failure Required RTO Required RPO ERP Very high Low 2 hours ~0 CRM Very high Low 6 hours * ~0 Telephone Very high Low 2 hours n/a Power Very high Medium 2 hours n/a LAN Very high Low n/a WAN Very high High n/a Current RTO Current RPO Application Systems Infrastructure Business objects Business partner Very high 6 hours 1 hour Products Very high 2 hours 1 hour Pricing conditions High 1 day 1 hour Sales order Low ** 1 day ~0 * determined by service process since criticality for sales order process was relaxed to 1 day in table 7 ** only resulting from this example, which does not regard the usually very critical order to cash process 3.4 The Business Continuity Strategy Now that the risks and continuity requirements are known, appropriate measures can be determined. Two basic approaches need to be distinguished: Risk Mitigation Recovery Risk Mitigation concentrates on prevention, trying to avoid a business disruption by eliminating or reducing the risks identified. Recovery on the other hand deals with reducing the impact a service disruption will have; by providing solutions or strategies that allow resumption of operations as quickly as possible. The goal of recovery options is to reestablish first minimal (if applicable) and then full business operations within the limits given by the requirements. Recovery options also come into play if a risk mitigation measure failed. As stated by ITIL, “an organization that identifies high impacts in the short term will want to concentrate efforts on preventative risk reduction methods, for example, through full resilience and fault tolerance, while an organization that has low short-term impacts would be better suited to comprehensive recovery options”. The hardest part of this task is to find the right balance between risk reduction and the different recovery options that are available. This is mostly determined by the costs involved with the different alternatives. Risk Mitigation is usually achieved through availability management – redundancy and high availability measures. The balance has to be found between the: Costs of the risk mitigation measures (hardware, cluster solutions, and so on) , including maintenance and testing of the solution - 31 - and the Costs of unavailability (including various aspects as described in section 3.2.2) plus costs of recovery measures to re-establish operations If the latter costs are lower, risk mitigation may not be appropriate and recovery options might be sufficient. However, as a general rule to avoid business disruption, risk mitigation should be preferred over the invocation of recovery mechanisms. Recovery options are available on various levels which distinguish primarily by the possible speed of recovery. Determining the appropriate solution also depends mainly on costs by comparing the: Costs of the respective recovery solution (standby arrangement, replication, backup and restore, tape shipping, etc.) and the Costs of unavailability through longer recovery time (including various aspects as described in section 3.2.2) plus costs of recovery to re-establish operations through simpler recovery measures The following diagram, which applies similarly to costs of risk mitigation measures as well as costs of recovery measures, illustrates this dilemma to identify the adequate technical solutions. Since technical measures to reduce risks or to reduce the duration of recovery will become more expensive, the higher the level of protection that can be achieved, it is necessary to balance the costs of these measures versus the costs of disruption and recovery. Figure 10: Business Continuity Cost Curves Ideally, the chosen strategy meets the intersection of the two lines; the intersection marks the maximum allowable outage time. Note: If the analysis should show that the ‘ideal’ solution that is chosen for business continuity cannot reach the previously determined service levels from section 3.3, these need to be revised and adapted in a new round with the business process owners. In addition to risk mitigation and recovery options, monitoring constitutes an additional building block for business continuity. Monitoring has to ensure that any disturbances or anomalies will be detected as early as possible and countermeasures can be initiated before they escalate into major business disruptions. 3.5 Risk Mitigation Measures Risk reduction measures are taken into account for contingencies with highest business impact. Their goal is to avoid the materialization of a risk and prevent a service disruption. Measures to reduce the risk of a business disruption include for example: - 32 - Elimination of Single Points of Failure (SPOFs) through: o redundancy of hardware components masking the failure of one component o high availability / cluster solutions allowing a failover of processes to a second server in case of a server failure, optimally masking any perception of a failure from the end user Resilient networks and systems Uninterruptible power supply Use of multiple service providers or establishing an alternate service provider for critical external services Fire detection and fire suppression installations Change control and change management 3.5.1 Elimination of Single Points of Failure A single point of failure (SPOF) imposes a risk to business continuity, since the failure of such a component immediately interrupts business operations. A SPOF typically is a technical component or service supporting a business process. The elimination of SPOFs can be achieved through redundancy or high availability (HA) solutions. HA solutions mitigate the risk of failures since they aim at masking failures by (almost) seamlessly failing over business operation to alternate hardware. Example measures to eliminate SPOFs in an SAP environment include: Redundant hardware components (Server, storage system, and so on) RAID protection Redundant network components and network routes Redundant middleware components (multiple load balancers, multiple web servers, and so on) Multiple SAP application servers Failover cluster solutions for database system and SAP central instance / SAP central services Parallel database management systems More information on HA solutions for SAP is available in the SAP Service Marketplace at http://service.sap.com/ha. The adequateness of technical solutions to eliminate SPOFs must be determined based on costs of the solutions, likelihood of the failure and impact of the failure / required service levels. As discussed above, the costs for risk mitigation should (in general) not be significantly higher than the costs induced by the risk. Example: In the previous sections, the components supporting the most critical sales order process have been identified. These components are candidates for protection through HA solutions. Even though a restricted workaround is available in case of an outage of CRM, investing in an HA solution can be reasonable because a noticeable impact is already perceived after 4 hours due to the inability to process opportunities. Therefore, the CRM and ERP database and central instance / central services will be protected by a cluster solution and multiple application servers will be provided for CRM and ERP for redundancy. The network connections should be redundant, a stand-by power supply should be available and telephony needs to be secured. See Table 10 for an overview. - 33 - 3.5.2 Change Management Logical errors or data inconsistencies usually result from incorrect software or user errors. There is no technical solution available to protect from such kinds of errors. The only chance is to prevent and detect such errors before they can affect the production landscape – through thorough change management and rigid testing in pre-production phases. 3.6 Determine Recovery Options Recovery options are taken into account for contingencies with lower business impact or for cases when risk mitigation measures fail or are not possible. Their goal is to reduce the duration of a service disruption by reestablishing normal operation as timely as possible and, if applicable, by providing and activating a workaround in the meantime. In planning how to recover certain business processes or system components, it is important to determine the available recovery options – on technology level (failures of system components) and on application level (logical errors or data inconsistencies). 3.6.1 Basic recovery categories There are a number of basic options that can be considered for the recovery approach: Do nothing If the contingency does not have a business impact, it might be possible that doing nothing is an option and no recovery is needed. Example: (Technical failure): A test system fails that was scheduled to be reinstalled next week. Operation without the test system is acceptable as it will be available again in a week and no major tests have to be done during this week. (Logical failure): A new program corrupted data. The analysis shows that only historical data was affected. Since all involved processing has already taken place, it is decided that a recovery of this corrupted data will not be needed. Manual correction If the contingency is of a small scale but has some impact on the business process, a manual correction could be that data is manually corrected or a small report corrects inconsistencies in business objects. This measure is contrasted to a major disruption of a process that needs a more sophisticated recovery procedure. Example: (Technical failure): A system becomes unavailable for a short period of time while a data upload via a file interface is in progress. Since the status of the upload is in an unknown state, manual intervention and restart of the upload is required. Application management identifies the affected objects and resends them to the system manually. (Logical failure): End users created sales orders with incorrect tax classification. The impact is that certain sales orders cannot be processed. The application management team identifies the sales orders and corrects the tax classification after discussion with the end users manually. Gradual recovery Gradual recovery or 'cold standby' is applicable if no immediate recovery of the business process is needed and the organization can operate for up to 72 hours (according to ITIL), or longer, without a reestablishment of the full business process on the respective system components. When considering cold standby, the necessary hardware is either already provided at a disaster recovery site or it must at least be ensured that the necessary hardware can be obtained in time to rebuild the systems. Example: (Technical failure): The storage hardware of the reporting system faces a problem. The system must be restored. As the reporting process is not critical, no spare hardware is available. Due to an agreement with the storage provider, replacement hardware will be delivered and installed within 2 days. Restore and database recovery of the system will be finished within another 2 days. Intermediate Recovery - 34 - Intermediate recovery or 'warm standby' is necessary if you want to reestablish your business process within 24 to 72 hours (according to ITIL). This involves at least having spare hardware available at a remote site, either company-owned or provided as a recovery service. This may also include the creation of a daily mirror of the production data at the remote site. To become operational, this mirror only needs database forward recovery for the logs created since then. Example: (Technical failure): The marketing process of our examples above can be unavailable for more than 1 day without a severe impact on business but the service process should not be unavailable for more than 2 days. To ensure a recovery after 24 hours the systems and database files of the affected systems are mirrored to a remote site. To make the systems available at the remote site, manual activities have to be performed for database recovery and system restart. (Logical failure): A database table with business partners is dropped. The corresponding tablespace is restored to an alternate hardware; the lost data is exported from this ‘analysis system’ and then imported back into the production system. The whole procedure requires 24 hours until the object business partners is accessible again. Immediate Recovery Immediate recovery or 'hot standby' provides for ‘immediate’ restoration of services/processes and is usually provided as an extension to the intermediate recovery provided. The immediate recovery is supported by the recovery of critical business functions and support areas during the first 24 hours (according to ITIL) following a service disruption. However, nowadays, recovery demands even lie in the range of few hours and less. For components that require immediate recovery, the impact of loss of service has an immediate impact on the organization's ability to make money, such as the sales order process in our previous examples. Example: (Technical failure): The sales order process in the examples of this chapter has a severe impact on business if the ERP system is unavailable for more than 4 hours. To ensure that system operation can be recovered in less than four hours, a standby database at a remote site is continuously recovered with logs from the production site. A switch to the standby database would be performed for example in case of a severe storage system failure when the implemented high-availability solution is not applicable. The standby database would also be activated if database block corruptions made the primary database unusable. When you have decided which recovery option you want to use for which business process disruption scenario or component failure scenario, you need to detail the applicable recovery method for recovering a system component or for recovering business objects. 3.6.2 Impact of Technical Recovery and Logical Recovery on Recovery Time When considering technical recovery options, it is important to recognize possible dependencies to logical recovery. If technical recovery of a system component like a database restore or switch to a remote site involves some amount of data loss, data consistency between the systems of a landscape is no longer guaranteed. This means that following the technical recovery, recovery on application level (‘logical recovery’) is required to detect and remove these inconsistencies. This has an important impact on overall recovery time. Figure 11 shows different recovery scenarios and the corresponding parts contributing to the required recovery time. Scenario 1: Technical Failure and Complete Recovery Following a technical failure, some kind of technical recovery method is applied without any data loss in the affected system. This could be, for example, a complete recovery of the database following a database restore or a switchover to a synchronously mirrored disaster recovery site. When the technical recovery has finished, the system is back fully operational. Scenario 2: Technical Failure and Incomplete Recovery Following a technical failure, some kind of technical recovery method is applied that causes some amount of data loss in the affected system. This could be, for example, an incomplete recovery of the database following a database restore because some logfiles are unavailable. Some data loss may - 35 - also be caused by a switchover to a disaster recovery site that is not synchronously replicated and cannot be subject to a complete database forward recovery. When the technical recovery has finished, business recovery has to follow to identify inconsistencies between this system and the other systems of the landscape. The overall recovery time until the system and dependant business processes are fully operational is given as the sum of the technical recovery and the logical recovery. Scenario 3: Logical Failure is Corrected in the Affected System A logical failure corrupts some business objects in system 1. The corrupted objects are identified and repaired directly in this system. The affected business process is operational after this logical recovery has finished. Scenario 4: Logical Failure is Corrected Through Point-in-time Recovery of the Affected System A logical failure corrupts some business objects in system 1. To avoid the effort of repairing the corrupt objects directly in the system, a database point-in-time recovery (incomplete recovery) is performed to the point before that error occurred. After this technical recovery method is finished, the affected system itself is in a consistent state. However, due to the data loss caused by this operation, data consistency in relation to the other systems of the landscape is no longer maintained. The overall recovery time is increased by the time it takes to repair these inconsistencies on application level. The affected business process is only fully operational after this logical recovery has finished. Additionally, many other business processes will be affected since their data also became inconsistent. Data consistency for many other systems not shown in this figure may be affected as well. Figure 11: Recovery Options Following a Technical or Logical Failure These scenarios show that the RTO of a business process can be determined by different phases of a recovery until the business process is fully operational. The actual recovery time for a process is given - 36 - by the RTO of the involved components plus the recovery time to re-establish data consistency, if required. 3.6.3 Recovery Options per System Component Recovery of a system component means bringing back a component into an operable state after any kind of failure. In case of system component failures, you can replace the involved hardware, reinstall the respective software/system components and restore the application data to reestablish operation. For database components, you have to define backup and restore strategies that adhere to the service levels and continuity requirements. As we saw before, completing the technical recovery does not necessarily mean that the business processes are already fully operational since data consistency may need to be checked and repaired (depending on the data currency (RPO) the chosen solution can ensure). If, for example, you perform an incomplete recovery of a database, you will have to deal with inconsistencies between dependent systems. For instance, recovering a CRM system with a backup that is one hour old creates inconsistencies with the ERP system. The ERP system will include data from CRM that was exchanged within the last hour. Disaster recovery (DR) typically focuses on technical solutions that allow a resumption of normal operation of a system component after an outage of the primary data center. Various DR solutions are offered by the hardware vendors. A DR solution must allow moving or restoring the data to a DR site. DR solutions can mainly be distinguished according to the recovery level they can provide regarding recovery time (RTO) and recovery point (RPO). The following table provides some typical RTO and RPO ranges for the main categories of recovery solutions. Table 9: Technical Recovery Solutions Recovery Solution RTO (only technical recovery part) RPO Database Restore and Recovery Gradual / Intermediate 0 ** 12 – 72 hours Tape shipping - Pickup Truck Access Method (PTAM) Gradual / Intermediate 48 - 168 hours Standby database (asynchronous log shipping) Intermediate / Immediate 1 – 8 hours Remote point-in-time copies Intermediate / Immediate 4 – 24 hours Asynchronous replication Immediate 30 min* - 8 hours 0 – 5 minutes Synchronous replication Immediate 5 min* - 8 hours 0 *** * in combination with HA solutions 24 – 168 hours 10 min – 24 hours 4 – 24 hours ** complete recovery *** depending on continuation policy Example: The following table gives an example of possible risk mitigation and recovery options for the example components depicted for our business processes: - 37 - Table 10: Example Recovery Options for System Components of Example Processes Component Recovery Class Risk Mitigation Method Technical Recovery Method Immediate Cluster solution for DB processes Synchronous replication to remote site, standby hardware readily available Appl. Systems ERP Database Daily full backup allowing restore within 12 hours (including log recovery for 1 day) ERP Central instance Immediate Cluster solution for enqueue and message server Replicated enqueue server ERP Appserver Immediate Two application servers for redundancy SLA to replace and reinstall appserver within 4 hours CRM Database Immediate Cluster solution for DB processes Synchronous replication to remote site, standby hardware readily available Daily full backup allowing restore within 12 hours (including log recovery for 1 day) CRM Central instance Immediate Cluster solution for enqueue and message server CRM Appserver Immediate Two application servers for redundancy SLA to replace and reinstall appserver within 4 hours Infrastructure Telephone Immediate SLA with service provider Mobile phones for key users Power Immediate Stand-by power components readily available LAN Immediate Redundant connections, redundant routers and switches WAN Immediate Alternative arrangement using satellite connection 3.6.4 Recovery Options for Business Objects Logical errors or data inconsistencies require recovery on application level by repairing business data on object level. This kind of recovery can be required as a consequence of a preceding technical recovery that resulted in some amount of data loss (RPO not 0) or as an error scenario in itself. Business object unavailability can result from business data being deleted or corrupted in a way that it is useless for dependant business processes. Basically, we can distinguish three types of errors: Logical errors inside a single system Inconsistencies between systems Inconsistencies between system and the real world - 38 - These inconsistencies have to be repaired manually or by reports. The identification of inconsistencies and the methods to correct them are a very important task for the experts that create the detailed recovery plan in the implementation stage (stage 3). At this stage, you should determine the business objects for which you need to work out detailed recovery plans due to their criticality. You should roughly identify possible strategies on how to repair inconsistencies for these objects. Possible options for repairing inconsistencies include: For logical errors inside a single system: Replicate lost data from another source system The data flow charts of the processes in section 3.1.4 show from which other systems business objects might be recovered. Export data from an analysis system (restore done to a sandbox) Re-enter lost data in the recovered system Re-extract data from middleware components Reconstruct data from database indexes Run correction reports For inconsistencies between systems Use compare tools provided by SAP Run check and correction reports Manual compare and correction Note: A point-in-time recovery of a database should not be used as a recovery option for logical errors inside a single system, since this induces new problems with data inconsistencies between the systems of a landscape and to the real world. Discuss the recovery options per business object and identify the recovery strategies for possible partial or complete loss of business objects. More information on consistency checks and consistency check tools provided by SAP can be found in the Best Practice “Data Consistency Monitoring within SAP Logistics” that will be available at http://service.sap.com/solutionmanagerbp. Section 4.6.1 will provide a detailed example of how recovery options for a business object can be worked out. Note: To detect inconsistencies as early as possible, and before they will result in a possible business disruption, data consistency monitoring should be established (see 3.7). Example: The decision is made that only business objects required for the highly critical sales order process shall be subject to detailed recovery planning. The basic strategy is outlined here while further details will be worked out in the implementation stage. Table 11: Typical Recovery Options for Business Objects of the Sales Order Process Business object Application Recovery Strategy Business partner Comparison with DIMa Initial or request load either from CRM or ERP Product Comparison with DIMa Initial or request load from ERP Pricing conditions Comparison with DIMa Initial or request load from ERP Sales order Comparison with DIMa Request load either from CRM or ERP - 39 - 3.6.5 Recovery Options per Process A business process may become unavailable due to a failure of its underlying system components or corruptions of its required business data. Recovery options for both error types were discussed in the previous sections. If the recovery options discussed above are not sufficient to satisfy the criticality of a business process, establishing an alternative procedure (workaround) to maintain at least rudimentary operation of the process can be considered. In section 3.3.1, already existing workarounds in a company were collected as an input for the determination of the support requirements. At this point, the focus lies on possibilities of establishing new workarounds for highly critical parts of business processes. Service disruptions can also be due to insufficient performance of a business process. If this lasts for a longer period, activation of a workaround can also be a method to circumvent this kind of service disruption. Workarounds need to be documented in detail and the end users need to be trained so the activation of a workaround will be successful. Since a workaround always implies some limitations and usually requires some more or less expensive post-processing when normal operations are reestablished, the activation of a workaround should be under control of the business continuity manager / business continuity team. Workarounds can be: Paper-based Based on remaining systems of a system landscape Working with reduced functionality A combination of the above Example: For the sales order process, it might be possible to implement the following alternative business process on a smaller scale. It is possible to replicate opportunities with a customer program as special quotations to the ERP system (it has to be considered if custom coding is worth the effort which depends on the criticality of the business process). The call center agents could access the ERP system, find the opportunities as quotations of a special transaction type and negotiate orders with customers because products, pricing, business partner hierarchies and the configurator are also available in the ERP system. It is not possible, however, to access the product catalog and use the customer fact sheet. This gives less value to the customer, but orders could be negotiated on a smaller scale. A requirement to establish this workaround would be to enable the replication of opportunities to ERP and to evaluate and develop reports that enable a stand-by processing of opportunities and orders without product catalog and fact sheet. As well training of the call center agents for the ERP environment needs to be planned. The following table summarizes the requirements for additional workarounds as determined in this step. - 40 - Table 12: Additional Workarounds for Processes Process Recovery class Workaround required for Description Sales Order Process Immediate Failure of CRM Current workaround for sales order taking is not sufficient, alternate procedure required for opportunity to sales order processing (as described in the text above). Customer Service Process Immediate n/a Marketing Process Intermediate n/a Reporting Process Gradual / Manual correction n/a 3.7 Define Monitoring Objectives Monitoring is an important element to detect events that may cause a business disruption before the effect becomes visible to the end users. Therefore, the definition of helpful monitoring tools and procedures should also be part of business continuity planning. Monitoring that supports business continuity should comprise: Monitoring of component availability Monitoring of data replication to a disaster recovery site Database and backup monitoring Monitoring of error logs Business process monitoring Monitoring of interfaces and data exchange Batch monitoring Performance Monitoring Monitoring of data consistency inside systems and between systems Consistency monitoring is supported by the Data Consistency Cockpit in SAP Solution Manager and will throw alerts when an anomaly is detected. 3.8 Identify Resources for Recovery Mechanisms For the detailed elaboration of the BC plans and implementation of the BC measures, technical as well as business staff is needed. The required skill set should be described so that an overview of the team to be involved can be obtained. For execution of a recovery plan, people with a similar skill set need to be part of the recovery team. Agreements have to be negotiated that allow emergency allocation plans for resources such as the key business users or in-house/external database experts from application management or business process operations. Apart from human resources, activation and execution of recovery mechanisms requires physical equipment, for example project offices for the recovery teams. - 41 - 3.9 Agree on Recommendations Before the next stage –the implementation of the business continuity concept– can be started, the continuity approaches determined in this stage need to be agreed by all involved parties. A summary of the analysis results, continuity requirements and recommended measures should be presented to the project steering committee and business continuity team. Only after the agreement on the proposed measures and an approval for the involved costs has been obtained from the steering committee and senior management, can the project continue with the implementation and creation of the actual BC plan. - 42 - 4 Stage 3: Implementation and Testing Roles BCM Project Manager IT specialists from application management, business process operations, and SAP technical operations Business Process Champions Output Implementation plan Organizational structure supporting BCM BC master plan Crisis management and escalation procedures that may invoke BC plan Detailed recovery plans and procedures Documentation of risk reduction measures and standby arrangements Test plan Initial test of continuity concepts During the implementation phase of a business continuity project, the solutions and recovery options determined and agreed in stage 2 will be implemented and elaborated in detail. Besides the implementation, the documentation of the solutions and procedures plays an equally important role. Following the ITIL standard, the third stage of the project comprises the following phases: Organization: Establish the organization that is responsible for business continuity management Implementation planning: Develop implementation plans that describe the structure of the BC plan and assign work packages Implement risk reduction measures Implement standby arrangements Develop recovery plans: Create and document the business continuity and recovery plan Develop procedures: Lay down general and detailed procedures for different recovery tasks Initial testing: Create test plans and perform initial tests of the BC concept 4.1 Establish Organization The organization that executes the recovery plans consists of The business continuity manager, who ensures that continuity plans are up-to-date and tested The incident management staff that reports a possible disaster case A crisis team of senior managers from business (business process champion) and IT (application management) that decide if disaster recovery plans need to be executed The recovery team with representatives from business (key business user/business process champion) and IT (application management/SAP technical operations/business process operations). The recovery team should also be staffed from key and end users that ensures a minimum operation of the business process for example working on paper or other workarounds. SLAs recording the availability agreements for involved departments and partners have to be defined. - 43 - Example: Organizational structure for Sales and Marketing recovery in a CRM/ERP landscape Figure 12: Org. chart of Business Continuity Project 4.2 Develop Implementation Plans Before developing the business continuity plan and implementing the recovery methods determined and agreed in stage 2, a detailed plan should be set up addressing how these steps will be executed. It has to be defined which plans shall be created and the overall structure of the BC plan has to be laid out. Lastly, the implementation plan has to specify who will be responsible for the creation of which parts. The owners of each plan must ensure that they have identified and agreed support and services from other parties. 4.2.1 Recovery Plans A BC plan usually consists of a master plan and a number of detailed plans for various aspects to be considered in different service disruption scenarios. The master plan is the summary document which provides all general information on the business continuity plan like the BC organization, roles and responsibilities, crisis management and invocation procedure, general guidelines, and so on. Within the scope of this document, detailed plans should then describe: Recovery procedures on business object level Recovery procedures and workarounds on process level Recovery procedures for implemented DR technologies Functionality and scope of risk mitigation measures The results of the analysis phase conducted during the BC project will be included in the respective plans. Appendix 7.2 lists the pieces of information that should be part of a BC plan. - 44 - 4.3 Crisis Management The business continuity plan must document how crisis management has to be performed and when the disaster recovery process is invoked. This means that incident management can trigger the invocation of the disaster recovery plan. To invoke the plan, incident management has to classify the problem occurring as endangering the operation of the whole business process involved. Thus, the staff performing incident management has to have clear guidelines when an incident is classified as endangering the whole process and thus might invoke the disaster recovery plan. The decision whether the plan is actually activated is typically made by a 'crisis management team'. The crisis team should include senior managers from the business and IT support departments using information gathered during the incident management process. A severe incident can occur at any time day or night, so it is essential that guidance on the invocation process is readily available. In the case of a physical disruption such as a fire, the decision to invoke a recovery plan is easy, but if there is a technical problem endangering a key business process, the decision is more difficult. In such a case, it is good practice to set a deadline for the problem to be resolved, otherwise the recovery plan is invoked. The deadline should incorporate the support requirements for the endangered business process to ensure that the impact of the disruption is acceptable from a business point of view. 4.4 Implement and Document Risk Reduction Measures In this stage, the risk mitigation measures that were determined and agreed in stage 2 to prevent a service disruption up front will be implemented and documented. The documentation should include: The scope of the solution Failure types not being covered by the solution Prerequisites and measures for activation or operation of the solution Test and maintenance plans to keep the solution operational Besides focusing on system components and technical solutions to eliminate SPOFs, the change management process should also be subject to verification or revision, since this is the only chance to avoid logical errors or data corruptions. 4.5 Implement and Document Standby Arrangements In this stage, the standby arrangements that were determined and agreed in stage 2 to reestablish operations in a timely manner after a service disruption will be implemented and documented. These standby arrangements include: Technical solutions chosen for the recovery of system components Workarounds determined to keep vital business functions operational using an alternate approach or process For intended business process workarounds, all details and requirements of the alternative process need to be worked out in this stage. Coming back to the example of section 3.6.5, the workaround described there requires the implementation of a customer specific report that replicates opportunities to ERP (as quotations with a special type). The documentation of standby arrangements should include: The scope of the solution Failure types not covered by the solution Prerequisites and measures for activation or operation of the solution o For a technical solution like switchover to a DR site: What is required before operations can continue at the DR site? Is data consistency given after a switchover? - 45 - o For a business process workaround: What resources are required? What data is needed, where does it come from? Training of end users on working with an alternate process Fallback requirements and procedures – what must be done to return to normal operation after the original systems are available again? o For a technical solution like switchover to a DR site, describe how to switch back operations to the primary site o For a business process workaround, describe how the data created by the workaround will be incorporated back into the standard process and systems and what follow-up activities are required to complete the missing parts that were not covered by the workaround Test and maintenance plans to keep the solution operational 4.6 Develop Recovery Plans In this phase, the recovery plans will be created according to the structure set in the implementation plan (section 4.2.1). The continuity plan on IT level must give all necessary information on how to ensure that service, facilities and critical systems are either still provided or are recovered in a time frame that is accepted by business. The continuity plan relies on the availability of systems and facilities. It does not only include recovering systems to a certain state but also resolving all inconsistencies between systems to achieve a consistent state of the system landscape which enables to return to normal business operation. The priority of services, facilities and systems must be included in the continuity plan to clearly communicate what needs to be done first. The continuity plan itself has to be readily available for all participants in the process. The plan is subject to control change management as a change of business processes must trigger an adaptation of the continuity plan. The plan has to ensure that the details of the plan suffice to enable a technical person with basic experience of the involved systems to follow the procedure. But we have to note that for repairing data inconsistencies or logical errors, basic knowledge is not sufficient. For such application-related issues, experts with deep knowledge of the business processes, business data and related databases objects need to be involved. For different contingency and recovery scenarios, checklists need to be developed that describe what needs to be done to revert involved systems of a business process back to normal operation. This includes, for example, data integrity checks that need to be run after technical recovery actions have been carried out for a system. 4.6.1 Example: Extraction of a Recovery Plan for the Incomplete Recovery of an SAP CRM System Regarding the specific scenario of an incomplete recovery of a system, the plan must include all dependencies of the systems among each other and their objects being exchanged. Scenario: A malicious network driver has corrupted the database of the CRM system. A point-in-time recovery of the CRM system is inevitable. All critical business processes are affected by the downtime of the CRM system. The following steps compromise those parts of a recovery plan which deal with the inconsistencies in the system landscape caused by the point in time recovery of the CRM system. It is assumed that the ERP and BW systems continue to operate while CRM is being recovered. The following procedure is proposed for incomplete recovery of CRM. The description is not at the full detail level, but can be carried out accurately by experienced staff. Also, the procedure is not complete; it is only used as an example that details some aspects of a full recovery plan. - 46 - Procedure: (1) Stop business operation in CRM (2) Perform point-in-time recovery of the database of the CRM system (do not start CRM SAP system before next step) (3) Isolate CRM system (disable communication with BW and ERP) Stop outbound queue processing from ERP to CRM: Deregister outbound queues in ERP (transaction SMQS) * Disable outbound queue processing from CRM to ERP: Lock respective RFC-user in ERP (transaction SU01) Disable data requests from BW to ERP (for example, from transaction RSA1 in BW or from process flow control, also from SM37) Disable synchronous data transfers from CRM to BW by locking the corresponding RFC-user in BW Start CRM SAP system Stop productive operation in CRM by locking all regular users (Transaction SU10) Stop CRM middleware message processing (transaction MW_MODE) For CRM Field scenarios only: Stop CRM replication&realignment queues (transaction SMOHQUEUE) Disable outbound queue processing from CRM to ERP: Deregister outbound queues in CRM (transaction SMQS) * Disable outbound queue processing from ERP to CRM: Lock respective RFC-user in CRM (transaction SU01) Unschedule the BDOC reorganization job MW_REORG in CRM to avoid deletion of BDOC message store * Deregistering outbound queues may not be sufficient in all cases. Some applications may register queues automatically even if queues are deregistered. To be safe, RFC destinations can be disabled instead, for example by pointing them to an invalid server (transaction SM59). (4) Handling of RFC queue entries Queues to ERP in the recovered CRM system: In the CRM system, which was restored to a former point in time, there may be pending RFC queue entries, which were not processed or which were just being processed at that time. If the queue status is monitored regularly, there should be only few such queue entries in the restored CRM system. All these queue entries were probably processed completely at the time of the crash and can therefore be deleted from the CRM system (otherwise, they would be processed doubly). There are two cases in which such queue entries may not have been processed at the time of the crash: - If the point to which the CRM system is set back lies only a very short period ahead of the point of the crash - If a queue could not be processed because it was deactivated or because it was in an error state preventing further entries from being processed and if this situation was not resolved in the period until the crash occurred. In such cases, the “contents” of these RFCs could be cross-checked against the ERP data to find out whether the RFCs were processed or not. Queue entries in ERP: Pending RFC queue entries in the ERP system may not be deleted because they contain the most current data objects, which need to be processed. They should be processed later, after business recovery between R/3 and CRM is completed, because they may rely on CRM data, which was lost due to the incomplete recovery. Because the CRM system was isolated above, unintentional processing of these queues is prevented. - 47 - (5) Repair data inconsistency with ERP Due to continued productive operation in ERP and BW, the following things must be considered during business recovery: - New RFC queue entries in ERP, which are created during the period of business recovery, may not be deleted. - CRM data, which is created for example by re-entering lost data may cause, double posting in other systems. This has to be considered during business recovery. In this example we will only show the case of business partners, but in a real plan recovery procedures must be described for all business objects. a. Business Partners: Leading systems can be ERP and CRM, so we have to regard objects transferred in both directions. The following figure displays the types of inconsistencies that may appear for business partners after an incomplete recovery of CRM. As both systems are leading systems, our goal will be to transfer all inconsistent objects from ERP to CRM because ERP has the newer version. This will also recover objects that were created or changed in CRM. Figure 13: Inconsistency Cases After Incomplete Recovery of CRM The CRM Data Integrity Manager (DIMa) can be used to compare business partners at header level (check for existence) or on detail level (check for different field content). Based on the comparison result, you can either request a business partner from ERP into CRM again, or you can send a business partner from CRM to ERP. Please see SAP note 647664 for more details on the DIMa features for comparing ERP customer masters with CRM business partners. This procedure will resolve all business partner inconsistencies between CRM and ERP except for deletions (cases 3 and 6, see below). It will re-transfer to CRM all business partners created or changed in ERP. In CRM, this will also recreate all business partners that were created or changed in CRM and already replicated to ERP. Only business partners that were not yet transferred to ERP cannot be recovered this way. Note: Please pay attention to the fact that the data models of the ERP customer master and the CRM business partner are different. This includes attributes that are available in ERP or CRM exclusively. Even if one system is considered as master system, where a new partner is preferably created, there can be subsequent updates in the other system to add further attributes. - 48 - Example: A new customer master is created in ERP and automatically loaded to CRM. Afterwards, CRM is used to add special marketing attributes to this customer, which cannot be maintained in ERP. In consequence this also means that such exclusive attributes cannot be recovered from another system. For identifying inconsistencies due to deletions (Cases 3 and 6) the DIMa tool can be used as well. We assume that there are no direct physical deletions of business partners. Instead, there is first a logical deletion (setting the deletion flag) and in periodic intervals archiving runs do the physical deletion. The deletion flag is replicated between ERP and CRM. Now, if the business partner was already physically deleted in one system, DIMa would report it as missing in the other system. This unwanted effect can be solved by using the deletion flag (field LOEVM) as additional filter for the DIMa comparison. DIMa does not allow restricting the comparison to a specific period of time (only on the creation date of a business partner). Note: There can be cases where pending queue entries contain updates that were missing in CRM. When performing a DIMa comparison, such temporary differences would be reported as inconsistencies as well. It can make sense to have multiple iterations of DIMa comparisons to consider data that was stored in queues but was processed later on successfully. Note: The time for a DIMa comparison can be quite long, comparable to a full initial download from ERP to CRM. If you fear that there is large number of missing or inconsistent objects, it should be considered whether it makes sense to perform a real initial download instead. In such a case, the pending ERP outbound queue entries can be omitted. Possible alternative to using DIMa for business partners: By means of change documents for business partner records in ERP, all business partners, which were modified since the recovery point-in-time can be selected in ERP and then transferred to CRM. Business Partner records that are created or modified in ERP after the crash point-in-time may not be included because they will be transferred to CRM using regular replication mechanisms (objects still in the ERP queues should also be excluded). Change documents are activated in ERP for all business partner fields that are relevant for CRM. 1. In ERP, select all business partners changed between the recovery point in time and the crash point in time. VD03 Environment, Field changes Environment, Multiple display can be used to display all business partners changed since that day (selection not possible by time) Note: A special report may be needed for that purpose if this function is not sufficient. 2. Download all these business partners from ERP to CRM. To avoid duplicates, only missing objects may be transferred. Note: A special report is needed to automate business partner download for the objects identified above. This report can call the standard functionality (transaction R3AS4) to synchronize missing changes and/or business partner. This report must pay attention to objects still available in the queues, these may not be re-transferred. In CRM, you can use the report BUSCHDOC to perform a mass evaluation of change documents for several business partners. Therefore, you can identify all business partners that have been changed within a given timeframe. Specifically for business partners, there is also a special transaction called CRMM_BUPA_SEND, to manually trigger the sending of CRM business partners to the receiving systems like a connected ERP backend. With transaction CRMM_BUPA_MAP you can even trigger a new request load from ERP to CRM for a single business partner. b. Other Business Objects For all other business objects that are exchanged between CRM and ERP, procedures to check and re-establish data integrity are required as well. For the common exchange objects between ERP and CRM, corresponding DIMa objects are available. Please see transaction SDIMA_BASIC for a list of DIMa objects with their offered repair options. - 49 - (6) Repair data inconsistency with other systems For all other interfaced systems and all business objects that are exchanged with the CRM system, procedures to check and re-establish data integrity are required (for example BW). (7) Other corrective actions - Process pending queue entries in ERP (see (4)) - Check if transports were imported into the CRM system during the period that was lost due to the incomplete recovery (8) Checks Execute functional checks of CRM business processes (9) Restart productive operation in CRM Now that data consistency has been re-established and all functionality checks were successful, the CRM system can return back to productive operation. The communication to other systems can be enabled and users can be released to work with the usual business processes in CRM. 4.7 Develop Recovery Procedures 4.7.1 General Procedures for Different Contingencies General procedures should lay out guidelines on how to handle different types of contingencies affecting business continuity. These can describe for example: Conditions to be given before performing a switchover to the DR site A general guideline for handling logical errors A basic procedure for handling data inconsistencies 4.7.1.1 Example: General Guideline to Avoid Incomplete Recovery In case of logical errors appearing in a system, point-in-time recovery of the system is a technical option to fix the situation – but with a serious side-effect on data consistency between the systems of a system landscape. Thus, a general guideline should restrict and even prevent the usage of this option by creating awareness for its downside. A control instance should be established that needs to consent prior to any incomplete recovery, also weighing the impact from an application point of view. Define approval process for point-in-time recovery Define roles and assigned people for approval Involve engagement of application management Define decision criteria for engaging a point-in-time recovery 4.7.1.2 Example: General Procedure for Handling Data Inconsistencies If, for example data, inconsistencies are reported by end users, a predefined procedure can help with the analysis and resolution of the situation. The following steps provide a general guideline: 1. Inconsistency is reported 2. Understand affected business process, data objects and corresponding interfaces Note: The documentation created in sections 3.1 will be very helpful for this. 3. Analyze if it is only a temporary difference or a real inconsistency - 50 - Note: Temporary differences can be caused for example if a queue used for data exchange is stopped. Since the data is not transferred, the end user perceives this as a data inconsistency. However, it is not a real inconsistency requiring corrective tools because the situation can be easily resolved by processing the pending messages. 4. Analyze if it is a technical or logical inconsistency Note: A Technical Inconsistency is everything that can be found on a database level and needs appropriate correction in the underlying database, while a Logical Inconsistency is a not disappearing mismatch that is due to a misunderstanding of process or misinterpretation of data. While technical inconsistencies can be identified by technical means like check reports, logical inconsistencies need to be identified by mapping the intended business process to the underlying data structures. They cannot be identified by technical means as the underlying data is consistent on a one-to-one level. 5. Decide if productive use can continue or needs to be interrupted 6. Identify root cause (programming error, non-transactional interface, incorrect error handling, incorrect data entry, no clear leading system, …) 7. Correct root cause 8. Analyze if dependent data is affected (in the same system, in other systems or in the real world) 9. Identify inconsistencies, filter out differences 10. Correct inconsistencies 11. Correct dependent data 4.7.2 Detailed Procedures for Specific Tasks Recovery procedures to solve specific recovery tasks need to be described on a detailed level. For technical recovery methods on system level, the provided details must enable technical persons to carry out the necessary recovery steps (for example, a database restore and recovery) without specific knowledge of the affected system. For the recovery from data inconsistencies on application level, recovery does always require expert knowledge of the affected application and business objects. Describing the recovery procedures involves the already available tools and/or specification of what reports need to be written, tested and executed to repair data inconsistencies for highly critical objects determined and agreed during stage 2. It also involves the definition of decision criteria and checks to tell whether a return to productive operation is possible. Example: In the previous section for recovery of the CRM system after database failure, two data consistency check and correction requirements were identified: A tool is needed to compare business objects between ERP and CRM. The comparison must identify completely missing objects and objects that have different content due to missing updates. For common business objects exchanged between ERP and CRM, the Data Integrity Manager (DIMa) can be used. For special requirements, the DIMa tool may not be sufficient. For example, a special report has to be developed to select changed business partners in ERP since a certain point in time. In standard ERP, the selection can only be done with the granularity of days. Another special report is needed that downloads all the selected business partners from ERP to CRM. To avoid duplicates, only missing objects may be transferred. This report can call the standard functionality (transaction R3AS4) to synchronize missing changes and/or business partner. The report must pay attention to objects still available in the queues, these may not be re-transferred. The IT support departments must implement such special reports needed during disaster recovery. - 51 - 4.8 Recovery Testing An important part of the BC project is the creation of test plans and the establishing of tests. Only regular testing can ensure that the recovery solutions will work, that all prerequisites are met and that the people involved have a good understanding of the processes and procedures. Only tests will show gaps and deficiencies of a plan. 4.8.1 Create Test Plan The test plan has to lay down the scope of testing, the objectives, the test procedures and test schedules. The test plan has to make sure that all aspects of the recovery plan are tested. A test plan will include separate tests for individual parts of the business continuity plan, as well as full tests that can verify the business recovery plan as a whole. A full test has to demonstrate that the whole recovery process is supported by both business and technical departments and that the documented recovery procedures are operational. A full test ensures that standby arrangements are valid, that external partners are integrated reliably in the continuity efforts and that in-house staff understands the procedure and can execute the recovery plan. Certain aspects of the recovery plan can and should be verified during tests: Recovery of the business process or system in time Capability of staff to execute recovery plan Availability of resources (physical and human resources) Effectiveness and on time involvement of external partners. 4.8.2 Initial Testing Initial tests should already be performed for each individual recovery solution in parallel to its implementation. Immediately following main milestones of the implementation phase, a more comprehensive test will complete stage 3 of the business continuity project, verifying the interaction of the individual recovery solutions in the overall business continuity plan. - 52 - 5 Stage 4: Operational Management Roles BCM Project Manager IT specialists from application management, business process operations, and SAP technical operations Business Process Champions End Users Output Change control procedure for BC plan Training plan Test schedule Monitoring schedule After having carried out the previous phases, it is important to install the continuity plan in day to day business operation. It is very important to verify that after changes in business processes or IT infrastructure, the continuity plan is kept up-to-date. This section describes the main tasks that are important from an operational point-of-view. 5.1 Create Awareness The business continuity concept has to be made aware to everybody in the organization. Also, everybody needs to understand the importance of disaster recovery and its impact. The BC plan needs to be shared and must be accessible to all people that will be involved in maintaining business continuity. Furthermore, disaster recovery efforts have to be perceived as requiring routine tasks like checking for changes of the continuity plan if business processes are altered. Continuity management related tasks have to be covered by normal budget in financial planning. 5.2 Establish Education, Trainings and Exercises To ensure that all people involved in the DR process are able to execute their recovery tasks effectively, trainings should be scheduled. This especially affects new people joining the DR team. This training should also be used to establish a common language to enable effective communication between business recovery team members from application and IT-related areas as well as team members from different regions. 5.3 Establish a Continuous Review and Change Control Process To ensure that the business continuity plan stays current, it is obligatory to review the business continuity deliverables every time there is a change to: Business processes Required service levels IT infrastructure The overall business or IT strategy. For maintenance of the continuity plan, it is essential to establish a change control procedure and clear responsibilities. It is good practice to include business continuity as a topic into every implementation project that changes business processes or IT processes. This way, possible changes to the continuity plan can be identified and implemented more easily and more timely. - 53 - 5.4 Establish Regular Testing After initial tests, a regular testing routine has to verify the operational readiness of the continuity plan. Tests should be done as frequently as directed by senior management or audit. SAP recommends testing the continuity plan at least once a year. It is especially necessary to test the continuity plan after changes have been incorporated in the underlying business processes or IT infrastructure. Only regular testing will ensure that reliance on the business continuity plan is well grounded. 5.5 Establish Monitoring and Resolution of Findings To prevent possible business disruptions, or at least detect upcoming issues as early as possible, the monitoring objectives and monitoring tasks defined in section 3.7 need to be established in regular operations. If irregularities are detected by monitoring, measures to resolve the situation need to be initiated. For example, in the area of data consistency monitoring, regular clearing of data inconsistencies should become part of operational management. This can help to prevent such inconsistencies from escalating into business disruptions. Furthermore, a regular clearing of data inconsistencies reduces the amount of differences and the repair effort after a disaster recovery. - 54 - 6 Conclusion Some people say that, for Business Continuity Management, “The planning process is more important than the plan itself.” The truth behind this statement lies in the fact that the lessons learned and the experiences gained during the planning process are an important side-effect of a business continuity project. The process will often reveal a multitude of findings and insights into weaknesses of current concepts and procedures. Addressing previously unknown issues and risks can already yield a higher availability. But of course, the business continuity plan itself is vital for establishing continuity procedures and spreading awareness throughout an organization. To ensure that a business continuity concept is successfully implemented and “lived” by an organization, it is necessary that awareness and commitment is created at the senior management level. The continuity project has to have an adequate priority and a sufficient sponsorship in order to have acceptance and commitment of managers and staff. This document outlined the initiation, the business impact analysis, the determination of recovery options and the creation of a detailed recovery plan, including initial testing and the tasks necessary in operational management to make the business continuity effort a part of daily operation. A last word on operational management to ensure the effectiveness of disaster recovery plans: To continuously keep the business continuity plan up-to-date, because otherwise it is not effective, staff and management have to be continually committed to the continuity effort within the organization. Everybody has to be aware of his responsibility for business continuity, so that triggering required changes to the continuity plan will become a matter of course whenever a change of processes or IT infrastructure is implemented. To maintain the quality of the continuity plans, management has to control and monitor activities in the area of continuity management. - 55 - 7 Appendix 7.1 Template for scoping presentation Presenter: Business management/IT management Audience: Senior management Structure: Introduction Case studies: Disaster Case 1 Impact of disaster case 1 Resulting loss without recovery plan Resulting loss with recovery plan Disaster Case 2 Impact of disaster case 2 Resulting loss without recovery plan Resulting loss with recovery plan Costs and Resources involved in a business continuity effort Costs of loss without recovery plans against Costs with recovery plan + recovery project costs Conclusion 7.2 Contents of a Business Continuity Plan A business contingency plan, consisting of a master plan and more specific detailed plans, should contain the following documentation: Results and documents created in the analysis phase (stage 2) o System landscape / architecture o Business processes o Interfaces and data exchange include documentation created during stage 2 o Business impact analysis o Required SLAs Organization, Roles and Responsibilities, including o DR team, o decision process, o contact and distribution lists Crisis management, including o notification and activation procedure for BC plan o damage assessment Risk reduction measures, as proposed in stage 2 and implemented in stage 3 Recovery options and recovery procedures, as proposed in stage 2 and implemented in stage 3 o Technical standby-arrangements and activation procedures for technical solutions o Alternative business processing using workarounds - 56 - o Logical recovery procedures for core business objects (business recovery) Procedure to return to normal operations o Prerequisites o Checks Education and training Testing Review and maintenance of BC plan - 57 -