Uploaded by spam.repository

Business Continuity Management Whitepaper

advertisement
Business Continuity Management for SAP
System Landscapes
Best Practice for Solution Management
Version Date: May 2008
The newest version of this Best Practice can always be
obtained through the SAP Solution Manager
Table of contents
1
Introduction
1.1
Goal of Document
4
1.2
What is Business Continuity Management?
4
1.2.1
What are the failure scenarios covered by Business Continuity?
5
1.2.2
System Recovery and Business Recovery – The two major steps of recovery
6
1.2.3
What impact can a major business disruption or disaster have?
6
1.2.4
What should business continuity plans protect?
6
1.2.5
The Course of Disaster Recovery
7
1.2.6
The Business Continuity Plan
9
1.3
2
3
4
Stages of the Business Continuity Lifecycle
Stage 1: Initiation
9
12
2.1
Scoping Study
13
2.2
Develop Project Plan
13
2.2.1
Project Organization and Control Structure
13
2.2.2
Responsibilities
15
Stage 2: Requirement Analysis and Strategy Definition
3.1
Documentation of System Landscape and Business Processes
16
19
3.1.1
Determine Core Business Processes
19
3.1.2
Documentation of Business Processes
19
3.1.3
Description of System Landscape and Interfaces
20
3.1.4
Description of Data Flow between system components
21
3.1.5
Aggregated Data Flow Between System Components
22
3.2
Business Impact Analysis
23
3.2.1
General Threats and Vulnerabilities
24
3.2.2
Costs of Outage and Process Prioritization
24
-1-
3.2.3
Component Failure Impact Analysis (CFIA)
25
3.2.4
CFIA Matrix
27
Risk Assessment
28
3.3
3.3.1
Support requirements for Business Processes
28
3.3.2
Support Requirements for Components
29
3.4
The Business Continuity Strategy
31
3.5
Risk Mitigation Measures
32
3.5.1
Elimination of Single Points of Failure
33
3.5.2
Change Management
34
3.6
4
Determine Recovery Options
34
3.6.1
Basic recovery categories
34
3.6.2
Impact of Technical Recovery and Logical Recovery on Recovery Time
35
3.6.3
Recovery Options per System Component
37
3.6.4
Recovery Options for Business Objects
38
3.6.5
Recovery Options per Process
40
3.7
Define Monitoring Objectives
41
3.8
Identify Resources for Recovery Mechanisms
41
3.9
Agree on Recommendations
42
Stage 3: Implementation and Testing
43
4.1
Establish Organization
43
4.2
Develop Implementation Plans
44
4.2.1
Recovery Plans
44
4.3
Crisis Management
45
4.4
Implement and Document Risk Reduction Measures
45
4.5
Implement and Document Standby Arrangements
45
4.6
Develop Recovery Plans
46
4.6.1
Example: Extraction of a Recovery Plan for the Incomplete Recovery of an SAP CRM
System 46
4.7
6
50
4.7.1
General Procedures for Different Contingencies
50
4.7.2
Detailed Procedures for Specific Tasks
51
4.8
5
Develop Recovery Procedures
Recovery Testing
52
4.8.1
Create Test Plan
52
4.8.2
Initial Testing
52
Stage 4: Operational Management
53
5.1
Create Awareness
53
5.2
Establish Education, Trainings and Exercises
53
5.3
Establish a Continuous Review and Change Control Process
53
5.4
Establish Regular Testing
54
5.5
Establish Monitoring and Resolution of Findings
54
Conclusion
55
-2-
7
Appendix
56
7.1
Template for scoping presentation
56
7.2
Contents of a Business Continuity Plan
56
-3-
1 Introduction
1.1 Goal of Document
This Best Practice provides the SAP view on establishing a business continuity concept for an SAP
environment, in the style of the ITIL (the IT Infrastructure Library, see http://www.itil.co.uk) approach
for “IT Service Continuity Management”. It outlines a general procedure on how to set up a business
continuity concept for an SAP system landscape and SAP business processes, including the
identification of different risks and failure situations, which impact continuity of operations or data
consistency.
This document introduces a methodology to be used in a Business Continuity Management project,
with the different project phases ranging from the analysis of the business requirements to the
identification of adequate risk mitigation measures and recovery plans, including necessary
documentation and operating procedures.
The continuity requirements are determined working top-down from the requirements of the core
business processes, to the requirements of the underlying systems and technical components.
Following this methodology, customers will gain a deep insight into their SAP environment supporting
core business functions. Even though customers might already have technical solutions in place to
safeguard the operation of the technical system landscape, a Business Continuity Management
project will yield a sound definition of the requirements and possibly come up with gaps that were not
yet addressed.
The approach described in this document does not stop at a technical level but also addresses
possible risks on the application layer, like continuity of business processes, or consistency of data
objects. Resulting from a Business Continuity Management project, a customer will have documented
possible workarounds to sustain core functionality and will have created recovery plans for technical
failures and application-related logical errors.
The concept described in this document is intended to be followed in the early stages of a project, but,
if not available, a continuity concept can also be established later on, during productive operations.
Being familiar with the ITIL documentation and its general approach to IT continuity management is
not a prerequisite for working with this document, but might help in valuing the different phases we will
depict for realizing a business continuity project for an SAP environment.
The SAP Continuity Management Service can support customers with their concepts to safeguard
the continuity of business operations. This service analyzes a customer’s continuity concept, mainly
focusing on technical options to protect the involved systems and application data. It helps to identify
gaps and discusses options to optimize protection against business disruptions due to technical or
application failures. The service can also assist a customer in the planning phase for a business
continuity concept or with reviewing different milestones during a business continuity project. For more
information on the SAP Continuity Management Service see http://service.sap.com/continuity.
1.2 What is Business Continuity Management?
“Business Continuity Management (BCM) is concerned with managing risks to ensure that at all
times an organization can continue operating to, at least, a pre-determined minimum level. The BCM
process involves reducing the risk to an acceptable level and planning for the recovery of business
processes should a risk materialize and a disruption to the business occur.” (Source: ITIL Service
Delivery, chapter 7.1.6)
The main focus of this document will be the continuity of IT services supporting the business (in ITIL
terminology called ITSCM – IT Service Continuity Management), while other services that the
business depends on may also be incorporated in a BCM concept. In the remainder of this document,
we will not distinguish further between the terms BCM and ITSCM.
A business disruption can be caused by an application failure, a system component failure or the loss
of the entire premise where the business operates. Regarding the severity of disruption, we have to
distinguish between incidents and major business disruptions. According to ITIL, an incident is “any
event which is not part of the standard operation of a service and which causes, or may cause, an
-4-
interruption to, or a reduction in, the quality of that service”. While minor incidents can be handled by
the Service Desk (ITIL “Incident Management”), a major business disruption or disaster is an
incident that needs to be reported to the business continuity crisis team, because it could seriously
impact the availability of one or more business processes. A major business disruption that stops a
critical business process from operating may require invoking a business continuity plan.
A business continuity plan (BC plan, also called disaster plan or recovery plan) elaborates on how
to operate critical business processes on a pre-determined minimum acceptable level by using an
alternative process and on how to recover the affected business process or the affected components
back to normal operation. The decision whether an incident will be escalated to a disaster and whether
a business continuity plan will be activated is up to the business continuity crisis team. This decision
will be taken depending on the time and impact of an outage and may differ depending on the
business process being affected.
The main goal of BCM is to establish procedures in an organization that allow handling such major
business disruptions by describing alternative procedures and possible recovery methods as well as
implementing risk reduction measures and recovery technologies.
However, BCM does not end with the creation of disaster recovery plans. There is a whole lifecycle to
BCM, which needs to be established with a business continuity project. Having established business
continuity procedures, BCM needs to ensure that the continuity plans will become part of change
control management. The plans need to be updated whenever changes are applied to business
processes, be it changes to IT or changes in business operation. BCM must also introduce education
and awareness for business continuity throughout the organization and has to establish regular testing
to ensure operability of the described procedures. Each stage of the business continuity lifecycle,
which is further outlined in section 1.3, will be discussed in a separate chapter of this document.
1.2.1 What are the failure scenarios covered by Business Continuity?
In general, BCM needs to address any type of scenario that prevents a company from operating its
critical business processes.
There are three main categories of failure scenarios:
Technical failure or disaster: This can range from crashes of individual hardware
components to building fires or flooding of an entire computer center. Technical failures affect
all business processes that are using the affected component(s).
Logical failure: Faulty software or incorrect use of software may corrupt data and provoke
data inconsistencies that cause a disruption to business processes. If, for instance, a
malicious program deployed in an ERP system corrupts master and transactional data,
executing an order-to-cash process may become impossible. Some misuse of inventory
management software may also result in a production down, as necessary goods were not
reordered on time at the factory.
A logical failure may also be the result of resolving a technical failure: Point-in-time recovery or
data loss in one system of a federated system landscape will result in data inconsistencies
between the systems that need to be addressed before resuming regular operations.
As these examples demonstrate, logical errors or data inconsistencies can have two
dimensions:
- inconsistent data within one system (for example, order data was accidentally deleted form
the database)
- inconsistent data between two systems of a system landscape (for example, orders are not
consistent between an ERP and a CRM system)
Logistical or operational failure (not in scope of this document): Apart from IT processes,
business operation depends on many operational or logistical aspects. Required staff need to
be available and facilities need to be accessible. Emergency plans for logistical aspects need
to make sure that equipment, meeting places and workspaces can be made available in case
of a disaster.
Since this document focuses on IT Service Continuity Management, logistical or operational aspects
falling into the third category will not be covered in the remainder of this document.
-5-
1.2.2 System Recovery and Business Recovery – The two major steps of
recovery
Usually, if a business process is unavailable, the availability of all involved systems needs to be
checked first. If a system is unavailable, system recovery or technical recovery has to reestablish
technical availability of the system as a first step. This can be done for example by exchanging some
defect hardware component, by activating a standby system or by restoring a database from a
backup.
In most cases, resolving the technical error will immediately return the systems and processes back to
regular operation. However, if for some reason, a method of resolving a technical error resulted in data
loss for a system component (for example when performing a point-in-time (incomplete) database
recovery or when activating an asynchronous standby solution), system recovery would leave a state
that required further analysis of data consistency between systems of a system landscape.
Business recovery or logical recovery is always required in case of logical errors or data
inconsistencies appearing either inside one system or between systems of a system landscape. With
inconsistent or outdated data, wrong business decisions might be made or inconsistencies in the
system may lead to unacceptable situations like for instance, an ERP system sending invoices without
the materials having been delivered to customers.
As described above, logical errors or inconsistencies can be the remains of a technical recovery
procedure but can also be a disaster cause of its own. In the latter case, usually only a subset of
business processes is affected because all technical components are available.
If a business process is unavailable due to a logical error (data corruption), the logical error needs to
be repaired. A BC plan should describe different ways to address data corruptions, for example by
extracting the correct data from some specially provided analysis system. Repairing logical errors
usually requires in-depth application knowledge.
Sometimes, it is considered to solve logical errors by a technical measure – by recovering the affected
system to the point before that error was introduced in the system due to some user error or faulty
program (database restore followed by a point-in-time recovery). This procedure can indeed remove
the logical error from the system but, as we have seen above, due to the data loss, it introduces a new
kind of logical error affecting data consistency between the systems of a federated landscape.
Resolution of such inconsistencies again requires business recovery, now between multiple systems.
Note: The main challenge of this document will be to distinguish both types of errors, technical and
logical errors, since a business continuity plan needs to address both levels: system recovery
and business recovery.
1.2.3 What impact can a major business disruption or disaster have?
A major disruption of a business process causes the process to be unavailable for a certain amount of
time. To minimize the downtime, BC plans are developed.
Various technical solutions are available to reduce the impact of a technical failure according to the
required service level and budget, for example relocating systems to another facility due to a fire or
flooding.
In case of a logical error, such technical measures are mostly useless. The duration of the required
recovery steps can be quite unpredictable. Preparing for this scenario, by having detailed
documentation of business processes at hand and by providing a general approach to addressing
such situations, will help to reduce the time for error resolution and reverting to normal operation of the
business process.
1.2.4 What should business continuity plans protect?
The BC plan should protect a company against the unavailability of its core business processes for an
unacceptable period of time due to the loss of key resources of the company. Key resources can be
-6-
personnel, computer system components and software, but also power supply, or other technical
facilities such as parts of the premises of the company itself.
The business continuity project has to evaluate which business processes need protection against a
business disruption. Depending on the importance of a process, it has to establish methods to recover
the process in case of a contingency. Critical core processes requiring immediate recovery can be
protected by high availability solutions for their critical system components and by alternative
implementations of the respective business process, to ensure operation after a disruption on a
minimum level acceptable to business. Less critical processes that can be unavailable for several
hours or days without a major impact on business can be sufficiently covered by recovery plans that
use fewer resources than the recovery plans for processes with immediate severe impact.
Since the list of possible disasters or disturbances is unbelievably long, business continuity plans
should not be based on special scenarios like a fire in building X. They are created on the assumption
that some key resources are lost or unavailable, yielding useful plans that apply to several scenarios
and not only to a single scenario. Instead of preparing for very specific error scenarios, it is more
important to clearly understand and document all vital business functions in order to keep the business
running regardless of the special peculiarity of a disaster.
1.2.5 The Course of Disaster Recovery
Using an example situation, this section describes the different phases that are passed through in the
course of a disaster.
Phase 1: “Incident Management”
A business disruption is detected by end users who trigger an incident at the supporting organization.
The incident is analyzed and rated whether it can be resolved within a certain time span. In this
example, three independent end users report that the CRM system is unavailable. The application
management organization checks through the monitoring cockpit of the CRM system that the
application servers are running but the database seems to be unavailable. Say this yields the core
business process sales order management unavailable for 500 users. The application management
sends an email to the users that the CRM system is currently unavailable to stop end users from
calling for help concerning the CRM system.
Phase 2: “Crisis team decides on invocation of business continuity plans”
Now application management will try to identify the problem of the database server. As this incident is
classified as a major business disruption, the crisis team for the sales order process is informed. A
deadline of 30 minutes is set, after which the business continuity plan is invoked if the business
process, respectively the CRM system, is still unavailable.
Phase 3: “Invocation of the BC plan”
Application Management was unable to restore the database server to normal operation within 30
minutes. However, the cause for the problem was identified. A malicious network driver that was
installed recently corrupted all data in the database. Since the initial deadline of 30 minutes for error
resolution was exceeded, the situation is escalated and the business continuity plan is activated.
Phase 4: “Alternative process implementation”
The now operative business recovery team instructs end users to use the ERP system instead to
enter sales orders. As not all functionality is available in ERP which is usually available in CRM, the
sales order volume entered is at 50% compared to the normal volume when using CRM.
Phase 5: “Recovery team executes recovery plans”
The recovery team identifies that the database server as a key resource is completely unavailable. No
partial recovery is possible. In this case, the recovery plan advocates a point-in-time recovery of the
database as a system recovery step. The inconsistencies produced by this incomplete recovery of the
database need to be dealt with in a subsequent business recovery step.
After making the database available on system level, the inconsistencies between CRM and ERP
database must be repaired. This step is executed by the application experts that are part of the
recovery team. Since the affected objects and necessary activities are documented in the business
continuity plan, the team can immediately start with this extensive work.
-7-
Phase 6: “Steps prior to normal operation”
When consistency between the systems is reestablished to a sufficient degree, the recovery team
runs consistency check reports to verify that the CRM system is ready to revert to normal operation.
Functional checks are run to ensure correct operation of the business processes.
Phase 7: “End users start using normal business processes”
Now that recovery has completed and the tests were successful, the end users are instructed to revert
to normal operation using the CRM system. Data that was created using the alternative process needs
to be fed back into the CRM system.
Phase 8: “Lessons-learned”
As a follow-up to the BC plan invocation, the situation leading to the error and the course of the overall
recovery procedure is analyzed to identify possible deficiencies in error protection and recovery
handling. The lessons-learned from this case are incorporated into the BC plan to improve the
business continuity concept further.
The following figure provides an overview of the different steps of a disaster recovery procedure to be
established by a business continuity plan:
Figure 1: Steps of a disaster recovery procedure
-8-
1.2.6 The Business Continuity Plan
Basically, a business continuity plan must provide answers to the following ‘management’ questions:
1. Which risks am I facing?
2. Which precautions can be taken?
3. How do I proceed in case of a contingency?
4. Will the plan work?
These questions will by answered by the following elements of business continuity planning:
1. Which risks am I facing?
Risk and Impact Analysis
2. Which precautions can be taken?
Risk Mitigation and Recovery Options
3. How do I proceed in case of a contingency?
Recovery Procedures & Priorities
4. Will the plan work?
Continuous Testing & Change Management
1.3 Stages of the Business Continuity Lifecycle
In order to establish a business continuity plan, it is necessary to run a business continuity project
(BC project). According to the ITIL standard, this project can be split into four main stages as outlined
in the following chart.
Figure 2: Stages of a Business Continuity Project
Source: Office of Government Commerce (OGC) www.itil.co.uk
-9-
Each of the following chapters of this document describes one stage of a BC project. At the beginning
of each chapter, a short summary table enumerates the personnel needed in the respective project
stage and lists the main deliverables of this stage.
In a project plan for a BC project, each of these stages resolves into a number of phases and
activities. The general course of a BC project is shown in Table 1. We will be following this structure
throughout this document and describe each of those different phases in more detail.
Table 1: General course of a business continuity project
Task Name
Section
Stage 1 – Initiation
2
Scoping Study
Define and set strategy
Develop Project Plan
Define project phases
Define the project organization
Define project control structure
Identify initial costs
2.1
Stage 2 – Requirements Analysis and Strategy Definition
3
2.2
2.3
2.3
Requirement and Impact Analysis
Documentation of System Landscape, Business Processes and Data Exchange
Identify critical core business processes to include in BC plan
Document core business processes
Document system landscape
Document interfaces
Document data flow for business processes
Business Impact Analysis
Identify general threats and vulnerabilities
Identify costs of outage and prioritize business processes
Conduct component failure impact analysis (CFIA)
Collect existing workarounds
Create CFIA matrix
Risk Assessment
Determine required service levels (support requirements) for business processes
Deviate required service levels for components
Business Continuity Strategy
Risk Mitigation Measures
Elimination of critical single points of failure
Thorough change management
Determine Recovery Options
Guiding principles
Technical solutions / standby arrangements for system components
Procedure / correction tools for business objects
Possible new workarounds for business processes
Define Monitoring Objectives
Project management
Identify required people and resources
Agree on recommendations
- 10 -
3.1
3.1.1
3.1.2
3.1.3
3.1.3
3.1.4/5
3.2
3.2.1
3.2.2
3.2.3
3.2.3
3.2.4
3.3
3.3.1
3.3.2
3.4
3.5
3.5.1
3.5.2
3.6
3.6.1/2
3.6.3
3.6.4
3.6.5
3.7
3.8
3.9
Duration
Stage 3 – Implementation
4
Establish Organization (DR team)
Develop detailed implementation plans
Crisis management
Activation procedure for BC plan -- Damage assessment and decision making
Roles and responsibilities for BC
Implement and document risk mitigation measures
Implement and document stand-by arrangements
Technical measures and corresponding procedures
Business process workarounds including prerequisites and procedures
Create recovery plan(s)
Document recovery options and chosen solutions
Document solutions and procedures
Develop individual procedures for systems, processes and business objects
Create master plan summarizing detailed procedures
Recovery Testing
Create test plans
Perform initial testing
4.1
4.2
4.3
Stage 4 - Operational Management
5
Create awareness
Establish education, training and exercises
Establish a continuous review and change control process
Ongoing risk evaluation and risk assessment
Establish regular testing
Establish Monitoring and Resolution of Findings
Error prevention and error detection
Clearing of inconsistencies
5.1
5.2
5.3
- 11 -
4.4
4.5
4.6/7
4.8
4.8.1
4.8.2
5.4
5.5
2 Stage 1: Initiation
Roles
Senior IT management
BCM Project Manager
Business Process Champions
Recovery Expert from Application
Management/ Business Process Operations
(Ability to translate business recovery
requirements (application) into technical
requirements and specifications)
Output
Scoping study
BC project plan
Initial costs
Project organization and control structure
To install a successful business continuity concept within an organization, it is essential to establish
awareness and commitment from senior management. The concept has to be fully endorsed to obtain
the acceptance and commitment of management and staff. Business Continuity depends on the
commitment at all levels in the organization and on a definition of their responsibilities. Management
(such as the business process champion or the program management office) needs to continually
monitor and prioritize business continuity activities against operational activities. The overall aim is a
stage in which management considers business continuity in relation to, and even prior to, making key
business decisions. This allows a balanced assessment of the risks to be considered in the decision
making process.
An awareness of the need for business continuity planning may be generated from:
The range of risks for the organization
The potential business impact that could result from the realization of the risks
The probability of each of the risks
Personal responsibilities and liabilities
External pressures
The best and most effective way to raise senior management awareness is to highlight potential risks
and business impact facing an organization in terms of business failure to meet key performance
indicators or corporate objectives.
As with most IT issues, business continuity crosses organizational boundaries and consumes
management time and financial resources. Sponsorship at the highest level and integration into the IT
structure is paramount to the success of a business continuity project. Without this level of
sponsorship, risks to business continuity include:
Misalignment with the business and IT strategies, thereby failing to address the true values
and business risks as perceived by senior management
Lack of momentum, profile or resources
Lack of extensive co-operation and input required from management at all levels
For business continuity to be successful within an organization, a suitable organizational structure
needs to be implemented. The roles should be integrated into the existing suite of IT management
responsibilities, like the responsibilities and roles defined by SAP’s E2E Solution Operations standard.
The optimum management structure will:
Allow responsibilities for ongoing business continuity to be clearly defined and allocated
Integrate into existing organizational structures, hierarchies and responsibilities
Allocate responsibilities to functions or individuals that have the necessary presence,
credibility, skills, knowledge and expertise within the IT organization
- 12 -
Ensure that the organizational structure that manages business continuity during day-to-day
operation closely resembles the structure that will execute the recovery mechanisms in case
of a disaster
Ensure the business continuity strategy and requirements are integrated with the business
and IT strategies
Tasks of the initiation stage of a business continuity project:
Conducting a scoping study
Establishing a business continuity project plan, project structure and procedures
Identifying critical business processes
Establishing the business continuity project team and business continuity responsibilities
2.1 Scoping Study
The scoping study is the initial task of the initiation stage in order to bring the risks and impacts to a
management attention. The scoping report is used to raise awareness of the need for business
continuity, to identify the business benefits, to generate management commitment and to act as the
starting point for more detailed project plans (stage 1) and business impact plans (stage 2).
A scoping study should describe the impact some perceivable disaster cases would have on one (or
more) of the most important business processes. It should provide an idea as to how a disaster plan
could mitigate this impact and compare the resulting costs with and without a recovery plan. A
template for the scoping study is outlined in appendix 7.1.
2.2 Develop Project Plan
After the initial awareness is raised and the permanent commitment of senior management is
established, a project plan for the business continuity concept is created, including project structure
and procedures. Table 1, which depicts the general course of a business continuity project, can also
be used as a template for a business continuity project plan. This template needs to be completed by
filling in the estimated duration of the different project steps. Consecutively, the initial costs of the
project need to be determined.
To get a more exact estimation of the duration of individual steps in the project plan as well as
business and IT areas to be involved in the project, it is helpful to already have an idea of the core
business processes that shall be covered in the BC plan (also see section 3.1.1.1 which finally sets
the scope of business processes to be included in the BC project).
2.2.1 Project Organization and Control Structure
After the BC project is approved, the project team is staffed and introduced. Initial briefing sessions for
the project team, stakeholders and the business areas, are always worthwhile to raise awareness,
prompt support and manage expectations. Ideally, these sessions develop into a campaign with
defined methods of communication. Regular feedback to participants demonstrates the progress
achieved as a result of their actions and contributions. Table 2 presents the methods of
communication which need to be established in the BC project.
- 13 -
Table 2: Methods of communication
Method of Communication
Goal
Regular briefings to staff on the emergency
procedures and guidelines
Readiness of staff for business continuity action plans
Regular desktop walkthroughs of the recovery
plans
Familiarity of staff with business continuity plans
Regular articles in newsletters, notice boards,
corporate intranet to maintain the profile of
Business Continuity
Awareness of business continuity plans
Inclusion of an overview of the organization’s
business, business processes and Business
Continuity mechanisms in the staff induction
process
Knowledge of staff regarding business continuity plans
Regular progress reports to the Board and regular
agenda items on other management and IT
committees
Update of Board members regarding business continuity
plans
During the project and after its completion, the typical organizational structure for large organizations
that supports both ongoing management and invocation of business continuity procedures is outlined
in the following figure.
Figure 3: Org chart of the business continuity project / Source: Office of Government Commerce
(OGC) www.itil.co.uk
Management sponsorship at board level is often executed by a person whose responsibilities
encompass most of the organization (for example IT). Day-to-day responsibility for business continuity
often is assigned to a senior manager, who advises the board on a business continuity strategy and
ensures that these are in line with business and IT strategies. The business continuity management
together with its team of business area manager (business process champions) and the IT service
continuity managers (application management) supervises change control, testing, auditing,
awareness, and training. Steering committees at senior management level co-ordinate business
continuity activities across the organization and support the business continuity manager.
The steering committee should meet regularly to confirm the business continuity strategy is still valid
and discuss changes that could affect the strategy as well as to review programs and procedures.
- 14 -
Management, such as application management within the IT organization, is typically given ownership
of the deliverables that relate to their area of expertise or responsibility. Ownership not only involves
responsibility for ensuring deliverables are met, but also for ensuring that they remain up to date and
fit for purpose as application management also owns the change control management standard.
Invocation of continuity mechanisms and recovery options is usually undertaken by one or several
business continuity teams, focused on specific areas of the IT organization (for example, external
communications, local area networks, servers, and so on). During periods of operational stability, the
service continuity teams play a vital role in the implementation, testing, maintenance and support of
these continuity or recovery procedures and plans.
IT may establish a working group that, typically, fills key roles in the IT recovery process and fills
operational management roles to deal with the continuity and availability management issues.
2.2.2 Responsibilities
Table 3 outlines the typical responsibilities for business continuity during times of normal operations,
as well as crisis operations. These layers of responsibility also correlate with the typical management
structure for business continuity (see org. chart in previous section).
Table 3: Responsibilities
Level
Task
Board
Initiate and Sponsor Business Continuity.
Set Strategy and Framework
Allocate Management Resources
Handle external communication
Senior Management
Direct Business Continuity
Define Scope
Create and Maintain Awareness
Co-ordinate Cross-Organizational Responsibilities and Procedures
Management
(Business process
champion/Program
Management Office)
Analyze Business Continuity
Supervisor and Staff
Develop and operate Business Continuity
(Application
Management/Business
process champion)
Define Requirements and Deliverables
Manage Contracts
Supervise Projects and Operations
Develop Requirements
Negotiate Contracts
Operate Procedures
Responsibilities should be clearly defined, communicated to management, and documented in
appropriate role and job descriptions. To ensure continual management of business continuity at an
operational level, an incorporation of specific deliverables into individual staff objectives and
responsibilities is recommended.
Following a disruption to the normal operating environment, management responsibilities change in
line with command, control and operational roles and responsibilities. These include responsibilities for
taking action to continuity plan invocation.
- 15 -
3 Stage 2: Requirement Analysis and Strategy
Definition
Roles
BCM Project Manager
Recovery Expert from Application
Management/ Business Process Operations
(Ability to translate business recovery
requirements (application) into technical
requirements and specifications)
Business Process Champions
Key Users
Output
Scope of BC plan
Documentation of processes, system
landscape, interfaces and data exchange
Risk analysis
Impact analysis / CFIA matrix
Service level requirements for processes and
components
Recommended risk mitigation measures
Recommended recovery options
Monitoring objectives
Resource requirements
Agreed BC strategy
This stage provides the foundation to determine how well an organization will be able to handle a
business process interruption or disaster. As a result, risks to the business operation and their impact
will be well understood, which forms the basis for planning countermeasures as well as emergency
and recovery procedures.
Stage 2 consists of two main tasks:
Requirement and impact analysis, which identifies threats to continuity of services and
business processes, assesses the severity of these risks and defines the requirements as
service levels
Design of the business continuity strategy, which identifies possibilities to reduce the above
risks and determines the options to support a recovery
The following figure visualizes the main phases and the main tasks to be included in stage 2:
- 16 -
Figure 4: The main phases and tasks of Stage 2 - Requirements analysis and strategy definition
Top-down Analysis / Bottom-up Recovery
The intent of BCM is to secure availability of
business processes. Business processes require
different applications and systems. These
systems require a technical infrastructure going
down to the electrical power supply. Therefore, to
analyze the requirements for business continuity,
a top-down approach is used to determine which
applications, systems, system components,
hardware, infrastructure components and
services are necessary to keep the critical
business processes running.
On the other hand, if a critical incident or disaster
occurs, recovery of the process functionality
starts bottom-up; beginning with the recovery of
technical infrastructure and system components.
Similarly, protection against failures (risk
mitigation) starts with eliminating critical points of
failure on infrastructure and hardware level up to
system components.
Figure 5: BCM dependencies
- 17 -
Requirement and Impact Analysis Phase (Sections 3.1 to 3.3)
During the impact analysis phase, threats to the continuity of business operations are collected and
evaluated.
In a first step, the continuity project needs to identify the important core business processes that are
vital for the company’s business and that shall be included in the BC plan.
When the core business processes are named, these processes and the system landscape supporting
them need to be documented, including the interfaces and data objects being used by these
processes. This documentation is not only required as a further input for the BC project, but will also
provide helpful information during a potential later execution of a BC plan. Therefore, this
documentation will constitute an important part of the BC plan.
The next step will identify risks and their impact on business. An estimation of the costs caused by an
outage will allow a prioritization of business processes. The Component Failure Impact Analysis will
then have a closer look at the components that are needed to operate the business processes and will
determine their criticality.
Based on this information, risk assessment can be conducted and the required availability demands
(service levels) can be determined for business processes and for the components of the system
landscape. Workarounds that may be available to substitute a failing business process play an
important role in assessing acceptable outage times. The determined service levels will provide the
input for the next phase, the design of a continuity strategy.
When this phase is finished, detailed documentation and recommendations for the business continuity
approach will be available. The following aspects will be defined and documented for each critical
business process:
The staffing, skills, and services necessary to enable critical business processes to continue
operating at an acceptable service level for a limited time
The time within which the critical business processes should be recovered to fully operational
level
Different contingency cases, that is, which systems are most likely to fail and which business
objects are highly critical, for example, because their failure would affect a large number of
business processes
Design of the Business Continuity Strategy Phase (Sections 3.4 to 3.6)
Business continuity can be achieved by risk mitigation that tries to avoid a failure or by providing
recovery options that allow a timely recovery from a failure.
Based on the criticality of business processes or system components, this phase will identify adequate
measures to reduce the risk of business interruption due to the occurrence of possible failures as
identified above. Risk mitigation will always have to find a balance between the costs for the measures
to be taken and the potential damage they can prevent from. The measures being taken and the
procedures to activate them will be part of the BC plan.
For risks or failure scenarios that will not (due to the costs being too high) or cannot (since no
protection is available) be covered by risk mitigation measures, possible recovery options and
recovery strategies should be identified and then worked out later in the BC plan.
Recovery options will be distinguished into:
Measures to recover from technical failures,
Procedures and tools to recover from logical failures or data inconsistencies
Workarounds that can reduce the criticality of a business process disruption by providing
substitute operations with usually reduced volume and functionality.
As a final step of stage 2, recommendations for risk reduction measures and recovery options need to
be agreed with all involved parties, especially from a management and cost perspective - before the
implementation stage and the creation of the BC plan can be started.
- 18 -
The next sections will describe these different phases in more detail and provide examples for their
output using some hypothetical business processes and a corresponding system landscape.
3.1 Documentation of System Landscape and Business Processes
It is vital to have a good understanding and documentation of the core business processes and the
system landscape. If, for example, a core business process is not operable due to corrupted business
data, resolution of this problem will be considerably faster if the process, the data objects it relies on
and the interfaces it uses are clearly documented. Reversely, if for example one system became
unavailable, this documentation would easily show which business processes would be affected.
Although an overview of the system landscape is often considered the fist step of the analysis, we
start with a description of the business processes since this will provide the input for documenting the
important system components of a system landscape – without forgetting components being used by
the identified core processes and also without including components that are not used by these core
processes.
3.1.1 Determine Core Business Processes
Together with the application management, identify critical business processes that need to be
covered in a BC plan. All processes should be prioritized. You may want to start with only a single
core process as “proof-of-concept” and extend the concept later on or you may start with a list of
critical processes right from the beginning. This list should be kept as small as possible to restrict the
amount of work and the extent of the BC plan.
The process list is approved by the steering committee and defines the scope of the BC plan.
Example:
The following gives an example list of business processes that we will be using during the course of
this document:
Opportunity/Sales order process
Customer service process
Marketing process
Reporting process
3.1.2 Documentation of Business Processes
A good format to document a business process is the swimlane representation (see example below).
This quickly allows the identification which business objects are touched by the process and which
business objects are exchanged across system boundaries. The description serves as a basis to
analyze the effects of an incident or disaster case as well as of the scenario ‘incomplete recovery’.
While for logical errors occurring inside a single system, all process steps accessing data objects may
be of interest, for data inconsistencies that are, for example, caused by an incomplete recovery of one
system, only process steps that cause data exchange between systems are relevant (since only data
exchanged between systems may become inconsistent). The latter information will mainly flow into the
documentation of the data flow described in sections 3.1.4 and 3.1.5.
As a result, this step should identify the stages in the process flow that are critical for data consistency
within and between the systems, and should pinpoint potential problem areas in the case of data loss
or inconsistencies. In this regard, the most important information that has to be collected with the help
of the business process descriptions, are process steps during which data of business objects are
saved as well as communication steps during which a data exchange is triggered between the
systems in the system landscape.
- 19 -
Example: Opportunity and sales order processing
Sales orders are maintained in SAP CRM and are uploaded to SAP ERP.
In the sales order process, a call center agent usually starts with an opportunity and calls the
respective customer. The agent discusses a potential order with the customer in detail, based on the
customer fact sheet. To discuss the order, the agent also accesses the product catalog and the
product configuration. If the customer decides to order, the opportunity is copied into a sales order.
The sales order is replicated to SAP ERP, where it is automatically processed.
The business process should be documented in an easily recognizable form for later references as in
Figure 6. SAP recommends using the SAP Solution Manager to document your processes at a central
point.
Figure 6: Example Sales Order Process
3.1.3 Description of System Landscape and Interfaces
Looking at the documentation of the core business processes created in the previous step, all systems
components being used by these processes can be identified. With this information, an overview of
the system landscape and the connections (interfaces) between the systems can be created.
Example:
Figure 7 shows an example production landscape of the SAP Business Suite implementation of SAP
CRM using ERP, CRM and BI. In addition to the SAP applications, a typical infrastructure often
includes several other SAP and non-SAP systems. The system landscape consists of several systems
that exchange data with each other via different interfaces. Therefore, it is important to also document
the type of interfaces. In an environment such as this, it is important that the business data that moves
between the systems is consistent and up to date.
SAP recommends using the SAP Solution Manager to document the system landscape and interfaces
for the supported business processes of the scenario in question.
- 20 -
Figure 7: Example System Landscape and Interfaces
3.1.4 Description of Data Flow between system components
To enable the development of disaster recovery procedures that follow an error and a potential partial
loss of business data, it is necessary to identify possible sources of data inconsistency. A prerequisite
for this is a description of the business object data flow for each core business process. This should
also include the flow of master data that supports the process between the system components. The
leading system for each business object should be identified, because objects are sometimes
maintained in more than one system. In these cases, a leading system cannot be defined. It is worth
noting that the maintenance in two systems may cause inconsistencies if no mechanism like cross
system locking is used and thus, could later disrupt business processes.
Having outlined the business object data flow for each core business process and the objects that
might be affected by inconsistencies, a BC plan can work out and describe actions to address the
resolution of possible inconsistencies.
Example: Business object flow between CRM and ERP for Opportunity/Sales order process
The business objects involved in the opportunity and sales order processing process are the business
partners, products, product catalog, price conditions, business partner hierarchies and the sales
orders themselves. The following graph shows the data flow of these objects between the SAP CRM
and the SAP ERP system.
- 21 -
Figure 8: Sales Order Process Data Flow
Master data flow:
Business partners are created and maintained in CRM and ERP and replicated between both
Business partner hierarchies, products and pricing conditions are created and maintained in
the ERP system and are replicated to CRM
The product catalog is maintained only in the CRM system
Transactional data flow:
Opportunities are maintained only in CRM
Sales orders are created and maintained in CRM and are replicated to ERP
3.1.5 Aggregated Data Flow Between System Components
When the business object data flow for all business processes is aggregated for the entire system
landscape, you obtain an overview of the overall data flow that is very useful to quickly recognize for
example, the impact of an incomplete recovery of one system component to a previous state like one
hour in the past.
Example:
The following figure shows the aggregated business object data flow chart for the entire system
landscape resulting from section 3.1.4. The data flow to and from SAP BW is incomplete since this
was not part of the example above.
If now for instance a severe error of the database forced the IT department to perform a point-in-time
recovery of the ERP system, the aggregated business object data flow chart would show for which
business objects this would cause data inconsistencies between the ERP system and the rest of the
system components in the landscape (because the other systems had not been set back in time).
Looking at the relationship to CRM, data consistency for business partners, business partner
- 22 -
hierarchies, pricing conditions, products and sales orders would need to be checked and
reestablished.
Figure 9: Aggregated Data Flow
3.2 Business Impact Analysis
Your business processes work as long as all business objects are valid and consistent and all
supporting system components (software and hardware) are available. In the business impact
analysis, you identify the risks that endanger business continuity and the impact this will have to your
core business processes.
In a first step, we look at threats that endanger business operations as a whole, for example, physical
disasters. The next step will analyze the impact that a complete service disruption would have on your
organization, quantifying and qualifying losses this may cause. This analysis will also lead to a
prioritization of the business processes.
However, in most cases, failures do not affect the complete business operation. Failures of single
system components may only leave some business processes or even only parts of business
processes inoperable. Workarounds may be available to replace such failing parts, so a business
process can be continued, even with limited functionality and often with limited transaction volume.
The same applies to logical failures like data inconsistencies which may only partly affect business
processes. In a subsequent step, the Component Failure Impact Analysis (CFIA) will analyze the
business processes and their reliance on technical components and data objects in detail.
- 23 -
3.2.1 General Threats and Vulnerabilities
To prevent a comprehensive disruption of business operations through a failure of IT services, you
need to identify the general threats that your company or the IT campus is exposed to.
Just to name a few examples, these threats can:
Be of a regional or local nature (rivers with a higher risk of flooding, earthquakes, nearby
airport increasing the risk of plane crashes)
Impose a higher risk of being subject to malicious attacks, for example due to a special
company profile
Lie in accidents resulting from the company’s business or nearby production facilities (for
example, explosions or fires)
Be a failure of service providers (for example, if IT is outsourced)
For a reasonable assessment of the risks, the likelihood of occurrence should be rated for each
applicable threat. Risk mitigation in a later stage of the BC project will have to identify appropriate
countermeasures and options to reduce these risks or the impact they would have.
An additional method to identify risks is the Service Outage Analysis, which analyzes incidents which
led to service disruptions in the past and how these were handled (successfully or less successfully).
This provides an insight into threats that may still be imminent.
3.2.2 Costs of Outage and Process Prioritization
The result of an interruption of the business processes has to be measured in terms of quantifiable
and qualifiable losses. The business impact can be qualified in lost income, additional cost and
damaged reputation. The impact should also be distinguished depending on the duration of the
unavailability of a process because costs and impact will escalate over time. While some processes
like a reporting process may have nearly no impact if they are inoperative for several days, other
processes like order entry and order processing may have an immediate impact if they are inoperable.
So business impact analysis has to identify:
The potential loss that may be caused to the organization because of an interruption of critical
business processes
The form that the loss may take: lower income, higher costs, damaged reputation, immediate
and long-term loss of market share, loss of goodwill, loss of competitive advantage, and so on
The degree the loss is likely to escalate if not addressed in a timely manner
Since it is often difficult to assess losses in the amount of money lost per day or week, it may be easier
to assess the impact on a scale from 1 (very low impact) to a 10 (crucial impact that might jeopardize
your company as a whole).
Make a list of your critical core business processes and evaluate the impact if a process is not
operative for a specific period of time.
During this phase, you should also collect and document any existing service level agreements, like
Recovery Time Objectives (RTO), for these processes. The RTO should be reviewed with the
business owners at a later stage, once all surrounding facts are understood (if for example a good
workaround is available for normal process operation, the RTO need not be very small, as we will see
in section 3.3.1).
Using the information obtained in this step, the business processes can be prioritized according to
their criticality and costs of outage. The criticality of a process will be an important criterion for the
preventive measures to protect against contingencies that will be planned at a later phase.
- 24 -
Example:
The following table provides a criticality rating for our example processes:
Table 4: Criticality of processes
Process
Impact
after 4
hours
Impact
after 1 day
Impact
Impact
after 2 days after 1
week
Recovery
Time
Objective
Priority
Marketing
process
2
5
6
8
Sales
order
process
8
10
10
10
Reporting
process
1
1
3
6
Less
critical
Customer
Service
Process
4
8
10
10
Highly
critical
Critical
2 hours
Highly
critical
After evaluating your core business processes as outlined in the above example, you can distinguish
between processes that create an instant negative impact, like the sales order process above, and
processes that only yield a negative impact after a considerable amount of time, like the reporting
process.
3.2.3 Component Failure Impact Analysis (CFIA)
In addition to the a protection against wide-spread failure scenarios as regarded above, continuity of
business processes depends on the availability of all involved hardware and software components of
the underlying IT system landscape and IT business applications. Therefore, the next analysis step
has to pinpoint the most critical hardware and software components supporting the identified business
processes. These components will be a focus of the business continuity concept to be established.
In order to get a general picture of the criticality of all components of the system landscape, the
criticality needs to be collected process by process. For each process, describe the criticality of a
failure of each component involved in the process, rating the impact and describing possible
workarounds that are already in place or that are perceivable. Not all components may be equally
critical for a process, because a process may still be operable without a component. For example, the
ATP check during order entry might be left out if the SCM system is unavailable or, as a workaround,
the ATP check could be done in the ERP system.
If a workaround is available, the description should also tell what percentage of original processing
volume can be covered using this workaround, after what time this workaround is usually activated
and for what period of time this workaround would be sufficient.
The information for this phase must be provided by the different business areas owning the business
process. For collecting this information, each involved business area can be provided with a template
they have to fill in, see example below. The template lists all components that were collected during
the previous analysis steps (section 3.1). Besides system components, this can include infrastructure
components and external systems that a process relies on. To cover the aspect of data consistency,
we also include the data objects in a separate section of this table.
Example:
The following table provides an example of useful information that has to be provided by the business
process owners for all processes in scope of the BC project.
- 25 -
The task of each component should be described briefly and the impact of a component failure should
be considered. The impact would be less critical if a workaround is available, which should also be
noted in the table. The criticality depends on the overall importance of the business process, the
impact a failure has and the possibility of replacing normal procedures by a workaround.
The likelihood of failure and possible countermeasures can only be filled in by the business process
owners if they have special experiences for this process; in general these will be completed on
component level in a later phase by the BC project team.
Table 5: CFIA for the sales order process
Process: Sales order process
Components
Task of
Component
Priority: Highly critical
Impact of Failure,
Workaround
Criticality
red =
highly critical,
yellow =
critical,
green =
non-critical
Likelihood
of Failure
Countermeasures
Rare,
Occasional,
Frequent
Application
Systems
CRM
Highly critical
RTO is 2h
Workaround: Orders
can be entered
directly in ERP;
Order entry volume
will be 20% of normal
order volume;
Opportunity
processing is not
possible, will be
critical after 4 hours ,
highly critical after 1
day
Highly critical
Rare
Telephone
Highly critical
Rare
Power
Highly critical
LAN
Highly critical
WAN
Highly critical
ERP
Order entry in CRM
can continue
Order processing is
not possible, will be
critical after 4h
Infrastructure
Business
objects
Business
partner
Products
Provide
information
for customer
contact,
provide
information
for delivery
object must exist but
some errors in
customer data can be
corrected
Critical
Order entry only
Highly critical
- 26 -
Frequent
HA solution
implemented
possible with correct
product information
Pricing
conditions
Order could be
created without final
prices, corrected
prices could be
provided later on
Critical
Sales order
consistency of old
sales orders is
irrelevant for creating
a new one
Non-critical
3.2.4 CFIA Matrix
The CFIA matrix provides a summary view of the criticality of all processes and components. The
information collected in the previous step is consolidated into this matrix. The example below details a
possible structure of a CFIA matrix.
The CFIA matrix allows you to easily identify which processes will be affected and to what extent they
will be affected by the failure of a component or by an issue with data consistency of a data object.
Based on the resulting criticality of a component, appropriate recovery options and protection
measures can be determined for this component.
Example:
Our example lists all processes as columns and all components as rows of the CFIA matrix. We
ordered the business processes with decreasing priority from left to right, according to the criticality
that was determined in section 3.2.2. Again, we include the main data objects in the matrix because
they are important for data consistency considerations and logical recovery on object level.
Each field of the matrix depicts the criticality of a component for a specific business process. The
criticality is noted according to the following schema:
Green: non-critical component (because it is used only by non-critical processes or because
an efficient workaround is in place)
Yellow: Critical component (strong business impact)
Red:
Highly critical component (process is interrupted)
A field is left blank if a failure of the component or an inconsistency of the object does not
impact the process in any way.
You could note the availability of a workaround, for example, by adding a ‘W’ to the field of the matrix.
This would mean that the original criticality was reduced due to the availability of this workaround.
The criticality of the components being used by a business process cannot be higher than the
criticality of the process itself. Therefore, the maximum criticality assigned in a column can be that of
the business process.
The overall criticality of a component is determined by the maximum criticality given for this
component over all columns.
Looking at the rows of the matrix, you can identify the criticality of components and objects that are
frequently used in business processes. The matrix also shows which processes are most vulnerable to
component failures or object inconsistencies by considering the amount of entries in the respective
column.
- 27 -
Table 6: CFIA matrix
Processes
Overall
criticality of
component
Sales order
process
Service
process
Marketing
process
Reporting
process
Criticality of
process
Highly
critical
Highly
critical
Critical
Less critical
# of components
# of data objects
6
4
5
2
4
2
4
4
R
Y
G
Components
# of
process
es
Application
Systems
CRM
R
4
R
ERP
R
2
R
Telephone
R
3
R
Y
Y
Power
R
4
R
R
Y
G
LAN
R
4
R
R
Y
G
WAN
R
2
R
Y
Business
partner
R
4
R
R
Y
G
Products
R
4
R
R
Y
G
Pricing
conditions
Y
2
Y
G
Sales order
G
2
G
G
G
Infrastructure
Business
objects
3.3 Risk Assessment
In this phase, required service levels are determined for business processes and underlying
components. The key figure for business continuity is the required recovery time, given by the
Recovery Time Objective (RTO).
We first determine the requirements on business process level. The input comes from the previous
phases of the analysis. However, if a business process can be replaced by some alternative process
for some period of time, the required RTO may need to be adapted accordingly. From process level,
we then come down to the requirements on component level.
3.3.1 Support requirements for Business Processes
This step identifies how long a critical process can be unavailable, from a business point of view,
without severe negative impact. Since business might be able to operate a business process in an
alternative way, such as a paper-based approach, this needs to be included in the considerations.
These alternatives can increase the time available for recovery of the systems supporting the
disrupted business process in contrast to the time identified in section 3.2.2. Business has to specify
the resulting time requirement, after which the business process has to be fully recovered.
The time a workaround is sufficient until the original process is recovered completely, is called RTO
for full recovery. Since a workaround may not be available immediately after a failure or should not
be activated immediately due to some side-effects, the RTO for minimum recovery defines how long
a process may be completely unavailable until the workaround must become operational.
- 28 -
In this step, roughly outline such alternative business processes. The details will be elaborated on in
the implementation stage (see chapter 4). For each alternative, describe the staffing, skills, services
and procedures that are needed to operate the workaround.
If no workarounds are available, business only has to provide the RTO after which operation of the
business process has to be fully recovered.
Example:
As an alternative process for the opportunity/sales order process, the ERP system might be used to
enter sales orders in the business process, without using the CRM system if it is unavailable due to a
system failure. The details of the implementation of the alternative process have to be elaborated in
stage 3 (see section 4.5). In this example, the availability of an alternative processing for the sales
order process relaxes the support requirement from 2 hours (as previously given in Table 4) to a full
day. Since this workaround only substitutes a failure of the CRM system, we need to distinguish
between components. The RTO may thus differ for a specific process depending on the component
that is unavailable.
Table 7: Support requirements for business processes
Process
Criticality
(from
3.2.2)
Workaroun
d exists for
failure of
Requirements for workaround
RTO for
minimum
recovery
RTO for
full
recovery
Sales Order
Process
Highly
critical
CRM
Staff needs access to ERP;
Training in ERP UI necessary
2 hours
1 day
Sales Order
Process
Highly
critical
n/a
n/a
4 hours
Customer
Service
Process
Highly
critical
n/a
n/a
6 hours
Marketing
Process
Critical
n/a
n/a
2 days
Reporting
Process
Less critical
n/a
n/a
1 week
3.3.2 Support Requirements for Components
After the determination of required service levels and maximum acceptance times of reduced service
levels for the critical business processes in case of emergency, the required service levels for the
underlying components can be derived. This will be the input for the next steps, the definition of
recovery options and risk reduction measures.
The criticality of components is determined by the CFIA matrix. The information that is still missing
before going into the design phase for the BC strategy is the likelihood of a failure of a component and
the times that can be allowed for a recovery of the component (the RTO for each component).
The goal must be to provide information for the upcoming decision whether recovery mechanisms are
sufficient for a component or whether preventive (risk reduction) measures are required due to the
criticality of the supported business processes.
The RTO for a component is given by the minimum ‘RTO for full recovery’ of all business processes
using this component. Please note that based on the availability of workarounds for some processes,
the ‘RTO for full recovery’ might have been relaxed in the previous step (section 3.3.1).
In addition to the RTO of the underlying business processes, the importance of a component is also
determined by the number of processes relying on it. A component that is used by a high number of
business processes usually has a higher criticality than a component that is used by a single process.
- 29 -
This aspect can require an adaptation of the values determined solely on the RTO of the involved
business processes.
Besides RTO, the Recovery Point Objective (RPO) constitutes an important criterion for recovery of a
component. RPO defines the acceptable data loss during recovery of a component, for example, when
restoring from a backup or when switching to a standby site. In order to ensure data consistency in a
federated system landscape, an RPO of 0 is required. If the RPO is more than 0, technical recovery of
a component will leave the need to analyze and resolve remaining data inconsistencies before
business operations can continue. This means that the recovery time can be considerably longer (also
see 3.6.2).
The RPO must be determined by the business process owners, taking into account the impact of
possible data loss.
Since again, we include business objects in the list of components (see table below), we need to
define what we understand as “RTO and RPO of a business object”. If for example business objects of
type Products became corrupted due to some software bug, the RTO for Products would define how
long it may take until objects of type Products are available again for use by the core business
processes. In addition, the RPO of a business object would define the amount of tolerable data loss in
case of a contingency. The minimum RPO of all business objects maintained by a component should
equal the overall RPO of that component.
Note: When determining RTO and RPO in this section, it is important to note that the costs to achieve
them are not yet considered. So, when discussing the (technical) solutions to achieve these
goals in sections 3.4 and 3.5, the results obtained in this phase may need to be revised later
on. If for example the costs of the solutions become prohibitive, the business may decide to
review the service level agreements based on a cost/benefit analysis.
Example:
The following table lists the criticality and RTO/RPO for our example components. These result from
the RTO requirements of the business processes. The RPO of an application component can be
determined by assessing the RPO required for all business objects maintained by this component (this
information is contained for example in the documentation created in section 3.1.4)
For the sales order process, the required RTO for CRM relaxes to 1 day due to the workaround
described in 3.3.1. Since the workaround relies on the availability of the ERP system, the RTO for
ERP remains at 2 hours.
Since Products are required for the most important Sales Order Process, their RTO is determined by
the RTO of this business process (2 hours). This notion would underline the importance of the
business object Product and show that some advance considerations should be made into possibilities
of re-establishing consistency of Products if corrupted for example by an incomplete recovery.
As can be seen from the data flow charts in section 3.1.4, Products are held in ERP and in CRM. So,
in case of corruptions or inconsistencies in one system, it might be possible to recover Products from
the other still operative system (although due to pending (queued) objects that were not yet
transferred between the systems, some objects might not be recovered completely this way).
Therefore, in a business continuity strategy it is important to ensure that the replication of Products
and other objects between systems is working properly to have the current data available in one of the
systems for recovery in case of a contingency in the other system.
The columns ‘Current RTO’ and ‘Current RPO’ can be used to contrast the required RTO/RPO with
the currently committed RTO/RPO from IT. If the latter one is higher than the requirements, this
indicates that new solutions will be needed to achieve the requirements or that the requirements
identified so far need to be revised.
- 30 -
Table 8: Support requirements for components
Component
Criticality
(from 3.2.4)
Likelihood
of failure
Required
RTO
Required
RPO
ERP
Very high
Low
2 hours
~0
CRM
Very high
Low
6 hours *
~0
Telephone
Very high
Low
2 hours
n/a
Power
Very high
Medium
2 hours
n/a
LAN
Very high
Low
n/a
WAN
Very high
High
n/a
Current RTO
Current RPO
Application
Systems
Infrastructure
Business
objects
Business
partner
Very high
6 hours
1 hour
Products
Very high
2 hours
1 hour
Pricing
conditions
High
1 day
1 hour
Sales order
Low **
1 day
~0
* determined by service process since criticality for sales order process was relaxed to 1 day in table 7
** only resulting from this example, which does not regard the usually very critical order to cash process
3.4 The Business Continuity Strategy
Now that the risks and continuity requirements are known, appropriate measures can be determined.
Two basic approaches need to be distinguished:
Risk Mitigation
Recovery
Risk Mitigation concentrates on prevention, trying to avoid a business disruption by eliminating or
reducing the risks identified.
Recovery on the other hand deals with reducing the impact a service disruption will have; by providing
solutions or strategies that allow resumption of operations as quickly as possible. The goal of recovery
options is to reestablish first minimal (if applicable) and then full business operations within the limits
given by the requirements. Recovery options also come into play if a risk mitigation measure failed.
As stated by ITIL, “an organization that identifies high impacts in the short term will want to
concentrate efforts on preventative risk reduction methods, for example, through full resilience and
fault tolerance, while an organization that has low short-term impacts would be better suited to
comprehensive recovery options”.
The hardest part of this task is to find the right balance between risk reduction and the different
recovery options that are available. This is mostly determined by the costs involved with the different
alternatives.
Risk Mitigation is usually achieved through availability management – redundancy and high availability
measures. The balance has to be found between the:
Costs of the risk mitigation measures (hardware, cluster solutions, and so on) , including
maintenance and testing of the solution
- 31 -
and the
Costs of unavailability (including various aspects as described in section 3.2.2) plus costs of
recovery measures to re-establish operations
If the latter costs are lower, risk mitigation may not be appropriate and recovery options might be
sufficient. However, as a general rule to avoid business disruption, risk mitigation should be preferred
over the invocation of recovery mechanisms.
Recovery options are available on various levels which distinguish primarily by the possible speed of
recovery. Determining the appropriate solution also depends mainly on costs by comparing the:
Costs of the respective recovery solution (standby arrangement, replication, backup and
restore, tape shipping, etc.)
and the
Costs of unavailability through longer recovery time (including various aspects as described in
section 3.2.2) plus costs of recovery to re-establish operations through simpler recovery
measures
The following diagram, which applies similarly to costs of risk mitigation measures as well as costs of
recovery measures, illustrates this dilemma to identify the adequate technical solutions. Since
technical measures to reduce risks or to reduce the duration of recovery will become more expensive,
the higher the level of protection that can be achieved, it is necessary to balance the costs of these
measures versus the costs of disruption and recovery.
Figure 10: Business Continuity Cost Curves
Ideally, the chosen strategy meets the intersection of the two lines; the intersection marks the
maximum allowable outage time.
Note: If the analysis should show that the ‘ideal’ solution that is chosen for business continuity cannot
reach the previously determined service levels from section 3.3, these need to be revised and
adapted in a new round with the business process owners.
In addition to risk mitigation and recovery options, monitoring constitutes an additional building block
for business continuity. Monitoring has to ensure that any disturbances or anomalies will be detected
as early as possible and countermeasures can be initiated before they escalate into major business
disruptions.
3.5 Risk Mitigation Measures
Risk reduction measures are taken into account for contingencies with highest business impact. Their
goal is to avoid the materialization of a risk and prevent a service disruption.
Measures to reduce the risk of a business disruption include for example:
- 32 -
Elimination of Single Points of Failure (SPOFs) through:
o
redundancy of hardware components masking the failure of one component
o
high availability / cluster solutions allowing a failover of processes to a second server
in case of a server failure, optimally masking any perception of a failure from the end
user
Resilient networks and systems
Uninterruptible power supply
Use of multiple service providers or establishing an alternate service provider for critical
external services
Fire detection and fire suppression installations
Change control and change management
3.5.1 Elimination of Single Points of Failure
A single point of failure (SPOF) imposes a risk to business continuity, since the failure of such a
component immediately interrupts business operations. A SPOF typically is a technical component or
service supporting a business process. The elimination of SPOFs can be achieved through
redundancy or high availability (HA) solutions. HA solutions mitigate the risk of failures since they
aim at masking failures by (almost) seamlessly failing over business operation to alternate hardware.
Example measures to eliminate SPOFs in an SAP environment include:
Redundant hardware components (Server, storage system, and so on)
RAID protection
Redundant network components and network routes
Redundant middleware components (multiple load balancers, multiple web servers, and so
on)
Multiple SAP application servers
Failover cluster solutions for database system and SAP central instance / SAP central
services
Parallel database management systems
More information on HA solutions for SAP is available in the SAP Service Marketplace at
http://service.sap.com/ha.
The adequateness of technical solutions to eliminate SPOFs must be determined based on costs of
the solutions, likelihood of the failure and impact of the failure / required service levels. As discussed
above, the costs for risk mitigation should (in general) not be significantly higher than the costs
induced by the risk.
Example:
In the previous sections, the components supporting the most critical sales order process have been
identified. These components are candidates for protection through HA solutions. Even though a
restricted workaround is available in case of an outage of CRM, investing in an HA solution can be
reasonable because a noticeable impact is already perceived after 4 hours due to the inability to
process opportunities. Therefore, the CRM and ERP database and central instance / central services
will be protected by a cluster solution and multiple application servers will be provided for CRM and
ERP for redundancy. The network connections should be redundant, a stand-by power supply should
be available and telephony needs to be secured. See Table 10 for an overview.
- 33 -
3.5.2 Change Management
Logical errors or data inconsistencies usually result from incorrect software or user errors. There is no
technical solution available to protect from such kinds of errors. The only chance is to prevent and
detect such errors before they can affect the production landscape – through thorough change
management and rigid testing in pre-production phases.
3.6 Determine Recovery Options
Recovery options are taken into account for contingencies with lower business impact or for cases
when risk mitigation measures fail or are not possible. Their goal is to reduce the duration of a service
disruption by reestablishing normal operation as timely as possible and, if applicable, by providing and
activating a workaround in the meantime.
In planning how to recover certain business processes or system components, it is important to
determine the available recovery options – on technology level (failures of system components) and
on application level (logical errors or data inconsistencies).
3.6.1 Basic recovery categories
There are a number of basic options that can be considered for the recovery approach:
Do nothing
If the contingency does not have a business impact, it might be possible that doing nothing is an
option and no recovery is needed.
Example:
(Technical failure): A test system fails that was scheduled to be reinstalled next week. Operation
without the test system is acceptable as it will be available again in a week and no major tests have to
be done during this week.
(Logical failure): A new program corrupted data. The analysis shows that only historical data was
affected. Since all involved processing has already taken place, it is decided that a recovery of this
corrupted data will not be needed.
Manual correction
If the contingency is of a small scale but has some impact on the business process, a manual
correction could be that data is manually corrected or a small report corrects inconsistencies in
business objects. This measure is contrasted to a major disruption of a process that needs a more
sophisticated recovery procedure.
Example:
(Technical failure): A system becomes unavailable for a short period of time while a data upload via a
file interface is in progress. Since the status of the upload is in an unknown state, manual intervention
and restart of the upload is required. Application management identifies the affected objects and
resends them to the system manually.
(Logical failure): End users created sales orders with incorrect tax classification. The impact is that
certain sales orders cannot be processed. The application management team identifies the sales
orders and corrects the tax classification after discussion with the end users manually.
Gradual recovery
Gradual recovery or 'cold standby' is applicable if no immediate recovery of the business process is
needed and the organization can operate for up to 72 hours (according to ITIL), or longer, without a reestablishment of the full business process on the respective system components. When considering
cold standby, the necessary hardware is either already provided at a disaster recovery site or it must
at least be ensured that the necessary hardware can be obtained in time to rebuild the systems.
Example:
(Technical failure): The storage hardware of the reporting system faces a problem. The system must
be restored. As the reporting process is not critical, no spare hardware is available. Due to an
agreement with the storage provider, replacement hardware will be delivered and installed within 2
days. Restore and database recovery of the system will be finished within another 2 days.
Intermediate Recovery
- 34 -
Intermediate recovery or 'warm standby' is necessary if you want to reestablish your business process
within 24 to 72 hours (according to ITIL). This involves at least having spare hardware available at a
remote site, either company-owned or provided as a recovery service. This may also include the
creation of a daily mirror of the production data at the remote site. To become operational, this mirror
only needs database forward recovery for the logs created since then.
Example:
(Technical failure): The marketing process of our examples above can be unavailable for more than 1
day without a severe impact on business but the service process should not be unavailable for more
than 2 days. To ensure a recovery after 24 hours the systems and database files of the affected
systems are mirrored to a remote site. To make the systems available at the remote site, manual
activities have to be performed for database recovery and system restart.
(Logical failure): A database table with business partners is dropped. The corresponding tablespace is
restored to an alternate hardware; the lost data is exported from this ‘analysis system’ and then
imported back into the production system. The whole procedure requires 24 hours until the object
business partners is accessible again.
Immediate Recovery
Immediate recovery or 'hot standby' provides for ‘immediate’ restoration of services/processes and is
usually provided as an extension to the intermediate recovery provided. The immediate recovery is
supported by the recovery of critical business functions and support areas during the first 24 hours
(according to ITIL) following a service disruption. However, nowadays, recovery demands even lie in
the range of few hours and less. For components that require immediate recovery, the impact of loss
of service has an immediate impact on the organization's ability to make money, such as the sales
order process in our previous examples.
Example:
(Technical failure): The sales order process in the examples of this chapter has a severe impact on
business if the ERP system is unavailable for more than 4 hours. To ensure that system operation can
be recovered in less than four hours, a standby database at a remote site is continuously recovered
with logs from the production site. A switch to the standby database would be performed for example
in case of a severe storage system failure when the implemented high-availability solution is not
applicable. The standby database would also be activated if database block corruptions made the
primary database unusable.
When you have decided which recovery option you want to use for which business process disruption
scenario or component failure scenario, you need to detail the applicable recovery method for
recovering a system component or for recovering business objects.
3.6.2 Impact of Technical Recovery and Logical Recovery on Recovery Time
When considering technical recovery options, it is important to recognize possible dependencies to
logical recovery. If technical recovery of a system component like a database restore or switch to a
remote site involves some amount of data loss, data consistency between the systems of a landscape
is no longer guaranteed. This means that following the technical recovery, recovery on application
level (‘logical recovery’) is required to detect and remove these inconsistencies. This has an important
impact on overall recovery time.
Figure 11 shows different recovery scenarios and the corresponding parts contributing to the required
recovery time.
Scenario 1: Technical Failure and Complete Recovery
Following a technical failure, some kind of technical recovery method is applied without any data loss
in the affected system. This could be, for example, a complete recovery of the database following a
database restore or a switchover to a synchronously mirrored disaster recovery site. When the
technical recovery has finished, the system is back fully operational.
Scenario 2: Technical Failure and Incomplete Recovery
Following a technical failure, some kind of technical recovery method is applied that causes some
amount of data loss in the affected system. This could be, for example, an incomplete recovery of the
database following a database restore because some logfiles are unavailable. Some data loss may
- 35 -
also be caused by a switchover to a disaster recovery site that is not synchronously replicated and
cannot be subject to a complete database forward recovery.
When the technical recovery has finished, business recovery has to follow to identify inconsistencies
between this system and the other systems of the landscape. The overall recovery time until the
system and dependant business processes are fully operational is given as the sum of the technical
recovery and the logical recovery.
Scenario 3: Logical Failure is Corrected in the Affected System
A logical failure corrupts some business objects in system 1. The corrupted objects are identified and
repaired directly in this system. The affected business process is operational after this logical recovery
has finished.
Scenario 4: Logical Failure is Corrected Through Point-in-time Recovery of the Affected
System
A logical failure corrupts some business objects in system 1. To avoid the effort of repairing the corrupt
objects directly in the system, a database point-in-time recovery (incomplete recovery) is performed to
the point before that error occurred. After this technical recovery method is finished, the affected
system itself is in a consistent state. However, due to the data loss caused by this operation, data
consistency in relation to the other systems of the landscape is no longer maintained.
The overall recovery time is increased by the time it takes to repair these inconsistencies on
application level. The affected business process is only fully operational after this logical recovery has
finished. Additionally, many other business processes will be affected since their data also became
inconsistent. Data consistency for many other systems not shown in this figure may be affected as
well.
Figure 11: Recovery Options Following a Technical or Logical Failure
These scenarios show that the RTO of a business process can be determined by different phases of a
recovery until the business process is fully operational. The actual recovery time for a process is given
- 36 -
by the RTO of the involved components plus the recovery time to re-establish data consistency, if
required.
3.6.3 Recovery Options per System Component
Recovery of a system component means bringing back a component into an operable state after any
kind of failure. In case of system component failures, you can replace the involved hardware, reinstall
the respective software/system components and restore the application data to reestablish operation.
For database components, you have to define backup and restore strategies that adhere to the
service levels and continuity requirements.
As we saw before, completing the technical recovery does not necessarily mean that the business
processes are already fully operational since data consistency may need to be checked and repaired
(depending on the data currency (RPO) the chosen solution can ensure). If, for example, you perform
an incomplete recovery of a database, you will have to deal with inconsistencies between dependent
systems. For instance, recovering a CRM system with a backup that is one hour old creates
inconsistencies with the ERP system. The ERP system will include data from CRM that was
exchanged within the last hour.
Disaster recovery (DR) typically focuses on technical solutions that allow a resumption of normal
operation of a system component after an outage of the primary data center. Various DR solutions are
offered by the hardware vendors. A DR solution must allow moving or restoring the data to a DR site.
DR solutions can mainly be distinguished according to the recovery level they can provide regarding
recovery time (RTO) and recovery point (RPO).
The following table provides some typical RTO and RPO ranges for the main categories of recovery
solutions.
Table 9: Technical Recovery Solutions
Recovery Solution
RTO (only technical recovery part)
RPO
Database Restore and
Recovery
Gradual /
Intermediate
0 **
12 – 72 hours
Tape shipping - Pickup Truck
Access Method (PTAM)
Gradual /
Intermediate
48 - 168 hours
Standby database
(asynchronous log shipping)
Intermediate /
Immediate
1 – 8 hours
Remote point-in-time copies
Intermediate /
Immediate
4 – 24 hours
Asynchronous replication
Immediate
30 min* - 8 hours
0 – 5 minutes
Synchronous replication
Immediate
5 min* - 8 hours
0 ***
* in combination with HA solutions
24 – 168 hours
10 min – 24 hours
4 – 24 hours
** complete recovery
*** depending on continuation policy
Example:
The following table gives an example of possible risk mitigation and recovery options for the example
components depicted for our business processes:
- 37 -
Table 10: Example Recovery Options for System Components of Example Processes
Component
Recovery
Class
Risk Mitigation Method
Technical Recovery
Method
Immediate
Cluster solution for DB
processes
Synchronous replication to
remote site, standby
hardware readily available
Appl. Systems
ERP Database
Daily full backup allowing
restore within 12 hours
(including log recovery for 1
day)
ERP Central
instance
Immediate
Cluster solution for enqueue
and message server
Replicated enqueue server
ERP Appserver
Immediate
Two application servers for
redundancy
SLA to replace and reinstall
appserver within 4 hours
CRM Database
Immediate
Cluster solution for DB
processes
Synchronous replication to
remote site, standby
hardware readily available
Daily full backup allowing
restore within 12 hours
(including log recovery for 1
day)
CRM Central
instance
Immediate
Cluster solution for enqueue
and message server
CRM Appserver
Immediate
Two application servers for
redundancy
SLA to replace and reinstall
appserver within 4 hours
Infrastructure
Telephone
Immediate
SLA with service provider
Mobile phones for key users
Power
Immediate
Stand-by power components
readily available
LAN
Immediate
Redundant connections,
redundant routers and
switches
WAN
Immediate
Alternative arrangement
using satellite connection
3.6.4 Recovery Options for Business Objects
Logical errors or data inconsistencies require recovery on application level by repairing business data
on object level. This kind of recovery can be required as a consequence of a preceding technical
recovery that resulted in some amount of data loss (RPO not 0) or as an error scenario in itself.
Business object unavailability can result from business data being deleted or corrupted in a way that it
is useless for dependant business processes.
Basically, we can distinguish three types of errors:
Logical errors inside a single system
Inconsistencies between systems
Inconsistencies between system and the real world
- 38 -
These inconsistencies have to be repaired manually or by reports. The identification of inconsistencies
and the methods to correct them are a very important task for the experts that create the detailed
recovery plan in the implementation stage (stage 3).
At this stage, you should determine the business objects for which you need to work out detailed
recovery plans due to their criticality. You should roughly identify possible strategies on how to repair
inconsistencies for these objects.
Possible options for repairing inconsistencies include:
For logical errors inside a single system:
Replicate lost data from another source system
The data flow charts of the processes in section 3.1.4 show from which other systems
business objects might be recovered.
Export data from an analysis system (restore done to a sandbox)
Re-enter lost data in the recovered system
Re-extract data from middleware components
Reconstruct data from database indexes
Run correction reports
For inconsistencies between systems
Use compare tools provided by SAP
Run check and correction reports
Manual compare and correction
Note: A point-in-time recovery of a database should not be used as a recovery option for logical errors
inside a single system, since this induces new problems with data inconsistencies between the
systems of a landscape and to the real world.
Discuss the recovery options per business object and identify the recovery strategies for possible
partial or complete loss of business objects.
More information on consistency checks and consistency check tools provided by SAP can be found
in the Best Practice “Data Consistency Monitoring within SAP Logistics” that will be available at
http://service.sap.com/solutionmanagerbp.
Section 4.6.1 will provide a detailed example of how recovery options for a business object can be
worked out.
Note: To detect inconsistencies as early as possible, and before they will result in a possible business
disruption, data consistency monitoring should be established (see 3.7).
Example:
The decision is made that only business objects required for the highly critical sales order process
shall be subject to detailed recovery planning. The basic strategy is outlined here while further details
will be worked out in the implementation stage.
Table 11: Typical Recovery Options for Business Objects of the Sales Order Process
Business object
Application Recovery Strategy
Business partner
Comparison with DIMa
Initial or request load either from CRM or ERP
Product
Comparison with DIMa
Initial or request load from ERP
Pricing conditions
Comparison with DIMa
Initial or request load from ERP
Sales order
Comparison with DIMa
Request load either from CRM or ERP
- 39 -
3.6.5 Recovery Options per Process
A business process may become unavailable due to a failure of its underlying system components or
corruptions of its required business data. Recovery options for both error types were discussed in the
previous sections.
If the recovery options discussed above are not sufficient to satisfy the criticality of a business
process, establishing an alternative procedure (workaround) to maintain at least rudimentary operation
of the process can be considered. In section 3.3.1, already existing workarounds in a company were
collected as an input for the determination of the support requirements.
At this point, the focus lies on possibilities of establishing new workarounds for highly critical parts of
business processes.
Service disruptions can also be due to insufficient performance of a business process. If this lasts for a
longer period, activation of a workaround can also be a method to circumvent this kind of service
disruption.
Workarounds need to be documented in detail and the end users need to be trained so the activation
of a workaround will be successful. Since a workaround always implies some limitations and usually
requires some more or less expensive post-processing when normal operations are reestablished, the
activation of a workaround should be under control of the business continuity manager / business
continuity team.
Workarounds can be:
Paper-based
Based on remaining systems of a system landscape
Working with reduced functionality
A combination of the above
Example:
For the sales order process, it might be possible to implement the following alternative business
process on a smaller scale. It is possible to replicate opportunities with a customer program as special
quotations to the ERP system (it has to be considered if custom coding is worth the effort which
depends on the criticality of the business process). The call center agents could access the ERP
system, find the opportunities as quotations of a special transaction type and negotiate orders with
customers because products, pricing, business partner hierarchies and the configurator are also
available in the ERP system. It is not possible, however, to access the product catalog and use the
customer fact sheet. This gives less value to the customer, but orders could be negotiated on a
smaller scale. A requirement to establish this workaround would be to enable the replication of
opportunities to ERP and to evaluate and develop reports that enable a stand-by processing of
opportunities and orders without product catalog and fact sheet. As well training of the call center
agents for the ERP environment needs to be planned.
The following table summarizes the requirements for additional workarounds as determined in this
step.
- 40 -
Table 12: Additional Workarounds for Processes
Process
Recovery
class
Workaround
required for
Description
Sales Order
Process
Immediate
Failure of CRM
Current workaround for sales order taking is
not sufficient, alternate procedure required
for opportunity to sales order processing (as
described in the text above).
Customer
Service
Process
Immediate
n/a
Marketing
Process
Intermediate
n/a
Reporting
Process
Gradual /
Manual
correction
n/a
3.7 Define Monitoring Objectives
Monitoring is an important element to detect events that may cause a business disruption before the
effect becomes visible to the end users. Therefore, the definition of helpful monitoring tools and
procedures should also be part of business continuity planning.
Monitoring that supports business continuity should comprise:
Monitoring of component availability
Monitoring of data replication to a disaster recovery site
Database and backup monitoring
Monitoring of error logs
Business process monitoring
Monitoring of interfaces and data exchange
Batch monitoring
Performance Monitoring
Monitoring of data consistency inside systems and between systems
Consistency monitoring is supported by the Data Consistency Cockpit in SAP Solution Manager
and will throw alerts when an anomaly is detected.
3.8 Identify Resources for Recovery Mechanisms
For the detailed elaboration of the BC plans and implementation of the BC measures, technical as well
as business staff is needed. The required skill set should be described so that an overview of the team
to be involved can be obtained.
For execution of a recovery plan, people with a similar skill set need to be part of the recovery team.
Agreements have to be negotiated that allow emergency allocation plans for resources such as the
key business users or in-house/external database experts from application management or business
process operations.
Apart from human resources, activation and execution of recovery mechanisms requires physical
equipment, for example project offices for the recovery teams.
- 41 -
3.9 Agree on Recommendations
Before the next stage –the implementation of the business continuity concept– can be started, the
continuity approaches determined in this stage need to be agreed by all involved parties. A summary
of the analysis results, continuity requirements and recommended measures should be presented to
the project steering committee and business continuity team.
Only after the agreement on the proposed measures and an approval for the involved costs has been
obtained from the steering committee and senior management, can the project continue with the
implementation and creation of the actual BC plan.
- 42 -
4 Stage 3: Implementation and Testing
Roles
BCM Project Manager
IT specialists from application management,
business process operations, and SAP
technical operations
Business Process Champions
Output
Implementation plan
Organizational structure supporting BCM
BC master plan
Crisis management and escalation
procedures that may invoke BC plan
Detailed recovery plans and procedures
Documentation of risk reduction measures
and standby arrangements
Test plan
Initial test of continuity concepts
During the implementation phase of a business continuity project, the solutions and recovery options
determined and agreed in stage 2 will be implemented and elaborated in detail. Besides the
implementation, the documentation of the solutions and procedures plays an equally important role.
Following the ITIL standard, the third stage of the project comprises the following phases:
Organization:
Establish the organization that is responsible for business continuity management
Implementation planning:
Develop implementation plans that describe the structure of the BC plan and assign work
packages
Implement risk reduction measures
Implement standby arrangements
Develop recovery plans:
Create and document the business continuity and recovery plan
Develop procedures:
Lay down general and detailed procedures for different recovery tasks
Initial testing:
Create test plans and perform initial tests of the BC concept
4.1 Establish Organization
The organization that executes the recovery plans consists of
The business continuity manager, who ensures that continuity plans are up-to-date and tested
The incident management staff that reports a possible disaster case
A crisis team of senior managers from business (business process champion) and IT
(application management) that decide if disaster recovery plans need to be executed
The recovery team with representatives from business (key business user/business process
champion) and IT (application management/SAP technical operations/business process
operations). The recovery team should also be staffed from key and end users that ensures a
minimum operation of the business process for example working on paper or other
workarounds.
SLAs recording the availability agreements for involved departments and partners have to be defined.
- 43 -
Example: Organizational structure for Sales and Marketing recovery in a CRM/ERP
landscape
Figure 12: Org. chart of Business Continuity Project
4.2 Develop Implementation Plans
Before developing the business continuity plan and implementing the recovery methods determined
and agreed in stage 2, a detailed plan should be set up addressing how these steps will be executed.
It has to be defined which plans shall be created and the overall structure of the BC plan has to be laid
out. Lastly, the implementation plan has to specify who will be responsible for the creation of which
parts. The owners of each plan must ensure that they have identified and agreed support and services
from other parties.
4.2.1 Recovery Plans
A BC plan usually consists of a master plan and a number of detailed plans for various aspects to be
considered in different service disruption scenarios.
The master plan is the summary document which provides all general information on the business
continuity plan like the BC organization, roles and responsibilities, crisis management and invocation
procedure, general guidelines, and so on.
Within the scope of this document, detailed plans should then describe:
Recovery procedures on business object level
Recovery procedures and workarounds on process level
Recovery procedures for implemented DR technologies
Functionality and scope of risk mitigation measures
The results of the analysis phase conducted during the BC project will be included in the respective
plans.
Appendix 7.2 lists the pieces of information that should be part of a BC plan.
- 44 -
4.3 Crisis Management
The business continuity plan must document how crisis management has to be performed and when
the disaster recovery process is invoked. This means that incident management can trigger the
invocation of the disaster recovery plan. To invoke the plan, incident management has to classify the
problem occurring as endangering the operation of the whole business process involved. Thus, the
staff performing incident management has to have clear guidelines when an incident is classified as
endangering the whole process and thus might invoke the disaster recovery plan. The decision
whether the plan is actually activated is typically made by a 'crisis management team'. The crisis team
should include senior managers from the business and IT support departments using information
gathered during the incident management process. A severe incident can occur at any time day or
night, so it is essential that guidance on the invocation process is readily available.
In the case of a physical disruption such as a fire, the decision to invoke a recovery plan is easy, but if
there is a technical problem endangering a key business process, the decision is more difficult. In
such a case, it is good practice to set a deadline for the problem to be resolved, otherwise the
recovery plan is invoked. The deadline should incorporate the support requirements for the
endangered business process to ensure that the impact of the disruption is acceptable from a
business point of view.
4.4 Implement and Document Risk Reduction Measures
In this stage, the risk mitigation measures that were determined and agreed in stage 2 to prevent a
service disruption up front will be implemented and documented. The documentation should include:
The scope of the solution
Failure types not being covered by the solution
Prerequisites and measures for activation or operation of the solution
Test and maintenance plans to keep the solution operational
Besides focusing on system components and technical solutions to eliminate SPOFs, the change
management process should also be subject to verification or revision, since this is the only chance to
avoid logical errors or data corruptions.
4.5 Implement and Document Standby Arrangements
In this stage, the standby arrangements that were determined and agreed in stage 2 to reestablish
operations in a timely manner after a service disruption will be implemented and documented. These
standby arrangements include:
Technical solutions chosen for the recovery of system components
Workarounds determined to keep vital business functions operational using an alternate
approach or process
For intended business process workarounds, all details and requirements of the alternative process
need to be worked out in this stage. Coming back to the example of section 3.6.5, the workaround
described there requires the implementation of a customer specific report that replicates opportunities
to ERP (as quotations with a special type).
The documentation of standby arrangements should include:
The scope of the solution
Failure types not covered by the solution
Prerequisites and measures for activation or operation of the solution
o
For a technical solution like switchover to a DR site: What is required before
operations can continue at the DR site? Is data consistency given after a switchover?
- 45 -
o
For a business process workaround: What resources are required? What data is
needed, where does it come from?
Training of end users on working with an alternate process
Fallback requirements and procedures – what must be done to return to normal operation
after the original systems are available again?
o
For a technical solution like switchover to a DR site, describe how to switch back
operations to the primary site
o
For a business process workaround, describe how the data created by the
workaround will be incorporated back into the standard process and systems and
what follow-up activities are required to complete the missing parts that were not
covered by the workaround
Test and maintenance plans to keep the solution operational
4.6 Develop Recovery Plans
In this phase, the recovery plans will be created according to the structure set in the implementation
plan (section 4.2.1).
The continuity plan on IT level must give all necessary information on how to ensure that service,
facilities and critical systems are either still provided or are recovered in a time frame that is accepted
by business. The continuity plan relies on the availability of systems and facilities. It does not only
include recovering systems to a certain state but also resolving all inconsistencies between systems to
achieve a consistent state of the system landscape which enables to return to normal business
operation.
The priority of services, facilities and systems must be included in the continuity plan to clearly
communicate what needs to be done first. The continuity plan itself has to be readily available for all
participants in the process. The plan is subject to control change management as a change of
business processes must trigger an adaptation of the continuity plan.
The plan has to ensure that the details of the plan suffice to enable a technical person with basic
experience of the involved systems to follow the procedure. But we have to note that for repairing data
inconsistencies or logical errors, basic knowledge is not sufficient. For such application-related issues,
experts with deep knowledge of the business processes, business data and related databases objects
need to be involved.
For different contingency and recovery scenarios, checklists need to be developed that describe what
needs to be done to revert involved systems of a business process back to normal operation. This
includes, for example, data integrity checks that need to be run after technical recovery actions have
been carried out for a system.
4.6.1 Example: Extraction of a Recovery Plan for the Incomplete Recovery of
an SAP CRM System
Regarding the specific scenario of an incomplete recovery of a system, the plan must include all
dependencies of the systems among each other and their objects being exchanged.
Scenario:
A malicious network driver has corrupted the database of the CRM system. A point-in-time recovery of
the CRM system is inevitable. All critical business processes are affected by the downtime of the CRM
system.
The following steps compromise those parts of a recovery plan which deal with the inconsistencies in
the system landscape caused by the point in time recovery of the CRM system. It is assumed that the
ERP and BW systems continue to operate while CRM is being recovered.
The following procedure is proposed for incomplete recovery of CRM. The description is not at the full
detail level, but can be carried out accurately by experienced staff. Also, the procedure is not
complete; it is only used as an example that details some aspects of a full recovery plan.
- 46 -
Procedure:
(1) Stop business operation in CRM
(2) Perform point-in-time recovery of the database of the CRM system (do not start CRM SAP system
before next step)
(3) Isolate CRM system (disable communication with BW and ERP)
Stop outbound queue processing from ERP to CRM: Deregister outbound queues in ERP
(transaction SMQS) *
Disable outbound queue processing from CRM to ERP: Lock respective RFC-user in ERP
(transaction SU01)
Disable data requests from BW to ERP (for example, from transaction RSA1 in BW or from
process flow control, also from SM37)
Disable synchronous data transfers from CRM to BW by locking the corresponding RFC-user
in BW
Start CRM SAP system
Stop productive operation in CRM by locking all regular users (Transaction SU10)
Stop CRM middleware message processing (transaction MW_MODE)
For CRM Field scenarios only: Stop CRM replication&realignment queues (transaction
SMOHQUEUE)
Disable outbound queue processing from CRM to ERP: Deregister outbound queues in CRM
(transaction SMQS) *
Disable outbound queue processing from ERP to CRM: Lock respective RFC-user in CRM
(transaction SU01)
Unschedule the BDOC reorganization job MW_REORG in CRM to avoid deletion of BDOC
message store
* Deregistering outbound queues may not be sufficient in all cases. Some applications may register queues
automatically even if queues are deregistered. To be safe, RFC destinations can be disabled instead, for
example by pointing them to an invalid server (transaction SM59).
(4) Handling of RFC queue entries
Queues to ERP in the recovered CRM system:
In the CRM system, which was restored to a former point in time, there may be pending RFC
queue entries, which were not processed or which were just being processed at that time. If the
queue status is monitored regularly, there should be only few such queue entries in the restored
CRM system. All these queue entries were probably processed completely at the time of the crash
and can therefore be deleted from the CRM system (otherwise, they would be processed doubly).
There are two cases in which such queue entries may not have been processed at the time of the
crash:
-
If the point to which the CRM system is set back lies only a very short period ahead of the
point of the crash
-
If a queue could not be processed because it was deactivated or because it was in an
error state preventing further entries from being processed and if this situation was not
resolved in the period until the crash occurred.
In such cases, the “contents” of these RFCs could be cross-checked against the ERP data to find
out whether the RFCs were processed or not.
Queue entries in ERP:
Pending RFC queue entries in the ERP system may not be deleted because they contain the most
current data objects, which need to be processed. They should be processed later, after business
recovery between R/3 and CRM is completed, because they may rely on CRM data, which was
lost due to the incomplete recovery. Because the CRM system was isolated above, unintentional
processing of these queues is prevented.
- 47 -
(5) Repair data inconsistency with ERP
Due to continued productive operation in ERP and BW, the following things must be considered
during business recovery:
-
New RFC queue entries in ERP, which are created during the period of business recovery,
may not be deleted.
-
CRM data, which is created for example by re-entering lost data may cause, double posting in
other systems. This has to be considered during business recovery.
In this example we will only show the case of business partners, but in a real plan recovery
procedures must be described for all business objects.
a. Business Partners:
Leading systems can be ERP and CRM, so we have to regard objects transferred in both
directions. The following figure displays the types of inconsistencies that may appear for business
partners after an incomplete recovery of CRM. As both systems are leading systems, our goal will
be to transfer all inconsistent objects from ERP to CRM because ERP has the newer version. This
will also recover objects that were created or changed in CRM.
Figure 13: Inconsistency Cases After Incomplete Recovery of CRM
The CRM Data Integrity Manager (DIMa) can be used to compare business partners at header
level (check for existence) or on detail level (check for different field content). Based on the
comparison result, you can either request a business partner from ERP into CRM again, or you
can send a business partner from CRM to ERP. Please see SAP note 647664 for more details on
the DIMa features for comparing ERP customer masters with CRM business partners.
This procedure will resolve all business partner inconsistencies between CRM and ERP except for
deletions (cases 3 and 6, see below). It will re-transfer to CRM all business partners created or
changed in ERP. In CRM, this will also recreate all business partners that were created or
changed in CRM and already replicated to ERP. Only business partners that were not yet
transferred to ERP cannot be recovered this way.
Note: Please pay attention to the fact that the data models of the ERP customer master and the
CRM business partner are different. This includes attributes that are available in ERP or
CRM exclusively. Even if one system is considered as master system, where a new partner
is preferably created, there can be subsequent updates in the other system to add further
attributes.
- 48 -
Example: A new customer master is created in ERP and automatically loaded to CRM.
Afterwards, CRM is used to add special marketing attributes to this customer, which cannot
be maintained in ERP.
In consequence this also means that such exclusive attributes cannot be recovered from
another system.
For identifying inconsistencies due to deletions (Cases 3 and 6) the DIMa tool can be used as
well. We assume that there are no direct physical deletions of business partners. Instead, there is
first a logical deletion (setting the deletion flag) and in periodic intervals archiving runs do the
physical deletion. The deletion flag is replicated between ERP and CRM. Now, if the business
partner was already physically deleted in one system, DIMa would report it as missing in the other
system. This unwanted effect can be solved by using the deletion flag (field LOEVM) as additional
filter for the DIMa comparison.
DIMa does not allow restricting the comparison to a specific period of time (only on the creation
date of a business partner).
Note: There can be cases where pending queue entries contain updates that were missing in
CRM. When performing a DIMa comparison, such temporary differences would be reported
as inconsistencies as well. It can make sense to have multiple iterations of DIMa
comparisons to consider data that was stored in queues but was processed later on
successfully.
Note: The time for a DIMa comparison can be quite long, comparable to a full initial download from
ERP to CRM. If you fear that there is large number of missing or inconsistent objects, it
should be considered whether it makes sense to perform a real initial download instead. In
such a case, the pending ERP outbound queue entries can be omitted.
Possible alternative to using DIMa for business partners:
By means of change documents for business partner records in ERP, all business partners,
which were modified since the recovery point-in-time can be selected in ERP and then
transferred to CRM. Business Partner records that are created or modified in ERP after the
crash point-in-time may not be included because they will be transferred to CRM using regular
replication mechanisms (objects still in the ERP queues should also be excluded). Change
documents are activated in ERP for all business partner fields that are relevant for CRM.
1. In ERP, select all business partners changed between the recovery point in time and the
crash point in time.
VD03
Environment, Field changes
Environment, Multiple display can be used to
display all business partners changed since that day (selection not possible by time)
Note: A special report may be needed for that purpose if this function is not sufficient.
2. Download all these business partners from ERP to CRM. To avoid duplicates, only
missing objects may be transferred.
Note: A special report is needed to automate business partner download for the objects
identified above. This report can call the standard functionality (transaction R3AS4) to
synchronize missing changes and/or business partner. This report must pay attention to
objects still available in the queues, these may not be re-transferred.
In CRM, you can use the report BUSCHDOC to perform a mass evaluation of change
documents for several business partners. Therefore, you can identify all business partners
that have been changed within a given timeframe.
Specifically for business partners, there is also a special transaction called
CRMM_BUPA_SEND, to manually trigger the sending of CRM business partners to the
receiving systems like a connected ERP backend. With transaction CRMM_BUPA_MAP you
can even trigger a new request load from ERP to CRM for a single business partner.
b. Other Business Objects
For all other business objects that are exchanged between CRM and ERP, procedures to check
and re-establish data integrity are required as well. For the common exchange objects between
ERP and CRM, corresponding DIMa objects are available. Please see transaction SDIMA_BASIC
for a list of DIMa objects with their offered repair options.
- 49 -
(6) Repair data inconsistency with other systems
For all other interfaced systems and all business objects that are exchanged with the CRM
system, procedures to check and re-establish data integrity are required (for example BW).
(7) Other corrective actions
-
Process pending queue entries in ERP (see (4))
-
Check if transports were imported into the CRM system during the period that was lost due to
the incomplete recovery
(8) Checks
Execute functional checks of CRM business processes
(9) Restart productive operation in CRM
Now that data consistency has been re-established and all functionality checks were successful,
the CRM system can return back to productive operation. The communication to other systems
can be enabled and users can be released to work with the usual business processes in CRM.
4.7 Develop Recovery Procedures
4.7.1 General Procedures for Different Contingencies
General procedures should lay out guidelines on how to handle different types of contingencies
affecting business continuity. These can describe for example:
Conditions to be given before performing a switchover to the DR site
A general guideline for handling logical errors
A basic procedure for handling data inconsistencies
4.7.1.1 Example: General Guideline to Avoid Incomplete Recovery
In case of logical errors appearing in a system, point-in-time recovery of the system is a technical
option to fix the situation – but with a serious side-effect on data consistency between the systems of a
system landscape. Thus, a general guideline should restrict and even prevent the usage of this option
by creating awareness for its downside. A control instance should be established that needs to
consent prior to any incomplete recovery, also weighing the impact from an application point of view.
Define approval process for point-in-time recovery
Define roles and assigned people for approval
Involve engagement of application management
Define decision criteria for engaging a point-in-time recovery
4.7.1.2 Example: General Procedure for Handling Data Inconsistencies
If, for example data, inconsistencies are reported by end users, a predefined procedure can help with
the analysis and resolution of the situation.
The following steps provide a general guideline:
1. Inconsistency is reported
2. Understand affected business process, data objects and corresponding interfaces
Note: The documentation created in sections 3.1 will be very helpful for this.
3. Analyze if it is only a temporary difference or a real inconsistency
- 50 -
Note: Temporary differences can be caused for example if a queue used for data exchange is
stopped. Since the data is not transferred, the end user perceives this as a data inconsistency.
However, it is not a real inconsistency requiring corrective tools because the situation can be
easily resolved by processing the pending messages.
4. Analyze if it is a technical or logical inconsistency
Note: A Technical Inconsistency is everything that can be found on a database level and
needs appropriate correction in the underlying database, while a Logical Inconsistency is a not
disappearing mismatch that is due to a misunderstanding of process or misinterpretation of
data. While technical inconsistencies can be identified by technical means like check reports,
logical inconsistencies need to be identified by mapping the intended business process to the
underlying data structures. They cannot be identified by technical means as the underlying
data is consistent on a one-to-one level.
5. Decide if productive use can continue or needs to be interrupted
6. Identify root cause (programming error, non-transactional interface, incorrect error handling,
incorrect data entry, no clear leading system, …)
7. Correct root cause
8. Analyze if dependent data is affected (in the same system, in other systems or in the real
world)
9. Identify inconsistencies, filter out differences
10. Correct inconsistencies
11. Correct dependent data
4.7.2 Detailed Procedures for Specific Tasks
Recovery procedures to solve specific recovery tasks need to be described on a detailed level. For
technical recovery methods on system level, the provided details must enable technical persons to
carry out the necessary recovery steps (for example, a database restore and recovery) without specific
knowledge of the affected system.
For the recovery from data inconsistencies on application level, recovery does always require expert
knowledge of the affected application and business objects. Describing the recovery procedures
involves the already available tools and/or specification of what reports need to be written, tested and
executed to repair data inconsistencies for highly critical objects determined and agreed during stage
2. It also involves the definition of decision criteria and checks to tell whether a return to productive
operation is possible.
Example:
In the previous section for recovery of the CRM system after database failure, two data consistency
check and correction requirements were identified:
A tool is needed to compare business objects between ERP and CRM. The comparison must
identify completely missing objects and objects that have different content due to missing
updates. For common business objects exchanged between ERP and CRM, the Data Integrity
Manager (DIMa) can be used.
For special requirements, the DIMa tool may not be sufficient.
For example, a special report has to be developed to select changed business partners in
ERP since a certain point in time. In standard ERP, the selection can only be done with the
granularity of days.
Another special report is needed that downloads all the selected business partners from ERP
to CRM. To avoid duplicates, only missing objects may be transferred. This report can call the
standard functionality (transaction R3AS4) to synchronize missing changes and/or business
partner. The report must pay attention to objects still available in the queues, these may not
be re-transferred.
The IT support departments must implement such special reports needed during disaster recovery.
- 51 -
4.8 Recovery Testing
An important part of the BC project is the creation of test plans and the establishing of tests. Only
regular testing can ensure that the recovery solutions will work, that all prerequisites are met and that
the people involved have a good understanding of the processes and procedures.
Only tests will show gaps and deficiencies of a plan.
4.8.1 Create Test Plan
The test plan has to lay down the scope of testing, the objectives, the test procedures and test
schedules. The test plan has to make sure that all aspects of the recovery plan are tested.
A test plan will include separate tests for individual parts of the business continuity plan, as well as full
tests that can verify the business recovery plan as a whole. A full test has to demonstrate that the
whole recovery process is supported by both business and technical departments and that the
documented recovery procedures are operational. A full test ensures that standby arrangements are
valid, that external partners are integrated reliably in the continuity efforts and that in-house staff
understands the procedure and can execute the recovery plan.
Certain aspects of the recovery plan can and should be verified during tests:
Recovery of the business process or system in time
Capability of staff to execute recovery plan
Availability of resources (physical and human resources)
Effectiveness and on time involvement of external partners.
4.8.2 Initial Testing
Initial tests should already be performed for each individual recovery solution in parallel to its
implementation.
Immediately following main milestones of the implementation phase, a more comprehensive test will
complete stage 3 of the business continuity project, verifying the interaction of the individual recovery
solutions in the overall business continuity plan.
- 52 -
5 Stage 4: Operational Management
Roles
BCM Project Manager
IT specialists from application management,
business process operations, and SAP
technical operations
Business Process Champions
End Users
Output
Change control procedure for BC plan
Training plan
Test schedule
Monitoring schedule
After having carried out the previous phases, it is important to install the continuity plan in day to day
business operation. It is very important to verify that after changes in business processes or IT
infrastructure, the continuity plan is kept up-to-date. This section describes the main tasks that are
important from an operational point-of-view.
5.1 Create Awareness
The business continuity concept has to be made aware to everybody in the organization. Also,
everybody needs to understand the importance of disaster recovery and its impact. The BC plan
needs to be shared and must be accessible to all people that will be involved in maintaining business
continuity.
Furthermore, disaster recovery efforts have to be perceived as requiring routine tasks like checking for
changes of the continuity plan if business processes are altered. Continuity management related tasks
have to be covered by normal budget in financial planning.
5.2 Establish Education, Trainings and Exercises
To ensure that all people involved in the DR process are able to execute their recovery tasks
effectively, trainings should be scheduled. This especially affects new people joining the DR team.
This training should also be used to establish a common language to enable effective communication
between business recovery team members from application and IT-related areas as well as team
members from different regions.
5.3 Establish a Continuous Review and Change Control Process
To ensure that the business continuity plan stays current, it is obligatory to review the business
continuity deliverables every time there is a change to:
Business processes
Required service levels
IT infrastructure
The overall business or IT strategy.
For maintenance of the continuity plan, it is essential to establish a change control procedure and
clear responsibilities.
It is good practice to include business continuity as a topic into every implementation project that
changes business processes or IT processes. This way, possible changes to the continuity plan can
be identified and implemented more easily and more timely.
- 53 -
5.4 Establish Regular Testing
After initial tests, a regular testing routine has to verify the operational readiness of the continuity plan.
Tests should be done as frequently as directed by senior management or audit. SAP recommends
testing the continuity plan at least once a year. It is especially necessary to test the continuity plan
after changes have been incorporated in the underlying business processes or IT infrastructure.
Only regular testing will ensure that reliance on the business continuity plan is well grounded.
5.5 Establish Monitoring and Resolution of Findings
To prevent possible business disruptions, or at least detect upcoming issues as early as possible, the
monitoring objectives and monitoring tasks defined in section 3.7 need to be established in regular
operations. If irregularities are detected by monitoring, measures to resolve the situation need to be
initiated.
For example, in the area of data consistency monitoring, regular clearing of data inconsistencies
should become part of operational management. This can help to prevent such inconsistencies from
escalating into business disruptions. Furthermore, a regular clearing of data inconsistencies reduces
the amount of differences and the repair effort after a disaster recovery.
- 54 -
6 Conclusion
Some people say that, for Business Continuity Management,
“The planning process is more important than the plan itself.”
The truth behind this statement lies in the fact that the lessons learned and the experiences gained
during the planning process are an important side-effect of a business continuity project. The process
will often reveal a multitude of findings and insights into weaknesses of current concepts and
procedures. Addressing previously unknown issues and risks can already yield a higher availability.
But of course, the business continuity plan itself is vital for establishing continuity procedures and
spreading awareness throughout an organization.
To ensure that a business continuity concept is successfully implemented and “lived” by an
organization, it is necessary that awareness and commitment is created at the senior management
level. The continuity project has to have an adequate priority and a sufficient sponsorship in order to
have acceptance and commitment of managers and staff.
This document outlined the initiation, the business impact analysis, the determination of recovery
options and the creation of a detailed recovery plan, including initial testing and the tasks necessary in
operational management to make the business continuity effort a part of daily operation.
A last word on operational management to ensure the effectiveness of disaster recovery plans:
To continuously keep the business continuity plan up-to-date, because otherwise it is not effective,
staff and management have to be continually committed to the continuity effort within the organization.
Everybody has to be aware of his responsibility for business continuity, so that triggering required
changes to the continuity plan will become a matter of course whenever a change of processes or IT
infrastructure is implemented. To maintain the quality of the continuity plans, management has to
control and monitor activities in the area of continuity management.
- 55 -
7 Appendix
7.1 Template for scoping presentation
Presenter: Business management/IT management
Audience: Senior management
Structure:
Introduction
Case studies:
Disaster Case 1
Impact of disaster case 1
Resulting loss without recovery plan
Resulting loss with recovery plan
Disaster Case 2
Impact of disaster case 2
Resulting loss without recovery plan
Resulting loss with recovery plan
Costs and Resources involved in a business continuity effort
Costs of loss without recovery plans against
Costs with recovery plan + recovery project costs
Conclusion
7.2 Contents of a Business Continuity Plan
A business contingency plan, consisting of a master plan and more specific detailed plans, should
contain the following documentation:
Results and documents created in the analysis phase (stage 2)
o
System landscape / architecture
o
Business processes
o
Interfaces and data exchange
include documentation created during stage 2
o
Business impact analysis
o
Required SLAs
Organization, Roles and Responsibilities, including
o
DR team,
o
decision process,
o
contact and distribution lists
Crisis management, including
o
notification and activation procedure for BC plan
o
damage assessment
Risk reduction measures, as proposed in stage 2 and implemented in stage 3
Recovery options and recovery procedures, as proposed in stage 2 and implemented in
stage 3
o
Technical standby-arrangements and activation procedures for technical solutions
o
Alternative business processing using workarounds
- 56 -
o
Logical recovery procedures for core business objects (business recovery)
Procedure to return to normal operations
o
Prerequisites
o
Checks
Education and training
Testing
Review and maintenance of BC plan
- 57 -
Download