Building Better Resiliency: Improve your RTO/RPO for Critical Systems

advertisement
MSI SYSTEMS INTEGRATORS
Building Better Resiliency
Through Data Replication
A presentation by:
Alan Salesman
IT Consultant
August 24, 2010
Abstract
• Abstract: Are you feeling the pressure for high
availability, need quick recovery time without
losing data? MSI will be reviewing approaches to
improving your Recovery Time Objectives (RTO)
and Recovery Point Objectives(RPO). A
discussion around technologies, Service
Definition, and Process will help you understand
how your critical systems can achieve
performance and resiliency.
Industry Terms and Definitions
The Disaster
•
The catastrophic failure of an IT Service
environment caused by an unplanned
event.
That causes…
• A significant and extended failure of an IT
service resulting in a severe impact to the
business through:
• A failure to satisfy a service level
agreement (SLA)
• Loss of market share
• Revenue loss
• Exposure to litigation / legal action
Definitions
Recovery Time Objective (RTO)
– is the duration of time and a service level within which a IT service must
be restored after a disaster (or disruption) in order to avoid unacceptable
consequences associated with a break in IT Service continuity
Recovery Point Objective (RPO)
– is the point in time to which you must recover an IT Service or data as
defined by your organization. This is generally a definition of what an
organization determines is an “acceptable loss" in a disaster situation
Definitions
Replication
– the process of sharing information so as to ensure consistency between
redundant resources, such as software or hardware components, to
improve reliability, fault-tolerance or accessibility.
Resilient
– An IT service deployment philosophy that describes an IT service that is
able to survive the loss of major components of the IT infrastructure
without presenting a total loss of functionality to the user. (The application
many continue to operate at an altered level of service.)
IT Service Continuity Management Spectrum
Frequency per year
100
Virus
Availability
Related
Data Corruption
Worms
Disk Failure
Application Outage
Network Problem
Component Failure
Disaster
Resiliency
1
Power Failure
Building Fire
Terrorism/Civil
Unrest
Natural Disaster
ITSCM – Recovery
Related
1/100,000
$
$$$
Costs per Occurrence
$$$$$
Evolution of Disaster Resiliency
The transformation of your IT Service
Evolution of Disaster Resiliency
2010
1990
•IT is a competitive
Advantage
•
IT has grown to be a strategic
center of companies, not just a
cost center
•Proprietary Systems
•
Globalization and collaboration
open IT environment
•Reactive Disaster
Recovery
•
Proactive Disaster
Resiliency/Data Protection
Pressure from Both Sides
Business Pressure
•Globalization
•‘”Mobilization” of Data
– user access at
anytime or anyplace
•All data critical
•Compliance/Regulatory
Time and Expectations
•True 24/7 availability
IT Pressure
•Increased Data Capacity
•Data growth at 30% per year
•Increased Data Value
•Maintenance windows
few and far between
•Outages are visible,
some make the news
Replication Model Key Characteristics
•
•
•
•
•
•
•
•
•
Data replication is performed for critical data
Consistency of replicated data is supported by synchronous or asynchronous
techniques
Recovery point is based on timing of consistency group creation and tape
backup timing
Recovery environment provides servers and supporting infrastructure for key
applications
Servers are inactive, excepts to support non-disk based replication
Tests can be performed and should be conducted on a regularly scheduled
basis
Facilities are provided by a vendor hot site/co-location agreement, company
owned internal private cloud or external private cloud
Configurations for infrastructure represent a subset of what is needed to
support production levels, plus what is needed to support replication method
Network bandwidth is established to support the volume of data replication
Infrastructure Replication
Performed at the infrastructure layer (synchronous or asynchronous)
•
•
•
•
Covers both geo-proximate or Geo-Remote dual data center designs.
Application neutral and lightweight on the operating system.
Asynchronous delivery can require up to three times the amount of physical storage capacity.
Requires an high bandwidth
Example Services are found in:
•
•
•
Peer-to-Peer Remote Copy (PPRC) (IBM)
Fast Remote Mirroring
FlashCopy
Recommended Infrastructure/Policy:
•
•
•
•
•
Geo-Remote data centers with high bandwidth connectivity
Redundant/Duplicate Infrastructure
A good job scheduler for asynchronous replication
Defined Policy/process for recovery or restarting of Services
Defined Process for Declaration of Disaster
RTO/RPO Target
•
•
Target can be 24 hours or less for RTO
Target can be less than 4 hour RPO
Client Example for a Recovery
Strategy
Client Focus
• Support sophisticated recovery initiative by
implementing storage replication technology to:
– Reduce Recovery Time (RTO)
– Recovery Point (RPO)
MSI Role
• Performed a review of IT services included within scope
to determine the impact of the new storage recovery
environment on recovery exercises, outages and daily
processing.
• Developed operational recovery procedure
recommendations to leverage the capabilities of the new
recovery environment and improve upon RTO/RPO
objectives.
• Developed recovery test timeline recommendations to
improve the quality and reduce the duration of recovery
test exercises.
Beginning State
• All mainframe IT services have the same recovery
time and recovery point objectives.
• Recovery time objective was 72 hours, which
could not be achieved.
• Recovery exercises were performed during 64
hour sessions at a vendor provided recovery site.
• The test process could be completed within 64
hours, but the process was a subset of what would
be necessary for a full recovery.
Beginning State
• The recovery point objective was stated as being no
greater than 24 hours from time of failure.
• Disaster recovery tapes were not taken off site until 3
pm each day.
• The time between tape creation and off-site rotation
introduced an additional eight hours to the recovery
point, or a total of 32 hours.
• On weekends and holidays, cycles were generally not
scheduled, therefore backups and off-site rotation did
not occur, leaving longer time periods of exposure.
Technology
• (2) DS8300 Storage Subsystems
– Flashcopy
– Global Mirror
• PPRC
• Asynchronous replication
– Storage instances
•
•
•
•
A Copy - Production
Local DS8300
B Copy – Performance Remote DS8300
C Copy – Consistency Remote DS8300
D Copy – DR Test
Remote DS8300
Strategic Shift to Internal Recovery
• Driven by inflexible recovery vendor:
–
–
–
–
Pricing of services
Definition of roles
Access to facilities
Contention for tests and regional disasters
• The existence of a secondary data center
• Mainframe cost mitigated by Capacity BackUp (CBU)
offering
• Open systems also need to be addressed, magnifying
inflexibility
• Cost concerns for extended stay at recovery site
Support Recommendations
• Establish clear ownership of remote storage, to
include:
– FlashCopy schedules.
– Bandwidth utilization.
– Global Mirror and Consistency Group validation and
monitoring process.
– Change control.
Support Recommendations
• Volume specific FlashCopy methods:
– Page datasets and coupling facility datasets should be
replicated once to create an instance on the recovery
platform and never replicated again.
– JES2 Spool and secondary checkpoint should follow
the same FlashCopy scheme as other production
volumes.
– SYSRES volumes should be replicated once per
month, following a successful IPL.
– Development and Test volumes should be replicated,
and FlashCopied once per day.
Support Recommendations
• Production volume consistency group creation
schedule:
– During prime shift create consistency groups at system
default interval of 1-2 minutes, to support CICS and
DB2.
– Pause Global Mirror operation before cycle and
resume after cycle, to support batch processing.
– For DR test exercises, perform FlashCopies to DR test
instance at a point in time to satisfy the test
requirements.
– Plan to implement TPC for Replication for advanced
capabilities.
Time of Day
13:00
14:00
15:00
16:00
17:00
18:00
19:00
20:00
21:00
22:00
23:00
0:00
1:00
2:00
3:00
4:00
5:00
6:00
7:00
8:00
9:00
10:00
11:00
Recovery Point in Time
Recovery Point Improvements
Recovery Point in Time Comparison
35
30
25
20
15
Before
After
10
5
0
Time of Day
13:00
14:00
15:00
16:00
17:00
18:00
19:00
20:00
21:00
22:00
23:00
0:00
1:00
2:00
3:00
4:00
5:00
6:00
7:00
8:00
9:00
10:00
11:00
RTO + RPO to Point of Outage
Total Recovery Improvement
Total Recovery Time to Point of Outage
80
70
60
50
40
30
Before
After
20
10
0
Summary
• Shift to internal recovery model using:
– Storage replication
– Capacity backup mainframe
• Shift from “restore” to “restart”
• Significant improvement in recovery objectives
Core Efficiency Technologies
Save
up to
90%
Deduplication
Save
over
70%
Saves up to 90% for full backups
Storage Virtualization
Pooling and sharing heterogeneous storage resources
Snapshot
Space efficient imaging for
data protection
Deduplication
• Deduplication removes redundant data blocks from volumes,
regardless of application or protocol
• With deduplication, users can recoup 50% or more of their capacity
for many data sets and environments
• Deduplication can be used for primary, secondary, and archival
storage tiers
Snapshots
•
Locally retained point-in time copies of file systems which you can use to protect
data
– Single files or complete backup and recovery
•
Block-incremental behavior limits associated storage capacity consumption
•
Reliable off-media backups without the need for long backup windows
•
Simplifies the process of recovering,
duplicating, or archiving data
Storage Virtualization
• Designed to improve the flexibility and utilization of your
storage resources
• Pools your storage volumes, files and file systems into a
single reservoir for centralized management
• Works with heterogeneous storage systems
• Reduce the effects of hardware configurations and helps
support business continuity
Where do you start?
• Understand your current Resiliency/Recovery capabilities,
limitations, and risks (BIA, Risk Assessment)
• Develop Service Inventory
– Prioritize and define the value/importance to the business of
that IT Service, i.e. Tier 1, 2, or 3
• Define Data Protection Methods
• Review/Define Service Level Objectives/Agreements
• Document Recovery Time Objective (RTO) and Recovery Point
Objective (RPO) for each IT Service
• Design and Build the resiliency model to support the business
supported IT Services
• Develop Data Migration Plan
Develop Service Inventory
• Define tier specific patterns for
–
–
–
–
–
Virtualization and Cloud Opportunities
Daily operational quality of service
High Availability requirements
Recovery requirements
Data protection requirements
• Establish service placement in each tier
• Develop budgetary roll-up of all services by tier
• Perform service delivery chain mapping for critical
systems
Sample Service Mapping
Sample IT Service Tiering
Tier 1
Tier Definitons:
Target RPO/RTO for 2010: RTO: <24 Hours
Services by Tier:
Tier 2
RPO: <12 Hours RTO: <72 Hours
RPO: <24 Hours
Tier 3
RTO: 7-10 Days
Tier 4
RPO: <24 Hours RTO: 30 DaysRPO: <24 Hours
Communications/Infrastructure
Critical Applications
Non-Critical Applications
Development / Test
Network
- Firewall
- Telecom
- Router
- Switch
- VPN
- ACS
Email
- Exchange
- Blackberry
Phones
- SRST
Phone System
- Call Manager
- Unity
- IPCC
SharePoint (intranet)
- intranet.XXX.org
client websites (w/ SLA)
Infra
License Servers (No Cache)
ProjectWise
SIP (2010)
client websites (no SLA)
Sharepoint (non-critical)
Oracle ERP
XXX Apps
Move-it DMZ
Records Management
License Servers (Cached)
SmartFilter
DMS (2010)
TFS
QC
Define Data Protection Patterns
• Develop an inventory of data types supporting
complete service delivery
• Based on data protection requirements, determine
best replication methods
• Develop estimated bandwidth and budgetary roleup costs to support the preferred replication
methods
• Define roadmap or incremental implementation
approach for desired replication methods which
are outside current approved funding
Data Migration Planning
• Discover characteristics of the source data
• Based on data protection requirements, determine
data migration requirements
• Develop data migration plan
• Deliverables:
– Initial and incremental roadmap for replication
implementation
Conclusion
•
•
•
•
There is no one-size-fits-all approach
Most businesses do not fully understand how vulnerable they are using
existing recovery processes.
Companies who are looking to improve their RTO/RPO approach need to:
– Look carefully at their business requirements and risk
– Understand IT demands by understanding the applications, the operating
system, storage, network bandwidth requirements and the total business
impact.
Relating the business support needs to IT Service Management can assist
them in determining which approach is the most suitable.
– Sometimes a combination of drawing elements from several approaches
works best.
– You can mix and match in order to tailor a solution that meets unique
needs of any business unit
Download