MSI SYSTEMS INTEGRATORS Building Better Resiliency Through Data Replication A presentation by: Alan Salesman IT Consultant August 24, 2010 Abstract • Abstract: Are you feeling the pressure for high availability, need quick recovery time without losing data? MSI will be reviewing approaches to improving your Recovery Time Objectives (RTO) and Recovery Point Objectives(RPO). A discussion around technologies, Service Definition, and Process will help you understand how your critical systems can achieve performance and resiliency. Industry Terms and Definitions The Disaster • The catastrophic failure of an IT Service environment caused by an unplanned event. That causes… • A significant and extended failure of an IT service resulting in a severe impact to the business through: • A failure to satisfy a service level agreement (SLA) • Loss of market share • Revenue loss • Exposure to litigation / legal action Definitions Recovery Time Objective (RTO) – is the duration of time and a service level within which a IT service must be restored after a disaster (or disruption) in order to avoid unacceptable consequences associated with a break in IT Service continuity Recovery Point Objective (RPO) – is the point in time to which you must recover an IT Service or data as defined by your organization. This is generally a definition of what an organization determines is an “acceptable loss" in a disaster situation Definitions Replication – the process of sharing information so as to ensure consistency between redundant resources, such as software or hardware components, to improve reliability, fault-tolerance or accessibility. Resilient – An IT service deployment philosophy that describes an IT service that is able to survive the loss of major components of the IT infrastructure without presenting a total loss of functionality to the user. (The application many continue to operate at an altered level of service.) IT Service Continuity Management Spectrum Frequency per year 100 Virus Availability Related Data Corruption Worms Disk Failure Application Outage Network Problem Component Failure Disaster Resiliency 1 Power Failure Building Fire Terrorism/Civil Unrest Natural Disaster ITSCM – Recovery Related 1/100,000 $ $$$ Costs per Occurrence $$$$$ Evolution of Disaster Resiliency The transformation of your IT Service Evolution of Disaster Resiliency 2010 1990 •IT is a competitive Advantage • IT has grown to be a strategic center of companies, not just a cost center •Proprietary Systems • Globalization and collaboration open IT environment •Reactive Disaster Recovery • Proactive Disaster Resiliency/Data Protection Pressure from Both Sides Business Pressure •Globalization •‘”Mobilization” of Data – user access at anytime or anyplace •All data critical •Compliance/Regulatory Time and Expectations •True 24/7 availability IT Pressure •Increased Data Capacity •Data growth at 30% per year •Increased Data Value •Maintenance windows few and far between •Outages are visible, some make the news Replication Model Key Characteristics • • • • • • • • • Data replication is performed for critical data Consistency of replicated data is supported by synchronous or asynchronous techniques Recovery point is based on timing of consistency group creation and tape backup timing Recovery environment provides servers and supporting infrastructure for key applications Servers are inactive, excepts to support non-disk based replication Tests can be performed and should be conducted on a regularly scheduled basis Facilities are provided by a vendor hot site/co-location agreement, company owned internal private cloud or external private cloud Configurations for infrastructure represent a subset of what is needed to support production levels, plus what is needed to support replication method Network bandwidth is established to support the volume of data replication Infrastructure Replication Performed at the infrastructure layer (synchronous or asynchronous) • • • • Covers both geo-proximate or Geo-Remote dual data center designs. Application neutral and lightweight on the operating system. Asynchronous delivery can require up to three times the amount of physical storage capacity. Requires an high bandwidth Example Services are found in: • • • Peer-to-Peer Remote Copy (PPRC) (IBM) Fast Remote Mirroring FlashCopy Recommended Infrastructure/Policy: • • • • • Geo-Remote data centers with high bandwidth connectivity Redundant/Duplicate Infrastructure A good job scheduler for asynchronous replication Defined Policy/process for recovery or restarting of Services Defined Process for Declaration of Disaster RTO/RPO Target • • Target can be 24 hours or less for RTO Target can be less than 4 hour RPO Client Example for a Recovery Strategy Client Focus • Support sophisticated recovery initiative by implementing storage replication technology to: – Reduce Recovery Time (RTO) – Recovery Point (RPO) MSI Role • Performed a review of IT services included within scope to determine the impact of the new storage recovery environment on recovery exercises, outages and daily processing. • Developed operational recovery procedure recommendations to leverage the capabilities of the new recovery environment and improve upon RTO/RPO objectives. • Developed recovery test timeline recommendations to improve the quality and reduce the duration of recovery test exercises. Beginning State • All mainframe IT services have the same recovery time and recovery point objectives. • Recovery time objective was 72 hours, which could not be achieved. • Recovery exercises were performed during 64 hour sessions at a vendor provided recovery site. • The test process could be completed within 64 hours, but the process was a subset of what would be necessary for a full recovery. Beginning State • The recovery point objective was stated as being no greater than 24 hours from time of failure. • Disaster recovery tapes were not taken off site until 3 pm each day. • The time between tape creation and off-site rotation introduced an additional eight hours to the recovery point, or a total of 32 hours. • On weekends and holidays, cycles were generally not scheduled, therefore backups and off-site rotation did not occur, leaving longer time periods of exposure. Technology • (2) DS8300 Storage Subsystems – Flashcopy – Global Mirror • PPRC • Asynchronous replication – Storage instances • • • • A Copy - Production Local DS8300 B Copy – Performance Remote DS8300 C Copy – Consistency Remote DS8300 D Copy – DR Test Remote DS8300 Strategic Shift to Internal Recovery • Driven by inflexible recovery vendor: – – – – Pricing of services Definition of roles Access to facilities Contention for tests and regional disasters • The existence of a secondary data center • Mainframe cost mitigated by Capacity BackUp (CBU) offering • Open systems also need to be addressed, magnifying inflexibility • Cost concerns for extended stay at recovery site Support Recommendations • Establish clear ownership of remote storage, to include: – FlashCopy schedules. – Bandwidth utilization. – Global Mirror and Consistency Group validation and monitoring process. – Change control. Support Recommendations • Volume specific FlashCopy methods: – Page datasets and coupling facility datasets should be replicated once to create an instance on the recovery platform and never replicated again. – JES2 Spool and secondary checkpoint should follow the same FlashCopy scheme as other production volumes. – SYSRES volumes should be replicated once per month, following a successful IPL. – Development and Test volumes should be replicated, and FlashCopied once per day. Support Recommendations • Production volume consistency group creation schedule: – During prime shift create consistency groups at system default interval of 1-2 minutes, to support CICS and DB2. – Pause Global Mirror operation before cycle and resume after cycle, to support batch processing. – For DR test exercises, perform FlashCopies to DR test instance at a point in time to satisfy the test requirements. – Plan to implement TPC for Replication for advanced capabilities. Time of Day 13:00 14:00 15:00 16:00 17:00 18:00 19:00 20:00 21:00 22:00 23:00 0:00 1:00 2:00 3:00 4:00 5:00 6:00 7:00 8:00 9:00 10:00 11:00 Recovery Point in Time Recovery Point Improvements Recovery Point in Time Comparison 35 30 25 20 15 Before After 10 5 0 Time of Day 13:00 14:00 15:00 16:00 17:00 18:00 19:00 20:00 21:00 22:00 23:00 0:00 1:00 2:00 3:00 4:00 5:00 6:00 7:00 8:00 9:00 10:00 11:00 RTO + RPO to Point of Outage Total Recovery Improvement Total Recovery Time to Point of Outage 80 70 60 50 40 30 Before After 20 10 0 Summary • Shift to internal recovery model using: – Storage replication – Capacity backup mainframe • Shift from “restore” to “restart” • Significant improvement in recovery objectives Core Efficiency Technologies Save up to 90% Deduplication Save over 70% Saves up to 90% for full backups Storage Virtualization Pooling and sharing heterogeneous storage resources Snapshot Space efficient imaging for data protection Deduplication • Deduplication removes redundant data blocks from volumes, regardless of application or protocol • With deduplication, users can recoup 50% or more of their capacity for many data sets and environments • Deduplication can be used for primary, secondary, and archival storage tiers Snapshots • Locally retained point-in time copies of file systems which you can use to protect data – Single files or complete backup and recovery • Block-incremental behavior limits associated storage capacity consumption • Reliable off-media backups without the need for long backup windows • Simplifies the process of recovering, duplicating, or archiving data Storage Virtualization • Designed to improve the flexibility and utilization of your storage resources • Pools your storage volumes, files and file systems into a single reservoir for centralized management • Works with heterogeneous storage systems • Reduce the effects of hardware configurations and helps support business continuity Where do you start? • Understand your current Resiliency/Recovery capabilities, limitations, and risks (BIA, Risk Assessment) • Develop Service Inventory – Prioritize and define the value/importance to the business of that IT Service, i.e. Tier 1, 2, or 3 • Define Data Protection Methods • Review/Define Service Level Objectives/Agreements • Document Recovery Time Objective (RTO) and Recovery Point Objective (RPO) for each IT Service • Design and Build the resiliency model to support the business supported IT Services • Develop Data Migration Plan Develop Service Inventory • Define tier specific patterns for – – – – – Virtualization and Cloud Opportunities Daily operational quality of service High Availability requirements Recovery requirements Data protection requirements • Establish service placement in each tier • Develop budgetary roll-up of all services by tier • Perform service delivery chain mapping for critical systems Sample Service Mapping Sample IT Service Tiering Tier 1 Tier Definitons: Target RPO/RTO for 2010: RTO: <24 Hours Services by Tier: Tier 2 RPO: <12 Hours RTO: <72 Hours RPO: <24 Hours Tier 3 RTO: 7-10 Days Tier 4 RPO: <24 Hours RTO: 30 DaysRPO: <24 Hours Communications/Infrastructure Critical Applications Non-Critical Applications Development / Test Network - Firewall - Telecom - Router - Switch - VPN - ACS Email - Exchange - Blackberry Phones - SRST Phone System - Call Manager - Unity - IPCC SharePoint (intranet) - intranet.XXX.org client websites (w/ SLA) Infra License Servers (No Cache) ProjectWise SIP (2010) client websites (no SLA) Sharepoint (non-critical) Oracle ERP XXX Apps Move-it DMZ Records Management License Servers (Cached) SmartFilter DMS (2010) TFS QC Define Data Protection Patterns • Develop an inventory of data types supporting complete service delivery • Based on data protection requirements, determine best replication methods • Develop estimated bandwidth and budgetary roleup costs to support the preferred replication methods • Define roadmap or incremental implementation approach for desired replication methods which are outside current approved funding Data Migration Planning • Discover characteristics of the source data • Based on data protection requirements, determine data migration requirements • Develop data migration plan • Deliverables: – Initial and incremental roadmap for replication implementation Conclusion • • • • There is no one-size-fits-all approach Most businesses do not fully understand how vulnerable they are using existing recovery processes. Companies who are looking to improve their RTO/RPO approach need to: – Look carefully at their business requirements and risk – Understand IT demands by understanding the applications, the operating system, storage, network bandwidth requirements and the total business impact. Relating the business support needs to IT Service Management can assist them in determining which approach is the most suitable. – Sometimes a combination of drawing elements from several approaches works best. – You can mix and match in order to tailor a solution that meets unique needs of any business unit