Slide Heading Business Continuity Management Why DR is not enough! Christopher LaVesser GE HealthCare February 15th 2012 Welcome Welcome ISACA members and friends! Before we start • Please silence cell phones and mute remote phones For local attendees • WIFI • Rest facilities • Food and beverage Announcements • February newsletter • Membership Renewal Deadline • International Volunteer Opportunities – Deadline, Feb 29 • Exam Prep Review Course Nominations for elections Upcoming ISACA KM Events BMIS - Business Model for Information Security - Mar 14 • 3 CPE, 1.30P start time Ethical Hacking - Security Processes and Procedures Apr 11 2nd Annual GRC Symposium - May 16 • 8:00AM - 5:00PM • 8 CPE • Ken Vander Wal, ISACA International President • Robert Stroud, ISACA Strategic Advisory Council, itSMF International Board Slide Heading Business Continuity Management Why DR is not enough! Christopher LaVesser GE HealthCare February 15th 2012 About GE • 4 businesses operating in more than 160 countries … 125+ years • Over 300,000 employees worldwide • 2010 revenue $150B Energy Infrastructure • Power & Water • Energy Services • Oil & Gas Technology Infrastructure • Aviation • Healthcare • Transportation GE Capital Home & Business Solutions • Appliances • Intelligent Platforms • Lighting About Chris LaVesser IT Service Continuity Manager – GE Healthcare Responsible for: • • • • • • • ITSCM Strategy, BCM Alignment, Awareness and Communication ITSCM Critical Business Process Alignment Data Center Failover and Recovery Plans ITSCM Vitality – Architecture, Strategic Platforms, Solutions and Automation Hosted Solutions DR Service Catalog DR Council IT Recovery Plan Approvals Prior to GE … 2004—2011: 2003—2004: 1996—2003: Education: ProHealth Care, Aurora Healthcare, information systems strategy Application delivery and recovery strategy Department of Defense (consultant to) Healthcare IT strategy and EMR deployment Hospital Corpsman, United States Navy Pre-hospital emergency care, Preventive medicine, Clinic operations Women’s health, Information systems development and strategy Bachelor of Science—Business Administration, MBA Agenda Business Continuity and Disaster Recovery – SWWC (So What, Who Cares)? Business Continuity Methodology Slide ITSCM atHeading GE Healthcare Assurance and Audit Framework DR Technologies and Virtualization: Saving Time & Money Justifying and selling ITSCM to your executives Agenda Business Continuity and Disaster Recovery - SWWC (So What Who Cares)? • Why are we taking about DR … AGAIN? • Probability ≠ Consequence: What is your cost of downtime? Slide Heading • Terminology • Implementing BCP Projects: Issues, Risks and Challenges So What, Who Cares? Why are we talking about Disaster Recovery … AGAIN? Why are we talking about DR? DR/BC Drivers • More data than ever before, big data • More tech-supported workflows • Strict data/availability requirements, regulatory issues • Interconnectedness, customer/user disruption and consequence • Corporate reputation and image Increasing reliance on availability DR Competition in the Cloud Reality Check Probability ≠ Consequence • 93% of companies that lost the use of their data center for 10 days or more filed for bankruptcy within one year of the disaster (Source: National Archives & Records Administration, Washington D.C.). • Of the companies experiencing disasters, 43% never reopen, and 29% close within two years (Source: McGladrey & Pullen, LLP). • The average cost of downtime as a function of labor (for large organizations) is $1,010,536 per hour (Source: Meta Group Study). • “Post” Katrina, over 100 healthcare agencies – which served over 21,000 patients combined – in the New Orleans area were flooded or permanently destroyed. (Source: TMC Healthcare Technology) Lessons Learned “Never confuse the PROBABILITY of failure with the CONSEQUENCE of failure.” ~ Rear Admiral Stephen Turcotte, Space Shuttle Columbia Disaster Interagency Investigation Board ~ The Big Guys Fall…Harder What is your cost of downtime? Productivity Lost shifts Lost days Wasted supplies Missed shipments Service failures Expense Temporary employees Temporary facilities Extra equipment Non-productive overhead Revenue Compensatory payments Missed sales Contract failures Billing losses Reputation Customers Creditors Suppliers Employees Distributors Trade groups Markets Competitors Performance Revenue recognition Cash flow Missed discounts Reporting failures Know Your Downtime Cost Per: • • • • Hour Day Week Month So What, Who Cares? Why are we talking about Disaster Recovery … AGAIN? We rely HEAVILY on system availability. BCP ≠ DR ≠ ITSCM Business Continuity Plans (BCP) • Define the minimum critical requirements for service delivery and acceptable operating conditions during times of unexpected outage. • Designed to mitigate potential loss and serve to minimize disruptions and financial impact during even minor events. Disaster Recovery (DR) /Disaster Recovery Plans (DRP) • Are IT focused plans designed to restore operability of the target system, applications, or computer facility at an alternate site after a major and usually catastrophic event. • Concentrates on deploying technology, writing scripts, and doing annual drills to practice recovery of IT systems after a disaster. DR is necessary but not sufficient. IT Service Continuity Management (itSCM) • The ‘technical' subcomponent of BCP; • Takes the IT-related business threats and addresses them through risk reduction, response, recovery, or avoidance actions. • Governs the execution of Business Continuity Management requirements. Business Continuity Management Is not just another name for Disaster Recovery, it includes… • Disaster Recovery – IT systems • • Hardware Restoration – Devices, Network, Communications Software Recovery – OS, Applications, Data • System Reliability and Availability • • • • Security Updates / Patches Interoperability Redundancy • Business Recovery and Resumption • • • Corporate escalation procedures Total site recovery plans Partial site recovery plans • Contingency Planning • Crisis Management BCM goes beyond BCP and also covers management aspects such as policy, training and awareness, maintenance and exercise, and continuous improvement, as well as understanding the organization and embedding BCM into its culture. Evolution of Business Continuity 1980s 1990s 2000s Business Focus Traditional Dot.com eBusiness Requirements Restore Recover High Availability 24x7 Scalable Driven By Regulation E-Commerce Competition Magnified by Disaster Absence of Brick E-Commerce and Mortar Recovery Hardware Hardware and Data Hardware, Data, Applications Expectation Days / Hours Minutes / Seconds Minutes / Seconds Decision Optional Recommended Mandatory Why are we talking about DR? • Trends and changes in Business Continuity Management • • • • • • • Consistency across acquired and outsourced teams Shifting Environments and Responsibility – new media, new threats (social media, cloud, wireless/mobile devices) Integration between incident management and corporate communications Regulatory and contractual obligations can be met Catastrophic event scenarios - protests and extreme weather Cyberspace attacks (zero-day virus) that impact customer-facing sites, partners and service providers on a much broader scale Testing: • Multi-day tests to ease burden on normal business operations • Multi-scenario, involves partners/providers • Continuity and communications plans (including social media) Issues, Challenges and Risks • Senior management commitment and involvement • Lack of thorough understanding of the data dynamics and dependencies involved in data recovery by BCM practitioners • Inappropriate approach in executing BCM processes • Incorrect and/or inappropriate assumptions in formulating business continuity and disaster recovery plans • Perception that DR is an IT problem Source: Key Issues, Challenges and Resolutions in Implementing Business Continuity Projects Issues, Challenges and Risks • Senior management commitment and involvement • Delegation by Senior Management • • • BCM Implementation for the Wrong Reasons • • People, processes and resources must be part of the equation Absence of a Single BCM Framework Across Multiple Offices • • Failure to align system capability with business needs and growth projections results in solution gaps and performance issues Technology-only Approach Toward Resilience • • Compliance driven not business-risk driven Business/IT Disconnect • • Reduces visibility Leads to lack of serious attention and cross-departmental cooperation For consistency, start with enterprise-wide standard/framework Lack of Consensus Between Senior Management and Operations Management – RPO/RTO Issues, Challenges and Risks • Lack of thorough understanding of the data dynamics and dependencies involved in data recovery by BCM practitioners • Incomplete Understanding of Data Recovery Requirements • • • • • Upstream and downstream systems End user computing systems (spreadsheets, batch processes) Synchronized recovery of transactional data Permissions, profiles, configurations, license keys Failure to Consider Full Recovery • Return to normal can be as painful as the failover itself Issues, Challenges and Risks • Inappropriate approach in executing BCM processes • • Location-based Risk Assessments • Building or Site-based not as effective as service/processbased risk assessment • Co-location arrangements share HVAC, physical security, etc. Inappropriate BIA Approach • • Analysis of individual business applications conducted in silos; business owners tend to overstate the importance of its function Equal Weight (Risk Priority Number) Assigned to All Risk Attributes • Failure Modes and Effects Analysis (FMEA) used alone may result in unnecessary investments for low-severity risks Risk 3 is more critical than the other two risks. When risks are prioritized for treatment, the effective way to establish risk acceptance criteria is to use both RPN and criticality. Issues, Challenges and Risks • Incorrect and/or inappropriate assumptions in formulating business continuity and disaster recovery plans • Failure to Consider All Relevant Assumptions and Limiting Factors • Disasters do not occur in isolation -- multiple systems and processes are impacted by disasters • Multiple businesses and service providers can be impacted by disasters • Competition for scarce resources in disasters can negatively impact recovery times • Local/regional disasters impact availability of key human resources – people first. Agenda Business Continuity Methodology • Awareness / Operation and Planning / Site Evaluation • Risk Assessment Slide Heading • Business Impact Analysis • Strategy Development • Recovery Plan Development • Validation/Testing • Recovery Plan Implementation, Maintenance Business Continuity Methodology Validation and Testing Procedures Awareness Organization & Planning Recovery Plan Development Site Evaluation Strategy Development Risk Assessment Business Impact Analysis Recovery Plan Training Recovery Plan Approval Recovery Plan Implentation Maintenance Program Most BCM activity is limited to just DR testing and if you are lucky, DR training. Business Continuity Methodology Awareness • • • • Over 90% of BCP projects fail at the onset beginning because of a lack of awareness & support by upper-management. First goal is to help develop awareness to confirm commitment. Trending towards board level initiation of BCP – risk, image Integrate into Operating Mechanisms: • • • • • C-Level Accountability Reviews Architecture Review Boards and Coaching Sessions Project Steering/Tollgate Reviews, Project Manager Training Programs Shouldn’t always be driven by IT. Business Continuity Methodology Organization & Planning • Similar to other large projects • • • Obtain management approval for funding, personnel, etc Develop overall program plans: implementation plan, communications plan, change control, maintenance and continuous improvement plans For BC/DR Programs - Develop standards and guidelines • Scope (of the system portfolio Must, Should, Could) Roles and Responsibilities Document creation, review, approvals, storage, maintenance Initiating, Aligning and Engaging Resources Create templates • Crisis Management • Business Impact Analysis / Business Continuity Plans • Disaster Recovery Plans and Disaster Recovery Test Plans • Test Summary Reports • Determine test cycles: what to test, when and how often • • • • Business Continuity Methodology Organization & Planning, continued • For discrete BC/DR projects • • Develop project plan Initiate/align/engage resources • • • • • • • • Create/update documents Obtain approval of documents Test Business Continuity Plans Test Recovery Plans Conduct After Action Review – What went wrong and what went right. • • • • Business Owners IT Owners Technology Owners Close issues / mitigate gaps Update documents Action plan for improvement (issues and scope) Schedule next test – Retest, regular test cycle, change/release cycle Business Continuity Methodology Site Evaluation(s) • Identify and document your business locations? • What business functions are performed there? • • • • • • • • Manufacturing / Industrial Sales / Retail Administration / Finance Customer Service / Call Center Data Center What vital records (paper, electronic) are stored there? Identify and document Business Systems – systems needed (i.e. accounting – are manual processes defined, existing procedures documented) Identify and document IT Systems - hardware, software, recovery plans, telecommunication, people, process, procedures, etc. Business Continuity Methodology Risk Assessment Should be performed/reviewed annually • What are the business vulnerabilities? • • • • • • • • • • • • Hardware Information / Data Systems and Processes Buildings People Partnerships / Suppliers Other Determine if there have been past interruptions What is the potential for future disruptions? What is the cost of that disruption if it is known? Need to consider factors unique to local environment/community. Evaluate and rate the probability of business interruptions…. Risk Assessment Evaluate and Rate Probability of Business Interruptions • • Physical • Human Error • Hardware Failure • Fire, Smoke, Water • Loss of Power • Malicious Attack Natural • Flood / Tsunami • Tornado / Hurricane • Volcano • Earthquake • Sink Holes • • Electronic • Bugs • Viruses • Hacking • Sabotage • Denial of Service Other • Strike • Material Shortage • Supplier Loss • Geo-Political Events • Endemic/Pandemics • More… Risk Assessment Considerations in analyzing risk will include the following: • • • • • Investigating the history and frequency of particular types of disasters (often versus seldom). Determining the degree of predictability of the disaster. Analyzing the speed of onset of the disaster (sudden versus gradual). Determining the amount of forewarning associated with the disaster. Estimating the duration of the disaster. Risk Assessment, Example Natural Threats Description Internal flooding Dry pipe: Probability of inadvertent activation is 10 raised to -6. Raised floor has all cooling fluids below surface. Drainage tiles are in place and have proven to work in the 2010 flood that surrounded building. External flooding One flood has occurred in last 5 years. Cars were floating in parking lot. Water did seep into conduit; however, drain design prevented any damage. There has not been any internal fires and data center has dry pipe. Building has gas suppression and water to save building. Based on historical accident data, in combination with site-specific data, such as the mass and arrangement of the cabling, the probability of a fire occurring in the data center area is usually estimated to be Internal fire “somewhat likely” (0.0001 - 0.01/yr). A fire in such a data center, particularly under the raised floor, could result in significant downtime - in excess of a week or more. This would typically be attributed to the mass of cabling in many areas and the toxic and corrosive combustion products resulting from a PVC fire. According to NFPA, the probability of an external commercial fire that could potentially spread to other buildings External fire is .452% The 1947 Wisconsin earthquake took place on May 6, 1947 immediately south of Milwaukee at 4:25 a.m. It was Seismic the largest tremor to be historically documented in Wisconsin, though it was not recorded by seismographs. No activity serious property damage resulted here or elsewhere, and the Milwaukee Gas Light Co. reported no breaks or trouble in the gas system. There are redundant telecom entrance lines that are below ground. However, they are below ground on company High winds property but are above at carrier. Snow and ice There are redundant telecom entrance lines. However, they are below ground on GE property but are above at storms carrier. Since 1950, 14 tornadoes have been reported in Milwaukee County . The earliest tornado in the City of Milwaukee was March 8, 2000. The F1 tornado with winds up to 110mph started near the airport and then went Tornado through Cudahy and St Francis. The strongest ever recorded tornado in Milwaukee was an F2 on August 4, 1980. The Tower building can withstand an F2. Hurricane N/A In the last 30 years, there has been the Avian Influenza and 1993 Milwaukee Cryptosporidium outbreak was a Epidemic significant distribution of the Cryptosporidium protozoan in Milwaukee, Wisconsin, and the largest waterborne disease outbreak in documented United States history. Tidal Wave Tower is sufficiently inland. Typhoon No recorded typhoon in Midwest Probability of annual Impact Risk occurrence (0-3) Value (0-10) 0 2 0 3 1 3 0.1 1 0.1 0.0452 3 0.1356 0 2 0 10 1 10 9 1 9 0.2 3 0.6 0 3 0 1 1 1 0 0 0 0.002 0 0 Risk Map Milwaukee Site Natural Threats 12 10 P r o b a b i l i t y 8 Internal fire 6 Tornado External fire High winds Snow and ice storms 4 External flooding Epidemic 2 0 0 0.5 1 1.5 2 Impact 2.5 3 3.5 Business Impact Assessment Purpose This BIA documents the critical business processes of system to be used in disaster recovery planning. • • • • Identify all the mission critical applications and functions in the environment. Priority Planning – Identify core biz apps & systems. Rate them. Mission critical if it must be restored in less than 2 hours because the financial impact is so great. Classifying the core function. Quantifying the impact. Determine recovery priority. Business Function – more in depth. Core functions. (quantify the impact on manufacturing plant lines) Through this analysis, the following are documented: • • • The potential impact on the business if the system or parts of the system were unavailable, as it applies to disaster recovery planning. The business continuity plan to develop workarounds for business users to operate during the system disruption time. Recovery objectives will state the length of time the business can tolerate an outage and how much data the business can risk losing. Results will drive development of strategy and scope Business Impact Assessment Scope of the BIA • Description of the system • What is in scope – Applications, Modules, Interfaces, Peripherals • What is not in scope – Upstream/downstream systems • Assumptions – Some systems your plan(s) depend are available The results of this analysis can be used for the following: • To identify the business requirements for the user acceptance segment of DR testing • To create the disaster recovery (DR) test plan • To develop strategic site-based IT Service Continuity planning; application information can be used to compare applications located at a site and an appropriate and objective recovery priority level assigned Business Impact Assessment • • • • What business functions use this system? At what point during the year is the system used most frequently? Is this system in scope for any compliance regulations? Did any unplanned service disruptions occur in the past 2 years associated with the system that caused an impact to the business and/or users of the system? • What was the impact of the disruption? • Should be reviewed/updated annually, and after any major change/event. Business Impact Assessment Critical business process: […] Describe the process within the Process identifier number: [SYSID-0#] application. Example: Create purchase order. This sub process aligns to the following Business Process(es): Go To Market (GTM), New Product Intro (NPI) Inquiry to Order (ITO), Order to Remittance (OTR) Close the Books (CTB) Service, Post Market , Customer Loyalty Other ____________________ Critical business process owner: Should be listed in the Communications Plan. System dependencies for this process: List any interfaces, applications, and/or other data inputs needed for this process. This information will help trace applications and their respective DR plans to each other. What is the input of this process? What is the output of this process? […]Describe inputs required for this process to work. This […]Describe outputs required for this process may be an input from a dependency listed above. to work. This may be an output from a dependency listed above. How frequently does this process occur? […] If this process were unavailable, what would be the operational impact? […] What else is affected if this process were unavailable? […] If this process were unavailable, what are the compliance risks? […] In the event of a disaster, does this process need to be Is there a workaround that can be used available? [Yes/No] If yes, this business process shall trace to a until the application is fully restored? test script within the application’s DR test plan. [Yes/No] If yes, explain that workaround in Section 7 of this document. These processes will be tested during the DR test, after the recovery steps, to ensure the critical processes are available in a DR environment Business Continuance Plan What process does this workaround replace? Is this workaround for the entire application? There may be one workaround for the entire application, or different workarounds documented for specific processes. Process identifier number: This workaround aligns to business process [#] Workaround steps: Describe the business continuity plan in the boxes below for short-, medium-, and longterm. Provide details for an appropriate sequence of actions to be executed in the event that the application is not available for the determined period of time. With agreement from the system owner, it is acceptable that the plan may be to take no action, and wait until the application is recovered. The boxes below state the length of time that is considered short, medium and long term, respectively. If the length of time is different during a period in which the application is used more frequently, also document that length of time. Short-term: Describe the steps that would be used in the short-term. Is there a Work Instruction for this business continuance process? Y/N Has this process been tested? Y/N After x hours/days Medium-term: Describe the steps that would be used in the medium-term. Is there a Work Instruction for this business continuance process? Y/N Has this process been tested? Y/N After x hours/days Long-term: Describe the steps that would be used in the long-term. Is there a Work Instruction for this business continuance process? Y/N Has this process been tested? Y/N After x hours/days Business Continuance Plan Return to Normal How will this critical business process return to normal? • Is this a return-to-normal process for the entire application? There may be one for the entire application, or different processes and quality checks documented for specific business processes. • If no workaround exists, or is not possible, complete one of these tables for the critical business process and state briefly how the lack of a workaround influences the Recovery Point Objective (RPO) or Recovery Time Objective (RTO) listed below. Return-to-normal steps: • Describe the process to be followed to ensure data created/captured during downtime are properly entered into the system for each critical business process, such that normal workflows and reporting with the system can be restored. • How will these data be verified after the return-to-normal process is complete? • Who is responsible for verifying the data? Recovery data source: e.g. Spreadsheet name, journal location, responsible party? Prerequisites: • • Is there a pre-requisite to resuming normal user activity?: Y/N If yes, does this process have to be completed before allowing normal users back on system? Dependencies: Which return-to-normal processes must be completed before this process is executed? List in order, and indicate “none” for those with no dependencies. Communication: How will completion be communicated? Recovery Objectives RPO: The maximum amount of data that may be lost when service is restored after an interruption. Measured by the point in the past to which you could restore data from a backup. RTO: The speed with which systems and processes must be in operation following an incident. This is measured by the maximum time allowed for recovery of an IT service following an interruption. BCP Communications Plan Internal Contacts Contact (or Contact (or How to notify? group) role distribution list) name External (Vendor) Contacts Contact (or Contact (or How to notify? group) role distribution list) name Communication plan: (How to remain in contact? Is there a physical meeting place? If a teleconference will be used, what is the number?) Communication plan. Primary point of contact. Who invokes the plan and who needs to know? BCP Risk Risk Workarounds are not available for some/all of the critical business processes. There is a risk that resources that should be notified that alternate processing is in effect will not receive timely notification if contact and distribution lists are not current. Unable to contact or engage sufficient resources required for business continuity plan. While the continuity plan is in effect, there is the risk of non-communication or delayed communication of critical issues and detail on them to all parties involved in resolution. Business Continuance Plans and work around processes have not been tested. There is a risk that some end users may continue using alternate methods once the system is available again. There is a risk of manually captured data being incorrectly entered into the system or missed being entered, once the system is available. There is a risk that applications and systems that the application depends on do not have detailed out their business-specific continuity plan and do not have alternate tools and processes in place to deal with an unplanned outage situation. There is a regulatory / compliance risk of …. Describe if applicable Mitigation Business Continuity Methodology Strategy Development • • • Identify systems or critical needs that will financially justify the strategy moving forward. Based on Risk Probability and Business impact, level of disaster, risk appetite, solution will be defined. Balance the cost of risk reduction measures and recovery actions • • • • • • • • • • Backup and recovery strategy Off-Site Storage Eliminate single points of failure Multiple outsourcing providers – SLAs, Disaster SLAs, Resilient IT systems and networks Hot site / High Availability Site, Warm/Cold Site Change controls Greater security control Better service disruption detection Continual improvement Strategy Development Strategy Development – Recovery Options • Do nothing • Manual Workarounds - Administrative actions take a lot of resources • Reciprocal Arrangements - Agree to use the infrastructure of another organization • Gradual Recovery – Cold Standby • An empty room available (in house or outsourced), mobile or fixed, where IT infrastructure can be rebuilt • Takes longer than 72 hours • Intermediate Recovery - Warm Standby • Contract with 3rd party recovery organization to use their infrastructure in a contingency situation (Sungard) • 24-72 hours to recover • Fast Recovery – Hot Standby • Recovery site with infrastructure only • Immediate Recovery – Hot Standby / High Availability • Recovery site – full redundancy infrastructure, mirrored data Business Continuity Methodology Recovery Plan Development • • • Identify resources that enable the business functions and their outputs to be generated • Business resources (e.g. fax, photocopiers, staff, application software etc) • Support resources (e.g. network drives, servers, PABX etc) • Infrastructure resources (e.g. computer rooms, storage facilities, etc) Develop the recovery plan based on the Strategy and approved budget. • i.e. Data center costs, business (interruption) insurance, etc. could be justified in the losses. Validation of BCM method(s) maintain service and compliance levels Recovery Plan • Overview • Document Purpose • Assumptions • Dependencies/Prerequisites • Inclusions • Exclusions • Limitations • Roles And Responsibilities • Test Items • Test Setup – Prerequisites, How conducted, Who engaged/how • Components And Functions To Be Tested • Infrastructure Recovery Segment • User Acceptance Segment – Traceability to BIA • Components And Functions Not To Be Tested Recovery Plan • Disaster Recovery Physical And Technical Requirements • Disaster Recovery Requirements • Backup Requirements • Other Disaster Recovery Requirements • Physical And Technical Environmental Needs • Physical Environment • Physical Application Architecture • Technical Environment • Risk Strategy • Training Requirements • Test Scripts • Execution Of Test Scripts • Incident Management • Retest Procedure • Acceptance Criteria For Test Scripts Recovery Plan • Acceptance Criteria – Test scripts passed, RPO/RTO met • Return To Normal Operations – Back out changes, restore/revert connections, take down test DBs, Business As Usual (BAU) mode • Exit Criteria – Test scripts approved, issues closed, BAU mode • Qualified Infrastructure • At Time Of Disaster – How is test plan different from the recovery plan? • References • Internal Quality References • Project Documentation References and Work Instructions • Glossary: Acronyms/Definitions Used Test Scripts DR Rehearsal Tasks Owner Disaster Occurs Disaster Activate Command Center CCT Team ComputeFacilities CTO Damage Assessment Disaster Declaration Actual Start Time (CST) Actual End Time (CST) Elapse d time Disaster Mobilization Initiate Tcon / WebEx DR Leader Notification to Business CCT Escalation Command Center CCT Invoke Business Communication Plan DR Leader CCT / SP DR Lead CCT & Compute CCT ComputeFacilities ComputeFacilities Mobilize Recovery Teams Review and Verify Customer Specific Info Initiate Recovery Process Identify alternate location and target h/w Engage the Facilities team to rack, power and network the equipment Rebuild the Catalogue for NetBackup to get list of tapes to recall from Iron Mountain. Request Iron Mountain rep to deliver tapes Re-import the tapes to appropriate data center tape library. Storage Admin Storage Admin Storage Admin - Test Scripts DR Rehearsal Tasks Server restore Root – Operating System SAN / NAS configuration Application Directories Application Data Engage networking to make DNS changes if necessary, reconfigure F5 Reboot after all configurations NAS/SAN Recovery Export file systems Open SRA for the exports and quotas Mount NAS Points to Directories on Servers Database restore Engage DBA Team Restore PRODdb to DRdb DR instance Startup the recovered DRdb Web Server & App Server restore Configure Application on DR target Configure Connection Pools Where applicable Citrix Check published application is available Validation H/W validation Database validation - Status check Owner Compute - Server Compute – SAN Compute - Server Compute - Server Compute - Server Compute - Server NAS Admin NAS Admin Unix Admin DBA Data Protection Team DBA Web Services Web Services Citrix Team Compute Compute Actual Start Time (CST) Actual End Time (CST) Elapse d time Test Scripts DR Rehearsal Tasks Test Application Test Script 1 Test Script 2 Test Script 3 Test Script 4 Test Script 5 Calculate RPO / RTO Return to Normal Operations Revert connections NAS Request to release temporary storage Take down dr database Disable access to DR environment Resume functions Return home Close the TCON Document findings Owner Actual Start Time (CST) Actual End Time (CST) Elapsed time Recovery Objective Recovery Results 1:00 0:00 24:00 0:00 Functional Team Functional Team Functional Team Functional Team Functional Team DR Leader NetworkingTeam Storage Team DBA Team Networking Team Business ALL CCT ALL RPO (Hours) RTO (Hours) Business Continuity Methodology Validation and Testing Procedures • The plan must be tested to: • • • • Ensure completeness and accuracy of recovery steps Uncover any problems or overlooked issues Verify recovery objectives can be met Testing Approaches: • • • Table Top - Roundtable review (virtual dry run) of plan to check completeness and begin training; also used when physical test is too risky (full failover), cost prohibitive (total loss of datacenter) or impossible to simulate (pandemic) Partial Test – Segments can be tested separately – backup / recovery Full Test • Either on weekends or planned outage: Business functions, IT systems, communication methods are interrupted and the plan is tested. • An alternative is to go off site and restore without access to the main facility. Business Continuity Methodology Recovery Plan Implementation • • • Once the plan has been tested and “approved”, it is then implemented Plan must be stored in secure common location and that location is communicated with business departments. Procedures on when it should be accessed is communicated. Maintenance Program • • • • Very important to keep it updated and review regularly Very important to have risk-based approach to frequency and scope of testing (criticality, availability, vulnerability) Define the change control Recommend a centralized department that encompassed security, asset management, Business Continuity Planning Agenda ITSCM at GE Healthcare • Current State of DR • Goals, Objectives and Strategic Imperatives Slide Heading • Why not just DR? • ITSCM – A new philosophy and new imperative • Virtualizing DR What IF….. … we lost one of our key data centers? • • • • Multiple businesses impacted Thousands of systems not available Recovery time 60-90 days (estimate) 1 M transactions/day don't happen! Recovery Goals • Tier 1 recovery < 24 hours • Tier 2 recovery < 72 hours HISTORICAL SHIFT Consolidation to key locations for cost out STRATEGIC Strengthen key locations + Leverage virtualization technology + Test and audit, address obsolescence Current State of DR • Fragmented and Imbalanced • Every application for itself • Compliance focus • Strong facilities and tools Fragmented and Imbalanced • Too much focus on the application • Too much focus on the smoking hole scenario • Too little focus on the smoking hole scenario • Wildly different approaches, understandings, motivations, etc. Every Application For Itself • • • • • Don’t we have an “easy” button? App-specific DR plans, tech, tests Poorly-understood requirements No standard approach—historically It’s based on “metal” classification… …no it isn’t. Compliance focus… SOX Enterprise GE Corporate: ITCF HIPPA HITECH JCAHO Contract Hosted Technology Solutions Strong facilities and tools Strategic Data Centers Recovery Tools / Knowledge Milwaukee Data Domain Waukesha Oracle RAC & SQL Polyserve Buc, France Virtualization Grove, UK H/A Clusters Beijing, China Redundancy Chicago Log Shipping How can we leverage these? Goals and Objectives Strategic Imperatives IT Service Continuity Mgmt. Disaster Recovery & Testing Critical Business Process Alignment Meet Compliance Requirements Proactive measures reducing risk of disruption to IT services • • • • Ensure business continuity by reducing the impact of disruptive events Reduce IT vulnerabilities/risk to the business through analysis and risk mgmt. Prevent loss of reputational value & customer and user confidence Create tight integration between ITSCM and BCM Automate the compliance process…eliminate complexity • • • Define and understand DR architecture options (SRM, HA, Bare Metal, etc.) Implement technology to simplify DR Testing and Recovery Define DR architecture standards … enforce through Enterprise Architecture Understand business impact of disruption to IT services • • • Establish applications mappings to servers and datacenters (2011-2012) Map applications to critical business processes (2012-2014) Institutionalize mechanisms to establish and maintain control of relationships Win at compliance 100% of the time • Define, align with, and meet GEHC enterprise, Managed Solutions, and Corporate Enterprise Risk Management compliance requirements for Disaster Recovery Why not just DR? • • • • • • 1900+ applications Complex architecture Large, cumbersome, outdated plans Constantly cycling technology Constantly adding/retiring systems Knowledge too-far removed from process • DR alone is never enough… Why not just DR? 1900+ Applications Inefficient & Expensive Technology & Process Separation Compliance Scope: 2011: ~ 50 Apps 2012: ~ 150 Apps Constant Application Change Constantly +/Applications DR alone won’t get us there… A new philosophy—Not just DR • Leverage what we’re doing well Improve Recoverability • Balance focus on smoking hole • Standardize Compliance Remove / Reduce Risk • DR-ready platforms/technologies • Proactive risk management • BC plans span residual risks Tightly-Bound BCP Planning RecoveryReady Infrastructure The ITSCM Imperative Competitors NPI CTB Ensure Continuous Service Service ITO OTR Enterprise Hosted Technologies ITSCM: TSO Focus Areas Guided by DR Council Reduce Risk IT Service Continuity Management IT Disaster Recovery & Crisis Management Improve Recoverability Reduce Cost Ensure Compliance Platform Risk Remediation Recovery Tools • SRM • Data Domain Recovery Tools • SRM • Data Domain SOX-DR Data Center Risk Remediation Virtualization Virtualization ITCF-DR HTS Contract Review Balanced Replication Strategy Balanced Replication Strategy HTS-DR Data Center Core Services Restoration Plans Standardization Continuous & iterative: Drive ITSCM maturity 2012 2014 ITSCM Continuous Improvement New New Products/Services Markets/Suppliers Capabilities App/Data Classification Business Continuity Planning New Regulations Risk Appetite Targets Process Improvement Test Results Obsolete Technology Budget Disaster Recovery Planning Resource Availability Priorities New Technology Agenda Assurance and Audit Framework • GE Compliance Framework – Mapping DR to regulations • ITSCMHeading Framework - COBIT Slide • ITSCM Assessment Checklist GEHC Compliance Framework Healthcare Industry Standards Area HIPAA FDA CFR IT Continuity Plans 164.308(a)(7)(ii)(B) 164.308(a)(7)(ii)(E) 164.310(a)(2)(i) Critical IT Resources 164.308(a)(7)(ii)(C) Maintenance of the IT Continuity Plan 164.308(a)(7)(ii)(D) Testing of the IT Continuity Plan 164.308(a)(7)(ii)(D) SD 4.5.5.2 SD 4.5.5.3 SD App K SD 4.4.5.2 SD 4.5.5.4 XI. B SD 4.5.5.4 IT Continuity Plan Training Distribution of the IT Continuity Plan IT Services Recovery and Resumption Post-resumption Review ITIL SD 4.5 SD 4.5.5.1 CSI 5.6.3 IT Continuity Framework Offsite Backup Storage Industry best practices 21CFR11 164.308(a)(7)(ii)(A) 164.310(d)(2)(iv) XI. C 21CFR11 SD 4.5.5.3 SD 4.5.5.4 SD 4.5.5.3 SD 4.5.5.4 SD 4.5.5.3 SD 4.5.5.4 SD 4.4.5.2 SD 4.5.5.4 SD 4.5.5.2 SO 5.2.3 SD 4.5.5.3 SD 4.5.5.4 Internal ISO/IEC 27002:2005 InfoSec 6.1.6 6.1.7 14.1.1 12.1.4 14.1.2 14.1.4 6.1.6 12.1.2 6.1.7 12.1.3 14.1.3 14.1.1 12.1.1 14.1.2 14.1.5 12.1.6 14.1.5 12.1.5 14.1.5 12.1.5 14.1.5 12.1.1 14.1.1 14.1.3 12.1.5 10.5.1 12.1.1 14.1.5 12.1.5 ITSCM Framework - COBIT COBIT Area Section Details (abbreviated) DS 4.1 IT Continuity Framework Develop a framework for IT continuity to support enterprise wide business continuity management using a consistent process DS 4.2 IT Continuity Plans Develop IT continuity plans based on the framework and designed to reduce the impact of a major disruption on key business functions and processes DS 4.3 Critical IT Resources Focus attention on items specified as most critical in the IT continuity plan to build in resilience and establish priorities in recovery situations DS 4.4 Maintenance of the IT Encourage IT management to define and execute change control procedures to ensure that the Continuity Plan IT continuity plan is kept up to date and continually reflects actual business requirements. DS 4.5 Testing of the IT Continuity Plan Provide all concerned parties with regular training sessions regarding the procedures and their roles and responsibilities in case of an incident or disaster Determine that a defined and managed distribution strategy exists to ensure that plans are Distribution of the IT DS 4.7 properly and securely distributed and available to appropriately authorized interested parties Continuity Plan when and where needed Plan the actions to be taken for the period when IT is recovering and resuming services. IT Services Recovery DS 4.8 Ensure that the business understands IT recovery times and the necessary technology and Resumption investments to support business recovery and resumption needs. Store offsite all critical backup media, documentation and other IT resources necessary for IT DS 4.9 Offsite Backup Storage recovery and business continuity plans Determine whether IT management has established procedures for assessing the adequacy of Post-resumption DS 4.10 the plan in regard to the successful resumption of the IT function after a disaster, and update Review the plan accordingly. DS 4.6 IT Continuity Plan Training Test the IT continuity plan on a regular basis to ensure that IT systems can be effectively recovered, shortcomings are addressed and the plan remains relevant ITSCM Assessment Checklist Item Category 1 2 3 4 Business impact analysis Critical process Communication 5 COBIT Control Objective section Has a Business Impact Analysis (BIA) been conducted? If yes, when was the last DS 4.1 update? DS 4.1 DS 4.2 DS 4.2 6 Workaround DS 4.1 7 Backup site DS 4.9 8 Level of service DS 4.8 DS 4.2 Role and responsibility 10 DS 4.3 11 Operational procedure 12 Critical software and hardware 13 Support equipment BIA DS 4.2 9 Assessment Reference Are critical processes documented and included in the Disaster Recovery Plan (DRP)? Is a communication plan included? Are several communication channels included? Are call trees and lists, staff names, and recovery procedures documented automated and/or manual? Are there layers of contingencies such as IT and/or manual workarounds documented? Does the DRP provide an alternate site for recovery? Does the DRP specify the level of service (which the business owner has agreed to be acceptable) to be provided while in recovery mode? Does the DRP have distinct management and staff assignment of responsibilities immediately following a disaster and continuing through the period of reestablishment of normal operations? Does the facilities section have predetermined contracts to recover facilities and/or rebuild plans for critical computing equipment and business area workstations? N/A BIA BIA BIA BIA BIA DRER BIA BIA - DS 4.1 Are the operational procedures documented in a systematic fashion that will allow recovery to be achieved in a timely and orderly way? DREP DS 4.3 Does the DRP identify hardware and software critical to recover the mission critical business and/or functions? DREP DS 4.3 Does the DRP identify necessary support equipment (forms, spare parts, office equipment, etc.) to recover the mission critical business and/or functions? BIA ITSCM Assessment Checklist Item Category COBIT section Control Objective Assessment Reference DS 4.3 Is there a back-up generator to support critical systems, technical staff and business area workstations? 15 DS 4.7 Is a current copy of the DRP maintained off-site? NA 16 DRP availability DS 4.7 Is the off-site DRP copy up-to-date? NA 17 DS 4.4 Do all users of the DRP have ready access to a current copy and/or copies at all times? NA DS 4.1 Are all critical or important data required to support the business being backed-up? How often is the backup? DREP DS 4.6 Does the training, testing/exercise plan list exercise type, sequence, and frequency of occurrence? - DS 4.6 Do all employees responsible for the execution of the DRP receive training? - Do the business conduct exercise(s) of the DRP at least annually? - Does the document cover method(s) used to test the DRP? DRER / BIA Has the DRP corrective action plan been completed and closed? DRER Are there DRP maintenance procedures and schedules? DRPlanning Is the summary of changes made to plan since last submission been documented? DREP 14 18 19 Infra backup Critical data backup Training 20 21 DS 4.5 Testing 22 DS 4.5 23 Corrective actions DS 4.10 24 Review and approval DS 4.1 25 Change management DS 4.4 Agenda Disaster Recovery Technologies and Virtualization • VMWare – Site Recovery Manager • Data Domain Slide Heading • DR in the Cloud Virtualizing Disaster Recovery • Site Recovery Manager • What is it? • • • Allows VM’s to be moved across sites Digitizes DR Testing Process Who can use it? • • Any virtualized infrastructure could be a potential candidate How does it work? • Leverages Data Replication and VMotion Site Recovery Manager Automate the recovery process…eliminate complexity • Rapid DR for critical applications and systems • Hardware agnostic, flexible DR design • Eliminate human error through DR automation • Lower DR costs through virtualization Virtualizing Disaster Recovery • Benefits of Site Recovery Manager • • • • • • • • • Simplifies recovery process Reduces complexity and stress Reduces recovery time Backup Agents and missed open files are not an issue with an image backup of a virtual server guest Most SANs have a built-in replication option Virtual server disk images can be replicated to a remote DR site • Allows you to pre-stage servers at remote DR site • Consolidating servers reduces hardware (and possibly colocation) costs of remote DR site Ability to test and fine tune DR procedures with significantly smaller investment Ideally suited to Vmware cluster environment running SAN Supports many-to-one failover using shared recovery sites Site Recovery Manager Estimated Recovery Time Recovery Method Bare Metal Recovery with Physical Servers 23 Hours Bare Metal Recovery with Virtual Servers 11.7 Hours Bare Metal Recovery with Pre-Staged Images 1.2 Hours Which do you prefer? Virtualizing Disaster Recovery • DR in the Cloud • What is it? • • • Who can use it? • • • Just starting to emerge Connectivity to compute and storage resources hosted on remote, scalable, elastic, multi-tenancy clouds Based on an OPEX model Suited to all organizations, ranging from small and medium businesses to large enterprises How does it work? • VMs are hosted in a third-party facility and are essentially operating from the cloud DR Competition in the Cloud Data Domain • Data Domain • What is it? • • • Who can use it? • • • • Disk-based backups with built-in intelligence Centralized Backup, Management and Reporting Organizations with large/growing data stores (and budgets) Companies offering cloud services Multi-site / Multi-data center organizations How does it work? • • Deduplication reduces data to smallest size to optimize transfer and storage Disk devices replacing linear tape Data Domain Benefits • Benefits of Data Domain • Centralized Backup, Management and Reporting • • Automate Backup and Recovery Process • • • 10% of images on tape are unrecoverable (this is optimistic) Lower cost of operation • • • • • 229% increase ability to run concurrent backups 16 staff hours/week saved by not changing tapes Increased reliability • • Provides off-site replication for BCP/DR Reduction in hardware footprint Eliminates tapes and trucks Reduced storage costs (30:1) data compression Reduced WAN bandwidth: 40:1 reduction for WAN transfer Improved recovery (restore) time • • Data restores minutes vs. hours/days Backups available despite community disruption Legacy Technology …Tape • Challenges at GEHC • • • • • • • • Massive data growth (store multiple copies) Mechanical failures Reliant on local personnel Longer recovery times DR via trucks Accidental loss Regulatory compliance challenges GEHC is replacing all tape backups with 88 DataDomain Deduplication • Opportunity • Offsite backups 365 days per year • Data securely encrypted at rest and on transition • Devices highly redundant • Improved backup success rates • Not reliant on local personnel • Simplifies Compliance and auditing • Leveraged heavily for application migrations and centralisation GEHC Enterprise Architecture Remote site A Site Deployment Status Data Protection Hubs Buc/Beijing/Milwaukee Backup Server Application Servers n AP - 75% Data Domain Device Backup Servers 100% US remote sites (45+ sites) Replicate Remote Office Data Remote Site B Backup Server WAN Data Domain Device EU - 90% Outstanding Milwaukee Q2 ‘12 Waukesha Q2 ‘12 Buc Q2 ’12 Cardiff Q2 ’12 Beijing Q2 ’12 Bangalore Q3 ‘12 Satellite sites DataDomain Dedupe Device 90 Agenda Justifying and Selling ITSCM to your executives • Audience • Drivers Slide Heading • Facts • Methods Justifying – The audience • Who is the audience(s) • • • • • • Board of Directors Executive leadership (“C monsters”) Technical leadership Line of Business/Unit (LOB) leadership Clients / partners / regulators Others • May be selling to multiple audiences Justifying – The drivers • What are the drivers (designed to the audience) • What is a driver? • Some of your work is already done • BIA – quality, depth, accuracy, completeness • Validate the drivers – make sure aimed at right target • Big, long term project requires long term vision • What’s now, what’s next • “Skating to where puck will be” • “Hard” and “soft” drivers Justifying – The facts • Current benchmarks – where are you now? • Gap analysis • Use specific, hard metrics wherever possible • SLA, contractual, implicit/explicit requirements • “STROLE” model (risk and reward for each) • • • • • • Strategic – soft, but meaningful Technical Reputational – image Operational Legal – risks, requirements and regulations Economic – money Justifying – The methods • The story – the Business Case, which explains: • What is our situation? (Exec overview) • Is there a problem? How big? Worth solving? • How do we know? (facts) • What if we don’t do it, or delay it? (drivers) • What are eligible options? Tradeoffs? • What will it cost? (Cost/benefit and friends) • What are your recommendations? (the plan) • What will we get/gain/avoid? (benefits) Lessons Learned • Do • Engage the right SMEs early in the process • System Owners, Architecture, Technology Owners • Communicate to Stakeholders • Form a Business Continuity Steering / DR Council • Leverage “Table Top” or Dry Run when appropriate • Conduct After Action Reviews (AAR) • Formalize, document and approve results • Link ITSM to Change Management to keep plans, scope and recovery options up to date Lessons Learned • Don’t… • Expect it to work the first time • Try to boil the ocean • Build the IT solution before you define the critical business solution • Rely on plans alone, “planning” is everything • Confuse probability with consequence So What, Who Cares? Why are we talking about Disaster Recovery … AGAIN? Ensuring continuous service determines not just recoverability, but survival. References ISACA • • • Business Continuity and Assurance Program http://www.isaca.org/KnowledgeCenter/ITAF-IT-Assurance-Audit-/Audit-Programs/Documents/WAPBC-Mgmt1Sept2011.doc BC / DR Planning Community - http://www.isaca.org/Groups/ProfessionalEnglish/business-continuity-disaster-recovery-planning/Pages/Overview.aspx Business Model for Information Security (BMIS) http://www.isaca.org/Knowledge-Center/BMIS/Pages/Business-Model-forInformation-Security.aspx Disaster Recovery International - https://www.drii.org/ Disaster Recovery Journal - http://www.drj.com/ Questions? Closing comments (if any)