Disaster Recovery and Business Continuity Planning in a University Environment Mardecia Bell Ann Harris Copyright Mardecia Bell/Ann Harris 2005. This work is the intellectual property of the authors. Permission is granted for this material to be shared for non-commercial, educational purposes, provided that this copyright statement appears on the reproduced materials and notice is given that the copying is by permission of the authors. To disseminate otherwise or to republish requires written permission from the authors. The realization of a single point of failure with one data center for both the central academic and administrative IT environments, prompted NC State University to implement a disaster recovery strategy for communications and critical applications residing on the mainframe & open systems computing environment. History/Timeline 1997 Initiated with the administrative environment Mainframe environment recovery test 1999 Y2K - Business Continuity concept Acquired central repository software (LDRPS) 2001 Scheduled annual Mainframe recovery test Included communications & academic environment 2002 Expanded to include Enterprise Business Continuity/Disaster Recovery Planning 2004 Successful DR test of ERP systems 2005 Co-processing of production services began in Data Center II Implementation Steps • • • • • • • Gain Sponsorship Establish Steering Committees Develop University Policy/Regulation Create DR Structure/Establish Staffing Market Program Establish Central Repository Review & Test Plans Regularly Gain Sponsorship • Office of the President – University System • Chancellor • Executive Management – – – – Present your Business Case Identify the roles involved Provide Executive Summary of BC/DR Program Present Statement of Work and Project Plan • Add responsibilities to staff work plans Establish Steering Committees • IT Steering Committee • Business/Service Steering Committee • Both committees are comprised of – Vice Chancellor/Vice Provost Level – Representatives from Critical Areas of the Campus – Ex Officio members from IT areas • Mission of IT Steering Committee – Provide guidance and oversight for the combined academic and administrative Disaster Recovery Plan. Policy/Regulations/Rule • Develop a Policy or Regulation to affirm the mandate and promote cooperation Divide Campus Into Groupings • • • • • • • • • Space/Facilities Teaching and Academic Programs Academic IT Administrative IT Environmental Health and Public Safety Business Administration Research Programs Student Affairs Extension and Engagement Resource Projections • Hire Full-Time Business Continuity and Disaster Recovery Personnel – Director of Business Continuity (plus 1 Business Analyst) – Admin IT DR Coordinator (plus 1 Business Analyst) – Academic DR Coordinator (part-time) • Add BC/DR responsibilities to work plan of existing staff • Identify Coordinators for each business unit Marketing • • • • • • • Present at campus departmental meetings Create a Website Utilize listserves Campus Newspaper Network with peer institutions Remain abreast of industry standards Attend conferences, workshops and seminars Establish Central Information Repository Continuous Implementation Accomplishments • Disaster Recovery and Business Continuity Plan • Risk Assessments for Critical Business Units • Successful Mainframe Recovery Tests • Designed and implemented infrastructure for central computing environment (academic & administrative) in secondary data center. • Implementation of recovery strategies in secondary data center • Creation of Administrative IT Disaster Recovery Unit Illustration of Various DR Deployments Fault-tolerant cluster (file and print services) A Production B Production B Configuration A Configuration B Production A Production Co-processing and load-balancing (ERP) A Production A Production A Production Distributed deployment (hosted systems) A Production A Development A Production Data replication (mainframe) Server Data Server Data Server Data Enterprise Resource Planning (ERP) Deployment Financial System Human Resources (Version 8.8) Student Information System (under construction) Campus Users DC I DC II Batch Server Batch Server Data Storage Area Network Web Server Web Server Web Server Web Server Application Server Application Server Application Server Application Server DB Server DB Server Batch Server Batch Server Summary and Future Steps DC I Novell Directory Services / Novell Email/Calendar Anti-SPAM File/Print, User Home Citrix DC II Novell Directory Services / Novell Email/Calendar Anti-SPAM Citrix Backup/vaulting Backup/vaulting Hosted systems Hosted systems Data Data Data Active Directory / Windows Storage Area Network Infrastructure Database Server Web Server ERP Web File/Print, User Home ERP Application Development Server ERP DB Server Mainframe Server ERP Batch Data Data Data Active Directory / Windows Storage Area Network Infrastructure Database Server Web Server ERP Web ERP Application Development Server ERP DB Server Mainframe Server ERP Batch Administrative IT Disaster Recovery Unit Mission • Ensure minimal risk of major disruptions to critical University systems and processes in the event that all or part of its computer operations are rendered inoperable. • Ensure timely recovery of infrastructure and services in the event of a disruption. • Ensure that business continuity plans are available and viable relative to its scenario. Risk Management • • • Identify Mitigate Process Mapping Risk Management Risk Mitigation Risk Assessment • Prioritize Actions • Evaluate recommended Control Options • Conduct Cost-Benefit Analysis • Select Controls • Assign Responsibility • Develop Safeguard Implementation Plan • Implement Selected Controls • • • • • • • • • NIST SP 800-30 System Characterization Threat Identification Vulnerability Identification Control Analysis Likelihood Determination Impact Analysis Risk Determination Control Recommendations Results Documentation Process Mapping Infrastructure • Total DR through distributed high availability • Client Recovery Solutions • Application Restoration • Establish collaborative partnerships with other Universities Client Recovery Solution(s) Application Restoration • Event • Time • Scope of Impact – Infrastructure – Software – Hardware Collaborative Partnerships Vaulting • • • • Readily accessible Secure Onsite Offsite Critical Business Units • Advancement Services • All Campus Network • Budget Office • College of Agriculture and Life Sciences - Personnel Office • ComTech - Data Networking • ComTech - Telecommunications • Contracts and Grants • Controller's Office • Enterprise Application and Database Services • EH&S - Business Continuity • EH&S - Campus Police • EH&S - Emergency Response • EH&S - Environmental Affairs • EH&S - Health and Safety • EH&S - Industrial Hygiene • EH&S - Insurance and Risk Management • EH&S - Radiation Safety • EH&S - Transportation • • • • EH&S - Waste Management Enrollment Management - Admissions Enrollment Management - Office of Scholarships & Financial Aid Enrollment Management - Registration and Records • • • • • Enterprise Technology Services and Support Facilities - Construction Management Facilities - Design and Construction Services Facilities - Operations Facilities - University Architect • • • • • • • • • • • • • • • • • • • Fire Protection Foundations Accounting & Investments HR - Benefits HR - Employment & Compensation HR - Human Resource Information Management HR - Payroll ITD - Business Services ITD - Computer Operations ITD - Computer Services ITD - Systems Libraries - Administration Materials Management - Materials Support Materials Management - Purchasing Materials Management - University Graphics Real Estate Student Health Services University Cashier's Office University Dining University Housing Business Continuity Planning Communication • • • • Consistency in plan updating Training Partnering Emergency Communication standardization – – – – – Call Trees Mobile Devices Website Incident Command System Call Center Incident Report Plan IT Disaster Categorization • Category 1: A single person or group in a Critical Business Unit (CBU) is unable to perform their critical functions • Category 2: An entire CBU is unable to perform its critical functions • Category 3: Multiple CBUs are unable to perform their critical functions • Category 4: Non CBUs are not able to perform their critical functions • Category 5: A wide spread event that impacts the entire University Goals • Total DR through distributed high availability • Standardized Emergency Communications • Immediate Client Recovery Solutions • Improved RTO Ann Harris Asst Dir, Administrative IT Disaster Recovery 919-515-9228 ann_harris@ncsu.edu http://www.fis.ncsu.edu/dr