[Project Name] System Backup, Recovery, and Failover Procedures Document Overview Prepared By: Prepared For: Date Created: Last Updated: (R)esponsible (A)uthority (S)upport: (C)onsult: (I)nform: RASCI Alignment Technical Lead Technical Manager ITPD Team ITPD Management Project Manager SYSTEM BACKUP, RECOVERY, AND FAILOVER PROJECT NAME Revision Log Revision Date Initials 1.0 1.1 1.2 1.3 1.4 1.5 10/10/2008 10/16/2009 10/20/2009 11/10/2009 11/12/2009 12/4/2009 AL AL AL AL AK 1.6 12/4/2009 AL Description of Revision Initial Draft Updated Draft Updated Draft Updated Draft Updated Draft Updated Draft – changes for grammar, a few more examples, other minor changes. Accepted the changes plus other minor changes File Name: QUAL042_PROJ Backup Recovery Failover Procedures_20091207_v1.0_1.6.docx Last Saved: 12/4/2009 Page 2 of 2 SYSTEM BACKUP, RECOVERY, AND FAILOVER PROJECT NAME Table of Contents 1) OVERVIEW: .................................................................................................................................................................. 4 1.1) OBJECTIVE ................................................................................................................................................................. 4 1.2) STRATEGY .................................................................................................................................................................. 4 1.3) ASSUMPTIONS ............................................................................................................................................................ 4 2) BACKUP SERVICE ...................................................................................................................................................... 5 3) INSTALLATION OF COMPONENTS ....................................................................................................................... 5 3.1) INSTALLATION OF OPERATING SYSTEM FOR APPLICATION SERVERS ......................................................................... 5 3.2) INSTALLATION OF APPLICATIONS ............................................................................................................................... 5 4) OPERATIONS ............................................................................................................................................................... 5 4.1) BACKUP REQUIREMENTS ............................................................................................................................................ 5 4.2) BACKUP, RECOVERY, AND FAILOVER SITUATIONS..................................................................................................... 6 4.3) MONITORING ALERTS ................................................................................................................................................ 6 4.4) MONITORING ALERT CLASSIFICATION ....................................................................................................................... 6 4.5) MONITORING ALERT DETAIL ..................................................................................................................................... 7 5) TROUBLESHOOTING ................................................................................................................................................. 7 5.1) TROUBLESHOOTING PROCEDURES .............................................................................................................................. 7 5.2) ERROR CODE CLASSIFICATION ................................................................................................................................... 8 5.3) ERROR CODE DETAIL ................................................................................................................................................. 8 5.4) VERIFY SITUATION ..................................................................................................................................................... 8 6) PROCEDURES .............................................................................................................................................................. 9 6.1) BACKUP PROCEDURES................................................................................................................................................ 9 6.2) SHUTDOWN PROCEDURES .......................................................................................................................................... 9 6.3) FAILOVER PROCEDURES ............................................................................................................................................. 9 6.4) RECOVERY PROCEDURES ........................................................................................................................................... 9 6.5) START-UP PROCEDURES............................................................................................................................................. 9 7) DOCUMENT SIGN OFF ............................................................................................................................................ 10 File Name: QUAL042_PROJ Backup Recovery Failover Procedures_20091207_v1.0_1.6.docx Last Saved: 12/4/2009 Page 2 of 2 SYSTEM BACKUP, RECOVERY, AND FAILOVER PROJECT NAME 1) Overview: 1.1) Objective The objective of this document is to address the backup/recovery/failover processes that will be put in place at the time of the implementation of a new/updated application. The document also iterates the procedure for backing up, recovery, failover of the application. <Below, you can find some examples of some topics that could be addressed: Backup Requirements Dependencies and Impacts Alerts and Resolution Procedures Troubleshooting Procedures Recovery Failover Instructions for areas in question or requiring more details are provided at the beginning of each section where appropriate and are colored in blue and should be deleted after completing the document. This includes this particular paragraph. Remember, areas can always be flagged as N/A, but be prepared to defend that decision.> 1.2) Strategy This section is an opportunity to describe the underlying strategy of the backup, recovery and failover processes. It is a high-level overview of what is trying to be accomplished. This is not the place to put extensive detail, but merely a place to describe the idea(s) which are behind this document. Please include different types of scenarios for backups (data, system files, etc.) if relevant and also explain why backup is not meant to restore one single file, unless that is your design goal. 1.3) Assumptions The following table identifies the assumptions regarding backup, recovery, failover activities. It also defines the appropriate contact(s) for addressing those activities, such as a vendor or an separate internal organization. <Use the below table for listing the assumptions related with backup, recovery, failover activities> Activity Installation of TSM Internal/External Internal/IT SERVICES File Name: QUAL042_PROJ Backup Recovery Failover Procedures_20091207_v1.0_1.6.docx Last Saved: 12/4/2009 Organization SEAWSS/SEAUNIX Page 2 of 2 SYSTEM BACKUP, RECOVERY, AND FAILOVER PROJECT NAME 2) Backup Service <Please refer to IT SERVICES documentation regarding the Backup Service and on what machine(s) the services are deployed. Be aware that the backups are scheduled in a time window of 12 hrs. That window could be either AM or PM hours, so be aware of what window you are in.> <If there are any non standard procedures, please include the information here> 3) Installation of Components <This section is likely to be applicable only if building a system from scratch> 3.1) Installation of Operating System for Application Servers Refer to the <insert doc name> for installation procedures for: 1. <OS- pack X> 3.2) Installation of Applications In most cases the installations are done by different IT SERVICES support teams (SEAWSS, SEASQL, etc.). Please make sure that you include only the comments that are relevant. Refer to the <insert doc name> for installation procedures for: 1. <ERA Click Commerce 2. Tivoli Client (TSM) 3. Tivoli Data Protection (TDP) client 4. NetBackup Client> 4) Operations 4.1) Backup Requirements <The backup software used for backup of the TSM version 5.5. This software initially communicates with the Backup Master Server via protocol over 13782/tcp. Subsequently after the initial connection, this process will use a random port less than 1024/tcp (unless port range is restricted in the “bp.conf” file). Please include the backups across all environments> The actual procedures for backup could be found on Procedure section . Identify what is backed up (system registry, OS, data, file system, etc) and how often. What is the retention policy for the backups? How many copies of the backups are stored? (Be careful of the 30 day retention, but only 4 copies stored. Retention and storage should match.) Where are the backup files stored (on site (list where), off site (list where))? The backup process should kick off based on the following schedule: File Name: QUAL042_PROJ Backup Recovery Failover Procedures_20091207_v1.0_1.6.docx Last Saved: 12/4/2009 Page 2 of 2 SYSTEM BACKUP, RECOVERY, AND FAILOVER Start Time or Prerequisite 12:01 AM Frequency Daily 12:01 AM, Saturday Weekly PROJECT NAME Content Backup only the deltas (incremental backup) Backup all records (full backup) 4.2) Backup, Recovery, and Failover Situations The situations outlined below address any scenarios that might invoke backup, recovery, and failover procedures. Please mention the contact person from SEATSM team (if relevant) that will monitor the backup logs for data integrity and also how the application team will be announced in case of a backup event (email, phone call, etc) All situations encountered by the testing team have been documented below. The following has been documented for each: Situation – Defines what happens to the <name of box being addressed> that creates the impacts. This is the trigger that initiates backup/recovery/failover procedures. Application Impacted – Name of application that is impacted by the situation Description of the Impact – Defines how the identified application is impacted. Situation Application ImpactedDescription of Impact Procedures 1. <Box> shuts down Oracle Will shut down with errors Follow restart of procedures 2. Network Down Application XYZ Will shut down with errors Follow shut down procedures 3. <Box> is not responding Application XYZ Will shut down with errors Catastrophic hardware Application XYZ failure 4.3) Monitoring Alerts This section will identify what is being monitored: Backup monitoring – <ensure that the tasks mentioned in section 4.1 are monitored.> Functional monitoring – <ensure that the hosted application is up and responsive to the users requests> 4.4) Monitoring Alert Classification The following classification has been defined to categorize monitoring Alerts: Error Message – Error message issued by any applicable Monitoring service as it appears on the console in the data center. Severity – Definition of the severity of the defined alert. Use the following as a guideline in assigning the severity: File Name: QUAL042_PROJ Backup Recovery Failover Procedures_20091207_v1.0_1.6.docx Last Saved: 12/4/2009 Page 2 of 2 SYSTEM BACKUP, RECOVERY, AND FAILOVER Severity HIGH MEDIUM LOW PROJECT NAME Description A system outage or complete loss of functionality. Loss of crucial functionality No workaround exists for the problem Keeps people from correctly performing their jobs Loss of non-crucial functionality Warning message that could eventually result in loss of crucial functionality Inconvenience to the operator Warning message that will not result in loss of functionality Description – Description of the alert based on experience in addressing it. Resolution – Step-by-step resolution procedures to get to the root of the problem. This may refer to another component document if the troubleshooting crosses over multiple components. NOTE: When adding new alerts and resolutions, place the alert alphabetically in order of its error message. 4.5) Monitoring Alert Detail The following alerts have been detected in running test scripts throughout system and end-to-end testing. This is an example, , please enumerate them if they are available for your systems < Each alert encountered has been documented and includes the following information:> Trap/Alert Message: Severity: Description: Resolution: 5) Troubleshooting 5.1) Troubleshooting Procedures Before moving/deciding on an aggressive approach of replacing the hardware in the event of a failure, some common sense tasks should be attempted Also, the SLA (Service Level Agreement) document should be consulted in determining how aggressive the approach should be (time wise). If the permissible time frame is very short per the SLA failover could be the preferred response if it is possible. <Here are some examples of possible common sense approaches that could be taken: Ensure that that all required services are running, in case they are not try to restart the affected service. If the service does not start manually, try to reboot the machine which is hosting it. Make sure all necessary files are in the proper place > Wait 5 mins and check again (network glitch) See if power cords still attached. File Name: QUAL042_PROJ Backup Recovery Failover Procedures_20091207_v1.0_1.6.docx Last Saved: 12/4/2009 Page 2 of 2 SYSTEM BACKUP, RECOVERY, AND FAILOVER PROJECT NAME 5.2) Error Code Classification The following classification has been defined to categorize error codes. They are applying for OS, application, backup services, etc. Error Code – Error code as it appears in the box/application being addressed in this document. Severity – Definition of the severity of the defined error. Use the following as a guideline in assigning the severity: Severity HIGH MEDIUM LOW Description A system outage or complete loss of functionality. Loss of crucial functionality No workaround exists for the problem Keeps people from correctly performing their jobs Loss of non-crucial functionality Warning message that could eventually result in loss of crucial functionality Inconvenience to the operator Warning message that will not result in loss of functionality Description – Description of the error based on experience in addressing it. Resolution – Step-by-step resolution procedures to get to the root of the problem. This may refer to another component document if the troubleshooting crosses over multiple components. 5.3) Error Code Detail When adding new errors and resolutions, place the error in order of its code. Error Message: Severity: Description: Resolution: ORA – 1205: Cannot…. HIGH Connectivity is down…. Call network admin to resolve connectivity issue Error Message: Severity: Description: Resolution: ORA – 1307: Connecting to server… MEDIUM Warning message indicating… No action required. If frequency of error is one every 5 minutes or more, call DBA for assistance. 5.4) Verify Situation If the above situation or errors occur, confirm the situation: 1. Review error logs and store in __ location for future reference. Please enumerate the steps that you take before escalating the situation (a.k.a verify system/backup log files, verify the consistence of the backups, etc.) File Name: QUAL042_PROJ Backup Recovery Failover Procedures_20091207_v1.0_1.6.docx Last Saved: 12/4/2009 Page 2 of 2 SYSTEM BACKUP, RECOVERY, AND FAILOVER PROJECT NAME 6) Procedures 6.1) Backup Procedures <Insert procedures for each component> 6.2) Shutdown Procedures <Insert procedures for each component> 6.3) Failover Procedures <Insert procedures for each component> 6.4) Recovery Procedures <Insert procedures for each component> 6.5) Start-Up Procedures <Insert procedures for each component> File Name: QUAL042_PROJ Backup Recovery Failover Procedures_20091207_v1.0_1.6.docx Last Saved: 12/4/2009 Page 2 of 2 SYSTEM BACKUP, RECOVERY, AND FAILOVER PROJECT NAME 7) Document Sign Off Phase: Design The (Deliverable Name) document has been reviewed and found to be consistent with the specifications and/or documented project requirements. The signature below documents acceptance of this document and/or work product by the signing authority Organization: University of Chicago________________ Contractor________________ Approved by: Signature: ___________________________________________________________________ Name: ______________________________________________________________________ Title: Date: Organization: University of Chicago________________ Contractor________________ Approved by: Signature: ___________________________________________________________________ Name: ______________________________________________________________________ Title: Date: File Name: QUAL042_PROJ Backup Recovery Failover Procedures_20091207_v1.0_1.6.docx Last Saved: 12/4/2009 Page 2 of 2