System Backup and Recovery Plan

advertisement
[Project Name]
System Backup, Recovery, and Failover Procedures
Document Overview
Prepared By:
Prepared For:
Date Created:
Last Updated:
(R)esponsible
(A)uthority
(S)upport:
(C)onsult:
(I)nform:
RASCI Alignment
Technical Lead
Technical Manager
ITPD Team
ITPD Management
Project Manager
SYSTEM BACKUP, RECOVERY, AND FAILOVER
PROJECT NAME
Revision Log
Revision
Date
Initials
1.0
1.1
1.2
1.3
1.4
1.5
10/10/2008
10/16/2009
10/20/2009
11/10/2009
11/12/2009
12/4/2009
AL
AL
AL
AL
AK
1.6
12/4/2009
AL
Description of Revision
Initial Draft
Updated Draft
Updated Draft
Updated Draft
Updated Draft
Updated Draft – changes for grammar, a few more examples,
other minor changes.
Accepted the changes plus other minor changes
File Name: QUAL042_PROJ Backup Recovery Failover Procedures_20091207_v1.0_1.6.docx
Last Saved: 12/4/2009
Page 2 of 2
SYSTEM BACKUP, RECOVERY, AND FAILOVER
PROJECT NAME
Table of Contents
1) OVERVIEW: .................................................................................................................................................................. 4
1.1) OBJECTIVE ................................................................................................................................................................. 4
1.2) STRATEGY .................................................................................................................................................................. 4
1.3) ASSUMPTIONS ............................................................................................................................................................ 4
2) BACKUP SERVICE ...................................................................................................................................................... 5
3) INSTALLATION OF COMPONENTS ....................................................................................................................... 5
3.1) INSTALLATION OF OPERATING SYSTEM FOR APPLICATION SERVERS ......................................................................... 5
3.2) INSTALLATION OF APPLICATIONS ............................................................................................................................... 5
4) OPERATIONS ............................................................................................................................................................... 5
4.1) BACKUP REQUIREMENTS ............................................................................................................................................ 5
4.2) BACKUP, RECOVERY, AND FAILOVER SITUATIONS..................................................................................................... 6
4.3) MONITORING ALERTS ................................................................................................................................................ 6
4.4) MONITORING ALERT CLASSIFICATION ....................................................................................................................... 6
4.5) MONITORING ALERT DETAIL ..................................................................................................................................... 7
5) TROUBLESHOOTING ................................................................................................................................................. 7
5.1) TROUBLESHOOTING PROCEDURES .............................................................................................................................. 7
5.2) ERROR CODE CLASSIFICATION ................................................................................................................................... 8
5.3) ERROR CODE DETAIL ................................................................................................................................................. 8
5.4) VERIFY SITUATION ..................................................................................................................................................... 8
6) PROCEDURES .............................................................................................................................................................. 9
6.1) BACKUP PROCEDURES................................................................................................................................................ 9
6.2) SHUTDOWN PROCEDURES .......................................................................................................................................... 9
6.3) FAILOVER PROCEDURES ............................................................................................................................................. 9
6.4) RECOVERY PROCEDURES ........................................................................................................................................... 9
6.5) START-UP PROCEDURES............................................................................................................................................. 9
7) DOCUMENT SIGN OFF ............................................................................................................................................ 10
File Name: QUAL042_PROJ Backup Recovery Failover Procedures_20091207_v1.0_1.6.docx
Last Saved: 12/4/2009
Page 2 of 2
SYSTEM BACKUP, RECOVERY, AND FAILOVER
PROJECT NAME
1) Overview:
1.1) Objective
The objective of this document is to address the backup/recovery/failover processes that will be put in place at
the time of the implementation of a new/updated application. The document also iterates the procedure for
backing up, recovery, failover of the application.
<Below, you can find some examples of some topics that could be addressed:
 Backup Requirements
 Dependencies and Impacts
 Alerts and Resolution Procedures
 Troubleshooting Procedures
 Recovery
 Failover
Instructions for areas in question or requiring more details are provided at the beginning of each
section where appropriate and are colored in blue and should be deleted after completing the
document. This includes this particular paragraph. Remember, areas can always be flagged as N/A,
but be prepared to defend that decision.>
1.2) Strategy
This section is an opportunity to describe the underlying strategy of the backup, recovery and failover
processes. It is a high-level overview of what is trying to be accomplished. This is not the place to
put extensive detail, but merely a place to describe the idea(s) which are behind this document.
Please include different types of scenarios for backups (data, system files, etc.) if relevant and also
explain why backup is not meant to restore one single file, unless that is your design goal.
1.3) Assumptions
The following table identifies the assumptions regarding backup, recovery, failover activities. It also
defines the appropriate contact(s) for addressing those activities, such as a vendor or an separate
internal organization.
<Use the below table for listing the assumptions related with backup, recovery, failover activities>
Activity
Installation of TSM
Internal/External
Internal/IT
SERVICES
File Name: QUAL042_PROJ Backup Recovery Failover Procedures_20091207_v1.0_1.6.docx
Last Saved: 12/4/2009
Organization
SEAWSS/SEAUNIX
Page 2 of 2
SYSTEM BACKUP, RECOVERY, AND FAILOVER
PROJECT NAME
2) Backup Service
<Please refer to IT SERVICES documentation regarding the Backup Service and on what machine(s)
the services are deployed. Be aware that the backups are scheduled in a time window of 12 hrs.
That window could be either AM or PM hours, so be aware of what window you are in.>
<If there are any non standard procedures, please include the information here>
3) Installation of Components
<This section is likely to be applicable only if building a system from scratch>
3.1) Installation of Operating System for Application Servers
Refer to the <insert doc name> for installation procedures for:
1. <OS- pack X>
3.2) Installation of Applications
In most cases the installations are done by different IT SERVICES support teams (SEAWSS,
SEASQL, etc.). Please make sure that you include only the comments that are relevant.
Refer to the <insert doc name> for installation procedures for:
1. <ERA Click Commerce
2. Tivoli Client (TSM)
3. Tivoli Data Protection (TDP) client
4. NetBackup Client>
4) Operations
4.1) Backup Requirements
<The backup software used for backup of the TSM version 5.5. This software initially
communicates with the Backup Master Server via protocol over 13782/tcp. Subsequently after the
initial connection, this process will use a random port less than 1024/tcp (unless port range is
restricted in the “bp.conf” file). Please include the backups across all environments>

The actual procedures for backup could be found on Procedure section .

Identify what is backed up (system registry, OS, data, file system, etc) and how often.

What is the retention policy for the backups?


How many copies of the backups are stored? (Be careful of the 30 day retention, but only
4 copies stored. Retention and storage should match.)
Where are the backup files stored (on site (list where), off site (list where))?
The backup process should kick off based on the following schedule:
File Name: QUAL042_PROJ Backup Recovery Failover Procedures_20091207_v1.0_1.6.docx
Last Saved: 12/4/2009
Page 2 of 2
SYSTEM BACKUP, RECOVERY, AND FAILOVER
Start Time or Prerequisite
12:01 AM
Frequency
Daily
12:01 AM, Saturday
Weekly
PROJECT NAME
Content
Backup only the deltas (incremental
backup)
Backup all records (full backup)
4.2) Backup, Recovery, and Failover Situations
The situations outlined below address any scenarios that might invoke backup, recovery, and failover
procedures. Please mention the contact person from SEATSM team (if relevant) that will monitor the
backup logs for data integrity and also how the application team will be announced in case of a backup
event (email, phone call, etc)
All situations encountered by the testing team have been documented below. The following has been
documented for each:
 Situation – Defines what happens to the <name of box being addressed> that creates the impacts.
This is the trigger that initiates backup/recovery/failover procedures.
 Application Impacted – Name of application that is impacted by the situation
 Description of the Impact – Defines how the identified application is impacted.
Situation
Application ImpactedDescription of Impact Procedures
1. <Box> shuts down
Oracle
Will shut down with errors
Follow restart of procedures
2. Network Down
Application XYZ
Will shut down with errors
Follow shut down procedures
3. <Box> is not responding
Application XYZ
Will shut down with errors
Catastrophic hardware
Application XYZ
failure
4.3) Monitoring Alerts
This section will identify what is being monitored:
Backup monitoring – <ensure that the tasks mentioned in section 4.1 are monitored.>
Functional monitoring – <ensure that the hosted application is up and responsive to the users requests>
4.4) Monitoring Alert Classification
The following classification has been defined to categorize monitoring Alerts:
 Error Message – Error message issued by any applicable Monitoring service as it appears on the
console in the data center.
 Severity – Definition of the severity of the defined alert. Use the following as a guideline in
assigning the severity:
File Name: QUAL042_PROJ Backup Recovery Failover Procedures_20091207_v1.0_1.6.docx
Last Saved: 12/4/2009
Page 2 of 2
SYSTEM BACKUP, RECOVERY, AND FAILOVER
Severity
HIGH
MEDIUM
LOW








PROJECT NAME
Description
A system outage or complete loss of functionality.
Loss of crucial functionality
No workaround exists for the problem
Keeps people from correctly performing their jobs
Loss of non-crucial functionality
Warning message that could eventually result in loss of crucial
functionality
Inconvenience to the operator
Warning message that will not result in loss of functionality
 Description – Description of the alert based on experience in addressing it.
 Resolution – Step-by-step resolution procedures to get to the root of the problem. This may refer to
another component document if the troubleshooting crosses over multiple components.
NOTE: When adding new alerts and resolutions, place the alert alphabetically in order of its error
message.
4.5) Monitoring Alert Detail
The following alerts have been detected in running test scripts throughout system and end-to-end
testing. This is an example, , please enumerate them if they are available for your systems
< Each alert encountered has been documented and includes the following information:>
Trap/Alert
Message:
Severity:
Description:
Resolution:
5) Troubleshooting
5.1) Troubleshooting Procedures
Before moving/deciding on an aggressive approach of replacing the hardware in the event of a
failure, some common sense tasks should be attempted Also, the SLA (Service Level Agreement)
document should be consulted in determining how aggressive the approach should be (time wise).
If the permissible time frame is very short per the SLA failover could be the preferred response if it
is possible.
<Here are some examples of possible common sense approaches that could be taken:




Ensure that that all required services are running, in case they are not try to restart the
affected service. If the service does not start manually, try to reboot the machine which is
hosting it.
Make sure all necessary files are in the proper place >
Wait 5 mins and check again (network glitch)
See if power cords still attached.
File Name: QUAL042_PROJ Backup Recovery Failover Procedures_20091207_v1.0_1.6.docx
Last Saved: 12/4/2009
Page 2 of 2
SYSTEM BACKUP, RECOVERY, AND FAILOVER
PROJECT NAME
5.2) Error Code Classification
The following classification has been defined to categorize error codes. They are applying for OS,
application, backup services, etc.
 Error Code – Error code as it appears in the box/application being addressed in this document.
 Severity – Definition of the severity of the defined error. Use the following as a guideline in assigning the
severity:
Severity
HIGH
MEDIUM
LOW








Description
A system outage or complete loss of functionality.
Loss of crucial functionality
No workaround exists for the problem
Keeps people from correctly performing their jobs
Loss of non-crucial functionality
Warning message that could eventually result in loss of crucial
functionality
Inconvenience to the operator
Warning message that will not result in loss of functionality
 Description – Description of the error based on experience in addressing it.
 Resolution – Step-by-step resolution procedures to get to the root of the problem. This may refer to
another component document if the troubleshooting crosses over multiple components.
5.3) Error Code Detail
When adding new errors and resolutions, place the error in order of its code.
Error Message:
Severity:
Description:
Resolution:
ORA – 1205: Cannot….
HIGH
Connectivity is down….
Call network admin to resolve connectivity issue
Error Message:
Severity:
Description:
Resolution:
ORA – 1307: Connecting to server…
MEDIUM
Warning message indicating…
No action required. If frequency of error is one every 5 minutes or more,
call DBA for assistance.
5.4) Verify Situation
If the above situation or errors occur, confirm the situation:
1. Review error logs and store in __ location for future reference.
Please enumerate the steps that you take before escalating the situation (a.k.a verify
system/backup log files, verify the consistence of the backups, etc.)
File Name: QUAL042_PROJ Backup Recovery Failover Procedures_20091207_v1.0_1.6.docx
Last Saved: 12/4/2009
Page 2 of 2
SYSTEM BACKUP, RECOVERY, AND FAILOVER
PROJECT NAME
6) Procedures
6.1) Backup Procedures
<Insert procedures for each component>
6.2) Shutdown Procedures
<Insert procedures for each component>
6.3) Failover Procedures
<Insert procedures for each component>
6.4) Recovery Procedures
<Insert procedures for each component>
6.5) Start-Up Procedures
<Insert procedures for each component>
File Name: QUAL042_PROJ Backup Recovery Failover Procedures_20091207_v1.0_1.6.docx
Last Saved: 12/4/2009
Page 2 of 2
SYSTEM BACKUP, RECOVERY, AND FAILOVER
PROJECT NAME
7) Document Sign Off
Phase: Design
The (Deliverable Name) document has been reviewed and found to be consistent with the specifications
and/or documented project requirements. The signature below documents acceptance of this document
and/or work product by the signing authority
Organization: University of Chicago________________
Contractor________________
Approved by:
Signature: ___________________________________________________________________
Name: ______________________________________________________________________
Title:
Date:
Organization: University of Chicago________________
Contractor________________
Approved by:
Signature: ___________________________________________________________________
Name: ______________________________________________________________________
Title:
Date:
File Name: QUAL042_PROJ Backup Recovery Failover Procedures_20091207_v1.0_1.6.docx
Last Saved: 12/4/2009
Page 2 of 2
Download