ITCP and DR Program Audit Program and Recovery Checklist

advertisement
ROBERT K. DUGGAN, CPA, CIA, CISA
 ITCP/
DRP often doesn’t work.
 We discover it doesn’t work when we really
need it to work.
 We pay a fortune to maintain it. (Tier 4-6$400K-$2M and up!)
 DR test recoveries are fun!





IBM sets Tiers 1-6 for CICS operating on z/OS
Based on configuration - Tiers 1-3 being 1 week to >24
hours recovery time
Tiers 4-6 being <24 hours (large
manufacturers/distributors with continuous processing
needs and low downtime tolerance to business to
instantaneous (Tier 6- banks- 0 downtime tolerance) (see
IBM .com for more information)
Today’s example is on a Tier 4 Scenario for medium to
large organizations with 24 hour RTO requirement for
critical applications (If you have a mainframe you most
likely need Tier 3 up)
< 24 hour recovery of critical platforms and applications –
key success factors and evaluation steps are similar for the
tiers



Determined by Business Impact Analysis and
Risk Assessment
RTO / RPO
Recovery of critical platforms and
applications – regardless of tier or platform,
key success factors and evaluation steps are
similar for all tiers . Configuration and RTO
changes.



Walkthru -“Tabletop”- Scenario with roles and
responsibilities
Functional Exercise – Verify the effectiveness
of the backup by platform
Off-Site Test Restore – Verify the
effectiveness of the IT DR plan offsite at the
test center
Two different things, but:
ITDR and BCP are severely impaired without
each other.




Should occur well before the offsite test
Include vendor team
Follow up process with platform owners/DR
team and vendor team to resolve issues
noted prior to actual test restore
Audit interviews platform support teams, IT
Director, DR Manager assigned as part of
planning to get an understanding of
objectives and where the process is on an
evolutionary scale






Call tree notification system dysfunctional / not
at vendor, call trees incomplete or not defined
Persons who can declare not defined or poorly
separated (or the wrong people) – vendor cannot
take action under contractual terms
Support teams not defined / backups for key
members
Approval process for changes to DR Documents
DR Documents not current and at vendor/on
secure website
Vendor in same geographic area





Step by step instructions for platform owner /
vendor operators are not crystal clear
No clear assignment of responsibilities or
documented procedures for key platform owners
No clear assignment of responsibility for vendor
personnel or appropriate training on platforms
Backups for key personnel not defined
Business impact analysis and risk assessment not
current/tier of recovery is insufficient- Example:
Distributor switch from call center to web
application/proprietary remote order entry
system

Vendor personnel or backup recovery personnel
cannot restore the system
- Port mapping / system documentation not
complete / up to date
- Insufficient remote software / hardware
support level
- Vendor hardware is insufficient
- Insufficient procedures / lack of clean updated
scripts
- Poorly trained recovery personnel




Backup not really effective- verify successful recovery
of each platform using a checklist and document
verification method (system, volume information in
header screens). PS - Don’t ask for screenshots in
the middle of a DR test. Just catch platform, LPAR,
times, and volume information – observe/confirm
effective validation.
Application recovery not verified during the 24 hour
test/inaccurate RTO
Inaccurate system documentation leads to failure to
meet RTO
Port mapping is inaccurate /not maintained properly
by hardware support





Restore personnel cannot follow scripts without assistance
from the company platform team
Test results not verified by DR Test Manager/DR Manager
or test leader is not independent/does not rotate by test
Teams do not complete verification checklist or keep
testing notes- it is an evolving process that needs to build
Teams do not update DR Instructions following test
restore for lessons learned- expensive process- should
have a post restore review with follow up task list
Teams do not accurately capture RT/RP , evaluate against
true RTO/RPO by platform and application

www.searchdisasterrecovery.com

www.IBM.com
Be sure to find me on Linked-In
Download