ART for TSM Admins

advertisement
AUTOMATED RESTORE TESTING
FOR
TIVOLI STORAGE MANAGER
TSMworks, Inc.
Based in Research Triangle area, NC, USA
IBM Advanced Business Partner
Big fans of Tivoli Storage Manager
Broad experience with Fortune 500 clients
What’s the Problem?
The Recovery Gap
The Gartner Group:
“30% of all backups are not recoverable.”
The Yankee Group:
“40% were unable to recover data.”
Symantec:
“…one in four tests fail.”
A few reasons why recoveries fail
Wisdom from TSM experts and User Groups:
• Node wasn’t backed up, because:
• It fell off the schedule
• It was never even registered to TSM (“rogue” server)
• Node was backed up “successfully”, but:
•
•
•
•
•
Some critical files were excluded
Files (often databases) were in-use or locked
Nothing was mounted on the mount point
Windows Journaling Service failed
Retention was too short
... and many more at www.tsmworks.com
Do Reporting Tools Help?
• Node wasn’t backed up, because:
• YES: It wasn’t scheduled by TSM (or anything else)
• NO: It was never registered to TSM (“rogue” server)
• Node was backed up “successfully”, but:
•
•
•
•
•
NO:
YES:
NO:
NO:
NO:
Some critical files were excluded
Files (often databases) were in-use or locked
Nothing was mounted on the mount point
Windows Journaling Service failed
Retention was too short
• MOSTLY NO: … many more at www.tsmworks.com
So, test the backups. (If you’re serious about
recovery).
Bare-metal restore testing is ideal...
TSM
OS
Filesystem
Database
Reboot
Test
…but unfeasible for 500 machines daily.
Sampling restores is very workable
TSM
This technique always uncovers problems.
And it’s quite feasible.
: Automated Restore Testing
• ART restores a few files, chosen at random,
from every computer TSM backs up.
• Restored files come to the ART VM.
Production nodes are untouched.
• ART doesn’t restore huge files, hammer TSM,
or disrupt migration, reclamation, etc.
• And … ART usually uncovers huge amounts of
wasted storage.
ART architecture
Web Server
login
MySQL
LDAP
setup
run tests
TSM config.
files
annotate
dsmc
upgrade
send
logs
dsmadmc
Linux OS
ssh, curl, etc.
Network
Rails
Engine
LDAP
Server
TSM
Servers
SMTP
Server
TSMworks
site
VMware Virtual Machine
• ART is a self-contained Virtual Appliance.
How ART installs
Download it,
Our website
Point it at TSM.
Your TSM Server
Start it up,
Your ESX farm
• Your ESX team usually does the install. Easy.
How ART works
Your TSM site
4.
Show results
on dashboard
1.
Discover all
clients
Web server
TSM
servers
dsmadmc
Network/SAN
dsmc
Database
Client 1 Client 2 Client 3
It’s all we do.
2.
For each client,
restore one file,
selected at
random
3.
Record results
for each test
restore
Dashboard
• Each bar shows the results of one “Sweep”,
one testing pass through all the nodes.
• One sweep may take hours or days. You can
break it up to run, say, 2 PM to 6 PM daily.
(Not usually necessary.)
ART for Auditors
• Storage auditors can use the “Passed” section
to see proof that each node was really tested.
ART for Auditors
• From the list of successful nodes, drill down …
ART for Auditors
• … to see that each filespace was tested…
ART for Auditors
• … and that the test succeeded.
ART for TSM Admins
• Unlike Auditors, TSM administrators will want
to look at the errors, not the successes.
ART for TSM Admins
• ART shows a short list of root causes, rather
than the long list of nodes that failed. Click the
message text…
ART for TSM Admins
• … to see activity log detail about the failure.
• Ex: restore fails due to a missing tape. The
sensible Admin will fix all missing tapes,
preventing failures on many other nodes.
ART for TSM Admins
• Click the Error Code to get plain-language
help and advice on what to do, if you’re not
familiar with TSM.
ART for TSM Admins
• If you have Rogue-server detection licensed,
ART for TSM Admins
• ART will find servers that are on your network,
but are not registered to TSM.
• No reporting tool that talks only to TSM will
ever find these rogue servers.
ART for TSM Admins
• These IP Addresses are on the network, but
TSM doesn’t know about them.
• Click the IP address to see network analysis of
these potential problems…
ART for TSM Admins
• … including what kind of OS they use, and
which ports they listen on.
• This helps distinguish important IPs (servers)
from irrelevant ones (printers, laptops, etc.).
ART for TSM Admins
• Document what the irrelevant IPs are, and
ignore them on future sweeps.
ART for Storage Trimming
• A side effect of testing: ART finds junk storage.
• Reduce your storage footprint by 20-50%.
Where does junk come from?
• Ancient policies are way too conservative
• Full database dumps retained for a year
• TDP agents don’t delete old backups
• Clusters are set up wrong
• Filesystems get renamed
• Users back up remote disks
• Decommissioned machines aren’t deleted.
ART for Storage Trimming
• This filesystem uses 127 TB, yet has not been
backed up in three weeks. Why not? Call
Chris Chiang. Either back it up, or delete it,
and reclaim 127 TB.
ART for Storage Trimming
• This 33 GB drive uses 1.3 TB of backups.
Why? Call J Williamson and decide.
ART for Storage Trimming
• The Almost Full view shows where OS or
Applications may soon crash due to low disk
space.
ART for Storage Trimming
• If you change the sorting or filtering, a “save
as..” link appears …
ART for Storage Trimming
• So you can give your custom report a new
name …
ART for Storage Trimming
• And have it on a new tab.
Benefits and Pricing
• Easy install
• Supports storage audits
• Prevents restore problems
• Finds rogue servers
• Trims waste storage
• Pricing: $35/node (less for larger sites).
Free trial = Free Health Check
• ART’s free trial tests 20% of your site.
• Use it as a free Health Check, with our
compliments.
Download