AUTOMATED RESTORE TESTING FOR TIVOLI STORAGE MANAGER TSMworks, Inc. Based in Research Triangle area, NC, USA IBM Advanced Business Partner Big fans of Tivoli Storage Manager Broad experience with Fortune 500 clients What’s the Problem? The Recovery Gap The Gartner Group: “30% of all backups are not recoverable.” The Yankee Group: “40% were unable to recover data.” Symantec: “…one in four tests fail.” A few reasons why recoveries fail Wisdom from TSM experts and User Groups: • Node wasn’t backed up, because: • It fell off the schedule • It was never even registered to TSM (“rogue” server) • Node was backed up “successfully”, but: • • • • • Some critical files were excluded Files (often databases) were in-use or locked Nothing was mounted on the mount point Windows Journaling Service failed Retention was too short ... and many more at www.tsmworks.com Do Reporting Tools Help? • Node wasn’t backed up, because: • YES: It wasn’t scheduled by TSM (or anything else) • NO: It was never registered to TSM (“rogue” server) • Node was backed up “successfully”, but: • • • • • NO: YES: NO: NO: NO: Some critical files were excluded Files (often databases) were in-use or locked Nothing was mounted on the mount point Windows Journaling Service failed Retention was too short • MOSTLY NO: … many more at www.tsmworks.com So, test the backups. (If you’re serious about recovery). Bare-metal restore testing is ideal... TSM OS Filesystem Database Reboot Test …but unfeasible for 500 machines daily. Sampling restores is very workable TSM This technique always uncovers problems. And it’s quite feasible. : Automated Restore Testing • ART restores a few files, chosen at random, from every computer TSM backs up. • Restored files come to the ART VM. Production nodes are untouched. • ART doesn’t restore huge files, hammer TSM, or disrupt migration, reclamation, etc. • And … ART usually uncovers huge amounts of wasted storage. ART architecture Web Server login MySQL LDAP setup run tests TSM config. files annotate dsmc upgrade send logs dsmadmc Linux OS ssh, curl, etc. Network Rails Engine LDAP Server TSM Servers SMTP Server TSMworks site VMware Virtual Machine • ART is a self-contained Virtual Appliance. How ART installs Download it, Our website Point it at TSM. Your TSM Server Start it up, Your ESX farm • Your ESX team usually does the install. Easy. How ART works Your TSM site 4. Show results on dashboard 1. Discover all clients Web server TSM servers dsmadmc Network/SAN dsmc Database Client 1 Client 2 Client 3 It’s all we do. 2. For each client, restore one file, selected at random 3. Record results for each test restore Dashboard • Each bar shows the results of one “Sweep”, one testing pass through all the nodes. • One sweep may take hours or days. You can break it up to run, say, 2 PM to 6 PM daily. (Not usually necessary.) ART for Auditors • Storage auditors can use the “Passed” section to see proof that each node was really tested. ART for Auditors • From the list of successful nodes, drill down … ART for Auditors • … to see that each filespace was tested… ART for Auditors • … and that the test succeeded. ART for TSM Admins • Unlike Auditors, TSM administrators will want to look at the errors, not the successes. ART for TSM Admins • ART shows a short list of root causes, rather than the long list of nodes that failed. Click the message text… ART for TSM Admins • … to see activity log detail about the failure. • Ex: restore fails due to a missing tape. The sensible Admin will fix all missing tapes, preventing failures on many other nodes. ART for TSM Admins • Click the Error Code to get plain-language help and advice on what to do, if you’re not familiar with TSM. ART for TSM Admins • If you have Rogue-server detection licensed, ART for TSM Admins • ART will find servers that are on your network, but are not registered to TSM. • No reporting tool that talks only to TSM will ever find these rogue servers. ART for TSM Admins • These IP Addresses are on the network, but TSM doesn’t know about them. • Click the IP address to see network analysis of these potential problems… ART for TSM Admins • … including what kind of OS they use, and which ports they listen on. • This helps distinguish important IPs (servers) from irrelevant ones (printers, laptops, etc.). ART for TSM Admins • Document what the irrelevant IPs are, and ignore them on future sweeps. ART for Storage Trimming • A side effect of testing: ART finds junk storage. • Reduce your storage footprint by 20-50%. Where does junk come from? • Ancient policies are way too conservative • Full database dumps retained for a year • TDP agents don’t delete old backups • Clusters are set up wrong • Filesystems get renamed • Users back up remote disks • Decommissioned machines aren’t deleted. ART for Storage Trimming • This filesystem uses 127 TB, yet has not been backed up in three weeks. Why not? Call Chris Chiang. Either back it up, or delete it, and reclaim 127 TB. ART for Storage Trimming • This 33 GB drive uses 1.3 TB of backups. Why? Call J Williamson and decide. ART for Storage Trimming • The Almost Full view shows where OS or Applications may soon crash due to low disk space. ART for Storage Trimming • If you change the sorting or filtering, a “save as..” link appears … ART for Storage Trimming • So you can give your custom report a new name … ART for Storage Trimming • And have it on a new tab. Benefits and Pricing • Easy install • Supports storage audits • Prevents restore problems • Finds rogue servers • Trims waste storage • Pricing: $35/node (less for larger sites). Free trial = Free Health Check • ART’s free trial tests 20% of your site. • Use it as a free Health Check, with our compliments.