IBM® System i™ & System Storage™ DS8000™ Recovery Handbook IBM ® This document can be found on the web, www.ibm.com/support/techdocs Search for document number WPxxxxxx under the category of “White papers”. Version 1.0 September 5th, 2007 IBM ATS System Storage Europe Ingo Dimmer © IBM Copyright, 2007 http://www-03.ibm.com/support/techdocs/atsmastr.nsf/WebIndex/WPxxxxxx IBM System i & System Storage DS8000 Recovery Handbook Version 1.0, 09/05/2007 Page 1/19 Purpose The purpose of this recovery handbook is to provide some handy reference information to storage administrators for troubleshooting the IBM® System Storage™ DS8000™ in a System i™ CopyServices environment. This WhitePaper is focussing on troubleshooting volume access problems, FlashCopy®, MetroMirror and GlobalMirror problems incl. failover/failback and is meant to provide some guidance in addition to the IBM official product documentation to help quickly diagnose respectively recover from failure situations. To gain the most benefit from this recovery handbook it is suggested that this document is taken as a template for developing a customized version specific to the customer’s current System i, SAN and DS8000 storage configuration. In addition to the provided technical procedures for failure isolation and recovery it is strongly recommended that customers using a disaster recovery or high availability setup develop their own decision criteria for switching to the disaster recovery or backup site. Augmenting the technical procedures by unambiguous site swap decision criteria with defined responsibilities and duration targets is important to help minimize the overall recovery time. Defined duration targets which support the decision for a site swap should include the efforts for checking recovery site data consistency and for failure analysis to compare expected recovery time for the production site versus known recovery time for unplanned site outages. Disclaimer Notice & Trademarks THE INFORMATION PROVIDED IN THIS DOCUMENT IS DISTRIBUTED "AS IS" WITHOUT ANY WARRANTY, EITHER EXPRESS OR IMPLIED. IBM EXPRESSLY DISCLAIMS ANY WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE OR NON-INFRINGEMENT. IBM shall have no responsibility to update this information. While IBM has reviewed each item for accuracy in a specific situation, there is no guarantee that the same or similar procedure will work elsewhere. The provision of the information contained herein is not intended to, and does not, grant any right or license under any IBM patents or copyrights. Inquiries regarding patent or copyright licenses should be made, in writing, to: IBM Director of Licensing IBM Corporation North Castle Drive Armonk, NY 10504-1785 U.S.A. IBM, the IBM logo, System Storage, FlashCopy, System i, System i5 and i5/OS are trademarks of International Business Machines Corporation in the United States, other countries, or both. Other company, product and service names may be trademarks or service marks of others. © IBM Copyright, 2007 http://www-03.ibm.com/support/techdocs/atsmastr.nsf/WebIndex/WPxxxxxx IBM System i & System Storage DS8000 Recovery Handbook Version 1.0, 09/05/2007 Page 2/19 Table of Contents Purpose Disclaimer Notice & Trademarks 1 Basic Error Determination and Trouble Shooting 1.1 Display DS8000 Serviceable Events 1.2 Reviewing DS8000 SNMP Traps for CopyServices Events 1.3 Troubleshooting Volume Access Problems 1.3.1 From System i Side 1.3.2 From DS8000/HMC Side 1.3.3 From SAN Fabric Side 1.4 Troubleshooting FlashCopy Problems 1.5 Troubleshooting MetroMirror Problems 1.5.1 Volume Suspends or Consistency Group Freezes 1.5.2 Path Failures 1.5.3 Primary Site Disaster Scenarios 1.5.4 MetroMirror Failover/Failback Procedures 1.6 Trouble Shooting GlobalMirror Problems 1.6.1 GM Session, PPRC Path and GlobalCopy Failures 1.6.2 GlobalMirror Session Failover/Failback Procedures 1.6.3 Failing over a Subset of Volumes via Pausing the GM Session 1.7 Problem Data Collection 1.7.1 From System i Side 1.7.2 From DS8000 Side 1.7.3 From SAN Side 1.8 References © IBM Copyright, 2007 http://www-03.ibm.com/support/techdocs/atsmastr.nsf/WebIndex/WPxxxxxx IBM System i & System Storage DS8000 Recovery Handbook 2 2 4 4 5 6 6 9 10 11 12 12 12 12 13 15 15 15 17 18 18 18 18 19 Version 1.0, 09/05/2007 Page 3/19 1 Basic Error Determination and Trouble Shooting 1.1 Display DS8000 Serviceable Events A serviceable event is created on the DS8000 HMC for each storage unit problem. 1) Logon to the DS8000 HMC with user ID customer and password cust0mer 2) Display serviceable events on the DS8000 HMC via Service Applications → Service Focal Point → Manage Serviceable Events. An example is shown in Figure 1 below: Figure 1: DS8000 Serviceable Events 2) Refer to the DS8000 Service Information Center → Messages and codes → Entry table for all messages and codes for further information: http://publib.boulder.ibm.com/infocenter/dsichelp/ds8000sv/index.jsp Note: If the DS8000 HMC has been configured for call-home outbound communication it will automatically open a Problem Management Hardware record for a DS8000 serviceable event which needs further attention from IBM support. The Serviceable Events window (see Figure 1) will show the problem reference number (PMH #) which the IBM remote support service representative will take care of. © IBM Copyright, 2007 http://www-03.ibm.com/support/techdocs/atsmastr.nsf/WebIndex/WPxxxxxx IBM System i & System Storage DS8000 Recovery Handbook Version 1.0, 09/05/2007 Page 4/19 1.2 Reviewing DS8000 SNMP Traps for CopyServices Events Specifically for DS8000 CopyServices events reviewing the information provided with the following SNMP traps is useful: Trap 1xx for PPRC link events o Trap 100: Remote mirror and copy links degraded* o Trap 101: Remote mirror and copy links are inoperable* o Trap 102: Remote mirror and copy links are operational Trap 20x for PPRC volume events o Trap 200: LSS pair consistency group remote mirror and copy pair error* o Trap 202: Primary remote mirror and copy devices on the LSS were suspended because of an error* Trap 210-220 for GlobalMirror events o o o o o o o o o Trap 210: Global Mirror initial consistency group successfully formed Trap 211: Global Mirror session is in a fatal state Trap 212: Global Mirror consistency group failure - Retry will be attempted Trap 213: Global Mirror consistency group successful recovery Trap 214: Global Mirror master terminated Trap 215: Global Mirror FlashCopy at remote site unsuccessful Trap 216: Global Mirror slave termination unsuccessful Trap 217: Global Mirror paused Trap 218: Global Mirror number of consistency group failures exceed threshold o Trap 219: Global Mirror first successful consistency group after prior failures o Trap 220: Global Mirror number of FlashCopy commit failures exceed threshold Refer to the DS8000 Information Center → Troubleshooting → Generic and specific alert traps for further information like CopyServices event reason codes: http://publib.boulder.ibm.com/infocenter/dsichelp/ds8000ic/index.jsp Note: Setting up the DS8000 HMC for SNMP notification to a customer provided SNMP manager software application is highly recommended especially in System i DS8000 CopyServices environments because SNMP is the only way in which DS8000 CopyServices events can be reported in an OpenSystem host environment. © IBM Copyright, 2007 http://www-03.ibm.com/support/techdocs/atsmastr.nsf/WebIndex/WPxxxxxx IBM System i & System Storage DS8000 Recovery Handbook Version 1.0, 09/05/2007 Page 5/19 1.3 Troubleshooting Volume Access Problems Storage access loss problems can originate from all components in the I/O chain consisting of System i server, SAN environment and DS8000 storage subsystem. For a thorough analysis all these components should be checked independently as described below. 1.3.1 From System i Side Refer to following subsections for failure isolation: Access loss to a SYSBAS disk unit Access loss to an IASP disk unit Loss of a redundant path to a multi-path disk unit → see section 1.3.1.1 → see section 1.3.1.2 → see section 1.3.1.3 System i storage access loss problems are logged in the System i Product Activity Log (PAL) and/or QSYSOPR message queue. Note: Use iSeries Navigator → My Connections → systemname → Basic Operations → Printer Output to easily transfer a spool file to a PC for problem data collection. 1.3.1.1 Access Loss to a SYSBAS Disk Unit Loss of access to SYSBAS disk unit(s) is indicated by SRC A6xx0255 or A6xx0266 being posted with System i entering a freeze state. Regaining access to the missing disk unit(s) is critical for System i in order to become operational again. Once access has been restored System i will automatically resume operation from the point where it lost access. Otherwise it would remain infinitely in its freeze state and there would be no other recovery than to powerdown and restore the whole system from backup. The following steps describe how to get information about the missing SYSBAS disk unit(s) when System i has entered a freeze state: 1) Logon to the System i5 HMC 2) Select the menu Server and Partition → Server Management and right-click the i5/OS partition which has SRC A6xx0255/0266 posted selecting Properties 3) In the “Partition Properties” window select the tab Reference Code selecting the current A6xx0255/0266 reference code from the list and clicking on Details 4) Word 8 of the “Reference Code Details” shows the volume S/N of one of the missing disk unit(s) like shown for DS8000 LUN ID 0x1000 in Figure 2 below. Word 9 provides information about whether the last operational FC path was lost indicated by SRC 21073002 or whether access was lost to the volume itself indicated by SRC 21073100. This information can be used for further failure isolation from SAN and DS8000 side. © IBM Copyright, 2007 http://www-03.ibm.com/support/techdocs/atsmastr.nsf/WebIndex/WPxxxxxx IBM System i & System Storage DS8000 Recovery Handbook Version 1.0, 09/05/2007 Page 6/19 Figure 2: System i Reference Code Details 1.3.1.2 Access Loss to an IASP Disk Unit An access loss to an IASP disk unit will cause a SRC B6000266 PAL entry and an automatic vary-off of the IASP after 20 min. indicated via message CPIB711 “ASP device xxxx failed.” in the QSYSOPR message queue. 1) Review the System i Product Activity Log (see section 1.3.1.5) and display details for the SRC B6000266 entry showing the DS8000 volume S/N to which access was lost. 2) Refer to sections 1.3.1.4 and 1.3.3 for further failure isolation from DS8000 and SAN side. 3) After recovery of the IASP disk unit access loss try to vary-on the IASP again via VRYCFG CFGOBJ(IASP_name) CFGTYPE(*DEV) STATUS(*ON) 1.3.1.3 Loss of a redundant Path to a Multi-Path Disk Unit A lost path to a multi-path disk unit is indicated via message ID CPPEA33 “Warning - An external storage subsystem disk unit connection has failed.” or/and CPI096E "Disk unit connection is missing" posted for every device of the lost path (re-posted every hour) and a SRC 21073002 PAL entry. Verify the following steps to help isolate a lost path problem: © IBM Copyright, 2007 http://www-03.ibm.com/support/techdocs/atsmastr.nsf/WebIndex/WPxxxxxx IBM System i & System Storage DS8000 Recovery Handbook Version 1.0, 09/05/2007 Page 7/19 1) Review the QSYSOPR message queue (see section 1.3.1.4) for message CPPEA33 to get the resource name DMPxxx for a disk unit with a failing path. 2) Access System Service Tools to get the System i IOA’s physical location and WWPN of the failing FC path by issuing the i5/OS command STRSST and selecting 1. Start a service tool → 7. Hardware service manager → 3. Locate resource by resource name. Enter the DMPxxx resource name, select option 8=Associated packaging resource(s), then option 5=Display detail to get System i IOA physical location displayed by the Unit ID and Card fields and WWPN displayed by the Worldwide Port Name field 3) Locate the System i IOA of the failed FC path in System Service Tools’ Hardware Service Manager → 1. Packaging hardware resources and ensure that the IOP/IOA is in "operational" state – if not, try an IOP reset/re-IPL and eventually engage your service provider for further assistance if needed 4) Ensure the status LEDs of the System i FC IOA are either solid green and flashing yellow (link up) – flashing green indicates that the link is down and other states typically indicate a HW problem 5) Ensure the affected System i IOA is logged into the SAN and DS8000 (DSCLI command lshostconnect –login may not represent the current login status unless the port is reset via switching to/back from another topology using setioport –topology [fc-al | scsi-fcp] port_ID ; Brocade command switchShow; Cisco command show flogi database) – if not, ensure the switch FC port and DS8000 FC port is in "online" status (DSCLI command lsioport) 6) For any recovered lost path verify it has been recognized by i5/OS via message CPPEA35 “Informational only. A connection to an external storage subsystem disk unit has been restored.” respectively a SRC 27873140 PAL entry. 1.3.1.4 Reviewing the i5/OS System Operator Message Queue 1) Logon to the i5/OS system and display the system operator message queue via issuing the command DSPMSG QSYSOPR 2) Use option 5=Display details to display details for a selected message 3) To print out the QSYSOPR message queue for problem data collection issue the command DSPMSG MSGQ(QSYSOPR) OUTPUT(*PRINT) Note: This print-out doesn’t include the message details. 1.3.1.5 Reviewing the System i Product Activity Log Access System Service Tools (SST) by using the command STRSST Select 1. Start a service tool from the "System Service Tools (SST)" screen Select 1. Product activity log from the "Start a Service Tool" screen Select option 1. Analyze log from the "Product Activity Log" screen Enter "3" for Log (3 = Magnetic media log) and timeframe of log in the "Select Subsystem Data" screen 6) Enter "3" for Report type (3 = Print options) and "Y" for including optional statistical entries in the "Select Analysis Report Options" screen 7) Enter "4" for Report type (4 = Print full report) and "Y" for including hexadecimal data in the "Select Options for Printed Report" screen 1) 2) 3) 4) 5) © IBM Copyright, 2007 http://www-03.ibm.com/support/techdocs/atsmastr.nsf/WebIndex/WPxxxxxx IBM System i & System Storage DS8000 Recovery Handbook Version 1.0, 09/05/2007 Page 8/19 8) Press F3 repeatedly and ENTER to exit from SST 9) Display the generated PAL spool file via running the command DSPSPLF FILE(QPCSMPRT) SPLNBR(*LAST) 1.3.1.6 Creating a System i HSM System Configuration List Printout Having a current System i Hardware Service Manager (HSM) system configuration list printout available on paper is highly recommended as reference information to easily associate System i disk resource names with the corresponding DS8000 volume S/N and the System i IOA WWPN. Use the following steps to create a HSM configuration list (see Figure 3): 1) Access system service tools by using the command STRSST 2) Select option 1. Start a service tool from the "System Service Tools (SST)" screen 3) Select option 7. Hardware Service Manager from the "Start a Service Tool" screen 4) Press F6=Print configuration from the "Hardware Service Manager" screen 5) Select the default "Format" option 1=132 characters wide and "Information printed" option 1=Packaging resources sorted by location and press ENTER 6) Press F3 repeatedly and ENTER to exit from SST 7) To ease problem determination for System i access loss problems store a soft-copy of the HSM system configuration list on another system and a paper printout of it together with this recovery handbook. Figure 3: Example Excerpt from HSM System Configuration List 1.3.2 From DS8000/HMC Side The following items should be verified from DS8000 side to help isolate a System i access loss problem: © IBM Copyright, 2007 http://www-03.ibm.com/support/techdocs/atsmastr.nsf/WebIndex/WPxxxxxx IBM System i & System Storage DS8000 Recovery Handbook Version 1.0, 09/05/2007 Page 9/19 1) Verify if all FiberChannel IOAs from the affected System i LPAR are logged into the DS8000 (DSCLI command lshostconnect –login may not represent the current login status unless the port is reset via switching to/back from another topology using setioport –topology [fc-al | scsi-fcp] port_ID) – if not, ensure the DS8000 FC ports are in "online" status (DSCLI commands lsioport, lshba storage_immage_ID) and there is no SAN connectivity problem (see section 1.3.3) 2) Verify all System i DS8000 volumes are in “online/normal” state (DSCLI command lsfbvol) 3) Verify relevant DS8000 resources are in online resp. normal state (DSCLI commands: lsrank, lsarray, lsddm storage_image_ID, lsda storage_image_ID; An IBM storage CE may check the D8000 resource states on the HMC selecting Service Applications → Service Focal Point → Service Utilities, highlight SF, Selected → View Storage Facility State (end of call); selecting any "FAILED" test and clicking on Details for further information) 1.3.3 From SAN Fabric Side Perform the following steps as a sanity check to isolate System i access loss from SAN perspective: 1) Ensure that both the System i IOA and its corresponding DS8000 host adapter are logged into the SAN fabric (Brocade command switchShow; Cisco command show flogi database; use DSCLI command lsioport to get the DS8000 adapter WWPN) 2) Ensure the switch zoning is correct so that System i host initiator and DS8000 storage target can “see” each other (Brocade command cfgShow; Cisco command show zoneset active) © IBM Copyright, 2007 http://www-03.ibm.com/support/techdocs/atsmastr.nsf/WebIndex/WPxxxxxx IBM System i & System Storage DS8000 Recovery Handbook Version 1.0, 09/05/2007 Page 10/19 1.4 Troubleshooting FlashCopy Problems With FlashCopy being a DS8000 internal CopyServices function the possibilities for troubleshooting from a user perspective are limited. Nonetheless check the following items mainly to help exclude a user error with FlashCopy in context with System i: 1) For a FlashCopy establish failure with DS8000 message ID “CMUN03049E mkflash: source_volID:target_volID: Copy Services operation failure: incompatible volumes” ensure that both the FlashCopy source and target volumes are the same i5/OS volume model, i.e. they have the same capacity and protection mode which is either “protected” (models A0x) or “unprotected” (models A8x) (DSCLI command lsfbvol; volume model information is shown by “DeviceMTM” column output) 2) For a FlashCopy establish failure with DS8000 message ID “CMUN03035E mkflash: source_volID:target_volID: Copy Services operation failure: feature not installed” ensure that the FlashCopy license key (PTCs feature #72xx) is installed (check via DSCLI command lskey). 3) For a System i backup host IPL or an IASP vary-on failure from FlashCopy target volumes verify the following holds true: a. The FlashCopy from SYSBAS or the entire System i disk space was taken while the System i production host accessing the FlashCopy source volumes was powered off, respectively for taking a FlashCopy from a System i independent auxiliary storage pool (IASP) ensure the IASP has been varied-off before establishing or re-synchronizing its FlashCopy relationships. This is the only way to ensure that all System i modified data in memory is flushed to disk storage for a consistent state and clean IPL respectively IASP varyon from the backup host accessing the FlashCopy target volumes. b. Ensure that − unless for a GlobalMirror B to C volume relationship − the FlashCopy relationship was NOT created using the target write inhibit mode (DSCLI command lsflash source_volID; “TargetWriteEnabled” column output should show “Enabled”) 4) For a FlashCopy establish failure ensure the following: a. Both FlashCopy source and target volumes are in “online / normal” state (DSCLI command lsfbvol) b. No violation to the rule that a FlashCopy target volume can be only in one FlashCopy relationship If the specified source is already used as a FlashCopy target volume message ID “CMUN03008E mkflash: source_volID:target_volID: Copy Services operation failure: cascading FlashCopy prohibited” or if the specified target is already used as target volume in an existing FlashCopy relationship message ID “CMUN03042E mkflash: source_voldID:target_volID: Copy Services operation failure: already a FlashCopy target” is posted. © IBM Copyright, 2007 http://www-03.ibm.com/support/techdocs/atsmastr.nsf/WebIndex/WPxxxxxx IBM System i & System Storage DS8000 Recovery Handbook Version 1.0, 09/05/2007 Page 11/19 1.5 Troubleshooting MetroMirror Problems Refer to the corresponding subsection below to troubleshoot PPRC volume, path or primary site disaster failures. 1.5.1 Volume Suspends or Consistency Group Freezes Perform the following steps to troubleshoot DS8000 PPRC suspend or consistency group freeze problems: 1) Review the SNMP trap 202 (resp. 200) message to find out the suspend reason code* 2) Check the current PPRC volume pair states on primary and secondary DS8000 (DSCLI command lspprc source_vol_ID:target_vol_ID) 3) Ensure all defined PPRC paths are available (DSCLI command lspprcpath 00-FF showing "Success" state for the paths) 4) Correct any volume error states and resume PPRC (DSCLI command resumepprc –type mmir source_vol_ID:target_vol_ID respectively use rmpprc … and mkpprc … if required) 1.5.2 Path Failures Follow the steps below to troubleshoot PPRC path failures: 1) Review the SNMP trap 100/101 message to find out the link failure reason code* 2) Check the current PPRC link states (DSCLI command lspprcpath 00-FF) 3) For any failed paths based on the link failure reason code make sure to exclude any connectivity errors: a. Ensure the primary and secondary DS8000 PPRC FC ports are online (DSCLI command lsioport) b. Ensure the corresponding PPRC FC link is up i. Primary and secondary DS8000 PPRC FC port LEDs showing solid green and flashing yellow – flashing green indicates that the link is down and other states typically indicate a HW problem ii. If a SAN is used for the PPRC connections ensure both primary and secondary DS8000 FC ports are logged into the SAN switch and the zoning is correct so they can “see” each other (Brocade commands switchShow and cfgShow; Cisco commands show flogi database and show zoneset active) 1.5.3 Primary Site Disaster Scenarios In case of a primary site disaster when using no PPRC “freeze” automation software proceed as follows to check if there is consistent data on the PPRC secondary site before deciding on a potential MetroMirror failover (see section 1.5.4) 1) Query the status of the PPRC primary and secondary volumes and paths (DSCLI commands lspprc source_vol_ID:target_vol_ID and lspprcpath 00-FF) © IBM Copyright, 2007 http://www-03.ibm.com/support/techdocs/atsmastr.nsf/WebIndex/WPxxxxxx IBM System i & System Storage DS8000 Recovery Handbook Version 1.0, 09/05/2007 Page 12/19 2) Refer to Figure 4: MetroMirror Volume Swap Decision Matrix to check eligibility for a site swap PPRC Primary Volume State (GUI or DCSLI) PPRC Secondary Volume State (GUI or DSCLI) PPRC Path State, Communication to Secondary SNMP Traps (ALL hosts that are conf. for SNMP need to be checked ) Host I/O State (Disk Error Presented?, I/O Halted?) CAN PPRC be SWAPPED (primary to secondary volume?) Actions to restore access and then later PPRC relationship DUPLEX DUPLEX PPRC paths O.K. NONE NO ERRORS YES N/A DUPLEX DUPLEX PPRC paths partially Lost, comm_to_sec. O.K. Trap 100 (pprc links degraded) will be posted NO ERRORS, eventually performance degradation. YES FIX PPRC path problems DUPLEX to SUSPENDED (when last pprc path lost) DUPLEX PPRC paths ALL LOST, comm_to_sec. FAILED Trap 101 (pprc links down), Trap 202 (Primary device susp) NO ERRORS, but PRIMARY VOL will go Suspended. *NO* FIX PPRC path problems and RESYNC PPRC afterwards SUSPENDED DUPLEX comm_to_sec. FAILED at ONE time for all pprc paths Trap 101 (pprc links down), Trap 202 (Primary device susp) NO ERRORS, but PRIMARY suspended *NO* "UNKNOWN" DUPLEX PPRC paths O.K., com_to_sec. O.K: Permanent I/O error, but NO PPRC suspend snmp traps SCSI_Disk Errors for "unknown volumes" YES FIX the Reason for volume suspend and RESYNC PPRC afterwards Fix PRIMARY to restore volume access, reestablish PPRC "UNKNOWN" DUPLEX and SUSPENDED PPRC paths O.K., com_to_sec. O.K: Trap 202 (Primary device suspended) SCSI_Disk Errors for "unknown" volumes *NO* Fix PRIMARY to restore volume access, then RESYNC SUSPENDED SUSPENDED or OFFLINE to primary PPRC paths O.K. Trap 202 (Primary device suspended) NO ERRORS *NO* (IF device problem on secondary!!!) Check effected DBs to decide WHAT volumes to be used (pri/sec) Figure 4: MetroMirror Volume Swap Decision Matrix 1.5.4 MetroMirror Failover/Failback Procedures This section shows the required steps for a MetroMirror failover from the production to the recovery site and later failback from the recovery to the original production site using DSCLI commands. 1.5.4.1 MetroMirror Failover from the Production to the Recovery Site 1) Follow the steps in section 1.5.3 to determine eligibility for a failover with consistent data on the recovery site before proceeding 2) For a practice failover only stop all host I/O on the production site and suspend the PPRC volume relationships (pausepprc source_volume_ID:target_volume_ID) 3) From the recovery site failover PPRC to terminate PPRC with target volumes becoming suspended source volumes available for host I/O (failoverpprc -type mmir source_volume_ID:target_volume_ID) 4) Start recovery site host systems 1.5.4.2 MetroMirror Failback from the Recovery to the original Production Site 1) From the recovery site verify PPRC volume and path states before failing back: a. Ensure PPRC volumes on recovery site are in source suspended states (lspprc source_volume_ID:target_volume_ID), © IBM Copyright, 2007 http://www-03.ibm.com/support/techdocs/atsmastr.nsf/WebIndex/WPxxxxxx IBM System i & System Storage DS8000 Recovery Handbook Version 1.0, 09/05/2007 Page 13/19 if not use pausepprc source_volume_ID:target_volume_ID / failoverpprc -type mmir source_volume_ID:target_volume_ID b. Ensure PPRC paths from recovery to production site are established (lspprcpath 00-FF, to establish paths if needed use mkpprcpath – remotewwnn wwnn –srclss source_LSS –tgtlss target_LSS source_port_ID:target_port_ID) 2) From the recovery site fail back MetroMirror to re-synchronize changes from the recovery site back to the production site Important: Stop host I/O on the original production site and make sure to specify the recovery site volumes as source volumes and the original production site volumes as target! (failbackpprc –type mmir source_volume_ID:target_volume_ID) 3) Verify if PPRC volumes are in full-duplex state again (lspprc source_volume_ID:target_volume_ID) 4) From the production site after PPRC volumes are in full-duplex again fail over MetroMirror from the production to the recovery site: a. Ensure PPRC paths from production to recovery site are established (lspprcpath 00-FF, to establish paths if needed use mkpprcpath – remotewwnn wwnn –srclss source_LSS –tgtlss target_LSS source_port_ID:target_port_ID) b. Fail over PPRC target volumes on the production site to become suspended primaries (failoverpprc –type mmir source_volume_ID:target_volume_ID) c. Re-establish the MetroMirror relationships between the original production and recovery site Important: Stop host I/O on the recovery site and make sure to specify the original production site volumes as source volumes and the recovery site volumes as target! (failbackpprc -type mmir source_volume_ID:target_volume_ID) c. Verify that all MetroMirror volumes are in full-duplex state again (lspprc source_volume_ID:target_volume_ID) 5) Start original production site host systems © IBM Copyright, 2007 http://www-03.ibm.com/support/techdocs/atsmastr.nsf/WebIndex/WPxxxxxx IBM System i & System Storage DS8000 Recovery Handbook Version 1.0, 09/05/2007 Page 14/19 1.6 Trouble Shooting GlobalMirror Problems Refer to the corresponding subsection below for troubleshooting GlobalMirror session, PPRC path and GlobalCopy failures and for failing over GlobalMirror in the case of total path or primary site disasters. 1.6.1 GM Session, PPRC Path and GlobalCopy Failures The following check-points shall help to isolate partial GlobalMirror failures − see section 1.6.2 for the failover procedure in case of a primary DS8000 or total path failure: 1) Check any corresponding SNMP CopyServices alert traps (refer to section 1.2) 2) Use the following DSCLI commands to diagnose GM problems: a. Check if the GM session is in "running" state (showgmir master_controlpath_LSS) b. Check the “last failure" for the GM session and whether it was associated with a particular LSS (showgmir –metrics master_controlpath_LSS) c. Check the status of the PPRC paths (lspprcpath 00-FF) d. Check the status of the GlobalCopy pairs especially looking for any suspend or simplex states and any large number of out-of-sync tracks (lssession LSS_ID, lspprc –l source_vol_ID:target_vol_ID, showgmiroos session_ID) e. Check the status of the FlashCopy relationships (lsflash –l source_vol_ID:target_vol_ID) 1.6.2 GlobalMirror Session Failover/Failback Procedures The required steps for a GlobalMirror session failover from the production to the recovery site and later failback from the recovery to the original production site using DSCLI commands are described in this section. 1.6.2.1 GlobalMirror Failover from the Production to the Recovery Site 1) If the production machine is still accessible prepare for a “clean” GM failover: a. Pause GlobalMirror (consistency group) processing (pausegmir –lss LSS_ID –session session_ID) b. Verify that GlobalMirror is in "paused", i.e. GlobalCopy, state (showgmir master_controlpath_LSS) c. Suspend GlobalCopy relationships, (pausepprc source_vol_ID:target_vol_ID) 2) From the recovery site fail over GlobalCopy to the recovery site to terminate PPRC with target volumes becoming suspended primaries (failoverpprc –type gcp source_vol_ID:target_vol_ID) 3) Ensure consistent data on the recovery site C volumes Note: For a planned site-swap with previously stopped host I/O before pausing or © IBM Copyright, 2007 http://www-03.ibm.com/support/techdocs/atsmastr.nsf/WebIndex/WPxxxxxx IBM System i & System Storage DS8000 Recovery Handbook Version 1.0, 09/05/2007 Page 15/19 removing the GM session this step is not required as the C volumes should be consistent. a. Query the consistency group state for the C volumes (lsflash source_vol_ID:target_vol_ID, looking for the b. Compare the C volumes’ “Sequence Number” and “revertible” information with Table 1 below and perform the required action to ensure consistent data on the C volumes. Sequence Numbers? Revertible Volumes? Required Action equal all disabled none different some enabled / some disabled For all revertible C volumes: revertflash source_volume_ID equal all for all C volumes: revertflash source_volume_ID equal some enabled / some disabled For all (revertible) C volumes: commitflash source_volume_ID Table 1: GlobalMirror FlashCopy Consistency Group Validation 4) Create consistent data on recovery site B volumes via FlashCopy fast reverse restore (reverseflash –fast –tgtpprc source_vol_ID:target_vol_ID, with specifying the B volumes as sources and the C volumes as targets) 5) Wait for FlashCopy background copy completion by checking the relationships have ended (lsflash source_vol_ID:target_vol_ID) 6) Re-establish the FlashCopy relationship between B and C volumes to have meaningful data on the C volumes after the FRR and prepare for re-establishing GM again (in a disaster situation using "nocp" may not be desired) (mkflash –tgtinhibit –record -nocp source_vol_ID:target_vol_ID) 7) Start recovery site host systems from consistent B volumes 1.6.2.2 GlobalMirror Failback from the Recovery to the original Production Site 1) Ensure production site A volumes are offline to all hosts 2) From the recovery site ensure PPRC paths are established between the recovery and production site (lspprcpath 00-FF, to establish paths if needed use mkpprcpath – remotewwnn wwnn –srclss source_LSS –tgtlss target_LSS source_port_ID:target_port_ID) From the recovery site fail back GlobalCopy to the production site to re-synchronize changes from the recovery site back to the original production site Important: Stop host I/O on the original production site and make sure to specify the recovery site volumes as source volumes and the original production site volumes as target! © IBM Copyright, 2007 http://www-03.ibm.com/support/techdocs/atsmastr.nsf/WebIndex/WPxxxxxx IBM System i & System Storage DS8000 Recovery Handbook Version 1.0, 09/05/2007 Page 16/19 (failbackpprc –type gcp source_vol_ID:target_vol_ID) 3) Stop host I/O to the recovery site B volumes 4) Verify if GlobalCopy synchronization from B to A has completed, i.e. “out of sync tracks” are zero for all GlobalCopy volumes (lspprc source_vol_ID:target_vol_ID) 5) From the production site re-establish GlobalCopy from the production to the recovery site: a. Ensure PPRC paths are established between the production and recovery site (lspprcpath 00-FF, to establish paths if needed use mkpprcpath – remotewwnn wwnn –srclss source_LSS –tgtlss target_LSS source_port_ID:target_port_ID) b. Fail over PPRC on the production site with target volumes on the production site becoming suspended primaries (failoverpprc –type gcp source_vol_ID:target_vol_ID) c. Fail back GlobalCopy on the production site to re-establish the GlobalCopy relationships between production and recovery site Note: Specify the original production site volumes as source volumes and the recovery site volumes as target! (failbackpprc -type gcp source_vol_ID:target_vol_ID) 6) Re-start GlobalMirror processing: a. List defined GlobalMirror sessions for the specified LSSs (lssession LSS_ID) b. Check if GlobalMirror has been paused or stopped (showgmir master_controlpath_LSS) c. Resume the GlobalMirror session if it was paused before the site switch (resumegmir –lss LSS_ID –session session_ID) or Re-start the GlobalMirror session if it was stopped before the site switch (mkgmir –lss LSS_ID –session session_ID) d. Verify that GlobalMirror is in "running" copy state (showgmir master_controlpath_LSS) 7) Re-start production site host systems from A volumes 1.6.3 Failing over a Subset of Volumes via Pausing the GM Session Proceed as follows to fail over only a subset of GlobalMirror volumes from the production to the recovery site which currently works only when using DSCLI commands: 1) Pause the GlobalMirror session (pausegmir –lss LSS_ID –session session_ID) 2) Remove the volumes to be failed over from the GlobalMirror session (chsession –action remove –volume source_volume_ID –lss LSS_ID –session session_ID) 3) Restart the GlobalMirror session (resumegmir –lss LSS_ID –session session_ID) 4) Go to section 1.6.2.1 step 1c and proceed to perform the GlobalMirror failover only for the previously removed volumes. © IBM Copyright, 2007 http://www-03.ibm.com/support/techdocs/atsmastr.nsf/WebIndex/WPxxxxxx IBM System i & System Storage DS8000 Recovery Handbook Version 1.0, 09/05/2007 Page 17/19 1.7 Problem Data Collection The following general information is typically required for further problem assistance by IBM support: detailed problem description impact (access loss, data loss, performance degradation) date/time of problem occurrence recovery actions tried and results date/time of any recent changes in the environment actual vs. expected or previous performance and type of workload (performance problems only) 1.7.1 From System i Side The following information is typically required by IBM support from System i side for any access loss, data loss or performance degradation problems: affected DS8000 volume IDs and System i host WWPNs i5/OS and cumPTF level (WRKPTFGRP) i5/OS HSM system configuration list (see section 1.3.1.6) difference between system time [DSPSYSVAL SYSVAL(QTIME)]and real time Product Activity Log (see section 1.3.1.5) QSYSOPR messages (see section 1.3.1.4) SAN connectivity diagram Collection Services "System report -> Disk utilization" data (performance problems only) 1.7.2 From DS8000 Side The following information is typically required by IBM support from DS8000 side for any access loss, data loss or performance degradation problems: DS8000 microcode level(s) (DSCLI command lsserver –l) Information about DS8000 CopyServices configuration (primary and secondary S/N, usage of MM, GM or/and FLC) DS8000 Data Collection on Demand (PE package) (typically offloaded by IBM support) forced DS8000 LPAR statesaves (typically forced by IBM support, for PPRC problems for both primary & secondary) 1.7.3 From SAN Side The following information is typically required by IBM support from SAN side for any access loss problems: SAN switch logs (e.g. Brocade supportShow or Cisco show tech-support detail) SAN layout diagram © IBM Copyright, 2007 http://www-03.ibm.com/support/techdocs/atsmastr.nsf/WebIndex/WPxxxxxx IBM System i & System Storage DS8000 Recovery Handbook Version 1.0, 09/05/2007 Page 18/19 1.8 References IBM System Storage DS8000 Service Information Center, http://publib.boulder.ibm.com/infocenter/dsichelp/ds8000sv/index.jsp IBM System Storage DS8000 Information Center, http://publib.boulder.ibm.com/infocenter/dsichelp/ds8000ic/index.jsp IBM System Storage DS8000 Command-Line Interface User’s Guide (SC26-7625), http://www1.ibm.com/support/docview.wss?rs=1114&context=HW2C2&dc=DB500&q1=ssg1*&uid =ssg1S1002949&loc=en_US&cs=utf-8&lang=en IBM System Storage DS8000 Messages Reference (GC26-7914) http://www-1.ibm.com/support/docview.wss?uid=ssg1S7001164&aid=1 IBM DS8000 Preventive Service Planning Information, http://www1.ibm.com/support/docview.wss?rs=1113&context=HW2B2&dc=DB500&uid=ssg1S100 2949&loc=en_US&cs=utf-8&lang=en IBM System Storage DS8000 Host System Attachment Guide (SC26-7917), http://www-1.ibm.com/support/docview.wss?uid=ssg1S7001161&aid=1 i5/OS V5R4 Information Center http://publib.boulder.ibm.com/infocenter/iseries/v5r4/index.jsp IBM System Storage DS8000 Series: Implementing CopyServices in Open Environments (SG24-6788), http://www.redbooks.ibm.com/abstracts/sg246788.html?Open iSeries and IBM TotalStorage: A Guide to Implementing External Disk on eServer i5 (SG24-7120), http://www.redbooks.ibm.com/abstracts/sg247120.html?Open PCI and PCI-X Placement Rules for IBM System i models (REDP-4011-03) http://www.redbooks.ibm.com/redpieces/abstracts/redp4011.html © IBM Copyright, 2007 http://www-03.ibm.com/support/techdocs/atsmastr.nsf/WebIndex/WPxxxxxx IBM System i & System Storage DS8000 Recovery Handbook Version 1.0, 09/05/2007 Page 19/19