System i & DS8000 Recovery Handbook

advertisement
IBM® System i™ & System Storage™ DS8000™
Recovery Handbook
IBM
®
This document can be found on the web, www.ibm.com/support/techdocs
Search for document number WPxxxxxx under the category of “White papers”.
Version 1.0
September 5th, 2007
IBM ATS System Storage Europe
Ingo Dimmer
© IBM Copyright, 2007
http://www-03.ibm.com/support/techdocs/atsmastr.nsf/WebIndex/WPxxxxxx
IBM System i & System Storage DS8000 Recovery Handbook
Version 1.0, 09/05/2007
Page 1/19
Purpose
The purpose of this recovery handbook is to provide some handy reference information to
storage administrators for troubleshooting the IBM® System Storage™ DS8000™ in a
System i™ CopyServices environment.
This WhitePaper is focussing on troubleshooting volume access problems, FlashCopy®,
MetroMirror and GlobalMirror problems incl. failover/failback and is meant to provide some
guidance in addition to the IBM official product documentation to help quickly diagnose
respectively recover from failure situations.
To gain the most benefit from this recovery handbook it is suggested that this document is
taken as a template for developing a customized version specific to the customer’s current
System i, SAN and DS8000 storage configuration.
In addition to the provided technical procedures for failure isolation and recovery it is
strongly recommended that customers using a disaster recovery or high availability setup
develop their own decision criteria for switching to the disaster recovery or backup site.
Augmenting the technical procedures by unambiguous site swap decision criteria with defined
responsibilities and duration targets is important to help minimize the overall recovery time.
Defined duration targets which support the decision for a site swap should include the efforts
for checking recovery site data consistency and for failure analysis to compare expected
recovery time for the production site versus known recovery time for unplanned site outages.
Disclaimer Notice & Trademarks
THE INFORMATION PROVIDED IN THIS DOCUMENT IS DISTRIBUTED "AS IS"
WITHOUT ANY WARRANTY, EITHER EXPRESS OR IMPLIED. IBM EXPRESSLY
DISCLAIMS ANY WARRANTIES OF MERCHANTABILITY, FITNESS FOR A
PARTICULAR PURPOSE OR NON-INFRINGEMENT.
IBM shall have no responsibility to update this information.
While IBM has reviewed each item for accuracy in a specific situation, there is no guarantee
that the same or similar procedure will work elsewhere.
The provision of the information contained herein is not intended to, and does not, grant any
right or license under any IBM patents or copyrights. Inquiries regarding patent or copyright
licenses should be made, in writing, to:
IBM Director of Licensing
IBM Corporation
North Castle Drive
Armonk, NY 10504-1785
U.S.A.
IBM, the IBM logo, System Storage, FlashCopy, System i, System i5 and i5/OS are
trademarks of International Business Machines Corporation in the United States, other
countries, or both.
Other company, product and service names may be trademarks or service marks of others.
© IBM Copyright, 2007
http://www-03.ibm.com/support/techdocs/atsmastr.nsf/WebIndex/WPxxxxxx
IBM System i & System Storage DS8000 Recovery Handbook
Version 1.0, 09/05/2007
Page 2/19
Table of Contents
Purpose
Disclaimer Notice & Trademarks
1
Basic Error Determination and Trouble Shooting
1.1
Display DS8000 Serviceable Events
1.2
Reviewing DS8000 SNMP Traps for CopyServices Events
1.3
Troubleshooting Volume Access Problems
1.3.1
From System i Side
1.3.2
From DS8000/HMC Side
1.3.3
From SAN Fabric Side
1.4
Troubleshooting FlashCopy Problems
1.5
Troubleshooting MetroMirror Problems
1.5.1
Volume Suspends or Consistency Group Freezes
1.5.2
Path Failures
1.5.3
Primary Site Disaster Scenarios
1.5.4
MetroMirror Failover/Failback Procedures
1.6
Trouble Shooting GlobalMirror Problems
1.6.1
GM Session, PPRC Path and GlobalCopy Failures
1.6.2
GlobalMirror Session Failover/Failback Procedures
1.6.3
Failing over a Subset of Volumes via Pausing the GM Session
1.7
Problem Data Collection
1.7.1
From System i Side
1.7.2
From DS8000 Side
1.7.3
From SAN Side
1.8
References
© IBM Copyright, 2007
http://www-03.ibm.com/support/techdocs/atsmastr.nsf/WebIndex/WPxxxxxx
IBM System i & System Storage DS8000 Recovery Handbook
2
2
4
4
5
6
6
9
10
11
12
12
12
12
13
15
15
15
17
18
18
18
18
19
Version 1.0, 09/05/2007
Page 3/19
1 Basic Error Determination and Trouble Shooting
1.1 Display DS8000 Serviceable Events
A serviceable event is created on the DS8000 HMC for each storage unit problem.
1) Logon to the DS8000 HMC with user ID customer and password cust0mer
2) Display serviceable events on the DS8000 HMC via Service Applications → Service
Focal Point → Manage Serviceable Events.
An example is shown in Figure 1 below:
Figure 1: DS8000 Serviceable Events
2) Refer to the DS8000 Service Information Center → Messages and codes → Entry
table for all messages and codes for further information:
http://publib.boulder.ibm.com/infocenter/dsichelp/ds8000sv/index.jsp
Note: If the DS8000 HMC has been configured for call-home outbound communication it
will automatically open a Problem Management Hardware record for a DS8000 serviceable
event which needs further attention from IBM support. The Serviceable Events window (see
Figure 1) will show the problem reference number (PMH #) which the IBM remote support
service representative will take care of.
© IBM Copyright, 2007
http://www-03.ibm.com/support/techdocs/atsmastr.nsf/WebIndex/WPxxxxxx
IBM System i & System Storage DS8000 Recovery Handbook
Version 1.0, 09/05/2007
Page 4/19
1.2 Reviewing DS8000 SNMP Traps for CopyServices Events
Specifically for DS8000 CopyServices events reviewing the information provided with the
following SNMP traps is useful:

Trap 1xx for PPRC link events
o Trap 100: Remote mirror and copy links degraded*
o Trap 101: Remote mirror and copy links are inoperable*
o Trap 102: Remote mirror and copy links are operational

Trap 20x for PPRC volume events
o Trap 200: LSS pair consistency group remote mirror and copy pair error*
o Trap 202: Primary remote mirror and copy devices on the LSS were suspended
because of an error*

Trap 210-220 for GlobalMirror events
o
o
o
o
o
o
o
o
o
Trap 210: Global Mirror initial consistency group successfully formed
Trap 211: Global Mirror session is in a fatal state
Trap 212: Global Mirror consistency group failure - Retry will be attempted
Trap 213: Global Mirror consistency group successful recovery
Trap 214: Global Mirror master terminated
Trap 215: Global Mirror FlashCopy at remote site unsuccessful
Trap 216: Global Mirror slave termination unsuccessful
Trap 217: Global Mirror paused
Trap 218: Global Mirror number of consistency group failures exceed
threshold
o Trap 219: Global Mirror first successful consistency group after prior failures
o Trap 220: Global Mirror number of FlashCopy commit failures exceed
threshold
Refer to the DS8000 Information Center → Troubleshooting → Generic and specific
alert traps for further information like CopyServices event reason codes:
http://publib.boulder.ibm.com/infocenter/dsichelp/ds8000ic/index.jsp
Note: Setting up the DS8000 HMC for SNMP notification to a customer provided SNMP
manager software application is highly recommended especially in System i DS8000
CopyServices environments because SNMP is the only way in which DS8000 CopyServices
events can be reported in an OpenSystem host environment.
© IBM Copyright, 2007
http://www-03.ibm.com/support/techdocs/atsmastr.nsf/WebIndex/WPxxxxxx
IBM System i & System Storage DS8000 Recovery Handbook
Version 1.0, 09/05/2007
Page 5/19
1.3 Troubleshooting Volume Access Problems
Storage access loss problems can originate from all components in the I/O chain consisting of
System i server, SAN environment and DS8000 storage subsystem. For a thorough analysis
all these components should be checked independently as described below.
1.3.1 From System i Side
Refer to following subsections for failure isolation:
 Access loss to a SYSBAS disk unit
 Access loss to an IASP disk unit
 Loss of a redundant path to a multi-path disk unit
→ see section 1.3.1.1
→ see section 1.3.1.2
→ see section 1.3.1.3
System i storage access loss problems are logged in the System i Product Activity Log (PAL)
and/or QSYSOPR message queue.
Note: Use iSeries Navigator → My Connections → systemname → Basic Operations →
Printer Output to easily transfer a spool file to a PC for problem data collection.
1.3.1.1 Access Loss to a SYSBAS Disk Unit
Loss of access to SYSBAS disk unit(s) is indicated by SRC A6xx0255 or A6xx0266 being
posted with System i entering a freeze state. Regaining access to the missing disk unit(s) is
critical for System i in order to become operational again. Once access has been restored
System i will automatically resume operation from the point where it lost access. Otherwise it
would remain infinitely in its freeze state and there would be no other recovery than to powerdown and restore the whole system from backup. The following steps describe how to get
information about the missing SYSBAS disk unit(s) when System i has entered a freeze state:
1) Logon to the System i5 HMC
2) Select the menu Server and Partition → Server Management and right-click the
i5/OS partition which has SRC A6xx0255/0266 posted selecting Properties
3) In the “Partition Properties” window select the tab Reference Code selecting the
current A6xx0255/0266 reference code from the list and clicking on Details
4) Word 8 of the “Reference Code Details” shows the volume S/N of one of the missing
disk unit(s) like shown for DS8000 LUN ID 0x1000 in Figure 2 below. Word 9
provides information about whether the last operational FC path was lost indicated by
SRC 21073002 or whether access was lost to the volume itself indicated by SRC
21073100. This information can be used for further failure isolation from SAN and
DS8000 side.
© IBM Copyright, 2007
http://www-03.ibm.com/support/techdocs/atsmastr.nsf/WebIndex/WPxxxxxx
IBM System i & System Storage DS8000 Recovery Handbook
Version 1.0, 09/05/2007
Page 6/19
Figure 2: System i Reference Code Details
1.3.1.2 Access Loss to an IASP Disk Unit
An access loss to an IASP disk unit will cause a SRC B6000266 PAL entry and an automatic
vary-off of the IASP after 20 min. indicated via message CPIB711 “ASP device xxxx failed.”
in the QSYSOPR message queue.
1) Review the System i Product Activity Log (see section 1.3.1.5) and display details for
the SRC B6000266 entry showing the DS8000 volume S/N to which access was lost.
2) Refer to sections 1.3.1.4 and 1.3.3 for further failure isolation from DS8000 and SAN
side.
3) After recovery of the IASP disk unit access loss try to vary-on the IASP again via
VRYCFG CFGOBJ(IASP_name) CFGTYPE(*DEV) STATUS(*ON)
1.3.1.3 Loss of a redundant Path to a Multi-Path Disk Unit
A lost path to a multi-path disk unit is indicated via message ID CPPEA33 “Warning - An
external storage subsystem disk unit connection has failed.” or/and CPI096E "Disk unit
connection is missing" posted for every device of the lost path (re-posted every hour) and a
SRC 21073002 PAL entry. Verify the following steps to help isolate a lost path problem:
© IBM Copyright, 2007
http://www-03.ibm.com/support/techdocs/atsmastr.nsf/WebIndex/WPxxxxxx
IBM System i & System Storage DS8000 Recovery Handbook
Version 1.0, 09/05/2007
Page 7/19
1) Review the QSYSOPR message queue (see section 1.3.1.4) for message CPPEA33 to
get the resource name DMPxxx for a disk unit with a failing path.
2) Access System Service Tools to get the System i IOA’s physical location and WWPN
of the failing FC path by issuing the i5/OS command STRSST and selecting 1. Start a
service tool → 7. Hardware service manager → 3. Locate resource by resource
name. Enter the DMPxxx resource name, select option 8=Associated packaging
resource(s), then option 5=Display detail to get System i IOA physical location
displayed by the Unit ID and Card fields and WWPN displayed by the Worldwide
Port Name field
3) Locate the System i IOA of the failed FC path in System Service Tools’ Hardware
Service Manager → 1. Packaging hardware resources and ensure that the IOP/IOA
is in "operational" state – if not, try an IOP reset/re-IPL and eventually engage your
service provider for further assistance if needed
4) Ensure the status LEDs of the System i FC IOA are either solid green and flashing
yellow (link up) – flashing green indicates that the link is down and other states
typically indicate a HW problem
5) Ensure the affected System i IOA is logged into the SAN and DS8000
(DSCLI command lshostconnect –login may not represent the current login
status unless the port is reset via switching to/back from another topology using
setioport –topology [fc-al | scsi-fcp] port_ID ; Brocade
command switchShow; Cisco command show flogi database) – if not,
ensure the switch FC port and DS8000 FC port is in "online" status (DSCLI command
lsioport)
6) For any recovered lost path verify it has been recognized by i5/OS via message
CPPEA35 “Informational only. A connection to an external storage subsystem disk
unit has been restored.” respectively a SRC 27873140 PAL entry.
1.3.1.4 Reviewing the i5/OS System Operator Message Queue
1) Logon to the i5/OS system and display the system operator message queue via issuing
the command DSPMSG QSYSOPR
2) Use option 5=Display details to display details for a selected message
3) To print out the QSYSOPR message queue for problem data collection issue the
command DSPMSG MSGQ(QSYSOPR) OUTPUT(*PRINT)
Note: This print-out doesn’t include the message details.
1.3.1.5 Reviewing the System i Product Activity Log
Access System Service Tools (SST) by using the command STRSST
Select 1. Start a service tool from the "System Service Tools (SST)" screen
Select 1. Product activity log from the "Start a Service Tool" screen
Select option 1. Analyze log from the "Product Activity Log" screen
Enter "3" for Log (3 = Magnetic media log) and timeframe of log in the "Select
Subsystem Data" screen
6) Enter "3" for Report type (3 = Print options) and "Y" for including optional statistical
entries in the "Select Analysis Report Options" screen
7) Enter "4" for Report type (4 = Print full report) and "Y" for including hexadecimal
data in the "Select Options for Printed Report" screen
1)
2)
3)
4)
5)
© IBM Copyright, 2007
http://www-03.ibm.com/support/techdocs/atsmastr.nsf/WebIndex/WPxxxxxx
IBM System i & System Storage DS8000 Recovery Handbook
Version 1.0, 09/05/2007
Page 8/19
8) Press F3 repeatedly and ENTER to exit from SST
9) Display the generated PAL spool file via running the command DSPSPLF
FILE(QPCSMPRT) SPLNBR(*LAST)
1.3.1.6 Creating a System i HSM System Configuration List Printout
Having a current System i Hardware Service Manager (HSM) system configuration list
printout available on paper is highly recommended as reference information to easily
associate System i disk resource names with the corresponding DS8000 volume S/N and the
System i IOA WWPN.
Use the following steps to create a HSM configuration list (see Figure 3):
1) Access system service tools by using the command STRSST
2) Select option 1. Start a service tool from the "System Service Tools (SST)" screen
3) Select option 7. Hardware Service Manager from the "Start a Service Tool" screen
4) Press F6=Print configuration from the "Hardware Service Manager" screen
5) Select the default "Format" option 1=132 characters wide and "Information printed"
option 1=Packaging resources sorted by location and press ENTER
6) Press F3 repeatedly and ENTER to exit from SST
7) To ease problem determination for System i access loss problems store a soft-copy of
the HSM system configuration list on another system and a paper printout of it
together with this recovery handbook.
Figure 3: Example Excerpt from HSM System Configuration List
1.3.2 From DS8000/HMC Side
The following items should be verified from DS8000 side to help isolate a System i access
loss problem:
© IBM Copyright, 2007
http://www-03.ibm.com/support/techdocs/atsmastr.nsf/WebIndex/WPxxxxxx
IBM System i & System Storage DS8000 Recovery Handbook
Version 1.0, 09/05/2007
Page 9/19
1) Verify if all FiberChannel IOAs from the affected System i LPAR are logged into the
DS8000
(DSCLI command lshostconnect –login may not represent the current login
status unless the port is reset via switching to/back from another topology using
setioport –topology [fc-al | scsi-fcp] port_ID) – if not, ensure
the DS8000 FC ports are in "online" status (DSCLI commands lsioport, lshba
storage_immage_ID) and there is no SAN connectivity problem (see section
1.3.3)
2) Verify all System i DS8000 volumes are in “online/normal” state
(DSCLI command lsfbvol)
3) Verify relevant DS8000 resources are in online resp. normal state
(DSCLI commands: lsrank, lsarray, lsddm storage_image_ID, lsda
storage_image_ID;
An IBM storage CE may check the D8000 resource states on the HMC selecting
Service Applications → Service Focal Point → Service Utilities, highlight SF,
Selected → View Storage Facility State (end of call);
selecting any "FAILED" test and clicking on Details for further information)
1.3.3 From SAN Fabric Side
Perform the following steps as a sanity check to isolate System i access loss from SAN
perspective:
1) Ensure that both the System i IOA and its corresponding DS8000 host adapter are
logged into the SAN fabric
(Brocade command switchShow; Cisco command show flogi database;
use DSCLI command lsioport to get the DS8000 adapter WWPN)
2) Ensure the switch zoning is correct so that System i host initiator and DS8000 storage
target can “see” each other
(Brocade command cfgShow; Cisco command show zoneset active)
© IBM Copyright, 2007
http://www-03.ibm.com/support/techdocs/atsmastr.nsf/WebIndex/WPxxxxxx
IBM System i & System Storage DS8000 Recovery Handbook
Version 1.0, 09/05/2007
Page 10/19
1.4 Troubleshooting FlashCopy Problems
With FlashCopy being a DS8000 internal CopyServices function the possibilities for
troubleshooting from a user perspective are limited.
Nonetheless check the following items mainly to help exclude a user error with FlashCopy in
context with System i:
1) For a FlashCopy establish failure with DS8000 message ID “CMUN03049E mkflash:
source_volID:target_volID: Copy Services operation failure: incompatible volumes”
ensure that both the FlashCopy source and target volumes are the same i5/OS volume
model, i.e. they have the same capacity and protection mode which is either
“protected” (models A0x) or “unprotected” (models A8x)
(DSCLI command lsfbvol; volume model information is shown by “DeviceMTM”
column output)
2) For a FlashCopy establish failure with DS8000 message ID “CMUN03035E mkflash:
source_volID:target_volID: Copy Services operation failure: feature not installed”
ensure that the FlashCopy license key (PTCs feature #72xx) is installed
(check via DSCLI command lskey).
3) For a System i backup host IPL or an IASP vary-on failure from FlashCopy target
volumes verify the following holds true:
a. The FlashCopy from SYSBAS or the entire System i disk space was taken
while the System i production host accessing the FlashCopy source volumes
was powered off, respectively for taking a FlashCopy from a System i
independent auxiliary storage pool (IASP) ensure the IASP has been varied-off
before establishing or re-synchronizing its FlashCopy relationships.
This is the only way to ensure that all System i modified data in memory is
flushed to disk storage for a consistent state and clean IPL respectively IASP
varyon from the backup host accessing the FlashCopy target volumes.
b. Ensure that − unless for a GlobalMirror B to C volume relationship − the
FlashCopy relationship was NOT created using the target write inhibit mode
(DSCLI command lsflash source_volID; “TargetWriteEnabled”
column output should show “Enabled”)
4) For a FlashCopy establish failure ensure the following:
a. Both FlashCopy source and target volumes are in “online / normal” state
(DSCLI command lsfbvol)
b. No violation to the rule that a FlashCopy target volume can be only in one
FlashCopy relationship
If the specified source is already used as a FlashCopy target volume message
ID “CMUN03008E mkflash: source_volID:target_volID: Copy Services
operation failure: cascading FlashCopy prohibited” or if the specified target is
already used as target volume in an existing FlashCopy relationship message
ID “CMUN03042E mkflash: source_voldID:target_volID: Copy Services
operation failure: already a FlashCopy target” is posted.
© IBM Copyright, 2007
http://www-03.ibm.com/support/techdocs/atsmastr.nsf/WebIndex/WPxxxxxx
IBM System i & System Storage DS8000 Recovery Handbook
Version 1.0, 09/05/2007
Page 11/19
1.5 Troubleshooting MetroMirror Problems
Refer to the corresponding subsection below to troubleshoot PPRC volume, path or primary
site disaster failures.
1.5.1 Volume Suspends or Consistency Group Freezes
Perform the following steps to troubleshoot DS8000 PPRC suspend or consistency group
freeze problems:
1) Review the SNMP trap 202 (resp. 200) message to find out the suspend reason code*
2) Check the current PPRC volume pair states on primary and secondary DS8000
(DSCLI command lspprc source_vol_ID:target_vol_ID)
3) Ensure all defined PPRC paths are available
(DSCLI command lspprcpath 00-FF showing "Success" state for the paths)
4) Correct any volume error states and resume PPRC
(DSCLI command resumepprc –type mmir
source_vol_ID:target_vol_ID respectively use rmpprc … and mkpprc …
if required)
1.5.2 Path Failures
Follow the steps below to troubleshoot PPRC path failures:
1) Review the SNMP trap 100/101 message to find out the link failure reason code*
2) Check the current PPRC link states
(DSCLI command lspprcpath 00-FF)
3) For any failed paths based on the link failure reason code make sure to exclude any
connectivity errors:
a. Ensure the primary and secondary DS8000 PPRC FC ports are online
(DSCLI command lsioport)
b. Ensure the corresponding PPRC FC link is up
i. Primary and secondary DS8000 PPRC FC port LEDs showing solid
green and flashing yellow – flashing green indicates that the link is
down and other states typically indicate a HW problem
ii. If a SAN is used for the PPRC connections ensure both primary and
secondary DS8000 FC ports are logged into the SAN switch and the
zoning is correct so they can “see” each other
(Brocade commands switchShow and cfgShow; Cisco commands
show flogi database and show zoneset active)
1.5.3 Primary Site Disaster Scenarios
In case of a primary site disaster when using no PPRC “freeze” automation software proceed
as follows to check if there is consistent data on the PPRC secondary site before deciding on a
potential MetroMirror failover (see section 1.5.4)
1) Query the status of the PPRC primary and secondary volumes and paths
(DSCLI commands lspprc source_vol_ID:target_vol_ID and
lspprcpath 00-FF)
© IBM Copyright, 2007
http://www-03.ibm.com/support/techdocs/atsmastr.nsf/WebIndex/WPxxxxxx
IBM System i & System Storage DS8000 Recovery Handbook
Version 1.0, 09/05/2007
Page 12/19
2) Refer to Figure 4: MetroMirror Volume Swap Decision Matrix to check eligibility for
a site swap
PPRC
Primary
Volume State
(GUI or DCSLI)
PPRC
Secondary
Volume State
(GUI or DSCLI)
PPRC
Path State,
Communication
to Secondary
SNMP Traps
(ALL hosts that
are conf. for
SNMP need to
be checked )
Host I/O State
(Disk Error
Presented?, I/O
Halted?)
CAN PPRC be
SWAPPED
(primary to
secondary
volume?)
Actions to
restore access
and then later
PPRC
relationship
DUPLEX
DUPLEX
PPRC paths
O.K.
NONE
NO ERRORS
YES
N/A
DUPLEX
DUPLEX
PPRC paths
partially Lost,
comm_to_sec.
O.K.
Trap 100 (pprc
links degraded)
will be posted
NO ERRORS,
eventually
performance
degradation.
YES
FIX PPRC path
problems
DUPLEX to
SUSPENDED
(when last pprc
path lost)
DUPLEX
PPRC paths ALL
LOST,
comm_to_sec.
FAILED
Trap 101 (pprc
links down), Trap
202 (Primary
device susp)
NO ERRORS,
but PRIMARY
VOL will go
Suspended.
*NO*
FIX PPRC path
problems and
RESYNC PPRC
afterwards
SUSPENDED
DUPLEX
comm_to_sec.
FAILED at ONE
time for all pprc
paths
Trap 101 (pprc
links down), Trap
202 (Primary
device susp)
NO ERRORS,
but PRIMARY
suspended
*NO*
"UNKNOWN"
DUPLEX
PPRC paths
O.K.,
com_to_sec.
O.K:
Permanent I/O
error, but NO
PPRC suspend
snmp traps
SCSI_Disk
Errors for
"unknown
volumes"
YES
FIX the Reason
for volume
suspend and
RESYNC PPRC
afterwards
Fix PRIMARY to
restore volume
access, reestablish PPRC
"UNKNOWN"
DUPLEX and
SUSPENDED
PPRC paths
O.K.,
com_to_sec.
O.K:
Trap 202
(Primary device
suspended)
SCSI_Disk
Errors for
"unknown"
volumes
*NO*
Fix PRIMARY to
restore volume
access, then
RESYNC
SUSPENDED
SUSPENDED or
OFFLINE to
primary
PPRC paths
O.K.
Trap 202
(Primary device
suspended)
NO ERRORS
*NO*
(IF device
problem on
secondary!!!)
Check effected
DBs to decide
WHAT volumes
to be used
(pri/sec)
Figure 4: MetroMirror Volume Swap Decision Matrix
1.5.4 MetroMirror Failover/Failback Procedures
This section shows the required steps for a MetroMirror failover from the production to the
recovery site and later failback from the recovery to the original production site using DSCLI
commands.
1.5.4.1 MetroMirror Failover from the Production to the Recovery Site
1) Follow the steps in section 1.5.3 to determine eligibility for a failover with consistent
data on the recovery site before proceeding
2) For a practice failover only stop all host I/O on the production site and suspend the
PPRC volume relationships
(pausepprc source_volume_ID:target_volume_ID)
3) From the recovery site failover PPRC to terminate PPRC with target volumes
becoming suspended source volumes available for host I/O
(failoverpprc -type mmir
source_volume_ID:target_volume_ID)
4) Start recovery site host systems
1.5.4.2 MetroMirror Failback from the Recovery to the original Production Site
1) From the recovery site verify PPRC volume and path states before failing back:
a. Ensure PPRC volumes on recovery site are in source suspended states
(lspprc source_volume_ID:target_volume_ID),
© IBM Copyright, 2007
http://www-03.ibm.com/support/techdocs/atsmastr.nsf/WebIndex/WPxxxxxx
IBM System i & System Storage DS8000 Recovery Handbook
Version 1.0, 09/05/2007
Page 13/19
if not use pausepprc source_volume_ID:target_volume_ID /
failoverpprc -type mmir
source_volume_ID:target_volume_ID
b. Ensure PPRC paths from recovery to production site are established
(lspprcpath 00-FF, to establish paths if needed use mkpprcpath –
remotewwnn wwnn –srclss source_LSS –tgtlss target_LSS
source_port_ID:target_port_ID)
2) From the recovery site fail back MetroMirror to re-synchronize changes from the
recovery site back to the production site
Important: Stop host I/O on the original production site and make sure to specify the recovery site
volumes as source volumes and the original production site volumes as target!
(failbackpprc –type mmir
source_volume_ID:target_volume_ID)
3) Verify if PPRC volumes are in full-duplex state again
(lspprc source_volume_ID:target_volume_ID)
4) From the production site after PPRC volumes are in full-duplex again fail over
MetroMirror from the production to the recovery site:
a. Ensure PPRC paths from production to recovery site are established
(lspprcpath 00-FF, to establish paths if needed use mkpprcpath –
remotewwnn wwnn –srclss source_LSS –tgtlss
target_LSS source_port_ID:target_port_ID)
b. Fail over PPRC target volumes on the production site to become suspended
primaries
(failoverpprc –type mmir
source_volume_ID:target_volume_ID)
c. Re-establish the MetroMirror relationships between the original production
and recovery site
Important: Stop host I/O on the recovery site and make sure to specify the original production
site volumes as source volumes and the recovery site volumes as target!
(failbackpprc -type mmir
source_volume_ID:target_volume_ID)
c. Verify that all MetroMirror volumes are in full-duplex state again
(lspprc source_volume_ID:target_volume_ID)
5) Start original production site host systems
© IBM Copyright, 2007
http://www-03.ibm.com/support/techdocs/atsmastr.nsf/WebIndex/WPxxxxxx
IBM System i & System Storage DS8000 Recovery Handbook
Version 1.0, 09/05/2007
Page 14/19
1.6 Trouble Shooting GlobalMirror Problems
Refer to the corresponding subsection below for troubleshooting GlobalMirror session, PPRC
path and GlobalCopy failures and for failing over GlobalMirror in the case of total path or
primary site disasters.
1.6.1 GM Session, PPRC Path and GlobalCopy Failures
The following check-points shall help to isolate partial GlobalMirror failures − see section
1.6.2 for the failover procedure in case of a primary DS8000 or total path failure:
1) Check any corresponding SNMP CopyServices alert traps
(refer to section 1.2)
2) Use the following DSCLI commands to diagnose GM problems:
a. Check if the GM session is in "running" state
(showgmir master_controlpath_LSS)
b. Check the “last failure" for the GM session and whether it was associated with
a particular LSS
(showgmir –metrics master_controlpath_LSS)
c. Check the status of the PPRC paths
(lspprcpath 00-FF)
d. Check the status of the GlobalCopy pairs especially looking for any suspend or
simplex states and any large number of out-of-sync tracks
(lssession LSS_ID,
lspprc –l source_vol_ID:target_vol_ID,
showgmiroos session_ID)
e. Check the status of the FlashCopy relationships
(lsflash –l source_vol_ID:target_vol_ID)
1.6.2 GlobalMirror Session Failover/Failback Procedures
The required steps for a GlobalMirror session failover from the production to the recovery site
and later failback from the recovery to the original production site using DSCLI commands
are described in this section.
1.6.2.1 GlobalMirror Failover from the Production to the Recovery Site
1) If the production machine is still accessible prepare for a “clean” GM failover:
a. Pause GlobalMirror (consistency group) processing
(pausegmir –lss LSS_ID –session session_ID)
b. Verify that GlobalMirror is in "paused", i.e. GlobalCopy, state
(showgmir master_controlpath_LSS)
c. Suspend GlobalCopy relationships,
(pausepprc source_vol_ID:target_vol_ID)
2) From the recovery site fail over GlobalCopy to the recovery site to terminate PPRC
with target volumes becoming suspended primaries
(failoverpprc –type gcp source_vol_ID:target_vol_ID)
3) Ensure consistent data on the recovery site C volumes
Note: For a planned site-swap with previously stopped host I/O before pausing or
© IBM Copyright, 2007
http://www-03.ibm.com/support/techdocs/atsmastr.nsf/WebIndex/WPxxxxxx
IBM System i & System Storage DS8000 Recovery Handbook
Version 1.0, 09/05/2007
Page 15/19
removing the GM session this step is not required as the C volumes should be
consistent.
a. Query the consistency group state for the C volumes
(lsflash source_vol_ID:target_vol_ID, looking for the
b. Compare the C volumes’ “Sequence Number” and “revertible” information
with Table 1 below and perform the required action to ensure consistent data
on the C volumes.
Sequence
Numbers?
Revertible Volumes?
Required Action
equal
all disabled
none
different
some enabled / some
disabled
For all revertible C volumes:
revertflash
source_volume_ID
equal
all
for all C volumes:
revertflash
source_volume_ID
equal
some enabled / some
disabled
For all (revertible) C
volumes:
commitflash
source_volume_ID
Table 1: GlobalMirror FlashCopy Consistency Group Validation
4) Create consistent data on recovery site B volumes via FlashCopy fast reverse restore
(reverseflash –fast –tgtpprc source_vol_ID:target_vol_ID,
with specifying the B volumes as sources and the C volumes as targets)
5) Wait for FlashCopy background copy completion by checking the relationships have
ended
(lsflash source_vol_ID:target_vol_ID)
6) Re-establish the FlashCopy relationship between B and C volumes to have meaningful
data on the C volumes after the FRR and prepare for re-establishing GM again (in a
disaster situation using "nocp" may not be desired)
(mkflash –tgtinhibit –record -nocp
source_vol_ID:target_vol_ID)
7) Start recovery site host systems from consistent B volumes
1.6.2.2 GlobalMirror Failback from the Recovery to the original Production Site
1) Ensure production site A volumes are offline to all hosts
2) From the recovery site ensure PPRC paths are established between the recovery and
production site
(lspprcpath 00-FF, to establish paths if needed use mkpprcpath –
remotewwnn wwnn –srclss source_LSS –tgtlss target_LSS
source_port_ID:target_port_ID)
From the recovery site fail back GlobalCopy to the production site to re-synchronize
changes from the recovery site back to the original production site
Important: Stop host I/O on the original production site and make sure to specify the recovery site
volumes as source volumes and the original production site volumes as target!
© IBM Copyright, 2007
http://www-03.ibm.com/support/techdocs/atsmastr.nsf/WebIndex/WPxxxxxx
IBM System i & System Storage DS8000 Recovery Handbook
Version 1.0, 09/05/2007
Page 16/19
(failbackpprc –type gcp source_vol_ID:target_vol_ID)
3) Stop host I/O to the recovery site B volumes
4) Verify if GlobalCopy synchronization from B to A has completed, i.e. “out of sync
tracks” are zero for all GlobalCopy volumes
(lspprc source_vol_ID:target_vol_ID)
5) From the production site re-establish GlobalCopy from the production to the recovery
site:
a. Ensure PPRC paths are established between the production and recovery site
(lspprcpath 00-FF, to establish paths if needed use mkpprcpath –
remotewwnn wwnn –srclss source_LSS –tgtlss
target_LSS source_port_ID:target_port_ID)
b. Fail over PPRC on the production site with target volumes on the production
site becoming suspended primaries
(failoverpprc –type gcp source_vol_ID:target_vol_ID)
c. Fail back GlobalCopy on the production site to re-establish the GlobalCopy
relationships between production and recovery site
Note: Specify the original production site volumes as source volumes and the recovery site
volumes as target!
(failbackpprc -type gcp source_vol_ID:target_vol_ID)
6) Re-start GlobalMirror processing:
a. List defined GlobalMirror sessions for the specified LSSs
(lssession LSS_ID)
b. Check if GlobalMirror has been paused or stopped
(showgmir master_controlpath_LSS)
c. Resume the GlobalMirror session if it was paused before the site switch
(resumegmir –lss LSS_ID –session session_ID)
or
Re-start the GlobalMirror session if it was stopped before the site switch
(mkgmir –lss LSS_ID –session session_ID)
d. Verify that GlobalMirror is in "running" copy state
(showgmir master_controlpath_LSS)
7) Re-start production site host systems from A volumes
1.6.3 Failing over a Subset of Volumes via Pausing the GM Session
Proceed as follows to fail over only a subset of GlobalMirror volumes from the production to
the recovery site which currently works only when using DSCLI commands:
1) Pause the GlobalMirror session
(pausegmir –lss LSS_ID –session session_ID)
2) Remove the volumes to be failed over from the GlobalMirror session
(chsession –action remove –volume source_volume_ID –lss
LSS_ID –session session_ID)
3) Restart the GlobalMirror session
(resumegmir –lss LSS_ID –session session_ID)
4) Go to section 1.6.2.1 step 1c and proceed to perform the GlobalMirror failover only
for the previously removed volumes.
© IBM Copyright, 2007
http://www-03.ibm.com/support/techdocs/atsmastr.nsf/WebIndex/WPxxxxxx
IBM System i & System Storage DS8000 Recovery Handbook
Version 1.0, 09/05/2007
Page 17/19
1.7 Problem Data Collection
The following general information is typically required for further problem assistance by IBM
support:






detailed problem description
impact (access loss, data loss, performance degradation)
date/time of problem occurrence
recovery actions tried and results
date/time of any recent changes in the environment
actual vs. expected or previous performance and type of workload
(performance problems only)
1.7.1 From System i Side
The following information is typically required by IBM support from System i side for any
access loss, data loss or performance degradation problems:








affected DS8000 volume IDs and System i host WWPNs
i5/OS and cumPTF level (WRKPTFGRP)
i5/OS HSM system configuration list (see section 1.3.1.6)
difference between system time [DSPSYSVAL SYSVAL(QTIME)]and real time
Product Activity Log (see section 1.3.1.5)
QSYSOPR messages (see section 1.3.1.4)
SAN connectivity diagram
Collection Services "System report -> Disk utilization" data
(performance problems only)
1.7.2 From DS8000 Side
The following information is typically required by IBM support from DS8000 side for any
access loss, data loss or performance degradation problems:




DS8000 microcode level(s) (DSCLI command lsserver –l)
Information about DS8000 CopyServices configuration
(primary and secondary S/N, usage of MM, GM or/and FLC)
DS8000 Data Collection on Demand (PE package)
(typically offloaded by IBM support)
forced DS8000 LPAR statesaves
(typically forced by IBM support, for PPRC problems for both primary & secondary)
1.7.3 From SAN Side
The following information is typically required by IBM support from SAN side for any access
loss problems:


SAN switch logs
(e.g. Brocade supportShow or Cisco show tech-support detail)
SAN layout diagram
© IBM Copyright, 2007
http://www-03.ibm.com/support/techdocs/atsmastr.nsf/WebIndex/WPxxxxxx
IBM System i & System Storage DS8000 Recovery Handbook
Version 1.0, 09/05/2007
Page 18/19
1.8 References

IBM System Storage DS8000 Service Information Center,
http://publib.boulder.ibm.com/infocenter/dsichelp/ds8000sv/index.jsp

IBM System Storage DS8000 Information Center,
http://publib.boulder.ibm.com/infocenter/dsichelp/ds8000ic/index.jsp

IBM System Storage DS8000 Command-Line Interface User’s Guide (SC26-7625),
http://www1.ibm.com/support/docview.wss?rs=1114&context=HW2C2&dc=DB500&q1=ssg1*&uid
=ssg1S1002949&loc=en_US&cs=utf-8&lang=en

IBM System Storage DS8000 Messages Reference (GC26-7914)
http://www-1.ibm.com/support/docview.wss?uid=ssg1S7001164&aid=1

IBM DS8000 Preventive Service Planning Information,
http://www1.ibm.com/support/docview.wss?rs=1113&context=HW2B2&dc=DB500&uid=ssg1S100
2949&loc=en_US&cs=utf-8&lang=en

IBM System Storage DS8000 Host System Attachment Guide (SC26-7917),
http://www-1.ibm.com/support/docview.wss?uid=ssg1S7001161&aid=1

i5/OS V5R4 Information Center
http://publib.boulder.ibm.com/infocenter/iseries/v5r4/index.jsp

IBM System Storage DS8000 Series: Implementing CopyServices in Open Environments
(SG24-6788),
http://www.redbooks.ibm.com/abstracts/sg246788.html?Open

iSeries and IBM TotalStorage: A Guide to Implementing External Disk on eServer i5
(SG24-7120), http://www.redbooks.ibm.com/abstracts/sg247120.html?Open

PCI and PCI-X Placement Rules for IBM System i models (REDP-4011-03)
http://www.redbooks.ibm.com/redpieces/abstracts/redp4011.html
© IBM Copyright, 2007
http://www-03.ibm.com/support/techdocs/atsmastr.nsf/WebIndex/WPxxxxxx
IBM System i & System Storage DS8000 Recovery Handbook
Version 1.0, 09/05/2007
Page 19/19
Download