February 11, 2014 TIVOLI NETCOOL PERFORMANCE MANAGER – HIGH AVAILABILITY TEST FOR TNPM 1.3.3 WIRELINE High Availability for Tivoli Netcool Performance Manager 1.3.3 – Wireline using High Availability Disaster Recovery (HADR) Document version v1.0 1 February 11, 2014 TIVOLI NETCOOL PERFORMANCE MANAGER – HIGH AVAILABILITY TEST FOR TNPM 1.3.3 WIRELINE 2 February 11, 2014 TIVOLI NETCOOL PERFORMANCE MANAGER – HIGH AVAILABILITY TEST FOR TNPM 1.3.3 WIRELINE 1. Introduction This document describes the test validation done on Tivoli Netcool Performance Manager 1.3.3 Wireline on Red Hat Enterprise Linux (RHEL) in a High Availability (HA) configuration using Tivoli System Automation for Multiplatforms (TSAM) with High Availability Disaster Recovery feature (HADR). This document describes: High level description of the test environment configuration/setup done. Issues and challenges Details of test validation done of TSAM-HADR on the IBM DB2 database redundancy in relation to TNPM applications. 1.1 Target Audience This document is intended to provide a limited view of High Availability test done for TNPM 1.3.3. It is not intended to capture details of configuration and setup as the product implementation differs for each customer based on their requirements. 1.2 References For detailed information on Tivoli Netcool Performance Manager 1.3.3 Wireline, see http://publib.boulder.ibm.com/infocenter/tivihelp/v8r1/topic/com.ibm.tnpm_1.3.3.doc/welcome_tnpm _db2_support.html For detailed information on System Automation for Multiplatforms installation and configuration, see http://publib.boulder.ibm.com/infocenter/tivihelp/v3r1/topic/com.ibm.samp.d oc_3.2.1/HALICG21.pdf Reference to Redbook HADR pdf http://www.redbooks.ibm.com/redbooks/pdfs/sg247363.pdf Tivoli Netcool Performance Manager – Wireline HAM Deployment Guide To know how Tivoli Netcool Performance Manager 1.3.3 - Wireline SNMP Dataload High Availability works, see http://publib.boulder.ibm.com/infocenter/tivihelp/v8r1/topic/com.ibm.tnpm_1.3.3.doc/install_db2_su pport/ctnpm_installguide_chapterusingthehighavailabilitymanager-08-01.html For detailed instructions on installing Tivoli Integration Portal (TIP) in load balanced environment, see http://publib.boulder.ibm.com/infocenter/tivihelp/v15r1/index.jsp?topic=/co m.ibm.tip.doc/ctip_config_ha_ovw.html For detailed and reference information on System Automation for multiplatform installation and configuration for Linux on TNPM 1.3.1 https://www-304.ibm.com/software/brandcatalog/ismlibrary/details?catalog.label=1TW10NP58# 3 February 11, 2014 TIVOLI NETCOOL PERFORMANCE MANAGER – HIGH AVAILABILITY TEST FOR TNPM 1.3.3 WIRELINE 2. Prerequisites This section describes the prerequisites required to design, install, and configure TNPM for high availability. 2.1 Operating System Use the RHEL 5.9, 64-bit for Tivoli System Automation for multiplatform-based TNPM high availability. 2.2 Prerequisites for Netcool/Proviso Refer to the following information before you install Netcool/Proviso: “Supported Operating Systems and Modules” section in http://publib.boulder.ibm.com/infocenter/tivihelp/v8r1/topic/com.ibm.tnpm_1.3.3.doc/confi g_recommendations/ctnpm_configrec_guide_db2.html “Pre-Installation Setup Tasks” section in http://publib.boulder.ibm.com/infocenter/tivihelp/v8r1/topic/com.ibm.tnpm_1.3.3.doc/insta ll_db2_support/ctnpm_installguide_preinstallationsetuptasks-04-04.html 2.3 Prerequisites for System Automation for Multiplatform For prerequisites details to install System Automation for Multiplatforms, refer to the “Preparing for installation” section in http://publib.boulder.ibm.com/infocenter/tivihelp/v3r1/topic/com.ibm.samp.doc_3.2. 1/HALICG21.pdf. For additional reference, see thewhite paper on HADR and TSA information pointers . https://www.ibm.com/developerworks/community/blogs/DB2LUWAvailability/entry/hadr_an d_tsa_information_pointers?lang=en 2.4 Highly available storage The TNPM high availability requires highly available shared storage array that protects data integrity. The shared disk storage devices, multi-hosted SCSI storage arrays, and SAN based devices are generally used for high availability solution. 4 February 11, 2014 TIVOLI NETCOOL PERFORMANCE MANAGER – HIGH AVAILABILITY TEST FOR TNPM 1.3.3 WIRELINE 3. High Availability Architecture for TNPM This section describes the TSAM- HADR implementation on the DB2 database server. 3.1 Pre-requisites and setup summary Figure 1 Important HADR links For procedures to configure and set up the HADR environment. For any issues and concerns, you should consult with the DBA expert on-site. http://pic.dhe.ibm.com/infocenter/db2luw/v10r1/index.jsp?topic=%2Fcom.ibm.db2.luw.admin.ha.doc %2Fdoc%2Fr0051349.html http://pic.dhe.ibm.com/infocenter/db2luw/v10r1/index.jsp?topic=%2Fcom.ibm.db2.luw.admin.ha.doc %2Fdoc%2Ft0011725.html Alternatively, the following link on Section 5 was followed during HADR configuration (Figure 1). The HADR configured was the “Single Network HADR topology”. It was automated using the db2haicu XML mode. http://public.dhe.ibm.com/software/dw/data/dm-0908hadrdb2haicu/HADR_db2haicu.pdf 5 February 11, 2014 TIVOLI NETCOOL PERFORMANCE MANAGER – HIGH AVAILABILITY TEST FOR TNPM 1.3.3 WIRELINE Requirements To set up HADR, you must have the following requirements in place: HADR is a DB2 feature available in all DB2 editions except for DB2 Express-C. For the standby and auxiliary standby databases, DB2 licensing takes place according to usage as hot warm or cold standby. For more information, consult with an IBM Information Management marketing representative IBM® Tivoli® System Automation for Multiplatform (SA MP) is integrated with IBM DB2® server as part of the DB2 High Availability Feature on AIX, Linux, and Solaris operating systems. You can select to install Tivoli SA MP while installing DB2 server. Check license terms for using IBM Tivoli SA MP integrated with IBM DB2 server. The operating system on the primary and standby databases should be the same version, including the patches. You can violate this rule for a short time during a rolling upgrade, but take extreme caution. A TCP/IP interface must be available between the HADR host machines. The DB2 version and level must be identical on both the primary and the standby databases The DB2 software for both the primary and the standby databases must be the same bit size ( 32-bit or 64-bit ) Buffer pool sizes on the primary and the standbys should be the same. If you build the standby database by restoring the database by using the backup copy from the primary, the buffer pool sizes are the same because this information is included in the database backup. If you are using the reads on standby feature, you must configure the buffer pool on the primary so that the active standby can accommodate log replay and read applications. The primary and standby databases must have the same database name, which means they must be in different instances Table spaces must be identical on the primary and standby databases, including: – Table space type (DMS or SMS) – Table space size – Container path – Container size – Container file type (raw device or file system) 6 February 11, 2014 TIVOLI NETCOOL PERFORMANCE MANAGER – HIGH AVAILABILITY TEST FOR TNPM 1.3.3 WIRELINE The amount of space that is allocated for log files should also be the same on both the primary and standby databases Database manager and instance profile registry variables setting of primary and standby should be same. For LOAD with COPY YES option, if shared NFS path is used, the ownership of the path should match with instance user owner. Ensure that the shared path is accessible from both servers. The system clock of the HADR primary and standby node must be synchronized. DB parameter CONNECT_PROC should be set to NULL (10.1 FP 1 bug). HADR requires the database to use archival logging. Set the LOGINDEXBUILD parameter so that index creation, recreation, or reorganization operations are logged by running the following command: db2 update database configuration for sample using LOGINDEXBUILD ON 7 February 11, 2014 TIVOLI NETCOOL PERFORMANCE MANAGER – HIGH AVAILABILITY TEST FOR TNPM 1.3.3 WIRELINE Creating NFS shared directory and mounting In HADR setup we create an instance (db2) on both primary and standby servers. The UID and GID of user db2 need to be same on primary and standby for NFS share. To check this, try accessing file from NFS share from both the server. If owner and group of the file is correctly set, you are good to go. $ ls -ltr /db2_load/hadr_shr | tail -1 -rw-r----- 1 db2 db2iadm 8429568 Jan 13 14:34 PV.4.db2.DBPART000.20140113143428.001 Directory /db2_load/hadr_shr is created on primary (10.55.236. 167) and then NFS mounted on standby( 10.55.236.163). [db2@tnpminlnx0303 milind]$ df -h Filesystem /dev/mapper/VolGroup00-LogVol01 /dev/hda1 tmpfs 10.55.236.167:/db2_load/hadr_shr Size 128G 487M 7.9G 128G Used Avail Use% Mounted on 48G 74G 40% / 18M 444M 4% /boot 12K 7.9G 1% /dev/shm 34G 87G 28% /db2_load/hadr_shr All HADR related messages are logged in db2 diagnostic log /opt/db2/sqllib/db2dump/db2diag.log Automatic failover and enablement HADR setup settings: Port number values used in /etc/services for HADR setup. db2_hadr_1 50010/tcp db2_hadr_2 50011/tcp Primary server: tnpminlnx0307 (IP: 10.55.236.167) Standby server: tnpminlnx0303 (IP: 10.55.236.163) Virtial IP: 10.55.236.191 HADR Setting on tnpminlnx0307 [db2@tnpminlnx0307 milind]$ db2 get db cfg for pv | grep -i hadr HADR database role = PRIMARY HADR local host name (HADR_LOCAL_HOST) = tnpminlnx0307 HADR local service name (HADR_LOCAL_SVC) = 50010 HADR remote host name (HADR_REMOTE_HOST) = tnpminlnx0303 HADR remote service name (HADR_REMOTE_SVC) = 50011 HADR instance name of remote server (HADR_REMOTE_INST) = db2 HADR timeout value (HADR_TIMEOUT) = 60 HADR target list (HADR_TARGET_LIST) = HADR log write synchronization mode (HADR_SYNCMODE) = SYNC HADR spool log data limit (4KB) (HADR_SPOOL_LIMIT) = 0 HADR log replay delay (seconds) (HADR_REPLAY_DELAY) = 0 8 February 11, 2014 TIVOLI NETCOOL PERFORMANCE MANAGER – HIGH AVAILABILITY TEST FOR TNPM 1.3.3 WIRELINE HADR peer window duration (seconds) (HADR_PEER_WINDOW) = 120 [db2@tnpminlnx0307 milind]$ db2set -all [i] DB2_DEFERRED_PREPARE_SEMANTICS=YES [i] DB2_COMPATIBILITY_VECTOR=ORA [i] DB2_LOAD_COPY_NO_OVERRIDE=COPY YES to /db2_load/hadr_shr [i] DB2COMM=TCPIP [i] DB2AUTOSTART=YES [g] DB2SYSTEM=tnpminlnx0307.persistent.co.in [g] DB2INSTDEF=db2 [g] DB2ADMINSERVER=dasusr1 HADR Setting on tnpminlnx0303 [db2@tnpminlnx0303 milind]$ db2 get db cfg for pv | grep -i hadr HADR database role = STANDBY HADR local host name (HADR_LOCAL_HOST) = tnpminlnx0303 HADR local service name (HADR_LOCAL_SVC) = 50011 HADR remote host name (HADR_REMOTE_HOST) = tnpminlnx0307 HADR remote service name (HADR_REMOTE_SVC) = 50010 HADR instance name of remote server (HADR_REMOTE_INST) = db2 HADR timeout value (HADR_TIMEOUT) = 60 HADR target list (HADR_TARGET_LIST) = HADR log write synchronization mode (HADR_SYNCMODE) = SYNC HADR spool log data limit (4KB) (HADR_SPOOL_LIMIT) = 0 HADR log replay delay (seconds) (HADR_REPLAY_DELAY) = 0 HADR peer window duration (seconds) (HADR_PEER_WINDOW) = 120 [db2@tnpminlnx0303 milind]$ db2set -all [i] DB2_DEFERRED_PREPARE_SEMANTICS=YES [i] DB2_COMPATIBILITY_VECTOR=ORA [i] DB2_LOAD_COPY_NO_OVERRIDE=COPY YES to /db2_load/hadr_shr [i] DB2COMM=TCPIP [i] DB2AUTOSTART=YES [g] DB2SYSTEM=tnpminlnx0303.persistent.co.in [g] DB2INSTDEF=db2 Note: In this HADR setup, the standby database will be in ROLL-FORWARD PENDING state. Configuring HADR for automatic failover by using Tivoli SA MP cluster manager. 9 February 11, 2014 TIVOLI NETCOOL PERFORMANCE MANAGER – HIGH AVAILABILITY TEST FOR TNPM 1.3.3 WIRELINE Run on both servers$ preprpnode tnpminlnx0307 tnpminlnx0303 $ preprpnode tnpminlnx0307 tnpminlnx0303 Conf. setting file/opt/db2/milind/167.xml /opt/db2/milind/163.xml On Standby $ db2haicu -f /opt/db2/milind/163.xml On Primary $ db2haicu -f /opt/db2/milind/167.xml Following fields are edited accordingly; eg: 167.xml below. For the HADRDBSet , the “localHost” and “remoteHost”, needs to be updated accordingly. 10.55.236.191 is the Virtual IP used. <PhysicalNetwork physicalNetworkName="db2_public_network_0" physicalNetworkProtocol="ip"> <Interface interfaceName="eth0" clusterNodeName="tnpminlnx0307"> <IPAddress baseAddress="10.55.236.167" subnetMask="255.255.252.0" networkName="db2_public_network_0"/> </Interface> <Interface interfaceName="eth0" clusterNodeName="tnpminlnx0303"> <IPAddress baseAddress="10.55.236.163" subnetMask="255.255.252.0" networkName="db2_public_network_0"/> </Interface> </PhysicalNetwork> <ClusterNode clusterNodeName="tnpminlnx0307"/> <ClusterNode clusterNodeName="tnpminlnx0303"/> <DB2Partition dbpartitionnum="0" instanceName="db2"> </DB2Partition> <HADRDBSet> <HADRDB databaseName="PV" localInstance="db2" remoteInstance="db2" localHost="tnpminlnx0307" remoteHost="tnpminlnx0303" /> <VirtualIPAddress baseAddress="10.55.236.191" subnetMask="255.255.252.0" networkName="db2_public_network_0"/> </HADRDBSet> Automatic client reroute setting with virtual IP Run on both servers$ db2 update alternate server for database pv using hostname 10.55.236.191 port 50000 $ db2 update alternate server for database pv using hostname 10.55.236.191 port 50000 10 February 11, 2014 TIVOLI NETCOOL PERFORMANCE MANAGER – HIGH AVAILABILITY TEST FOR TNPM 1.3.3 WIRELINE 3.2 Updating the required db2auth for TNPM The db2auth needs to be examined in TNPM 1.3.3, so that the proper access is granted to TNPM user (PV_ADMIN, PV_LOIS, PV_GUI etc) during failover is properly setup/configured. See section 4 on “Issues and Challenges describes some troubleshooting”. These authorization changes were implemented post TNPM standalone installation, while the redundant DB2 database was attached to the standalone environment along with TSAM-HADR configured later on. 11 February 11, 2014 TIVOLI NETCOOL PERFORMANCE MANAGER – HIGH AVAILABILITY TEST FOR TNPM 1.3.3 WIRELINE 4. Issues and Challenges Issue 1: Inventory was displaying the following error when running profiles. The MIB browser was unable to ping/resolve network connectivity with device on the same VLAN. Note the “Host unreachable”, via the Virtual IP 10.55.236.191 12 February 11, 2014 TIVOLI NETCOOL PERFORMANCE MANAGER – HIGH AVAILABILITY TEST FOR TNPM 1.3.3 WIRELINE Following issues are seen when trying to save a custom formula into Formula group. User configuration screen was showing SQLSTATE= 42501 Solution: Grant stmt – GRANT EXECUTE ON FUNCTION "PV_GUI"."SEQUENCE_RU_MULTI"(NUMBER) TO ROLE "PV_GUI_ROLE" 13 February 11, 2014 TIVOLI NETCOOL PERFORMANCE MANAGER – HIGH AVAILABILITY TEST FOR TNPM 1.3.3 WIRELINE Issue 2: When saving a changed state in Request Editor the following error is displayed. Solution: Running grant access to GRANT EXECUTE ON FUNCTION "PV_GUI"."SEQUENCE_DC"(NUMBER) TO ROLE "PV_GUI_ROLE" 14 February 11, 2014 TIVOLI NETCOOL PERFORMANCE MANAGER – HIGH AVAILABILITY TEST FOR TNPM 1.3.3 WIRELINE Issue 3: Database Information was showing refresh action error. This is an existing issue in TNPM 1.3.3, and has nothing to do with TSAM or HADR configuration. User would need to workaround this refresh issue by relaunching the ProvisoInfo Browser UI Issue 4: During DB failover ( triggered via server reboot ). The Datachannel component LDR was going unresponsive. Solution: Appended the following into /etc/fstab to enable auto mount of the NFS directory during reboot. However, this means that when 10.55.236.167 is rebooting, the Secondary DB needs to wait for that server to be up before being able to mount the NFS directory again. 15 February 11, 2014 TIVOLI NETCOOL PERFORMANCE MANAGER – HIGH AVAILABILITY TEST FOR TNPM 1.3.3 WIRELINE On the Secondary DB. [db2@tnpminlnx0303 db2dump]$ cat /etc/fstab /dev/VolGroup00/LogVol01 / ext3 defaults 11 LABEL=/boot /boot ext3 defaults 12 tmpfs /dev/shm tmpfs defaults 00 devpts /dev/pts devpts gid=5,mode=620 0 0 sysfs /sys sysfs defaults 00 proc /proc proc defaults 00 /dev/VolGroup00/LogVol00 swap swap defaults 00 10.55.236.167:/db2_load/hadr_shr /db2_load/hadr_shr nfs defaults 00 Issue 5: Issue 4 might further cause the Active and Standby DB to be unsynced. This might also be due to some partition in locked state. Logs from proviso.log V1:129 2014.02.06-00.03.51 UTC LDR.1-20274:4848 STOREDPROCERR GYMDC10107W ADDTRUNCPART - [IBM][CLI Driver][DB2/LINUXX8664] SQL0438N Application raised error or warning with diagnostic text: "Error adding partition". SQLSTATE=UD164 V1:130 2014.02.06-00.03.51 UTC LDR.1-20274:4848 ERROR GYMDC10004F Shutting down the image with error: [IBM][CLI Driver][DB2/LINUXX8664] SQL0438N Application raised error or warning with diagnostic text: "Error adding partition". SQLSTATE=UD164 V1:131 2014.02.06-00.03.51 UTC LDR.1-20274:4848 ERROR GYMDC10004F Walkback written to: /opt/proviso/datachannel/log/walkback-LDR.1-20274-2014.02.06-00.03.51.log V1:132 2014.02.06-00.03.52 UTC LDR.1-20274:4848 ERROR GYMDC10004F Shutting down the image with error: [IBM][CLI Driver][DB2/LINUXX8664] SQL0438N Application raised error or warning with diagnostic text: "Error adding partition". SQLSTATE=UD164 Run the following to check: $ db2 list tablespaces | more Check this: Tablespace ID = 71 Name = C01010002014020200 Type = Database managed space Contents = All permanent data. Large table space. State = 0x0100 Detailed explanation: Restore pending Solution: Rebuild the standby DB and replicate back with the Active DB 16 February 11, 2014 TIVOLI NETCOOL PERFORMANCE MANAGER – HIGH AVAILABILITY TEST FOR TNPM 1.3.3 WIRELINE Issue 6: The Dataview was unable to update changes from user to the DB2 database. This was due to some PV_LOIS permission. The example below shows the The first exception from the SystemOut.log shows. PVRcDataUpdateRow getNextSequence() [WebContainer : 1] SQLExceptioncom.ibm.db2.jcc.am.SqlSyntaxErrorException: DB2 SQL Error: SQLCODE=-551, SQLSTATE=42501, SQLERRMC=PV_LOIS;EXECUTE;PV_LOIS.SEQUENCE, DRIVER=4.14.111 This will actually execute “select sequence from DUAL” to get the next index to insert. In 10.55.236.167, I get this: db2 => select sequence from DUAL SQL0551N "PV_LOIS" does not have the required authorization or privilege to perform operation "EXECUTE" on object "PV_LOIS.SEQUENCE". SQLSTATE=42501 db2 => select sequence from DUAL 1 -----------------------100002711 1 record(s) selected. Solution: GRANT EXECUTE ON FUNCTION "PV_LOIS"."SEQUENCE"() TO ROLE "PV_LOIS_ROLE" 17 February 11, 2014 TIVOLI NETCOOL PERFORMANCE MANAGER – HIGH AVAILABILITY TEST FOR TNPM 1.3.3 WIRELINE 5. Test validation done on TSAM-HADR DB2 for TNPM 1.3.3 TSAM Version used [db2@tnpminlnx0307 ~]$ samlicm -s Product: IBM Tivoli System Automation for Multiplatform 3.2 The test validation done would concentrate on DB2 database availability. The data reliability and the application downtime/stability will be examined. The results does not conclude if the downtime is at an acceptable range, nor concludes the application stability over longer periods (i.e.: over a month). The following types of failover are being documented. a) When Active DB2 database server is failed over manually using a takeover command. b) When Active DB2 database server is failed over due to server reboot. c) When Active DB2 database server is able to recover from manually killing db2 process ids. ( No failover is triggered ) The sequence of activities is carried out on each type of failover. 1) Precheck on TNPM applications - Datamart Datachannel and Dataload applications are running, processing and collection data. The latest BOFs data are being processed in the /output of LDR. - PVM UI is able to perform normal provisioning and configuration routines; Run Inventory, View current subelements/elements via Resource Editor, Toggle active-idle Request Editor formulas and etc. 2) Check on TSAM status - Running command such as “lssam”, “db2pd -db PV –hadr” and checking the db2diag.txt log that no errors in database activities are occurring. 3) Execute the failover type 4) Postcheck on TSAM status 5) Postcheck on TNPM application 6) Determine the downtime for TNPM application before recovering back to service, without any “restart” or “initialization” from user. 18 February 11, 2014 TIVOLI NETCOOL PERFORMANCE MANAGER – HIGH AVAILABILITY TEST FOR TNPM 1.3.3 WIRELINE Precheck on TNPM application Running to check current status of each DB server [db2@tnpminlnx0303 ~]$ db2pd -db PV -hadr Database Member 0 -- Database PV -- Active -- Up 16 days 17:36:30 -- Date 2014-02-06-12.03.01.748667 HADR_ROLE = PRIMARY REPLAY_TYPE = PHYSICAL HADR_SYNCMODE = SYNC STANDBY_ID = 1 LOG_STREAM_ID = 0 HADR_STATE = PEER PRIMARY_MEMBER_HOST = tnpminlnx0303 PRIMARY_INSTANCE = db2 PRIMARY_MEMBER = 0 STANDBY_MEMBER_HOST = tnpminlnx0307 STANDBY_INSTANCE = db2 STANDBY_MEMBER = 0 HADR_CONNECT_STATUS = CONNECTED HADR_CONNECT_STATUS_TIME = 02/05/2014 12:28:29.531645 (1391583509) [db2@tnpminlnx0307 ~]$ db2pd -db PV -hadr Database Member 0 -- Database PV -- Standby -- Up 0 days 23:34:40 -- Date 2014-02-0612.03.04.107670 HADR_ROLE = STANDBY REPLAY_TYPE = PHYSICAL HADR_SYNCMODE = SYNC STANDBY_ID = 0 LOG_STREAM_ID = 0 HADR_STATE = PEER PRIMARY_MEMBER_HOST = tnpminlnx0303 PRIMARY_INSTANCE = db2 PRIMARY_MEMBER = 0 STANDBY_MEMBER_HOST = tnpminlnx0307 STANDBY_INSTANCE = db2 STANDBY_MEMBER = 0 HADR_CONNECT_STATUS = CONNECTED HADR_CONNECT_STATUS_TIME = 02/05/2014 12:28:28.955942 (1391583508) 19 February 11, 2014 TIVOLI NETCOOL PERFORMANCE MANAGER – HIGH AVAILABILITY TEST FOR TNPM 1.3.3 WIRELINE Using “lssam” to check similar “Active” DB running. This determines which server is ONLINE (Standby) and which server is OFFLINE (Standby) 20 February 11, 2014 TIVOLI NETCOOL PERFORMANCE MANAGER – HIGH AVAILABILITY TEST FOR TNPM 1.3.3 WIRELINE Checking that random TNPM application connection with ACTIVE DB is working, prior to failover All these activities relate to communicating with Active DB2. 1. 2. 3. 4. 5. 6. PVM GUI is launched - SUCCESS Datachannel components ( CNS; LOG; CMGR, AMGR ) startup - SUCCESS Inventory run for a profile - SUCCESS Resource Editor able to view elements/subelements - SUCCESS Request Editor able to view/disable/enable formula - SUCCESS Snapshot of the current data timestamp in LDR Note: Refer to the Header of the Windows, that Resource Editor is connected to DB via Virtual IP= 10.55.236.191 21 February 11, 2014 TIVOLI NETCOOL PERFORMANCE MANAGER – HIGH AVAILABILITY TEST FOR TNPM 1.3.3 WIRELINE No errors seen from Request Editor. dccmd status all 22 February 11, 2014 TIVOLI NETCOOL PERFORMANCE MANAGER – HIGH AVAILABILITY TEST FOR TNPM 1.3.3 WIRELINE Datachannel visuals are running Last data in LDR /output. Last timestamp shows 5th Feb 2014 23 February 11, 2014 TIVOLI NETCOOL PERFORMANCE MANAGER – HIGH AVAILABILITY TEST FOR TNPM 1.3.3 WIRELINE Failover Type 1 : Failover of Active DB ( triggered via command line ) to Standby DB. IP:10.55.236.167 (tnpminlnx0307) virtual IP (Active) DB NFS mount (emulate storage device ) /db2_load/hadr_shr IP:10.55.236.163 (tnpminlnx0303) virtual IP (Standby ) DB Virtual IP:10.55.236.191 (tnpminlnx0306) IP:10.55.236.166 DB client TNPM As db2, run the following command to initiate a force failover. Run the following on standby DB prompt (on the Standby DB server) $ db2 takeover hard on database pv The following db2diag.log events occurs. Db2Diag-Feb7-failover.txt - /opt/db2/sqllib/db2dump Feb7-167to163.txt - This is from running “lssam” and “db2pd –db pv –hadr”. Db2Diag-Feb7-failover-167-to-163.txt Feb7-167to163. tx t Downtime during manual “takeover” trigger: a) Resource Editor - 1-2 minutes b) Request Editor - 2-3 minutes c) Inventory profile running - 8-10 minutes. This is because the Datachannel visual process will start to have communication errors, and recovery. No Datachannel, Datamart UI or Dataload were required to restart/bounced. 24 February 11, 2014 TIVOLI NETCOOL PERFORMANCE MANAGER – HIGH AVAILABILITY TEST FOR TNPM 1.3.3 WIRELINE During recovery, walkback files will be generated. Validate that the Data is still being collected/processed and loaded by Datachannel into the new Active Database. 1. 2. 3. 4. 5. PVM GUI is launched - SUCCESS Datachannel components ( CNS; LOG; CMGR, AMGR ) startup - SUCCESS Inventory run for a profile - SUCCESS Resource Editor able to view elements/subelements - SUCCESS Request Editor able to view/disable/enable formula - SUCCESS Snapshot of the current data timestamp in LDR Dccmd status does not show any “unresponsive” and “flow control asserted” Dataload process is still up, check process id date. Check the LDR.1’s output 25 February 11, 2014 TIVOLI NETCOOL PERFORMANCE MANAGER – HIGH AVAILABILITY TEST FOR TNPM 1.3.3 WIRELINE The contents of the latest BOF data processed by LDR /output Note, that the hourly run dates are consistent still, and no interruption from failover/takeover for the past 6 hours of data collection activities. Summary of result test ( Manual trigger ) a) Data reliability - Data collection is not interrupted for SNMP collector with polling 15 minutes. LDR.1 /output shows a consistent filesize of data ( as the data stream is almost constant). SNMP collector process (pvmd_3002) is still up and running, during takeover. b) Disaster recovery type - Manual takeover (graceful failover ) - Recovery /Downtime is approximately <10 minutes. This is determine by the “dccmd status” and “proviso.log” on how fast the application recovers from lost of connection with active DB to standby DB. This also is checked from the GUI , on how fast application services such as Request Editor, Resource Editor, Inventory Tool is able to recover. 26 February 11, 2014 TIVOLI NETCOOL PERFORMANCE MANAGER – HIGH AVAILABILITY TEST FOR TNPM 1.3.3 WIRELINE Failover Type 2 : Failover of Active DB ( triggered by Active DB reboot ) to Standby DB. IP:10.55.236.167 (tnpminlnx0307) virtual IP (Standby) DB NFS mount (emulate storage device ) /db2_load/hadr_shr IP:10.55.236.163 (tnpminlnx0303) virtual IP (Active ) DB Virtual IP:10.55.236.191 (tnpminlnx0306) IP:10.55.236.166 DB client TNPM Login to the Active DB2 as root. Run an “init 6” command to perform a server reboot. $ init 6 Captured Db2diag.log as below. Db2Diag-Feb7-failover.txt - /opt/db2/sqllib/db2dump Feb7-163to167.txt - This is from running “lssam” and “db2pd –db pv –hadr”. Feb7-failover-163-to-167.txt Feb7-163to167.txt 27 February 11, 2014 TIVOLI NETCOOL PERFORMANCE MANAGER – HIGH AVAILABILITY TEST FOR TNPM 1.3.3 WIRELINE Downtime during REBOOT server trigger: a) Resource Editor - 10-11 minutes b) Request Editor - 10-11 minutes c) Inventory profile running - 15-16 minutes. This is because the Datachannel visual process will start to have communication errors, and recovery. No Datachannel, Datamart UI or Dataload were required to restart/bounced. PBL (Plan Builder) was unresponsive for 15 minutes, before it recovers. Generates walkback files. Validate that the Data is still being collected/processed and loaded by Datachannel into the new Active Database. 1. 2. 3. 4. 5. PVM GUI is launched - SUCCESS Datachannel components ( CNS; LOG; CMGR, AMGR ) startup - SUCCESS Inventory run for a profile - SUCCESS Resource Editor able to view elements/subelements - SUCCESS Request Editor able to view/disable/enable formula - SUCCESS Snapshot of the current data timestamp in LDR dccmd command to trace the recovery of DC components during failover 28 February 11, 2014 TIVOLI NETCOOL PERFORMANCE MANAGER – HIGH AVAILABILITY TEST FOR TNPM 1.3.3 WIRELINE Finally PBL component recovers. Dataload process still up. Check the LDR.1’s output Note that the hourly loader has been successfully loaded into DB2. 29 February 11, 2014 TIVOLI NETCOOL PERFORMANCE MANAGER – HIGH AVAILABILITY TEST FOR TNPM 1.3.3 WIRELINE Summary of result test ( Reboot ) a) Data reliability - Data collection is not interrupted for SNMP collector with polling 15 minutes. LDR.1 /output shows a consistent filesize of data ( as the data stream is almost constant). SNMP collector process (pvmd_3002) is still up and running, during takeover. b) Disaster recovery type - Automatic Failover - Recovery /Downtime is approximately <16 minutes. This is determined by the “dccmd status” and “proviso.log” on how fast the application recovers from lost connection with active DB to standby DB. This also is checked from the GUI , on how fast application services such as Request Editor, Resource Editor, Inventory Tool is able to recover. 30 February 11, 2014 TIVOLI NETCOOL PERFORMANCE MANAGER – HIGH AVAILABILITY TEST FOR TNPM 1.3.3 WIRELINE Failover Type 3 : Active DB recovers when random killed db2 process ID. IP:10.55.236.167 (tnpminlnx0307) virtual IP (Active ) DB NFS mount (emulate storage device ) /db2_load/hadr_shr IP:10.55.236.163 (tnpminlnx0303) virtual IP (Standby ) DB Virtual IP:10.55.236.191 (tnpminlnx0306) IP:10.55.236.166 DB client TNPM The reason for validating this is to determine that Active DB would try to recover “killed” db2 process, and does not require any failover to be triggered. Login to the Active DB2 as root. Run an “init 6” command to perform a server reboot. $ ps –ef | grep db2 31 February 11, 2014 TIVOLI NETCOOL PERFORMANCE MANAGER – HIGH AVAILABILITY TEST FOR TNPM 1.3.3 WIRELINE Randomly kill some of the db2 process IDs ( Take note the process date is Feb 6) $ kill -9 5229 5231 5252 5245 12829 This is killing db2sysc, db2ckpwd, db2acd, db2vend and db2fmp Downtime during DB2 processes killed is triggered: a) Resource Editor - 2 minutes b) Request Editor - 4 minutes c) Inventory profile running - 6 minutes. All datachannel recovers after 6 minutes, having LDR the last component to restart. Generates DLDR and LDR walkback files. Validate that the Data is still being collected/processed and loaded by Datachannel into the new Active Database. 6. 7. 8. 9. 10. PVM GUI is launched - SUCCESS Datachannel components ( CNS; LOG; CMGR, AMGR ) startup - SUCCESS Inventory run for a profile - SUCCESS Resource Editor able to view elements/subelements - SUCCESS Request Editor able to view/disable/enable formula - SUCCESS Snapshot of the current data timestamp in LDR 32 February 11, 2014 TIVOLI NETCOOL PERFORMANCE MANAGER – HIGH AVAILABILITY TEST FOR TNPM 1.3.3 WIRELINE After the recovery, the new process ID for the db2 is respawned. lssam status after recovery 33 February 11, 2014 TIVOLI NETCOOL PERFORMANCE MANAGER – HIGH AVAILABILITY TEST FOR TNPM 1.3.3 WIRELINE Db2pd –db pv –hadr Summary of result test ( db2 process killed ) a) Data reliability - Data collection is not interrupted for SNMP collector with polling 15 minutes. LDR.1 /output shows a consistent filesize of data ( as the data stream is almost constant). SNMP collector process (pvmd_3002) is still up and running, during takeover. b) Disaster recovery type - No failover is triggered. - Recovery /Downtime is approximately <6 minutes. This is determined by the “dccmd status” and “proviso.log” on how fast the application recovers from lost of connection with active DB to standby DB. This also is checked from the GUI , on how fast application services such as Request Editor, Resource Editor, Inventory Tool is able to recover. 34 February 11, 2014 TIVOLI NETCOOL PERFORMANCE MANAGER – HIGH AVAILABILITY TEST FOR TNPM 1.3.3 WIRELINE Data validation from Dataview using Diagnostic View, to observe the data trending. Exported the CSV for the above chart data, and compared with the BOF data being processed in LDR.1 /output. The above shows the timestamp data trending for 11th and 12th Feb 2014. Random comparison of the data from DataView and BOF result as shown below. Note: The BOF data is collected at GMT+0 while the DV chart is running on IST timezone (GMT-5.5) The different is 5 hours and 30 minutes. Metric ID (MID)= 10587 is “IP Out Request”, while Resource ID (RID) = 200000202 or tnpmsun1z10.persistent.co.in The Notepad text file is result from bofDump. The Excel CSV is from Diagnostic View ( eg: 11 Feb 2014, 3.45am , Value=233.7815 ) 35 February 11, 2014 TIVOLI NETCOOL PERFORMANCE MANAGER – HIGH AVAILABILITY TEST FOR TNPM 1.3.3 WIRELINE Another check on a different MID =10585; IP Packets Received for the same resource. (eg: 11 Feb 2014, 6.30am , Value=280.9591) The following are taken from DataView Report Filtered for the same resource = tnpmsun1z10 36 February 11, 2014 TIVOLI NETCOOL PERFORMANCE MANAGER – HIGH AVAILABILITY TEST FOR TNPM 1.3.3 WIRELINE Drilldown DataView Report, shows the metric 10585 ( IP Pkt Received ) and 10587 (IP Out Request ) The exported CSV for Mid=10587 , IP Out Request ( eg: 11 Feb 2014, 3.45am , Value=233.7815 ) The exported CSV for Mid=10585, IP Packets Received ( eg: 11 Feb 2014, 6.30am , Value=280.9591 ) 37