Tivoli Netcool Performance Manager * HIGH Availability test

advertisement
February 11,
2014
TIVOLI NETCOOL PERFORMANCE MANAGER – HIGH AVAILABILITY TEST FOR
TNPM 1.3.3 WIRELINE
High Availability for Tivoli Netcool
Performance Manager 1.3.3 – Wireline
using High Availability Disaster Recovery
(HADR)
Document version v1.0
1
February 11,
2014
TIVOLI NETCOOL PERFORMANCE MANAGER – HIGH AVAILABILITY TEST FOR
TNPM 1.3.3 WIRELINE
2
February 11,
2014
TIVOLI NETCOOL PERFORMANCE MANAGER – HIGH AVAILABILITY TEST FOR
TNPM 1.3.3 WIRELINE
1. Introduction
This document describes the test validation done on Tivoli Netcool Performance Manager
1.3.3 Wireline on Red Hat Enterprise Linux (RHEL) in a High Availability (HA) configuration
using Tivoli System Automation for Multiplatforms (TSAM) with High Availability Disaster
Recovery feature (HADR).
This document describes:
High level description of the test environment configuration/setup done.
Issues and challenges
Details of test validation done of TSAM-HADR on the IBM DB2 database redundancy in
relation to TNPM applications.
1.1 Target Audience
This document is intended to provide a limited view of High Availability test done for TNPM
1.3.3. It is not intended to capture details of configuration and setup as the product
implementation differs for each customer based on their requirements.
1.2 References
For detailed information on Tivoli Netcool Performance Manager 1.3.3 Wireline, see
http://publib.boulder.ibm.com/infocenter/tivihelp/v8r1/topic/com.ibm.tnpm_1.3.3.doc/welcome_tnpm
_db2_support.html
For detailed information on System Automation for Multiplatforms installation
and configuration, see
http://publib.boulder.ibm.com/infocenter/tivihelp/v3r1/topic/com.ibm.samp.d
oc_3.2.1/HALICG21.pdf
Reference to Redbook HADR pdf
http://www.redbooks.ibm.com/redbooks/pdfs/sg247363.pdf

Tivoli Netcool Performance Manager – Wireline HAM Deployment Guide
To know how Tivoli Netcool Performance Manager 1.3.3 - Wireline SNMP
Dataload High Availability works, see
http://publib.boulder.ibm.com/infocenter/tivihelp/v8r1/topic/com.ibm.tnpm_1.3.3.doc/install_db2_su
pport/ctnpm_installguide_chapterusingthehighavailabilitymanager-08-01.html

For detailed instructions on installing Tivoli Integration Portal (TIP) in load
balanced environment, see
http://publib.boulder.ibm.com/infocenter/tivihelp/v15r1/index.jsp?topic=/co
m.ibm.tip.doc/ctip_config_ha_ovw.html
For detailed and reference information on System Automation for multiplatform installation
and configuration for Linux on TNPM 1.3.1
https://www-304.ibm.com/software/brandcatalog/ismlibrary/details?catalog.label=1TW10NP58#
3
February 11,
2014
TIVOLI NETCOOL PERFORMANCE MANAGER – HIGH AVAILABILITY TEST FOR
TNPM 1.3.3 WIRELINE
2. Prerequisites
This section describes the prerequisites required to design, install, and configure
TNPM for high availability.
2.1 Operating System
Use the RHEL 5.9, 64-bit for Tivoli System Automation for multiplatform-based TNPM high
availability.
2.2 Prerequisites for Netcool/Proviso
Refer to the following information before you install Netcool/Proviso:
“Supported Operating Systems and Modules” section in
http://publib.boulder.ibm.com/infocenter/tivihelp/v8r1/topic/com.ibm.tnpm_1.3.3.doc/confi
g_recommendations/ctnpm_configrec_guide_db2.html
“Pre-Installation Setup Tasks” section in
http://publib.boulder.ibm.com/infocenter/tivihelp/v8r1/topic/com.ibm.tnpm_1.3.3.doc/insta
ll_db2_support/ctnpm_installguide_preinstallationsetuptasks-04-04.html
2.3 Prerequisites for System Automation for Multiplatform
For prerequisites details to install System Automation for Multiplatforms, refer to
the “Preparing for installation” section in
http://publib.boulder.ibm.com/infocenter/tivihelp/v3r1/topic/com.ibm.samp.doc_3.2.
1/HALICG21.pdf.
For additional reference, see thewhite paper on HADR and TSA information pointers
.
https://www.ibm.com/developerworks/community/blogs/DB2LUWAvailability/entry/hadr_an
d_tsa_information_pointers?lang=en
2.4 Highly available storage
The TNPM high availability requires highly available shared storage
array that protects data integrity. The shared disk storage devices, multi-hosted
SCSI storage arrays, and SAN based devices are generally used for high
availability solution.
4
February 11,
2014
TIVOLI NETCOOL PERFORMANCE MANAGER – HIGH AVAILABILITY TEST FOR
TNPM 1.3.3 WIRELINE
3. High Availability Architecture for TNPM
This section describes the TSAM- HADR implementation on the DB2 database server.
3.1 Pre-requisites and setup summary
Figure 1
Important HADR links
For procedures to configure and set up the HADR environment.
For any issues and concerns, you should consult with the DBA expert on-site.
http://pic.dhe.ibm.com/infocenter/db2luw/v10r1/index.jsp?topic=%2Fcom.ibm.db2.luw.admin.ha.doc
%2Fdoc%2Fr0051349.html
http://pic.dhe.ibm.com/infocenter/db2luw/v10r1/index.jsp?topic=%2Fcom.ibm.db2.luw.admin.ha.doc
%2Fdoc%2Ft0011725.html
Alternatively, the following link on Section 5 was followed during HADR configuration (Figure 1). The
HADR configured was the “Single Network HADR topology”. It was automated using the db2haicu XML
mode.
http://public.dhe.ibm.com/software/dw/data/dm-0908hadrdb2haicu/HADR_db2haicu.pdf
5
February 11,
2014
TIVOLI NETCOOL PERFORMANCE MANAGER – HIGH AVAILABILITY TEST FOR
TNPM 1.3.3 WIRELINE
Requirements
To set up HADR, you must have the following requirements in place:
 HADR is a DB2 feature available in all DB2 editions except for DB2
Express-C. For the standby and auxiliary standby databases, DB2 licensing
takes place according to usage as hot warm or cold standby. For more
information, consult with an IBM Information Management marketing representative

IBM® Tivoli® System Automation for Multiplatform (SA MP) is integrated with IBM DB2® server
as part of the DB2 High Availability Feature on AIX, Linux, and Solaris operating systems.

You can select to install Tivoli SA MP while installing DB2 server. Check license terms for using
IBM Tivoli SA MP integrated with IBM DB2 server.

The operating system on the primary and standby databases should be the
same version, including the patches. You can violate this rule for a short time
during a rolling upgrade, but take extreme caution.

A TCP/IP interface must be available between the HADR host machines.

The DB2 version and level must be identical on both the primary and the standby databases

The DB2 software for both the primary and the standby databases must be the same bit size (
32-bit or 64-bit )

Buffer pool sizes on the primary and the standbys should be the same. If you
build the standby database by restoring the database by using the backup copy
from the primary, the buffer pool sizes are the same because this information
is included in the database backup. If you are using the reads on standby
feature, you must configure the buffer pool on the primary so that the active
standby can accommodate log replay and read applications.

The primary and standby databases must have the same database name, which means they
must be in different instances

Table spaces must be identical on the primary and standby databases, including:
– Table space type (DMS or SMS)
– Table space size
– Container path
– Container size
– Container file type (raw device or file system)
6
February 11,
2014
TIVOLI NETCOOL PERFORMANCE MANAGER – HIGH AVAILABILITY TEST FOR
TNPM 1.3.3 WIRELINE

The amount of space that is allocated for log files should also be the same on both the primary
and standby databases

Database manager and instance profile registry variables setting of primary and standby should
be same.

For LOAD with COPY YES option, if shared NFS path is used, the ownership of the path should
match with instance user owner.
Ensure that the shared path is accessible from both servers.
The system clock of the HADR primary and standby node must be synchronized.
DB parameter CONNECT_PROC should be set to NULL (10.1 FP 1 bug).
HADR requires the database to use archival logging.
Set the LOGINDEXBUILD parameter so that index creation, recreation, or reorganization
operations are logged by running the following command:
db2 update database configuration for sample using LOGINDEXBUILD ON





7
February 11,
2014
TIVOLI NETCOOL PERFORMANCE MANAGER – HIGH AVAILABILITY TEST FOR
TNPM 1.3.3 WIRELINE
Creating NFS shared directory and mounting
In HADR setup we create an instance (db2) on both primary and standby servers. The UID and GID of
user db2 need to be same on primary and standby for NFS share. To check this, try accessing file from
NFS share from both the server. If owner and group of the file is correctly set, you are good to go.
$ ls -ltr /db2_load/hadr_shr | tail -1
-rw-r----- 1 db2 db2iadm 8429568 Jan 13 14:34
PV.4.db2.DBPART000.20140113143428.001
Directory /db2_load/hadr_shr is created on primary (10.55.236. 167) and then NFS mounted on
standby( 10.55.236.163).
[db2@tnpminlnx0303 milind]$ df -h
Filesystem
/dev/mapper/VolGroup00-LogVol01
/dev/hda1
tmpfs
10.55.236.167:/db2_load/hadr_shr
Size
128G
487M
7.9G
128G
Used Avail Use% Mounted on
48G 74G 40% /
18M 444M 4% /boot
12K 7.9G 1% /dev/shm
34G 87G 28% /db2_load/hadr_shr
All HADR related messages are logged in db2 diagnostic log /opt/db2/sqllib/db2dump/db2diag.log
Automatic failover and enablement
HADR setup settings:
Port number values used in /etc/services for HADR setup.
db2_hadr_1 50010/tcp
db2_hadr_2 50011/tcp
Primary server: tnpminlnx0307 (IP: 10.55.236.167)
Standby server: tnpminlnx0303 (IP: 10.55.236.163)
Virtial IP: 10.55.236.191
HADR Setting on tnpminlnx0307
[db2@tnpminlnx0307 milind]$ db2 get db cfg for pv | grep -i hadr
HADR database role
= PRIMARY
HADR local host name
(HADR_LOCAL_HOST) = tnpminlnx0307
HADR local service name
(HADR_LOCAL_SVC) = 50010
HADR remote host name
(HADR_REMOTE_HOST) = tnpminlnx0303
HADR remote service name
(HADR_REMOTE_SVC) = 50011
HADR instance name of remote server (HADR_REMOTE_INST) = db2
HADR timeout value
(HADR_TIMEOUT) = 60
HADR target list
(HADR_TARGET_LIST) =
HADR log write synchronization mode (HADR_SYNCMODE) = SYNC
HADR spool log data limit (4KB) (HADR_SPOOL_LIMIT) = 0
HADR log replay delay (seconds) (HADR_REPLAY_DELAY) = 0
8
February 11,
2014
TIVOLI NETCOOL PERFORMANCE MANAGER – HIGH AVAILABILITY TEST FOR
TNPM 1.3.3 WIRELINE
HADR peer window duration (seconds) (HADR_PEER_WINDOW) = 120
[db2@tnpminlnx0307 milind]$ db2set -all
[i] DB2_DEFERRED_PREPARE_SEMANTICS=YES
[i] DB2_COMPATIBILITY_VECTOR=ORA
[i] DB2_LOAD_COPY_NO_OVERRIDE=COPY YES to /db2_load/hadr_shr
[i] DB2COMM=TCPIP
[i] DB2AUTOSTART=YES
[g] DB2SYSTEM=tnpminlnx0307.persistent.co.in
[g] DB2INSTDEF=db2
[g] DB2ADMINSERVER=dasusr1
HADR Setting on tnpminlnx0303
[db2@tnpminlnx0303 milind]$ db2 get db cfg for pv | grep -i hadr
HADR database role
= STANDBY
HADR local host name
(HADR_LOCAL_HOST) = tnpminlnx0303
HADR local service name
(HADR_LOCAL_SVC) = 50011
HADR remote host name
(HADR_REMOTE_HOST) = tnpminlnx0307
HADR remote service name
(HADR_REMOTE_SVC) = 50010
HADR instance name of remote server (HADR_REMOTE_INST) = db2
HADR timeout value
(HADR_TIMEOUT) = 60
HADR target list
(HADR_TARGET_LIST) =
HADR log write synchronization mode (HADR_SYNCMODE) = SYNC
HADR spool log data limit (4KB) (HADR_SPOOL_LIMIT) = 0
HADR log replay delay (seconds) (HADR_REPLAY_DELAY) = 0
HADR peer window duration (seconds) (HADR_PEER_WINDOW) = 120
[db2@tnpminlnx0303 milind]$ db2set -all
[i] DB2_DEFERRED_PREPARE_SEMANTICS=YES
[i] DB2_COMPATIBILITY_VECTOR=ORA
[i] DB2_LOAD_COPY_NO_OVERRIDE=COPY YES to /db2_load/hadr_shr
[i] DB2COMM=TCPIP
[i] DB2AUTOSTART=YES
[g] DB2SYSTEM=tnpminlnx0303.persistent.co.in
[g] DB2INSTDEF=db2
Note: In this HADR setup, the standby database will be in ROLL-FORWARD PENDING state.
Configuring HADR for automatic failover by using Tivoli SA MP cluster manager.
9
February 11,
2014
TIVOLI NETCOOL PERFORMANCE MANAGER – HIGH AVAILABILITY TEST FOR
TNPM 1.3.3 WIRELINE
Run on both servers$ preprpnode tnpminlnx0307 tnpminlnx0303
$ preprpnode tnpminlnx0307 tnpminlnx0303
Conf. setting file/opt/db2/milind/167.xml
/opt/db2/milind/163.xml
On Standby
$ db2haicu -f /opt/db2/milind/163.xml
On Primary
$ db2haicu -f /opt/db2/milind/167.xml
Following fields are edited accordingly; eg: 167.xml below. For the HADRDBSet , the “localHost” and
“remoteHost”, needs to be updated accordingly. 10.55.236.191 is the Virtual IP used.
<PhysicalNetwork physicalNetworkName="db2_public_network_0" physicalNetworkProtocol="ip">
<Interface interfaceName="eth0" clusterNodeName="tnpminlnx0307">
<IPAddress baseAddress="10.55.236.167" subnetMask="255.255.252.0"
networkName="db2_public_network_0"/>
</Interface>
<Interface interfaceName="eth0" clusterNodeName="tnpminlnx0303">
<IPAddress baseAddress="10.55.236.163" subnetMask="255.255.252.0"
networkName="db2_public_network_0"/>
</Interface>
</PhysicalNetwork>
<ClusterNode clusterNodeName="tnpminlnx0307"/>
<ClusterNode clusterNodeName="tnpminlnx0303"/>
<DB2Partition dbpartitionnum="0" instanceName="db2">
</DB2Partition>
<HADRDBSet>
<HADRDB databaseName="PV" localInstance="db2" remoteInstance="db2"
localHost="tnpminlnx0307" remoteHost="tnpminlnx0303" />
<VirtualIPAddress baseAddress="10.55.236.191" subnetMask="255.255.252.0"
networkName="db2_public_network_0"/>
</HADRDBSet>
Automatic client reroute setting with virtual IP
Run on both servers$ db2 update alternate server for database pv using hostname 10.55.236.191 port 50000
$ db2 update alternate server for database pv using hostname 10.55.236.191 port 50000
10
February 11,
2014
TIVOLI NETCOOL PERFORMANCE MANAGER – HIGH AVAILABILITY TEST FOR
TNPM 1.3.3 WIRELINE
3.2 Updating the required db2auth for TNPM
The db2auth needs to be examined in TNPM 1.3.3, so that the proper access is granted to TNPM user
(PV_ADMIN, PV_LOIS, PV_GUI etc) during failover is properly setup/configured. See section 4 on
“Issues and Challenges describes some troubleshooting”.
These authorization changes were implemented post TNPM standalone installation, while the
redundant DB2 database was attached to the standalone environment along with TSAM-HADR
configured later on.
11
February 11,
2014
TIVOLI NETCOOL PERFORMANCE MANAGER – HIGH AVAILABILITY TEST FOR
TNPM 1.3.3 WIRELINE
4. Issues and Challenges
Issue 1:
Inventory was displaying the following error when running profiles.
The MIB browser was unable to ping/resolve network connectivity with device on the same VLAN.
Note the “Host unreachable”, via the Virtual IP 10.55.236.191
12
February 11,
2014
TIVOLI NETCOOL PERFORMANCE MANAGER – HIGH AVAILABILITY TEST FOR
TNPM 1.3.3 WIRELINE
Following issues are seen when trying to save a custom formula into Formula group.
User configuration screen was showing SQLSTATE= 42501
Solution:
Grant stmt –
GRANT EXECUTE ON FUNCTION "PV_GUI"."SEQUENCE_RU_MULTI"(NUMBER) TO ROLE "PV_GUI_ROLE"
13
February 11,
2014
TIVOLI NETCOOL PERFORMANCE MANAGER – HIGH AVAILABILITY TEST FOR
TNPM 1.3.3 WIRELINE
Issue 2:
When saving a changed state in Request Editor the following error is displayed.
Solution:
Running grant access to GRANT EXECUTE ON FUNCTION "PV_GUI"."SEQUENCE_DC"(NUMBER) TO ROLE
"PV_GUI_ROLE"
14
February 11,
2014
TIVOLI NETCOOL PERFORMANCE MANAGER – HIGH AVAILABILITY TEST FOR
TNPM 1.3.3 WIRELINE
Issue 3:
Database Information was showing refresh action error. This is an existing issue in TNPM
1.3.3, and has nothing to do with TSAM or HADR configuration. User would need to workaround this
refresh issue by relaunching the ProvisoInfo Browser UI
Issue 4:
During DB failover ( triggered via server reboot ). The Datachannel component LDR was
going unresponsive.
Solution:
Appended the following into /etc/fstab to enable auto mount of the NFS directory during reboot.
However, this means that when 10.55.236.167 is rebooting, the Secondary DB needs to wait for that
server to be up before being able to mount the NFS directory again.
15
February 11,
2014
TIVOLI NETCOOL PERFORMANCE MANAGER – HIGH AVAILABILITY TEST FOR
TNPM 1.3.3 WIRELINE
On the Secondary DB.
[db2@tnpminlnx0303 db2dump]$ cat /etc/fstab
/dev/VolGroup00/LogVol01 /
ext3 defaults
11
LABEL=/boot
/boot
ext3 defaults
12
tmpfs
/dev/shm
tmpfs defaults
00
devpts
/dev/pts
devpts gid=5,mode=620 0 0
sysfs
/sys
sysfs defaults
00
proc
/proc
proc defaults
00
/dev/VolGroup00/LogVol00 swap
swap defaults
00
10.55.236.167:/db2_load/hadr_shr
/db2_load/hadr_shr nfs defaults
00
Issue 5:
Issue 4 might further cause the Active and Standby DB to be unsynced. This might also be due to some
partition in locked state. Logs from proviso.log
V1:129 2014.02.06-00.03.51 UTC LDR.1-20274:4848 STOREDPROCERR GYMDC10107W ADDTRUNCPART - [IBM][CLI
Driver][DB2/LINUXX8664] SQL0438N Application raised error or warning with diagnostic text: "Error adding partition".
SQLSTATE=UD164
V1:130 2014.02.06-00.03.51 UTC LDR.1-20274:4848 ERROR GYMDC10004F Shutting down the image with error:
[IBM][CLI Driver][DB2/LINUXX8664] SQL0438N Application raised error or warning with diagnostic text: "Error adding
partition". SQLSTATE=UD164
V1:131 2014.02.06-00.03.51 UTC LDR.1-20274:4848 ERROR GYMDC10004F Walkback written to:
/opt/proviso/datachannel/log/walkback-LDR.1-20274-2014.02.06-00.03.51.log
V1:132 2014.02.06-00.03.52 UTC LDR.1-20274:4848 ERROR GYMDC10004F Shutting down the image with error:
[IBM][CLI Driver][DB2/LINUXX8664] SQL0438N Application raised error or warning with diagnostic text: "Error adding
partition". SQLSTATE=UD164
Run the following to check:
$ db2 list tablespaces | more
Check this:
Tablespace ID
= 71
Name
= C01010002014020200
Type
= Database managed space
Contents
= All permanent data. Large table space.
State
= 0x0100
Detailed explanation:
Restore pending
Solution: Rebuild the standby DB and replicate back with the Active DB
16
February 11,
2014
TIVOLI NETCOOL PERFORMANCE MANAGER – HIGH AVAILABILITY TEST FOR
TNPM 1.3.3 WIRELINE
Issue 6:
The Dataview was unable to update changes from user to the DB2 database.
This was due to some PV_LOIS permission. The example below shows the
The first exception from the SystemOut.log shows.
PVRcDataUpdateRow getNextSequence() [WebContainer : 1]
SQLExceptioncom.ibm.db2.jcc.am.SqlSyntaxErrorException: DB2 SQL Error: SQLCODE=-551,
SQLSTATE=42501, SQLERRMC=PV_LOIS;EXECUTE;PV_LOIS.SEQUENCE, DRIVER=4.14.111
This will actually execute “select sequence from DUAL” to get the next index to insert.
In 10.55.236.167, I get this:
db2 => select sequence from DUAL
SQL0551N "PV_LOIS" does not have the required authorization or privilege to
perform operation "EXECUTE" on object "PV_LOIS.SEQUENCE". SQLSTATE=42501
db2 => select sequence from DUAL
1
-----------------------100002711
1 record(s) selected.
Solution: GRANT EXECUTE ON FUNCTION "PV_LOIS"."SEQUENCE"() TO ROLE "PV_LOIS_ROLE"
17
February 11,
2014
TIVOLI NETCOOL PERFORMANCE MANAGER – HIGH AVAILABILITY TEST FOR
TNPM 1.3.3 WIRELINE
5. Test validation done on TSAM-HADR DB2 for TNPM 1.3.3
TSAM Version used
[db2@tnpminlnx0307 ~]$ samlicm -s
Product: IBM Tivoli System Automation for Multiplatform 3.2
The test validation done would concentrate on DB2 database availability. The data reliability and the
application downtime/stability will be examined.
The results does not conclude if the downtime is at an acceptable range, nor concludes the application
stability over longer periods (i.e.: over a month).
The following types of failover are being documented.
a) When Active DB2 database server is failed over manually using a takeover command.
b) When Active DB2 database server is failed over due to server reboot.
c) When Active DB2 database server is able to recover from manually killing db2 process ids. ( No
failover is triggered )
The sequence of activities is carried out on each type of failover.
1) Precheck on TNPM applications
- Datamart Datachannel and Dataload applications are running, processing and collection
data. The latest BOFs data are being processed in the /output of LDR.
- PVM UI is able to perform normal provisioning and configuration routines; Run Inventory,
View current subelements/elements via Resource Editor, Toggle active-idle Request Editor
formulas and etc.
2) Check on TSAM status
- Running command such as “lssam”, “db2pd -db PV –hadr” and checking the db2diag.txt log
that no errors in database activities are occurring.
3) Execute the failover type
4) Postcheck on TSAM status
5) Postcheck on TNPM application
6) Determine the downtime for TNPM application before recovering back to service, without any
“restart” or “initialization” from user.
18
February 11,
2014
TIVOLI NETCOOL PERFORMANCE MANAGER – HIGH AVAILABILITY TEST FOR
TNPM 1.3.3 WIRELINE
Precheck on TNPM application
Running to check current status of each DB server
[db2@tnpminlnx0303 ~]$ db2pd -db PV -hadr
Database Member 0 -- Database PV -- Active -- Up 16 days 17:36:30 -- Date 2014-02-06-12.03.01.748667
HADR_ROLE = PRIMARY
REPLAY_TYPE = PHYSICAL
HADR_SYNCMODE = SYNC
STANDBY_ID = 1
LOG_STREAM_ID = 0
HADR_STATE = PEER
PRIMARY_MEMBER_HOST = tnpminlnx0303
PRIMARY_INSTANCE = db2
PRIMARY_MEMBER = 0
STANDBY_MEMBER_HOST = tnpminlnx0307
STANDBY_INSTANCE = db2
STANDBY_MEMBER = 0
HADR_CONNECT_STATUS = CONNECTED
HADR_CONNECT_STATUS_TIME = 02/05/2014 12:28:29.531645 (1391583509)
[db2@tnpminlnx0307 ~]$ db2pd -db PV -hadr
Database Member 0 -- Database PV -- Standby -- Up 0 days 23:34:40 -- Date 2014-02-0612.03.04.107670
HADR_ROLE = STANDBY
REPLAY_TYPE = PHYSICAL
HADR_SYNCMODE = SYNC
STANDBY_ID = 0
LOG_STREAM_ID = 0
HADR_STATE = PEER
PRIMARY_MEMBER_HOST = tnpminlnx0303
PRIMARY_INSTANCE = db2
PRIMARY_MEMBER = 0
STANDBY_MEMBER_HOST = tnpminlnx0307
STANDBY_INSTANCE = db2
STANDBY_MEMBER = 0
HADR_CONNECT_STATUS = CONNECTED
HADR_CONNECT_STATUS_TIME = 02/05/2014 12:28:28.955942 (1391583508)
19
February 11,
2014
TIVOLI NETCOOL PERFORMANCE MANAGER – HIGH AVAILABILITY TEST FOR
TNPM 1.3.3 WIRELINE
Using “lssam” to check similar “Active” DB running.
This determines which server is ONLINE (Standby) and which server is OFFLINE (Standby)
20
February 11,
2014
TIVOLI NETCOOL PERFORMANCE MANAGER – HIGH AVAILABILITY TEST FOR
TNPM 1.3.3 WIRELINE
Checking that random TNPM application connection with ACTIVE DB is working, prior to failover
All these activities relate to communicating with Active DB2.
1.
2.
3.
4.
5.
6.
PVM GUI is launched - SUCCESS
Datachannel components ( CNS; LOG; CMGR, AMGR ) startup - SUCCESS
Inventory run for a profile - SUCCESS
Resource Editor able to view elements/subelements - SUCCESS
Request Editor able to view/disable/enable formula - SUCCESS
Snapshot of the current data timestamp in LDR
Note: Refer to the Header of the Windows, that Resource Editor is connected to DB via Virtual IP=
10.55.236.191
21
February 11,
2014
TIVOLI NETCOOL PERFORMANCE MANAGER – HIGH AVAILABILITY TEST FOR
TNPM 1.3.3 WIRELINE
No errors seen from Request Editor.
dccmd status all
22
February 11,
2014
TIVOLI NETCOOL PERFORMANCE MANAGER – HIGH AVAILABILITY TEST FOR
TNPM 1.3.3 WIRELINE
Datachannel visuals are running
Last data in LDR /output. Last timestamp shows 5th Feb 2014
23
February 11,
2014
TIVOLI NETCOOL PERFORMANCE MANAGER – HIGH AVAILABILITY TEST FOR
TNPM 1.3.3 WIRELINE
Failover Type 1 : Failover of Active DB ( triggered via command line )
to Standby DB.
IP:10.55.236.167
(tnpminlnx0307)
virtual IP
(Active) DB
NFS mount (emulate
storage device )
/db2_load/hadr_shr
IP:10.55.236.163
(tnpminlnx0303)
virtual IP
(Standby ) DB
Virtual IP:10.55.236.191
(tnpminlnx0306)
IP:10.55.236.166
DB client
TNPM
As db2, run the following command to initiate a force failover.
Run the following on standby DB prompt (on the Standby DB server)
$ db2 takeover hard on database pv
The following db2diag.log events occurs.
Db2Diag-Feb7-failover.txt - /opt/db2/sqllib/db2dump
Feb7-167to163.txt
- This is from running “lssam” and “db2pd –db pv –hadr”.
Db2Diag-Feb7-failover-167-to-163.txt
Feb7-167to163. tx t
Downtime during manual “takeover” trigger:
a) Resource Editor - 1-2 minutes
b) Request Editor - 2-3 minutes
c) Inventory profile running - 8-10 minutes.
This is because the Datachannel visual process will start to have communication errors, and
recovery.
No Datachannel, Datamart UI or Dataload were required to restart/bounced.
24
February 11,
2014
TIVOLI NETCOOL PERFORMANCE MANAGER – HIGH AVAILABILITY TEST FOR
TNPM 1.3.3 WIRELINE
During recovery, walkback files will be generated.
Validate that the Data is still being collected/processed and loaded by Datachannel into the new Active
Database.
1.
2.
3.
4.
5.
PVM GUI is launched - SUCCESS
Datachannel components ( CNS; LOG; CMGR, AMGR ) startup - SUCCESS
Inventory run for a profile - SUCCESS
Resource Editor able to view elements/subelements - SUCCESS
Request Editor able to view/disable/enable formula - SUCCESS
Snapshot of the current data timestamp in LDR
Dccmd status does not show any “unresponsive” and “flow control asserted”
Dataload process is still up, check process id date.
Check the LDR.1’s output
25
February 11,
2014
TIVOLI NETCOOL PERFORMANCE MANAGER – HIGH AVAILABILITY TEST FOR
TNPM 1.3.3 WIRELINE
The contents of the latest BOF data processed by LDR /output
Note, that the hourly run dates are consistent still, and no interruption from failover/takeover for the
past 6 hours of data collection activities.
Summary of result test ( Manual trigger )
a) Data reliability
- Data collection is not interrupted for SNMP collector with polling 15 minutes. LDR.1 /output
shows a consistent filesize of data ( as the data stream is almost constant).
SNMP collector process (pvmd_3002) is still up and running, during takeover.
b) Disaster recovery type
- Manual takeover (graceful failover )
- Recovery /Downtime is approximately <10 minutes.
This is determine by the “dccmd status” and “proviso.log” on how fast the application
recovers from lost of connection with active DB to standby DB. This also is checked from the GUI , on
how fast application services such as Request Editor, Resource Editor, Inventory Tool is able to recover.
26
February 11,
2014
TIVOLI NETCOOL PERFORMANCE MANAGER – HIGH AVAILABILITY TEST FOR
TNPM 1.3.3 WIRELINE
Failover Type 2 : Failover of Active DB ( triggered by Active DB
reboot ) to Standby DB.
IP:10.55.236.167
(tnpminlnx0307)
virtual IP
(Standby) DB
NFS mount (emulate
storage device )
/db2_load/hadr_shr
IP:10.55.236.163
(tnpminlnx0303)
virtual IP
(Active ) DB
Virtual IP:10.55.236.191
(tnpminlnx0306)
IP:10.55.236.166
DB client
TNPM
Login to the Active DB2 as root.
Run an “init 6” command to perform a server reboot.
$ init 6
Captured Db2diag.log as below.
Db2Diag-Feb7-failover.txt - /opt/db2/sqllib/db2dump
Feb7-163to167.txt
- This is from running “lssam” and “db2pd –db pv –hadr”.
Feb7-failover-163-to-167.txt
Feb7-163to167.txt
27
February 11,
2014
TIVOLI NETCOOL PERFORMANCE MANAGER – HIGH AVAILABILITY TEST FOR
TNPM 1.3.3 WIRELINE
Downtime during REBOOT server trigger:
a) Resource Editor - 10-11 minutes
b) Request Editor - 10-11 minutes
c) Inventory profile running - 15-16 minutes.
This is because the Datachannel visual process will start to have communication errors, and
recovery.
No Datachannel, Datamart UI or Dataload were required to restart/bounced.
PBL (Plan Builder) was unresponsive for 15 minutes, before it recovers.
Generates walkback files.
Validate that the Data is still being collected/processed and loaded by Datachannel into the new Active
Database.
1.
2.
3.
4.
5.
PVM GUI is launched - SUCCESS
Datachannel components ( CNS; LOG; CMGR, AMGR ) startup - SUCCESS
Inventory run for a profile - SUCCESS
Resource Editor able to view elements/subelements - SUCCESS
Request Editor able to view/disable/enable formula - SUCCESS
Snapshot of the current data timestamp in LDR
dccmd command to trace the recovery of DC components during failover
28
February 11,
2014
TIVOLI NETCOOL PERFORMANCE MANAGER – HIGH AVAILABILITY TEST FOR
TNPM 1.3.3 WIRELINE
Finally PBL component recovers.
Dataload process still up.
Check the LDR.1’s output
Note that the hourly loader has been successfully loaded into DB2.
29
February 11,
2014
TIVOLI NETCOOL PERFORMANCE MANAGER – HIGH AVAILABILITY TEST FOR
TNPM 1.3.3 WIRELINE
Summary of result test ( Reboot )
a) Data reliability
- Data collection is not interrupted for SNMP collector with polling 15 minutes. LDR.1 /output
shows a consistent filesize of data ( as the data stream is almost constant).
SNMP collector process (pvmd_3002) is still up and running, during takeover.
b) Disaster recovery type
- Automatic Failover
- Recovery /Downtime is approximately <16 minutes.
This is determined by the “dccmd status” and “proviso.log” on how fast the application
recovers from lost connection with active DB to standby DB. This also is checked from the GUI , on how
fast application services such as Request Editor, Resource Editor, Inventory Tool is able to recover.
30
February 11,
2014
TIVOLI NETCOOL PERFORMANCE MANAGER – HIGH AVAILABILITY TEST FOR
TNPM 1.3.3 WIRELINE
Failover Type 3 : Active DB recovers when random killed db2 process
ID.
IP:10.55.236.167
(tnpminlnx0307)
virtual IP
(Active ) DB
NFS mount (emulate
storage device )
/db2_load/hadr_shr
IP:10.55.236.163
(tnpminlnx0303)
virtual IP
(Standby ) DB
Virtual IP:10.55.236.191
(tnpminlnx0306)
IP:10.55.236.166
DB client
TNPM
The reason for validating this is to determine that Active DB would try to recover “killed” db2 process,
and does not require any failover to be triggered.
Login to the Active DB2 as root.
Run an “init 6” command to perform a server reboot.
$ ps –ef | grep db2
31
February 11,
2014
TIVOLI NETCOOL PERFORMANCE MANAGER – HIGH AVAILABILITY TEST FOR
TNPM 1.3.3 WIRELINE
Randomly kill some of the db2 process IDs ( Take note the process date is Feb 6)
$ kill -9 5229 5231 5252 5245 12829
This is killing db2sysc, db2ckpwd, db2acd, db2vend and db2fmp
Downtime during DB2 processes killed is triggered:
a) Resource Editor - 2 minutes
b) Request Editor - 4 minutes
c) Inventory profile running - 6 minutes.
All datachannel recovers after 6 minutes, having LDR the last component to restart.
Generates DLDR and LDR walkback files.
Validate that the Data is still being collected/processed and loaded by Datachannel into the new Active
Database.
6.
7.
8.
9.
10.
PVM GUI is launched - SUCCESS
Datachannel components ( CNS; LOG; CMGR, AMGR ) startup - SUCCESS
Inventory run for a profile - SUCCESS
Resource Editor able to view elements/subelements - SUCCESS
Request Editor able to view/disable/enable formula - SUCCESS
Snapshot of the current data timestamp in LDR
32
February 11,
2014
TIVOLI NETCOOL PERFORMANCE MANAGER – HIGH AVAILABILITY TEST FOR
TNPM 1.3.3 WIRELINE
After the recovery, the new process ID for the db2 is respawned.
lssam status after recovery
33
February 11,
2014
TIVOLI NETCOOL PERFORMANCE MANAGER – HIGH AVAILABILITY TEST FOR
TNPM 1.3.3 WIRELINE
Db2pd –db pv –hadr
Summary of result test ( db2 process killed )
a) Data reliability
- Data collection is not interrupted for SNMP collector with polling 15 minutes. LDR.1 /output
shows a consistent filesize of data ( as the data stream is almost constant).
SNMP collector process (pvmd_3002) is still up and running, during takeover.
b) Disaster recovery type
- No failover is triggered.
- Recovery /Downtime is approximately <6 minutes.
This is determined by the “dccmd status” and “proviso.log” on how fast the application
recovers from lost of connection with active DB to standby DB. This also is checked from the GUI , on
how fast application services such as Request Editor, Resource Editor, Inventory Tool is able to recover.
34
February 11,
2014
TIVOLI NETCOOL PERFORMANCE MANAGER – HIGH AVAILABILITY TEST FOR
TNPM 1.3.3 WIRELINE
Data validation from Dataview using Diagnostic View, to observe the data trending.
Exported the CSV for the above chart data, and compared with the BOF data being processed in LDR.1
/output.
The above shows the timestamp data trending for 11th and 12th Feb 2014.
Random comparison of the data from DataView and BOF result as shown below.
Note: The BOF data is collected at GMT+0 while the DV chart is running on IST timezone (GMT-5.5)
The different is 5 hours and 30 minutes. Metric ID (MID)= 10587 is “IP Out Request”, while Resource ID
(RID) = 200000202 or tnpmsun1z10.persistent.co.in
The Notepad text file is result from bofDump.
The Excel CSV is from Diagnostic View
( eg: 11 Feb 2014, 3.45am , Value=233.7815 )
35
February 11,
2014
TIVOLI NETCOOL PERFORMANCE MANAGER – HIGH AVAILABILITY TEST FOR
TNPM 1.3.3 WIRELINE
Another check on a different MID =10585; IP Packets Received for the same resource.
(eg: 11 Feb 2014, 6.30am , Value=280.9591)
The following are taken from DataView Report
Filtered for the same resource = tnpmsun1z10
36
February 11,
2014
TIVOLI NETCOOL PERFORMANCE MANAGER – HIGH AVAILABILITY TEST FOR
TNPM 1.3.3 WIRELINE
Drilldown DataView Report, shows the metric 10585 ( IP Pkt Received ) and 10587 (IP Out Request )
The exported CSV for Mid=10587 , IP Out Request ( eg: 11 Feb 2014, 3.45am , Value=233.7815 )
The exported CSV for Mid=10585, IP Packets Received ( eg: 11 Feb 2014, 6.30am , Value=280.9591 )
37
Download