Uploaded by met met

VCS

advertisement
VCS ( Veritas Cluster Services ) Beginners
lesson – Cluster Membership & IO Fencing
by Ramdev · Published August 19, 2011 · Updated July 23, 2016
Other Learning Articles that you may like to read










Latest Articles
Information Centers
Linux Admin
Solaris Admin
VxVM Admin
VCS Admin
Career Guidance
Scripting
Storage Admin
Quick Reference
Why can’t I find the user name that created an EBS volume by searching CloudTrail?
Debian: DSA-4532-1: spip security update
RedHat: RHSA-2019-2825:01 Moderate: OpenShift Container Platform 4.1.17
Debian: DSA-4531-1: linux security update
AWS Limit Monitor Now Supports vCPU-Based On-Demand Instance Limit Monitoring
Librem 5, the $699 Linux Phone, Has Started Shipping to Backers
AWS DataSync supports all Amazon S3 storage classes, more data controls
Free Courses We Offer
Paid Training Courses we Offer


Self Paced Courses
Live Webex
The current members of the cluster are the systems that are actively participating in the
cluster. It is critical for HAD to accurately determine current cluster membership in
order to take corrective action on system failure and maintain overall cluster topology.
A change in cluster membership is one of the starting points of the logic to determine if
HAD needs to perform any fault handling in the cluster. There are two aspects to cluster
membership, initial joining of the cluster and how membership is determined once the
cluster is up and running.
Before going to actual topic I would like to talk about one general point that is useful for
every learning mind.
Every day, we are dealing with many different technologies which were actually built by
some brilliant human minds for the purpose of providing solutions to the problems, which
were existing before introducing that specific technology.
During the phase of learning if our focus is just on the features of the technology without
making any attempt to investigate the reason behind the existence of those features, we
will always end up with half knowledge. It is always wise to spend some time to
understand the history of any technology, once you chose to master in it.
Initial joining of systems to cluster membership
When the cluster initially boots, LLT determines which systems are sending heartbeat signals, and
passes that information to GAB. GAB uses this information in the process of seeding the cluster
membership.
Seeding a Cluster
Seeding a new cluster nothing but ensuring that new cluster starting up with correct number of
cluster nodes configured in the cluster, just to avoid starting single cluster as multiple subclusters.
Cluster Seeding happens as below :


When the cluster initially boots all the nodes will be in unseeded status.
GAB in each system checks the total number of systems configured in /etc/gabtab with the
entry
/sbin/gabconfig -c -nx


( x is replaced with total number of cluster nodes ).
When GAB on each system detects that the correct number of systems are running, based
on the number declared in /etc/gabtab and input from LLT, it will seed.
HAD will start on each seeded system. HAD will only run on a system that has seeded.
Manual seeding of a Cluster Node :
manual seeding of cluster node is nor a recommended option unless System administrator sure
about the consequences. And it is required for rare situations when a cluster node is downfor
maintenance during the cluster boot.
Before seeding the cluster node manually , make sure that node is able to send and receive cluster
heartbeats to each other successfully. And this is important to avoid possible cluster network
partition because of new cluster node to be joined.
The command used to seed the cluster node manually is
#/sbin/gabconfig -c -x
this will seed all the nodes in communication with the node where this command is run.
Ongoing cluster membership
Once the cluster is up and running, a system remains an active member of the cluster as long as
peer systems receive a heartbeat signal from that system over the cluster interconnect. A change
in cluster membership is determined as follows:





When LLT on a system no longer receives heartbeat messages from a system on any of
the configured LLT interfaces for a predefined time, LLT informs GAB of the heartbeat
loss from that specific system.
This predefined time is 16 seconds by default, but can be configured. It is set with the settimer peerinact command as described in the llttab manual page.
When LLT informs GAB of a heartbeat loss, the systems that are remaining in the cluster
coordinate to agree which systems are still actively participating in the cluster and which
are not. This happens during a time period known as GAB Stable Timeout (5 seconds).
VCS has specific error handling that takes effect in the case where the systems do not
agree.
GAB marks the system as DOWN, excludes the system from the cluster membership, and
delivers the membership change to the fencing module.

The fencing module performs membership arbitration to ensure that there is not a split
brain situation and only one functional cohesive cluster continues to run.
We will be discussing all above points in detail in this post
Below diagram explains of the data access happens from the shared resources during the regular
functioning of the cluster. Once cluster properly seeded and the cluster configured with High
priority as well low priority cluster interconnects, the cluster start functioning in its expected
manner.
In the diagram you can see two cluster nodes “node-1 and node-2” interconnected with LLT
hearbeat connections and running with a copy of HAD ( VCS engine) on each node.
HAD in addition to GAB and LLT make sure that each node is accessing the shared resources in
a controlled manner so that no conflict in access.
When ever there is a node failure in the cluster, VCS automatically fails over the service groups
and resources from failed node to working node of the cluster.
Like any other technologies VCS also had challenges to deal with some exceptional situations like
having trouble with cluster interconnects, cluster node or HAD instead of actual cluster node
failure. The problem in this scenarios is VCS cannot differentiate a cluster node failure with a
cluster interconnect / HAD failure unless there is a logical solution prepared for it.
The brilliant minds behind VCS came up with below two solutions, initially, to deal with below
two scenarios
Scenario 1. when
the cluster interconnects failing one by one, and left with
last interconnect working
In ideal case, whenever LLT on a system no longer receives heartbeat messages from another
system on any of the configured LLT interfaces, GAB reports a change in membership to VCS
engine.
When a cluster node had trouble with the interconnects and has only one interconnect link
remaining to the cluster, GAB can no longer reliably discriminate between loss of a system and
loss of the network. The reliability of the system’s membership is considered at risk. In this
situation, a special membership category called a jeopardy membership will be assigned to the
cluster node with single cluster interconnect.
When a system is placed in jeopardy membership status, two actions occur


Service groups running on the system are placed in autodisabled state. A service group in
autodisabled state may failover on a resource or group fault, but can not fail over on a
system fault until the autodisabled flag is manually cleared by the administrator.
VCS operates the system as a single system cluster. Other systems in the cluster are
partitioned off in a separate cluster membership.
Scenario 2. HAD daemon failed on one cluster node
Daemon Down Node Alive (DDNA) is a condition in which the VCS high availability daemon
(HAD) on a node fails, but the node is running. When HAD fails, the hashadow process tries to
bring HAD up again. If the hashadow process succeeds in bringing HAD up, the system leaves the
DDNA membership and joins the regular membership.
In a DDNA condition, VCS does not have information about the state of service groups on the
node. So, VCS places all service groups that were online on the affected node in the autodisabled
state. The service groups that were online on the node cannot fail over.
Manual intervention is required to enable failover of autodisabled service groups. The
administrator must release the resources running on the affected node, clear resource faults, and
bring the service groups online on another node.
Above two solutions helps VCS deal with major part of the problems with cluster interconnect and
HAD, but there is a real challenging scenario where the above two solution doesn’t work we need
more perfect solution for that. Ofcourse, VCS minds had also offered an effective solution for that.
Let us first discuss about the problem, then we will go to the solution.
Scenario 3: All cluster interconnects failed at a time , and the cluster was split into multiple
subclusters
As we discussed earlier, HAD (VCS engine) is brain of the cluster and each node of the cluster
running with one copy of HAD loaded into their memory. And this VCS engine will control all
the cluster nodes worked together under predefined rules to access shared resources and provide
high availability to the applications.
When a cluster node disconnects from the main cluster because of the the problem in all the cluster
interconnects at a time, forms a subcluster and the copy of HAD running in its memory will start
acting like a second brain of the cluster. And the second brain (HAD of disconnected node) will
start competing with the initial brain ( Actual cluster HAD) to gain control on the cluster resources.
We know the result when a human brain splits into two and each one trying to control the body
parts, it will ultimately make the person sick to the death. The same rule applies the cluster and
this condition will lead to data destruction in shared resources. And VCS brains named this
condition as “SPLIT BRAIN” Condition.
If you look at the above diagram at step (1) all cluster interconnects failed, step (2) the HAD
daemon running on both nodes of cluster started acting like separate brains and finally at step (3)
both nodes trying to access the shared resources forcibly.
Then what is the solution for Split Brain Condition? Answer is “Membership Arbitration”
Membership Arbitration
Membership Arbitration is nothing but set of rules to be followed whenever a cluster member
completely disconnects from the other cluster members.
Membership arbitration is necessary on a perceived membership change because systems may
falsely appear to be down. When LLT on a system no longer receives heartbeat messages from
another system on any configured LLT interface, GAB marks the system as DOWN. However, if
the cluster interconnect network failed, a system can appear to be failed when it actually is not. In
most environments when this happens, it is caused by an insufficient cluster interconnect network
infrastructure, usually one that routes all communication links through a single point of failure.
If all the cluster interconnect links fail, it is possible for one cluster to separate into two subclusters,
each of which does not know about the other subcluster. The two subclusters could each carry out
recovery actions for the departed systems. This is termed split brain.
In a split brain condition, two systems could try to import the same storage and cause data
corruption, have an IP address up in two places, or mistakenly run an application in two places at
once.
Membership arbitration guarantees against such split brain conditions
There are two components in Membership Arbitration
1. Fencing Module
2. Coordinator Disks
Below diagram explain how the Fencing module starts during the cluster startup
The fencing module starts up as follows:
The coordinator disks are placed in a disk group.This allows the fencing start up script to use
Veritas Volume Manager (VxVM) commands to easily determine which disks are coordinator
disks, and what paths exist to those disks. This disk group is never imported, and is not used for
any other purpose.
Step 1. The fencing start up script on each system uses VxVM commands to populate the file
/etc/vxfentab with the paths available to the coordinator disks.
Step 2. The fencing driver examines GAB port B for membership information.
Step 3. If no other systems are up and running, it is the first system up and is considered the correct
coordinator disk configuration.
Step 5,6 and 7 . When a new member joins and fencing module starts it will check the GAB port
B for the existing nodes and finds that node-1 is already running in the cluster.
Step 8. Then the node-2 requests a coordinator disks configuration from node-1. Ideally The
system with the lowest LLT ID will respond with a list of the coordinator disk serial numbers. If
there is a match, the new member joins the cluster. If there is not a match, vxfen enters an error
state and the new member is not allowed to join. This process ensures all systems communicate
with the same coordinator disks.
How the fencing driver determines if a possible preexisting split brain condition exists?
This is done by verifying that any system that has keys on the coordinator disks can also be seen
in the current GAB membership. If this verification fails, the fencing driver prints a warning to the
console and system log and does not start.
Final Step: If all verification pass, the fencing driver on each system registers keys with each
coordinator disk. ( I have mentioned this task as step 4 and 9, but actually they should be the
last numbers, sorry for that )
How Fencing algorithm deals with the cluster interconnect failures?
From the above diagram we can understand the function of fencing algorithm as below:
Step 1. When the Node-1 failed ( due to the cluster interconnect failure) , Node-2 will initiate
the the fencing operation
Step 2. The GAB module on Node-2 determines Node-1 has failed due to loss of heartbeat signal
reported from LLT. GAB passes the membership change to the fencing module on each system in
the cluster.
Step 3. Node-2 gains control of the coordinator disks by ejecting the key registered by Node-1
from each coordinator disk. The ejection takes place one by one, in the order of the coordinator
disk’s serial number. When the fencing module on Node 2 successfully controls the coordinator
disks, HAD carries out any associated policy connected with the membership change.
Step 4. Node-1 is blocked access to the shared storage, if this shared storage was configured in a
service group that was now taken over by System0 and imported
So far so good, VCS guys provided us good solutions to deal with this complicated Split Brain
condition. And now the question are the difficulties end? and the answer is “ No” .
There are some other scenarios where this membership arbitration ( using fencing module and
coordinatory disks) alone cannot provide data protection in the cluster. And they are.



A system hang causes the kernel to stop processing for a period of time.
The system resources were so busy that the heartbeat signal was not sent.
A break and resume function is supported by the hardware and executed. Dropping the
system to a system controller level with a break command can result in the heartbeat signal
timeout
In these types of situations, the systems are not actually down, and may return to the cluster after
cluster membership has been recalculated. This could result in data corruption as a system could
potentially write to disk before it determines it should no longer be in the cluster.
Combining membership arbitration with data protection of the shared storage eliminates all of the
above possibilities for data corruption.
Data protection fences off (removes access to) the shared data storage from any system that is not
a current and verified member of the cluster. Access is blocked by the use of SCSI-3 persistent
reservations.
Membership arbitration combined with data protection is termed I/O Fencing.
From the above I/O fencing diagram you can notice that the shared disks were configured with
SCSI-3 persistant reservation enabled. And enabling SCSI-3 PR along with Memory Arbitration
techniques will guarantee the Data protection in the above mentioned rare scenarios.
What is SCSI-3 Persistent Reservation?
SCSI-3 Persistent Reservation (SCSI-3 PR) supports device access from multiple systems, or from
multiple paths from a single system. At the same time it blocks access to the device from other
systems, or other paths.
VCS logic determines when to online a service group on a particular system. If the service group
contains a disk group, the disk group is imported as part of the service group being brought online.
When using SCSI-3 PR, importing the disk group puts registration and reservation on the data
disks. Only the system that has imported the storage with SCSI-3 reservation can write to the
shared storage. This prevents a system that did not participate in membership arbitration from
corrupting the shared storage.
SCSI-3 PR ensures persistent reservations across SCSI bus resets.
*** This ends the post here. Please drop your comments ****
Announcement
I believe most of you already know that Symantec going to release VCS 6.0 very soon. And
before releasing the actual product, Symantec giving opportunity to users to understand and
discuss it’s features directly with the VCS developement team through symantec connect group.
Symantec sharing videos, to the group members, presenting the new features of VCS 6.0. If you
want to experience VCS 6.0 and it’s features, i would recommend you to join the group
Instructions to Join the beta program on symantec connect group:
Please go though the below link
https://www-secure.symantec.com/connect/groups/storage-foundation-and-veritas-cluster-server60-beta-program
Note : The primary requirement is that you have an NDA in place, which we have a doc you can
fill out.
Other Popular Posts







VCS (Veritas Cluster Services ) Beginners Lesson – Cluster Communications
VCS information Center
VCS Learning : Learn about Cluster Hearbeats
Beginners Lesson – Veritas Cluster Services for System Admin
VCS : Starting LLT , GAB and VCS manually
VCS : Sample VCS configuration with Single Oracle Instance
VCS : Configuring Private Network for veritas cluster



Beginner’s Lesson – VERITAS Volume Manager for Solaris
Beginners Information Center
VCS 6.1 : How to Administrate VCS Service Groups using the hagrp
Spread a word





Page 1 of 11
Ramdev
I have started unixadminschool.com ( aka gurkulindia.com) in 2009 as my own personal
reference blog, and later sometime i have realized that my leanings might be helpful for other
unixadmins if I manage my knowledge-base in more user friendly format. And the result is
today's' unixadminschool.com. You can connect me at https://www.linkedin.com/in/unixadminschool/

Comment On Blog
27 Responses


Comments26
Pingbacks1
1.
Ramesh
December 24, 2011 at 1:35 pm
Ver good article… Is there any way to setup VCS in laptop using vmware?
Reply
o
kk
December 9, 2013 at 5:42 am
Yes Ramesh u cas setup by installing vcs software in your solaris vertual box
Reply
2.
Madhav
August 7, 2012 at 11:55 am
Good article … the thing is liked most & total agree with is ..
“During the phase of learning if our focus is just on the features of the technology
without making any attempt to investigate the reason behind the existence of those
features, we will always end up with half knowledge. It is always wise to spend some
time to understand the history of any technology, once you chose to master in it.”
Reply
3.
Ramdev
August 13, 2012 at 2:41 pm
@Madhav – Thanks for the comment.
Reply
4.
vivek
September 10, 2012 at 10:55 pm
Great explanation…!!!!! Thanks for all your effort.. The article best explains the
technology behind VCS :)
Reply
5.
vivek
September 10, 2012 at 11:00 pm
It would be helpful, if you can post an article on Weblogic technologies and Application
hosting process. Thanks..!!!
Reply
o
Ramdev
September 11, 2012 at 1:16 am
Vivek, I will ask our middleware folks for this article.
Reply
6.
Daniel
October 27, 2012 at 12:53 pm
Excellent article. Extremely useful.
What is an NDA? – you mention it in the instructions to join the Beta Program at the
bottom of the article.
Reply
7.
Ramdev
October 27, 2012 at 4:01 pm
Hi Daniel, NDA – Non Disclosure Agreement. The VCS 6.0 is already in the market,
and the beta program is no longer valid.
Reply
8.
RamaRao
November 21, 2012 at 8:24 am
Step by step explanation is very good and helpful for new VCS learners.
Thanks for efforts.
Reply
o
Ramdev
November 22, 2012 at 12:21 am
Ramarao, thank you.
Reply
9.
Pavanisastry
January 2, 2013 at 9:21 am
Thanks to Ramdev
Reply
10.
sagar.parit
January 30, 2013 at 7:40 am
Hi, I want start learning the VCS; but on net number of docs are avalabel so i am tooo
confused between which dos is good to understand VCS
Pls help me.
Reply
o
Ramdev
January 30, 2013 at 8:28 am
Hi, I would recommend to go with this document –
http://sfdoccentral.symantec.com/sf/5.1/solaris/pdf/vcs_admin.pdf
Reply
11.
sagar.parit
January 30, 2013 at 6:02 pm
Hi, Thnks sir quickly reply.But honestly say 720 pages.
What are topics that must to be read or which 1 i can exclude. BECOZ MAIN REASON
IS MY DESKTOP IS 32-BIT SO I UNABLE TO PERFORM VCS ON MY PC. & IN
OFFICE IT WONT BE POSSIBLE. so which are topics are must be to be read to
understand vcs & also help to face the interview…….
Again many thanks 4 reply………
Reply
o
Ramdev
January 31, 2013 at 3:37 am
Sagar, for the initial learning you can focus on below
Cluster heartbeats – know about GAB + LLT
Cluster daemons and Cluster startup scripts
Cluster Service Groups and Resources
haxx commands related to Cluster oeprations – starting , stopping, switchover,
freeze
Cluster file – main.cf
Different Cluster States, Service Group states and Resource States
Reply
12.
sagar.parit
January 31, 2013 at 11:50 am
Hi,
Many many thanks 4 to giving best path to learn & start the VCS….
Reply
13.
Maheshbabu
April 26, 2013 at 12:44 pm
nice explaniation about the vcs i/o fencing, spilt brain condition. Sir, Keep posting good
concepts on linux also. It will be very useful for beginners also. PLEASE DO NEED
FUL
Reply
14.
Ravi Joshi
March 15, 2014 at 2:59 pm
Very nice article to get understanding of how VCS works. Nice thing you started the
blog. Sharing knowledge
is always good. Keep it up.
Ravi
Reply
15.
akaky
March 27, 2014 at 10:37 am
How great explanation! I’m a beginner of the VCS admin.
I couldn’t understand why the “GAB can no longer reliably discriminate between loss of
a system and loss of the network.” Why the VCS cannot differentiate “system fault” from
“network fault” in the Jeopardy situation?
Reply
16.
pavani
March 31, 2014 at 3:09 pm
good article, helpful for me a lot, thanks for explaining clearly
Reply
17.
Srinivas
September 13, 2014 at 4:07 pm
Excellent Aricle which gives a very good idea of VCS behaviour. Thanks alot Ram…
Reply
18.
sivakumar
September 23, 2014 at 9:00 am
Hi Ramdev,
Its an excellent document , i configured two node cluster in my laptop by using the
openfiler shared storage . I have one doubt how can we validate the i/o fencing on the
vmware workstation 9. ( I am using ).
I have done lots of searching but unable to find …Can you advise ……….>???
Reply
19.
sivakumar
September 23, 2014 at 9:03 am
1. Is it Vxvm certification is a mandatory pre-requisite for VCS certification?
Download