Which tasks are performed by the OVO heartbeat polling and which

advertisement
Which tasks are performed by the OVO heartbeat polling and which are not?
+ HBP checks availability of managed nodes
+ HBP checks the availability of some core processes
DCE: opcctla, opcmsga, DCE RPC daemon (rpcd)
HTTPS: ovbbccb, opcmsga, ovcd
+ HBP checks, if messages are buffered by the opcmsga.
+ HBP creates the same error messages for each agent platform.
- HBP does NOT check whether processes like opcle, opcmsgi, opctrapi,
coda, opcmona etc are running. This functionality is covered by the control
processes. The control programs send an appropriate message, if one of their
child processes died or was killed.
The control programs opcctla (DCE) and ovcd (HTTPS) generate no message,
when a sub processes was gracefully stopped (e. g. ovc -stop opcacta).
The control programs restart aborted sub processes.
- For HTTPS agents only: HBP does NOT check, whether an SSL communication
can be established between OVO server and agent. The HBP requests are based
on HTTP (not HTTPS; due to performance reasons).
-------------------------------------------------------------------------------Basics about OVO Heartbeat Polling
* basic HBP algorithm (Polling Type "Normal"):
The OVO server sends HBP requests periodically to the agents. Per agent you
can configure in the admin GUI the interval, in which requests are sent,
as well as the request type. By default, a node is first checked with
ping packages, in the second phase the OVO-required infrastructure elements
on the node itself are checked via RPC (communication broker, control process,
message agent). As soon as a status change happens with the agent (positive or
negative; node down/up, core process down/up), a heartbeat message is
created.
* The "RPC Only" mode:
The HBP behavior changes when the "Polling Type" field is switched from "Normal"
(default) to "RPC Only" in the "Modify Node" screen. In the "Normal" mode,
ping packages + remote procedure calls are used to verify the node status.
"RPC Only" means, that no ping packages (ICMP protocol) are sent, but only
DCE RPCs or respectively BBC RPCs (for the HTTPS agent). The quality of error
messages is better if ping is used as well, but in firewall scenarios the
ICMP protocol is typically blocked and it doesn't make sense to use ping then.
For all cases where an RPC reaches the node, there's no difference in the
possible error messages for "Normal" or "RPC Only" mode.
If a node is behind a firewall and ping packages are blocked, the
following problem happens in case "Normal" mode is accidentally used.
OVO always reports that the node is down, because the ping requests
will never be responded (ping is used as initial check-method here).
The correct HBP settings for a node behind a firewall (with ICMP blocked) are:
"Modify Node -> Polling Type" = "RPC Only"
"Modify Node -> Agent Sends Alive Packets" = "No"
* The "Agent Sends Alive Packets" button in the "Modify Node screen":
The "agent Sends Alive Packets" feature has no impact on the generated HBP
error messages. This flag is for HBP network load and performance improvements
only.
The feature works on ICMP protocol level. Means that it's unusable in
firewall environments where ping is blocked. If the flag is enabled, then
the agents sends in an interval of 2/3 of the regular HBP interval a
ping reply package to its primary server (only to the primary server, not
to further MoM servers).
If the OVO server gets the alive-ping-packages early enough, it doesn't
start own HBP activities. The performance improvement compared to the normal
RPC-based HBP is about 90%. But, the feature is only usable in intranets
with no ping-blocking firewalls between OVO server and agents.
The feature is reflected by the OPC_HBP_INTERVAL_ON_AGENT setting on the
agent. Value of "-1" means that it's switched off. Don't change the setting
on the agent itself (don't use e.g. ovconfchg -ns eaagt -set
OPC_HBP_INTERVAL_ON_AGENT <value_in_seconds>). The value corresponds to
the
"Heartbeat Interval" in the "Modify Node" screen and is automatically
updated on the agent as soon a change happens in the GUI.
* The "No Polling" mode:
If you select "No Polling", no heartbeat polling will be done at all, not even
if you use "Agent Sends Alive Packets" (the Agent will still send the packages,
but the server won't do anything if it doesn't get a package in time).
* The Heartbeat Flag:
Independent of the Heartbeat type, OVO also maintains a Heartbeat Flag. When you
add a node, this flag is FALSE and no heartbeat polling is done for this node.
After you successfully installed agent software using the GUI or the inst.sh script,
the heartbeat polling flag will be set to TRUE and heartbeat polling starts. If
an agent is de-installed using the GUI or inst.sh, the heartbeat flag is set to
FALSE again.
If you install agents manually, the heartbeat flag will be set to TRUE when you
perform the step opcsw -installed <node>.
The status of the heartbeat flag is displayed for information in the Node Modify
Screen after the Heartbeat Monitoring, for example: Heartbeat Monitoring (Enabled).
But it cannot be changed in the GUI.
The heartbeat flag can be set to TRUE using "opchbp -start <node>", and it can be
set to false using "opchbp -stop <node>"
* auto-acknowledgement of heartbeat polling messages:
A heartbeat message acknowledges any previous heartbeat messages
for the same node. You don't have more than one heartbeat message per node in the
browser at a time.
Note that the message key and the message key correlation are not forwarded to
the message interceptor if internal message filtering is used. If you want to use
internal message filtering, you will need to define your own message keys and
message key correlations in the used opcmsg template.
* criticality of heartbeat messages:
Always the most critical problem is reported. You won't get the error
"message agent is down" when the system is down. Instead you get "node down",
although the message agent isn't running as well when the system is down.
* frequency of error messages:
HBP sends an error message only once (except when the
OPC_HBP_CONTINOUS_ERRORS
flag is set). It doesn't remind you every day again, that a problem exists.
To set the flag for recurring HBP error messages:
# ovconfchg -ovrg server -ns opc -set OPC_HBP_CONTINOUS_ERRORS TRUE
The OVO server must be restarted to activate the setting. Advantage of the
setting is, that you always have an up-to-date message in the browser about
HBP-detected problems. Disadvantage is, that the /var/opt/OV/log/System.txt
logfile is filled with one message per problematic node per interval. If you
don't use count and suppress duplicate, there will also be more messages in
the history message table.
* command-line utility and GUI
All node-specific HBP configurations, except of setting the Heartbeat Flag can
be done in the OVO admin GUI. Modifications are synchronized with the OVO server
processes at runtime.
Besides that there is the utility /opt/OV/bin/OpC/opchbp to en/disable
heartbeat polling on node / node group basis (by setting the heartbeat flag), or
to view the actual settings.
The heartbeat interval and -type can only be viewed but cannot be modified by opchbp.
Here are the meanings of the heartbeat type value as displayed by opchbp:
0x0 - No Polling
0x1 - RPC Only
0x3 - Normal
0x4 - No Polling, Agent Sends Alive Package
0x5 - RPC Only, Agent Sends Alive Package
0x7 - Normal, Agent Sends Alive Package
------------------------------------------------------------------------------The table below lists the OVO messages that can result from OVO heartbeat polling and
can be used as a reference if you want to setting up a template for internal messages
filtering (using the OPC_INT_MSG_FLT setting).
-------------------------------------------------------------------------------HTTPS Agent
-------------------------------------------------------------------------------Messages describing a problem
OpC40-404: Message agent on node <node> is not running. (since 8.23 server patch)
OpC40-433: The llbd/rpcdaemon on node <node> is down. (Before 8.13 server patch)
OpC40-1913 OV Communication Broker (ovbbccb) on node <node> is down. (Since
8.13 server patch)
OpC40-434: Routing packages via gateway <node> to node <node> failed
(NET_UNREACHABLE)
OpC40-435: Routing packages via gateway <node> to node <node> failed
(HOST_UNREACHABLE)
OpC40-436: Node <node> is probably down. Contacting it with ping packages failed.
(since 8.23 server patch)
OpC40-1900: The local core ID for node <node> is not the same as the core ID for this
node stored in the OVO database!
OpC40-1901: Node <node> does not have a security certificate installed!
OpC40-1902: Security certificate deployment pending on node <node>.
OpC40-1903: Security certificate deployment on node <node> denied!
OpC40-1904: OV Control Daemon is not running on node <node>!
OpC40-1905: Message Agent on node <node> is buffering messages.
OpC40-1906: Message Agent on node <node> is buffering messages for this
Management Server.
OpC40-1911: Event/Action RPC server (Message Agent) is not running on node
<node>. (before 8.23 server patch)
Opc40-1911: Failed to contact node <node> with BBC. Probably the node is down or
there's a network problem. (since 8.23 server patch)
Messages describing a (return to) normal situation
OpC40-1907: Message Agent on node <node> is no longer buffering messages.
OpC40-1908: Core ID on node <node> has been aligned with the value stored in the
OVO database.
OpC40-1909: Security certificate has been installed on node <node>.
OpC40-1910: OV Control Daemon on node <node> is now running.
OpC40-1912: Event/Action RPC server (Message Agent) is now running on node
<node>. (before 8.23 server patch)
OpC40-1912 Successfully contacted the OVO agent on node <node> via BBC. (since
8.23 server patch)
-------------------------------------------------------------------------------DCE Agent
-------------------------------------------------------------------------------Messages describing a problem
OpC40-404: Message agent on node <node> is not running.
OpC40-405: Control agent on node <node> isn't accessible.
OpC40-431: The control agent on node <node> is registered at the llbd or rpcdaemon but
it is not running.
OpC40-432: The llbd on node <node> seems to be down. The OVO mgmt-server cannot
contact the managed node by using NCS.
OpC40-433: The llbd/rpcdaemon on node <node> is down.
OpC40-434: Routing packages via gateway <node> to node <node> failed
(NET_UNREACHABLE)
OpC40-435: Routing packages via gateway <node> to node <node> failed
(HOST_UNREACHABLE)
OpC40-436: Node <node> is probably down. Contacting it with ping packages failed.
OpC40-441: Message agent with process-id <pid> aborted on node <node>.
OpC40-1410: The Message Agent on node <node> is buffering messages for this
Management Server.
OpC40-1411: The Message Agent on node <node> is buffering messages.
Messages describing a (return to) normal situation
OpC40-462: Control agent on node <node> is now running.
OpC40-1408: The Message Agent on node <node> is now running.
-------------------------------------------------------------------------------Which heartbeat message is generated when:
Error cases:
- a node is down or there's a network problem:
in RPC-only mode:
DCE: OpC40-432 (llbd seems to be down)
HTTPS: OpC40-1911 (Failed to contact node with BBC)
When ping can be used (no "RPC only" set)
DCE and HTTPS: OpC40-436 (Node is probably down)
(OpC40-434 and OpC40-435 in case of some network problems).
- the DCE RPC daemon or the HTTPS BBC communication broker isn't running:
DCE: DCE daemon (rpcd process) is down: OpC40-432 (llbd on node seems to be
down)
Note that it's the same message like NODE_DOWN in RPC-only mode.
HTTPS: BBC communication broker (ovbbccb process) not running:
OpC40-1913 (OV Communication Broker is down)
E. g. when stopping the OVO agent via "ovc -kill".
HTTPS: BBC communication broker hanging or halted:
OpC40-1911 (Failed to contact node with BBC)
Note that it's the same message like NODE_DOWN in RPC-only mode.
Message can be provoked by sending signal SIGSTOP to the ovbbccb process.
- message agent not running:
DCE: OpC40-441 (Message agent aborted) or OpC40-404 (Message agent is not
running)
Note: in case of "opcagt -stop" (regular stop of opcmsga) no heartbeat
message is generated (difference to HTTPS agent). A message is only
created when the message agent aborted.
HTTPS: OpC40-404 (Message agent is not running)
Note: independent whether the message agent aborted or was regularly
stopped (e.g. via "ovc -stop opcmsga"), this both leads to the
same error message. The difference is caused by the different
architecture of DCE and HTTPS agents: on DCE the heartbeat requests
are handled by opcctla (control agent), on HTTPS by opcmsga.
- control process not running:
DCE: OpC40-405 (Control agent isn't accessible)
Can be provoked by stopping the DCE agent via "opcagt -kill".
HTTPS: OpC40-1904 (ovcd process not running)
If this message is coming, it means that the OVO message agent is
running, but the "ovcd" process is not.
Can be provoked by killing or halting the ovcd process.
- OVO agent completely stopped via opcagt -kill (or ovc -kill for HTTPS)
DCE: OpC40-405 (control agent down)
HTTPS: OpC40-1913 (ovbbccb down)
- OVO agent processes stopped, but control process still running
(opcagt -stop, respectively ovc -stop or ovc -stop EA)
DCE: no error message (is not regarded as an error on DCE agents)
HTTPS: Opc40-404 (message agent down)
- message agent buffering
DCE: (functionality works only for Unix nodes)
OpC40-1410 (opcmsga is buffering for that OVO server which does the HBP)
Critical. Communication from OVO server to agent is working, but not
from agent to server. The OVO server is up and running but the agent
can't send messages. Can be provoked by killing the opcmsgrd process.
OpC40-1411 (opcmsga is buffering messages for any other OVO server)
Warning. Happens in MoM scenarios, e. g. if a certain OVO server is
temporarily unavailable.
HTTPS:
OpC40-1906 (opcmsga is buffering for that OVO server which does the HBP)
Critical. Communication from OVO server to agent is working, but not
from agent to server. The OVO server is up and running but the agent
can't send messages. Can be provoked by killing the opcmsgrb process.
OpC40-1905 (opcmsga is buffering messages for any other OVO server)
Warning. Happens in MoM scenarios, e. g. if a certain OVO server is
temporarily unavailable.
- core ID mismatch detected (HTTPS agents only)
HTTPS: OpC40-1900 (Core ID mismatch)
This problem disturbs the OVO server -> agent communication. It does
not disturb the agent -> server communication itself, but messages will
not be added into the database and are removed due to the mismatch.
success cases:
- OVO agent up and running again:
DCE: OpC40-462 (Control agent is now running) or OpC40-1408
(Message Agent is now running) if message agent was down
HTTPS: OpC40-1912 (Successfully contacted the OVO agent via BBC)
- message agent no longer buffering:
DCE: OpC40-1408 (Message Agent is now running)
HTTPS: OpC40-1907 (Message Agent is no longer buffering messages)
- Core ID has been corrected (HTTPS agent only):
HTTPS: OpC40-1908 (Core ID has been aligned)
-------------------------------------------------------------------------------What's different and what's the same with DCE and HTTPS HBP:
- DCE opcctla handles HBP requests, whereas HTTPS opcmsga does that.
- DCE opcctla checks whether its child process "opcmsga" is running, HTTPS
opcmsga checks via RPC whether "ovcd" is running.
- "message agent not running" is treated as an error on HTTPS, whereas it's
not regarded as an error on DCE if opcmsga was regularly stopped.
- the HTTPS HBP also informs about certificate and core ID problems.
- The HTTPS agent heartbeat RPCs use HTTP (not HTTPS; due to performance
reasons), the DCE agent RPCs use DCE. The ping calls are identical for
DCE and HTTPS.
= "RPC Only" mode exists for DCE and HTTPS with the same semantic.
= DCE and HTTPS inform in the same manner, whether opcmsga is buffering.
= "Agent Sends Alive Packets" works in the same manner for DCE and HTTPS
-------------------------------------------------------------------------------Note that the 8.14 server patch introduced several heartbeat polling changes.
From the patch text:
- Introduced new config parameter
OPC_HBP_CONTINOUS_ERRORS. If OPC_HBP_CONTINOUS_ERRORS
is set to TRUE, heartbeat polling errors and buffering
messages will be sent each polling interval and not
only once.
- Added the buffering messages OpC40-1410 and
OpC40-1411 to the heartbeat msgkey correlation list.
- Make sure for OVO8, that the different heartbeat
messages correlate each other like with OVO7 where it
makes sense.
- Introduced new config parameter
OPC_HBP_NORMAL_START_MSG. If OPC_HBP_NORMAL_START_MSG
is set to TRUE, ovoareqsdr will send an agent is now
running message when started for each running node
with HBP enabled (like it was for HTTPS agents with
OVO8). The default is FALSE.
---------------------------------------------------------------------------------------Note that the 8.23 server patch introduced several heartbeat polling changes.
From the patch text:
- re-introduced messages 40-404 (message agent down) and 40-436 (node
unreachable even with ping packages) for the HTTPS agents
(just like they existed for the DCE agent). Heartbeat messages are
more precise now.
- Fix the problem that the /var/opt/OV/log/System.txt file was
filled up with low-level heartbeat messages.
---------------------------------------------------------------------------------------Note that the 8.25 server patch introduced following heartbeat polling change:
New option -interval is added to the opchbp CLI for setting up the heartbeat interval :
opchbp -interval <interval>
Set the heartbeat interval to the specified period.
The interval value must be specified in the format
00h00m00s."
Download