Which tasks are performed by the OVO heartbeat polling and which are not? + HBP checks availability of managed nodes + HBP checks the availability of some core processes DCE: opcctla, opcmsga, DCE RPC daemon (rpcd) HTTPS: ovbbccb, opcmsga, ovcd + HBP checks, if messages are buffered by the opcmsga. + HBP creates the same error messages for each agent platform. - HBP does NOT check whether processes like opcle, opcmsgi, opctrapi, coda, opcmona etc are running. This functionality is covered by the control processes. The control programs send an appropriate message, if one of their child processes died or was killed. The control programs opcctla (DCE) and ovcd (HTTPS) generate no message, when a sub processes was gracefully stopped (e. g. ovc -stop opcacta). The control programs restart aborted sub processes. - For HTTPS agents only: HBP does NOT check, whether an SSL communication can be established between OVO server and agent. The HBP requests are based on HTTP (not HTTPS; due to performance reasons). -------------------------------------------------------------------------------Basics about OVO Heartbeat Polling * basic HBP algorithm (Polling Type "Normal"): The OVO server sends HBP requests periodically to the agents. Per agent you can configure in the admin GUI the interval, in which requests are sent, as well as the request type. By default, a node is first checked with ping packages, in the second phase the OVO-required infrastructure elements on the node itself are checked via RPC (communication broker, control process, message agent). As soon as a status change happens with the agent (positive or negative; node down/up, core process down/up), a heartbeat message is created. * The "RPC Only" mode: The HBP behavior changes when the "Polling Type" field is switched from "Normal" (default) to "RPC Only" in the "Modify Node" screen. In the "Normal" mode, ping packages + remote procedure calls are used to verify the node status. "RPC Only" means, that no ping packages (ICMP protocol) are sent, but only DCE RPCs or respectively BBC RPCs (for the HTTPS agent). The quality of error messages is better if ping is used as well, but in firewall scenarios the ICMP protocol is typically blocked and it doesn't make sense to use ping then. For all cases where an RPC reaches the node, there's no difference in the possible error messages for "Normal" or "RPC Only" mode. If a node is behind a firewall and ping packages are blocked, the following problem happens in case "Normal" mode is accidentally used. OVO always reports that the node is down, because the ping requests will never be responded (ping is used as initial check-method here). The correct HBP settings for a node behind a firewall (with ICMP blocked) are: "Modify Node -> Polling Type" = "RPC Only" "Modify Node -> Agent Sends Alive Packets" = "No" * The "Agent Sends Alive Packets" button in the "Modify Node screen": The "agent Sends Alive Packets" feature has no impact on the generated HBP error messages. This flag is for HBP network load and performance improvements only. The feature works on ICMP protocol level. Means that it's unusable in firewall environments where ping is blocked. If the flag is enabled, then the agents sends in an interval of 2/3 of the regular HBP interval a ping reply package to its primary server (only to the primary server, not to further MoM servers). If the OVO server gets the alive-ping-packages early enough, it doesn't start own HBP activities. The performance improvement compared to the normal RPC-based HBP is about 90%. But, the feature is only usable in intranets with no ping-blocking firewalls between OVO server and agents. The feature is reflected by the OPC_HBP_INTERVAL_ON_AGENT setting on the agent. Value of "-1" means that it's switched off. Don't change the setting on the agent itself (don't use e.g. ovconfchg -ns eaagt -set OPC_HBP_INTERVAL_ON_AGENT <value_in_seconds>). The value corresponds to the "Heartbeat Interval" in the "Modify Node" screen and is automatically updated on the agent as soon a change happens in the GUI. * The "No Polling" mode: If you select "No Polling", no heartbeat polling will be done at all, not even if you use "Agent Sends Alive Packets" (the Agent will still send the packages, but the server won't do anything if it doesn't get a package in time). * The Heartbeat Flag: Independent of the Heartbeat type, OVO also maintains a Heartbeat Flag. When you add a node, this flag is FALSE and no heartbeat polling is done for this node. After you successfully installed agent software using the GUI or the inst.sh script, the heartbeat polling flag will be set to TRUE and heartbeat polling starts. If an agent is de-installed using the GUI or inst.sh, the heartbeat flag is set to FALSE again. If you install agents manually, the heartbeat flag will be set to TRUE when you perform the step opcsw -installed <node>. The status of the heartbeat flag is displayed for information in the Node Modify Screen after the Heartbeat Monitoring, for example: Heartbeat Monitoring (Enabled). But it cannot be changed in the GUI. The heartbeat flag can be set to TRUE using "opchbp -start <node>", and it can be set to false using "opchbp -stop <node>" * auto-acknowledgement of heartbeat polling messages: A heartbeat message acknowledges any previous heartbeat messages for the same node. You don't have more than one heartbeat message per node in the browser at a time. Note that the message key and the message key correlation are not forwarded to the message interceptor if internal message filtering is used. If you want to use internal message filtering, you will need to define your own message keys and message key correlations in the used opcmsg template. * criticality of heartbeat messages: Always the most critical problem is reported. You won't get the error "message agent is down" when the system is down. Instead you get "node down", although the message agent isn't running as well when the system is down. * frequency of error messages: HBP sends an error message only once (except when the OPC_HBP_CONTINOUS_ERRORS flag is set). It doesn't remind you every day again, that a problem exists. To set the flag for recurring HBP error messages: # ovconfchg -ovrg server -ns opc -set OPC_HBP_CONTINOUS_ERRORS TRUE The OVO server must be restarted to activate the setting. Advantage of the setting is, that you always have an up-to-date message in the browser about HBP-detected problems. Disadvantage is, that the /var/opt/OV/log/System.txt logfile is filled with one message per problematic node per interval. If you don't use count and suppress duplicate, there will also be more messages in the history message table. * command-line utility and GUI All node-specific HBP configurations, except of setting the Heartbeat Flag can be done in the OVO admin GUI. Modifications are synchronized with the OVO server processes at runtime. Besides that there is the utility /opt/OV/bin/OpC/opchbp to en/disable heartbeat polling on node / node group basis (by setting the heartbeat flag), or to view the actual settings. The heartbeat interval and -type can only be viewed but cannot be modified by opchbp. Here are the meanings of the heartbeat type value as displayed by opchbp: 0x0 - No Polling 0x1 - RPC Only 0x3 - Normal 0x4 - No Polling, Agent Sends Alive Package 0x5 - RPC Only, Agent Sends Alive Package 0x7 - Normal, Agent Sends Alive Package ------------------------------------------------------------------------------The table below lists the OVO messages that can result from OVO heartbeat polling and can be used as a reference if you want to setting up a template for internal messages filtering (using the OPC_INT_MSG_FLT setting). -------------------------------------------------------------------------------HTTPS Agent -------------------------------------------------------------------------------Messages describing a problem OpC40-404: Message agent on node <node> is not running. (since 8.23 server patch) OpC40-433: The llbd/rpcdaemon on node <node> is down. (Before 8.13 server patch) OpC40-1913 OV Communication Broker (ovbbccb) on node <node> is down. (Since 8.13 server patch) OpC40-434: Routing packages via gateway <node> to node <node> failed (NET_UNREACHABLE) OpC40-435: Routing packages via gateway <node> to node <node> failed (HOST_UNREACHABLE) OpC40-436: Node <node> is probably down. Contacting it with ping packages failed. (since 8.23 server patch) OpC40-1900: The local core ID for node <node> is not the same as the core ID for this node stored in the OVO database! OpC40-1901: Node <node> does not have a security certificate installed! OpC40-1902: Security certificate deployment pending on node <node>. OpC40-1903: Security certificate deployment on node <node> denied! OpC40-1904: OV Control Daemon is not running on node <node>! OpC40-1905: Message Agent on node <node> is buffering messages. OpC40-1906: Message Agent on node <node> is buffering messages for this Management Server. OpC40-1911: Event/Action RPC server (Message Agent) is not running on node <node>. (before 8.23 server patch) Opc40-1911: Failed to contact node <node> with BBC. Probably the node is down or there's a network problem. (since 8.23 server patch) Messages describing a (return to) normal situation OpC40-1907: Message Agent on node <node> is no longer buffering messages. OpC40-1908: Core ID on node <node> has been aligned with the value stored in the OVO database. OpC40-1909: Security certificate has been installed on node <node>. OpC40-1910: OV Control Daemon on node <node> is now running. OpC40-1912: Event/Action RPC server (Message Agent) is now running on node <node>. (before 8.23 server patch) OpC40-1912 Successfully contacted the OVO agent on node <node> via BBC. (since 8.23 server patch) -------------------------------------------------------------------------------DCE Agent -------------------------------------------------------------------------------Messages describing a problem OpC40-404: Message agent on node <node> is not running. OpC40-405: Control agent on node <node> isn't accessible. OpC40-431: The control agent on node <node> is registered at the llbd or rpcdaemon but it is not running. OpC40-432: The llbd on node <node> seems to be down. The OVO mgmt-server cannot contact the managed node by using NCS. OpC40-433: The llbd/rpcdaemon on node <node> is down. OpC40-434: Routing packages via gateway <node> to node <node> failed (NET_UNREACHABLE) OpC40-435: Routing packages via gateway <node> to node <node> failed (HOST_UNREACHABLE) OpC40-436: Node <node> is probably down. Contacting it with ping packages failed. OpC40-441: Message agent with process-id <pid> aborted on node <node>. OpC40-1410: The Message Agent on node <node> is buffering messages for this Management Server. OpC40-1411: The Message Agent on node <node> is buffering messages. Messages describing a (return to) normal situation OpC40-462: Control agent on node <node> is now running. OpC40-1408: The Message Agent on node <node> is now running. -------------------------------------------------------------------------------Which heartbeat message is generated when: Error cases: - a node is down or there's a network problem: in RPC-only mode: DCE: OpC40-432 (llbd seems to be down) HTTPS: OpC40-1911 (Failed to contact node with BBC) When ping can be used (no "RPC only" set) DCE and HTTPS: OpC40-436 (Node is probably down) (OpC40-434 and OpC40-435 in case of some network problems). - the DCE RPC daemon or the HTTPS BBC communication broker isn't running: DCE: DCE daemon (rpcd process) is down: OpC40-432 (llbd on node seems to be down) Note that it's the same message like NODE_DOWN in RPC-only mode. HTTPS: BBC communication broker (ovbbccb process) not running: OpC40-1913 (OV Communication Broker is down) E. g. when stopping the OVO agent via "ovc -kill". HTTPS: BBC communication broker hanging or halted: OpC40-1911 (Failed to contact node with BBC) Note that it's the same message like NODE_DOWN in RPC-only mode. Message can be provoked by sending signal SIGSTOP to the ovbbccb process. - message agent not running: DCE: OpC40-441 (Message agent aborted) or OpC40-404 (Message agent is not running) Note: in case of "opcagt -stop" (regular stop of opcmsga) no heartbeat message is generated (difference to HTTPS agent). A message is only created when the message agent aborted. HTTPS: OpC40-404 (Message agent is not running) Note: independent whether the message agent aborted or was regularly stopped (e.g. via "ovc -stop opcmsga"), this both leads to the same error message. The difference is caused by the different architecture of DCE and HTTPS agents: on DCE the heartbeat requests are handled by opcctla (control agent), on HTTPS by opcmsga. - control process not running: DCE: OpC40-405 (Control agent isn't accessible) Can be provoked by stopping the DCE agent via "opcagt -kill". HTTPS: OpC40-1904 (ovcd process not running) If this message is coming, it means that the OVO message agent is running, but the "ovcd" process is not. Can be provoked by killing or halting the ovcd process. - OVO agent completely stopped via opcagt -kill (or ovc -kill for HTTPS) DCE: OpC40-405 (control agent down) HTTPS: OpC40-1913 (ovbbccb down) - OVO agent processes stopped, but control process still running (opcagt -stop, respectively ovc -stop or ovc -stop EA) DCE: no error message (is not regarded as an error on DCE agents) HTTPS: Opc40-404 (message agent down) - message agent buffering DCE: (functionality works only for Unix nodes) OpC40-1410 (opcmsga is buffering for that OVO server which does the HBP) Critical. Communication from OVO server to agent is working, but not from agent to server. The OVO server is up and running but the agent can't send messages. Can be provoked by killing the opcmsgrd process. OpC40-1411 (opcmsga is buffering messages for any other OVO server) Warning. Happens in MoM scenarios, e. g. if a certain OVO server is temporarily unavailable. HTTPS: OpC40-1906 (opcmsga is buffering for that OVO server which does the HBP) Critical. Communication from OVO server to agent is working, but not from agent to server. The OVO server is up and running but the agent can't send messages. Can be provoked by killing the opcmsgrb process. OpC40-1905 (opcmsga is buffering messages for any other OVO server) Warning. Happens in MoM scenarios, e. g. if a certain OVO server is temporarily unavailable. - core ID mismatch detected (HTTPS agents only) HTTPS: OpC40-1900 (Core ID mismatch) This problem disturbs the OVO server -> agent communication. It does not disturb the agent -> server communication itself, but messages will not be added into the database and are removed due to the mismatch. success cases: - OVO agent up and running again: DCE: OpC40-462 (Control agent is now running) or OpC40-1408 (Message Agent is now running) if message agent was down HTTPS: OpC40-1912 (Successfully contacted the OVO agent via BBC) - message agent no longer buffering: DCE: OpC40-1408 (Message Agent is now running) HTTPS: OpC40-1907 (Message Agent is no longer buffering messages) - Core ID has been corrected (HTTPS agent only): HTTPS: OpC40-1908 (Core ID has been aligned) -------------------------------------------------------------------------------What's different and what's the same with DCE and HTTPS HBP: - DCE opcctla handles HBP requests, whereas HTTPS opcmsga does that. - DCE opcctla checks whether its child process "opcmsga" is running, HTTPS opcmsga checks via RPC whether "ovcd" is running. - "message agent not running" is treated as an error on HTTPS, whereas it's not regarded as an error on DCE if opcmsga was regularly stopped. - the HTTPS HBP also informs about certificate and core ID problems. - The HTTPS agent heartbeat RPCs use HTTP (not HTTPS; due to performance reasons), the DCE agent RPCs use DCE. The ping calls are identical for DCE and HTTPS. = "RPC Only" mode exists for DCE and HTTPS with the same semantic. = DCE and HTTPS inform in the same manner, whether opcmsga is buffering. = "Agent Sends Alive Packets" works in the same manner for DCE and HTTPS -------------------------------------------------------------------------------Note that the 8.14 server patch introduced several heartbeat polling changes. From the patch text: - Introduced new config parameter OPC_HBP_CONTINOUS_ERRORS. If OPC_HBP_CONTINOUS_ERRORS is set to TRUE, heartbeat polling errors and buffering messages will be sent each polling interval and not only once. - Added the buffering messages OpC40-1410 and OpC40-1411 to the heartbeat msgkey correlation list. - Make sure for OVO8, that the different heartbeat messages correlate each other like with OVO7 where it makes sense. - Introduced new config parameter OPC_HBP_NORMAL_START_MSG. If OPC_HBP_NORMAL_START_MSG is set to TRUE, ovoareqsdr will send an agent is now running message when started for each running node with HBP enabled (like it was for HTTPS agents with OVO8). The default is FALSE. ---------------------------------------------------------------------------------------Note that the 8.23 server patch introduced several heartbeat polling changes. From the patch text: - re-introduced messages 40-404 (message agent down) and 40-436 (node unreachable even with ping packages) for the HTTPS agents (just like they existed for the DCE agent). Heartbeat messages are more precise now. - Fix the problem that the /var/opt/OV/log/System.txt file was filled up with low-level heartbeat messages. ---------------------------------------------------------------------------------------Note that the 8.25 server patch introduced following heartbeat polling change: New option -interval is added to the opchbp CLI for setting up the heartbeat interval : opchbp -interval <interval> Set the heartbeat interval to the specified period. The interval value must be specified in the format 00h00m00s."