ESM monitoring of Gateway node results in repeated alarts for "NODE.operatingStatus value UNKNOWN is out of tolerance"
search cancel

ESM monitoring of Gateway node results in repeated alarts for "NODE.operatingStatus value UNKNOWN is out of tolerance"

book

Article ID: 42891

calendar_today

Updated On:

Products

STARTER PACK-7 CA Rapid App Security CA API Gateway

Issue/Introduction

Solution

Background

The CA Enterprise Service Manager (ESM) is capable of monitoring one or more CA API Gateway appliances and clusters. This monitoring includes system operating status as well as various system resource utilization levels such as storage, CPU, and memory. There may be certain circumstances where ESM is unable to adequately poll the operating status of a Gateway node. This article will prescribe the steps for resolving a circumstance where an ESM implementation is unable to monitor the operating status of a Gateway node in a monitored cluster. The inability to monitor the running state--while anomalous--does not negatively impact the availability of the Gateway. The Gateway will typically still be capable of processing message traffic in most circumstances.

Presentation

The following log entry may appears in the ESM logs when the behavior is being exhibited:
com.l7tech.server.processcontroller.monitoring.MonitoringKernelImpl: NODE.operatingStatus value UNKNOWN is out of tolerance (not equal to RUNNING)

The?Monitored Properties?dashboard of ESM may display the following status for the?Operating Status?parameter of a particular Gateway node:

<Please see attached file for image>

A screen capture of the Monitored Properties dashboard in ESM

Resolution

This issue occurs when a Gateway node or cluster is under a significant amount of load. A large quantity of sustained?message processing traffic may result in the Gateway being unable to report its running state to ESM. The?UNKNOWN?operating state will persist until the sustained traffic has subsided long enough for the Gateway to continue processing traffic.

The Gateway cluster may need additional nodes added to compensate for unexpected or surging?message traffic. This would distribute more traffic over more nodes which would allow each node to have more reserve resources for processing requests.?Increasing the amount of system resources available to each node in the cluster--specifically CPU and memory--would also alleviate the issue by providing more resources to handle surging utilization.

If modifying or adding to the infrastructure is not feasible then the impacted Gateway nodes can be configured in such a manner to allow the Gateway to wait longer before falsely reporting an unknown operating state. Execute the following procedure from the privileged shell of each impacted node in order to implement these changes:

  1. Open the host properties configuration file (/opt/SecureSpan/Controller/etc/host.properties) for editing
  2. Add the following properties:
  • host.sampler.timeout.fast.connect=30000
  • host.sampler.timeout.fast.read=60000
  1. Restart the Gateway appliance?
These properties change the read and connect timeouts in milliseconds for the inter-process communication links for the Gateway application. These do not impact the read and connection timeouts for normal message traffic processing. It may be necessary to adjust these values upwards or downwards based on the necessary responsiveness and load on the system. 60000 milliseconds may be too long in some circumstances (resulting in false negatives) or too aggressive in other circumstances (resulting in false positives). The values of "30000" and "60000" are starting points for evaluation.

Environment

Release:
Component: APIESM

Attachments

1558722721328000042891_sktwi1f5rjvs16wk6.jpeg get_app