The CA Enterprise Service Manager (ESM) is capable of monitoring one or more CA API Gateway appliances and clusters. This monitoring includes system operating status as well as various system resource utilization levels such as storage, CPU, and memory. There may be certain circumstances where ESM is unable to adequately poll the operating status of a Gateway node. This article will prescribe the steps for resolving a circumstance where an ESM implementation is unable to monitor the operating status of a Gateway node in a monitored cluster. The inability to monitor the running state--while anomalous--does not negatively impact the availability of the Gateway. The Gateway will typically still be capable of processing message traffic in most circumstances.
The following log entry may appears in the ESM logs when the behavior is being exhibited:
com.l7tech.server.processcontroller.monitoring.MonitoringKernelImpl: NODE.operatingStatus value UNKNOWN is out of tolerance (not equal to RUNNING)
The?Monitored Properties?dashboard of ESM may display the following status for the?Operating Status?parameter of a particular Gateway node:
<Please see attached file for image>
This issue occurs when a Gateway node or cluster is under a significant amount of load. A large quantity of sustained?message processing traffic may result in the Gateway being unable to report its running state to ESM. The?UNKNOWN?operating state will persist until the sustained traffic has subsided long enough for the Gateway to continue processing traffic.
The Gateway cluster may need additional nodes added to compensate for unexpected or surging?message traffic. This would distribute more traffic over more nodes which would allow each node to have more reserve resources for processing requests.?Increasing the amount of system resources available to each node in the cluster--specifically CPU and memory--would also alleviate the issue by providing more resources to handle surging utilization.
If modifying or adding to the infrastructure is not feasible then the impacted Gateway nodes can be configured in such a manner to allow the Gateway to wait longer before falsely reporting an unknown operating state. Execute the following procedure from the privileged shell of each impacted node in order to implement these changes: