alarm is not triggering for CPU threshold violations on the device
search cancel

alarm is not triggering for CPU threshold violations on the device

book

Article ID: 420518

calendar_today

Updated On:

Products

DX Unified Infrastructure Management (Nimsoft / UIM) CA Unified Infrastructure Management On-Premise (Nimsoft / UIM) CA Unified Infrastructure Management SaaS (Nimsoft / UIM)

Issue/Introduction

We have observed that an alarm is not triggering for CPU threshold violations on the device `<robot_name>`, despite a clear threshold breach.
 
To troubleshoot this, we have already performed the following steps:

* Executed `flushdns` on both the hub and the device.
* Restarted the ems, data engine, and NAS probes.
* Restarted the nimbus service and cdm probe on the device.
 
Upon reviewing the NAS logs, we found the following error message:

_nas.log:
Sep 25 09:18:27:835 [11776] 1 nas: ptNetIpToHost - getaddrinfo failed for Windows Robot Ping-<robot_name>with return value 11001`
Sep 25 09:18:27:835 [11776] 2 nas: nsLookup ip:Windows Robot Ping-<robot_name> => 'Windows Robot Ping-<robot_name>' in 0ms`

Environment

  • UIM 23.4.3
  • cdm 8.03
  • MCS/alarm policy

Cause

  • It appears that the robot / system was unavailable at the time of the event

Resolution

Based on the nas log entries in the _nas.log'

nas: ptNetIpToHost - getaddrinfo failed for Windows Robot Ping-<robot_name> with return value 11001`

nas: nsLookup ip:Windows Robot Ping--<robot_name>  => 'Windows Robot Ping--<robot_name>' in 0ms`

The 11001 return value indicates that the host is not found at all. This is a common error that can occur when the name (-<robot_name> ) cannot be resolved to an IP address. It is important to ensure that the name specified in the ping command is correctly resolved through the local Hosts file, DNS queries, or NetBIOS name resolution techniques. If the name cannot be resolved, the ping command will return the 11001 error code.

So the system was unavailable during that time frame hence no alarm could be generated by the cdm probe. Perhaps there is/was an issue with DNS, network route, system reboot or some other network-related anomaly at the time.

Final result: alarms were being successfully sent and received to/from the robot.