Disk full alert not generated and alarm notification not sent

Products

DX Unified Infrastructure Management (Nimsoft / UIM) CA Unified Infrastructure Management On-Premise (Nimsoft / UIM) CA Unified Infrastructure Management SaaS (Nimsoft / UIM)

Issue/Introduction

Today we faced a problem with a Disk Full alert not being generated and notification not sent.

We have configured the Cluster probe with a disk monitoring profile and we experienced a disk full problem not reported by UIM despite that monitoring was activated on disks.

Can you please help us to verify the configuration of the probe?

We have strange messages on both servers of the cluster regarding the Cluster probe :

Feb 29 13:59:58:516 [5352] cluster: (ConnectAllNodes) - node=<Server1> failed: permission denied
Feb 29 14:00:22:532 [12624] cluster: nimVerifyLogin request verify_login failed ##.##.#.##/48000 (permission denied)
Feb 29 14:00:22:532 [12624] cluster: verify login - cmd=set_node_up frm=##.##.#.###/41123 failed
Feb 29 14:00:58:860 [3204] cluster: ConnectToNode - connect to node=<Server1> failed: permission denied
Feb 29 14:00:58:860 [3204] cluster: (ConnectAllNodes) - node=<Server1> failed: permission denied
Feb 29 14:01:22:876 [12624] cluster: nimVerifyLogin request verify_login failed ##.##.#.##/48000 (permission denied)
Feb 29 14:01:22:876 [12624] cluster: verify login - cmd=set_node_up frm=##.##.#.###/41155 failed
Feb 29 14:01:58:157 [9688] cluster: ConnectToNode - connect to node=<Server1> failed: permission denied
Feb 29 14:01:58:157 [9688] cluster: (ConnectAllNodes) - node=<Server1> failed: permission denied

Feb 29 14:01:58:161 [16240] cluster: nimVerifyLogin request verify_login failed ##.##.#.###/48000 (permission denied)
Feb 29 14:01:58:161 [16240] cluster: verify login - cmd=set_node_up frm=##.##.#.##/56497 failed
Feb 29 14:02:22:178 [0744] cluster: ConnectToNode - connect to node=<Server2> failed: permission denied
Feb 29 14:02:22:178 [0744] cluster: (ConnectAllNodes) - node=<Server2> failed: permission denied
Feb 29 14:02:58:429 [16240] cluster: nimVerifyLogin request verify_login failed ##.##.#.###/48000 (permission denied)
Feb 29 14:02:58:429 [16240] cluster: verify login - cmd=set_node_up frm=##.##.#.##/56510 failed
Feb 29 14:03:22:449 [16372] cluster: ConnectToNode - connect to node=<Server2> failed: permission denied
Feb 29 14:03:22:449 [16372] cluster: (ConnectAllNodes) - node=<Server2> failed: permission denied

Environment

DX UIM 20.4 CU6

Cause

Unknown

Resolution

This is a common problem or question from customers.

We cannot write out to the local file system due to low disk space or the disk is full. This is not specific to our monitoring product and you can find lots of info online about this.

It normally happens when a disk fills up rapidly due to a core or crash file or some other reason. It fills the hard drive and cdm, unless the interval query was already running and there was still some disk space available, by the time it runs, the disk will be full and the cdm probe will be unable to write the alarm to the log and the queue.

Not being able to write to the file system will prevent the alarm from being generated and sent. This can happen with any monitoring software/monitoring solution and for any OS.

Solutions or workarounds to this scenario may include one or more of the following:

One preventive measure that could be taken is to install and keep the robot on a separate drive/file system to the application that potentially will fill the storage, but that is not always possible as there may only be a single drive or filesystem available.

Proactive monitoring which includes baselining and lower-value thresholding including ever-increasing alarms based on stepped percentage-climb.

Dedicated response to an earlier alarm by the sys admin group / systems administrator / systems team / manual intervention.

Automatic remediation via nas auto operator (AO) profile using scripts/commands to temporarily free up some space on the given drive/filesystem and send a critical alarm via email and text.

If the application on the same system is considered 'critical,' then that application should be on its own file system and separate from the UIM robot.

If the application on the same system is less critical, despite this fact, the robot won't be able to report to its hub but a robot inactive alarm will be thrown by the hub.

Last but not least, also please make sure that there is a mechanism to send the local alarms from a given robot under a hub, to a another upstream hub, e.g., ATTACH and GET queues or alarm forwarding and replication.

For <example_hostname> I can see:

2025-02-08 13:19:04.0 1 MS95158741-38851 2025-02-08 13:19:04.0 5 critical Disk Free (%) on K:\ for <example_hostname> is at 0.00 %. It has violated threshold of 5.00 percent Disk 1.1.1.1 <example_hostname1> <example_hostname>.example.net cdm <example_hostname> xxxxx xxxxx xxxxx xxxxx |xxxxx| |XXX|ZZZ|PROD|MSSQL| 0 18000 0 as#standard.alarm.format.tot 432 <removed_encrypted_string>

But there was no trace of any/reference to any cdm configuration entries nor alarm rules/messages for <example_hostname2>

Please also refer to the information in this KB Article:
UIM Robot Unable to Spool Message

Please also follow the KB Article/guide to configure your cluster probe monitoring.
How to install and configure UIM probes for cluster monitoring
cluster IM configuration

Please deploy the latest GA version of the probe which includes all fixes including security vulnerability fixes, which is currently v3.72.
Cluster Release Notes

Check the nas_transaction_log table for the alarm(s) you didn't receive the first time the disk was full as maybe it was somehow preprocessed by a script or rule or a nas AO profile closed it.

SELECT * from NAS_TRANSACTION_LOG where message like '%<example_alarm_msg_sub_string>%'

Issue (Disk Full alert not generated and Notification not sent) did not reoccur and is not reproducible.