Cluster disks stop monitoring occasionally
search cancel

Cluster disks stop monitoring occasionally

book

Article ID: 125250

calendar_today

Updated On:

Products

DX Unified Infrastructure Management (Nimsoft / UIM)

Issue/Introduction

Windows cluster disks stop monitoring on occasion, e.g., after patching/reboot. We typically don't notice very quickly, so it has been hard to quantify. When I looked through all of the robots in our environment running the cluster probe several months ago, I found that almost half of them had some or all of the disk monitoring turned off at some point in the past. (There was QoS in the database, so the monitoring of those disks had been active previously.) I've been watching our clusters closely for the past several months to learn more about this issue, and I've seen it repeat a few times. For example, the Q: drive was being monitored yesterday but is no longer monitored today. The alarm history indicates that the cluster failed over to the B node for a short time and then failed back to the A node. 

Environment

Release: UIM 20.4*/23.4*UIM v8.5.1 
Component: Cluster probe and CDM probes

Cause

- cdm probe configuration

Resolution

It is not recommended to have fixed_default set to active = yes in the cdm probes on the cluster nodes. 

The cluster probe is responsible for changing the cdm.cfg file to move the shared disk monitoring. 

If possible, try to test/prove that the issue is still reproducible with active = no in the <fixed_disk> section of the cdm probes. 

<fixed_default> 
active = no 

From the Known Issues section of the cdm probe release notes: 

"When running this probe in a clustered environment, setting the flag /disk/fixed_default/active to yes, causes problems with the disks that appear and disappear with the resource groups. This flag is only available through raw configuration or by directly modifying the cdm.cfg file."