Windows cluster disks stop monitoring on occasion, e.g., after patching/reboot. We typically don't notice very quickly, so it has been hard to quantify. When I looked through all of the robots in our environment running the cluster probe several months ago, I found that almost half of them had some or all of the disk monitoring turned off at some point in the past. (There was QoS in the database, so the monitoring of those disks had been active previously.) I've been watching our clusters closely for the past several months to learn more about this issue, and I've seen it repeat a few times. For example, the Q: drive was being monitored yesterday but is no longer monitored today. The alarm history indicates that the cluster failed over to the B node for a short time and then failed back to the A node.
- cdm probe configuration
- UIM v8.5.1
- cdm v6.30
- cluster v3.50
- database servers - VMs running Windows 2008 R2 Enterprise.
- All of the clusters have only two nodes and are running MS SQL Server 2008 R2 (SP2) Enterprise
It is not recommended to have fixed_default set to active = yes in the cdm probes on the cluster nodes.
The cluster probe is responsible for changing the cdm.cfg file to move the shared disk monitoring.
If possible, try to test/prove that the issue is still reproducible with active = no in the <fixed_disk> section of the cdm probes.
active = no
From the Known Issues section of the cdm probe release notes:
"When running this probe in a clustered environment, setting the flag /disk/fixed_default/active to yes, causes problems with the disks that appear and disappear with the resource groups. This flag is unavailable is only available through raw configuration or by directly modifying the cdm.cfg file."