We noticed on Friday that all of our servers had the CPU and Memory thresholds unchecked in their CDM probes.
The MCS profiles for the servers was set up correctly though. Having the CPU threshold at 95% and the memory thresholds at 90%. The fix is easy, just reapply the MCS profile for those at the parent level and they get fixed. I can't find a CAUSE for why this happened though and that is very troublesome. We only saw this issue because someone asked about a threshold setting. Digging in deeper we saw every server had this problem.
We have the main MCS profiles at the robot group levels (AIX robots, Linux robots, Windows robots) then we have more specific groups with just a few servers with a higher priority MCS profile as well. All of the profiles were affected across the board, including our prod and non-prod environments.
Release: 20.4
Observed that all the profile level configurations were not disturbed but on the cdm probe, metric thresholds were unexpectedly unchecked on all the devices in UIM.
We tried checking the ssrv2audittrail table but there are no traces of changes from the mcs profile.
We also checked Windows events that occurred on the date it occurred but all we saw was a hub.exe crash the day before, before the event supposedly took place so that event seems discountable.
Confirmed that manual changes are not possible as there are hundreds of devices in their environment.
We also checked the mcs recon probe but it is not enabled, so we do not know why this happened to the cdm probes.
We suggested making an increase to the mon_config_service probe java memory since it only had 1024/2048 respectively set for the java min and max. Instead, we set it to 2048/4096 and cold-started the probe.
MCS Profile configuration for cdm thresholds remained the same but the cdm GUI showed it was greyed out/empty for the configured thresholds. Furthermore, only a single alarm occurred on the given date relating to cdm cpu/memory alarms and that is not normal for the environment.
To workaround the issue, the customer reapplied the MCS profile at the parent level. The customer made a change to one of the cdm MCS profiles, Setup cdm, and saved it, and the issue went away (cdm GUI displayed the thresholds again and alarms were being generated again as they normally do each day.)
We set the cdm loglevel on one of the machines to 5 and logsize to 500000 for now just in case the issue occurs again
An exhaustive search of KB Articles, Cases, and the Broadcom DX UIM community posts did not yield any similar issues of this type.
The customer will continue to monitor the environment for some more time and if the issue ever happens again, we will check the cdm and mon_config-service logs, but this remains unknown and seemed like a rare glitch.