DX UIM OpenShift Metrics and Alerting Discrepancies

Products

DX Unified Infrastructure Management (Nimsoft / UIM)

Issue/Introduction

Users may report that alerts for OpenShift metrics (e.g., QOS_OPENSHIFT_NODE_CPU_USAGE_PERCENTAGE, QOS_OPENSHIFT_NODE_MEMORY_PRESSURE) are not triggering in DX Unified Infrastructure Management (UIM) even when manual CLI checks (like oc describe node) suggest a threshold breach.

Specifically, the following Quality of Service (QOS) metrics may appear inconsistent:

QOS_OPENSHIFT_NODE_SPEC_UNSCHEDULABLE
QOS_OPENSHIFT_NODE_NETWORK_UNAVAILABLE
QOS_OPENSHIFT_NODE_MEMORY_PRESSURE
QOS_OPENSHIFT_NODE_CPU_USAGE_PERCENTAGE

Environment

Product: DX Unified Infrastructure Management (UIM)
Probe: openshift
Versions: 2.01, 3.00

Cause

The discrepancy typically arises from how the opensift probe cluster-info pod calculate percentages compared to standard OpenShift CLI tools:

Metric Collection and Logic

The cluster-info pod running in the cluster is responsible for collecting CPU and memory usage metrics, which the OpenShift probe receives via APIs.
CPU and Memory Requests: If you compare the CPU and Memory Request values between the collected QoS data and the output of oc describe node <node_name>, they will match for any node.
Limit Discrepancies: Limit values may not match because they are calculated differently in the pod. For example, if a limit is not defined for a specific pod, the Node limit is used as the default. This often results in different percentage calculations for the same node.
Percentage Calculations: The percentages shown in a node description are the ratios between Requests and Limits. However the cluster-info pod calculate percentages using actual usages, which are directly aggregated from containerinfo which is running in the openshift probe daemonset app-container-monitor

Why Values Differ

You may see differences in rounded percentages because of these node limit differences.
Because of these variations in how Limits and usage are calculated, the percentage values will differ from those seen in the top command or standard OpenShift descriptions.

Resolution

1. Verify Metric Calculation Source

Confirm that the threshold is based on actual usage rather than Request/Limit ratios. If you require alerts based on Request/Limit ratios, ensure the specific QOS for those attributes is being monitored.

2. Validate Threshold Configuration

Ensure the thresholds are set correctly for the metric type:

Unschedulable: Normal is 0. Set threshold to > 0 to alert when the node becomes unschedulable.
Network Unavailable: Normal is 0. Set threshold to 1 (or > 0) to alert on network issues.
Memory Pressure: This is a boolean state in OpenShift; ensure the threshold is set to alert when the value is 1.

3. Analyze Raw Data with PRD

To verify what the probe is actually recording:

Open the Performance Report Designer (PRD).
Select the affected node and the specific QOS metric.
Set the Aggregation to Raw.
Enable Show aggregate values.
Review the data points to see if the value ever physically breached the threshold (e.g., reached 90.0 for CPU usage) during the expected alert window.

4. Update Monitoring Image (If Metrics are 0.00)

If metrics are consistently reporting as 0.00 despite known activity, update the image version in your values.yaml to include the latest metric processing fixes.

Configure Openshift Monitoring Best Practices | Known issues | Requirements

Additional Information

OpenShift Metrics