VCF management services refers to the collection of management components running on the VCF services runtime in the VCF stack. This dashboard monitors the health of the services runtime to address the following use cases:
Are there any backup failures (Backup Failures)
This indicates the average CPU and memory usage across all nodes in the VCF services runtime. Thresholds for the overall memory, nodes’ memory and disk utilization are set at
<= 80% is good
80-90 - Warning
90-95 - Severe
>95% - Critical
Problem: The overall utilization of the VCF services runtime is consistently exceeding pre-defined operational thresholds which could impact the components functionality.
Potential Causes:
Recommended action:
Component status is considered "unhealthy" if any replica within the deployment is down. In case of HA deployment, this often does not impact overall functionality. This status is updated every 15 minutes.
In most cases, replicas will auto-recover within this timeframe, addressing the temporary unhealthy status. It is recommended to check the operational state of components for a duration of 30 minutes.
Problem: The component functionality is not working properly and components displaying unhealthy status for more than 30 mins.
Potential Causes:
Recommended action:
The peak memory utilization represents the highest memory usage recorded by a component, while the average memory utilization reflects the component's typical or overall memory consumption.
Problem: A component is consistently experiencing high memory utilization (spiking and remaining high) because it is processing a workload that exceeds its current configured capacity/threshold.
Potential Causes:
A single component is consistently processing a workload that exceeds the capacity it is currently configured for, causing its memory utilization to spike and remain higher. This is often an indicator that the component is reaching its threshold for its dedicated function.
Recommended action:
This indicates the disk space utilization for storage volumes used by components. Volume group comprises multiple storage volumes for redundancies and distribution of data.
Problem: Disk space utilization for a storage volume nearing or reaching its capacity, impacting the performance of the component.
Potential Causes:
The storage volume allocated to a particular component is nearing its capacity due to the accumulated size of the components data (e.g., retained logs, historical metrics, binaries, etc.). If the volume reaches capacity, it could impact the health of the component.
Recommended action:
Components traffic go via the gateway services, indicating the health and memory utilization of the gateway instances. Logs management has their dedicated gateway services.
Problem:
Potential Causes:
Recommended action:
Backup Failures
Problem:
The scheduled or on-demand backup operation for the management services components have failed to complete successfully.
Potential causes:
Recommended Action:
Nodes Health
Management services run on multiple nodes, this section indicates the health and utilization of the services runtime nodes.
Problem: The nodes could be in an unhealthy state resulting in some of the components having health or performance issues or preventing the instances from other components from running.
Potential causes:
Multiple conditions can result in a node being unhealthy: extreme memory pressure, excessive disk pressure, persistent or frequent network connectivity problems, or a core OS service is not functional.
Recommended Action
Start troubleshooting node health problems by reviewing the VCF Operations node status metrics for the nodes with health problems. Search in the VCF Operations inventory by node name and select the VCF Management Service Node object, not the Virtual Machine Object. Review the following metrics in the Status group
If the disk pressure metric is reporting a 1, this condition is typically caused by the node not being able to process logs quickly enough. If the condition is observed for 15 or more minutes, Navigate to the Virtual Machine object for the node and review network and storage latency. Elevated latency will make destaging take longer. If these conditions are not present, check whether the datastore on which the virtual machine is running is impacted by an APD or PDL condition. Use the topology viewer to switch to the object representing the datastore, and check for APD or PDL alerts. Otherwise, use the topology tab to determine the components running on the node, and reduce the load on those components until the problem clears. Or add component instances if appropriate.
If kernel deadlock or Kublet not ready are observed, wait for the Management Services Runtime to replace the node. The node problem detector will replace unhealthy nodes after a period of time. The replacement is triggered after 3 minutes and will take up to 1.5 hours.
If memory pressure is observed and is sustained over an hour or more, and only one node is impacted, reduce the load imposed on the node. If the problem is observed for multiple nodes, scale up the cluster to a larger size, if not already running large. To determine the components running on a node, search for the VCF Management Service Node object with the node name, navigate to it, and use the topology viewer.
If the four above mentioned conditions are not occurring, search for the virtual machine object with the same name as the node, navigate to the object, and review the networking stats for network errors and dropped packets.