VCF Management Services: Health and Diagnostics
search cancel

VCF Management Services: Health and Diagnostics

book

Article ID: 417255

calendar_today

Updated On:

Products

VCF Operations

Issue/Introduction

VCF management services refers to the collection of management components running on the VCF services runtime in the VCF stack. This dashboard monitors the health of the services runtime to address the following use cases:

  1. Is the management services runtime experiencing high load (Management Services Utilization Summary)
  2. Are there any unhealthy nodes in the services runtime (Nodes Health)
  3. Are the services runtime gateways healthy (Gateway Health)
  4. Are there any un-healthy replicas in component deployments (Component Health)
  5. Understanding the resource utilization of each component 
  6. Understanding the storage volume of each component (Component Volume Utilization)

Are there any backup failures (Backup Failures)

Resolution

Management Services Utilization Summary

This indicates the average CPU and memory usage across all nodes in the VCF services runtime. Thresholds for the overall memory, nodes’ memory and disk utilization are set at

<= 80% is good

80-90 - Warning

90-95 - Severe

>95% - Critical

 

Problem: The overall utilization of the VCF services runtime is consistently exceeding pre-defined operational thresholds which could impact the components functionality.

  • Threshold Breaches: Utilization levels are in the "Severe" (90-95%), or "Critical" (>95%) range for overall memory, nodes’ cpu, and memory utilization.

Potential Causes:

  • This typically indicates an increase in the components functional load (e.g., API calls, data processing) that the current cluster may require additional resources.

Recommended action:

  • Scale VCF services runtime size to next level, through VCF Operations > Build > Lifecycle > VCF services runtime > Actions.
    • This includes adding more resources (CPU/ memory) to the cluster to accommodate the increased load.

 

Components’ Health

Component status is considered "unhealthy" if any replica within the deployment is down. In case of HA deployment, this often does not impact overall functionality. This status is updated every 15 minutes.

In most cases, replicas will auto-recover within this timeframe, addressing the temporary unhealthy status. It is recommended to check the operational state of components for a duration of 30 minutes. 

Problem: The component functionality is not working properly and components displaying unhealthy status for more than 30 mins.

Potential Causes:

  • Operations: A patch, upgrade, or other maintenance procedure such as scale is currently in progress. During these operations, components may be temporarily restarted, leading to a "down" status.
  • Infrastructure: Check for any Infrastructure issues happening such as ESXi host, and VM experiencing any disk pressures from the Topology view.
  • Service Instability: If there are no maintenance operations and the component status remains continuously down for 30 minutes, it suggests an unexpected failure, crash, or deadlock within the service.

Recommended action:

  • Check Maintenance: Ensure there is no patch or upgrade in progress. If maintenance is running, the temporary unavailability is expected.
  • Check Infrastructure Health: Check the Topology for management services from Inventory and look for any alerts to fix.
  • Restart Services: If the component is continuously down for an extended period, try restarting the affected services via the Lifecycle action to attempt recovery. If the issue persists, proceed to log collection and support ticket creation.

 

Components’ Memory Utilization

The peak memory utilization represents the highest memory usage recorded by a component, while the average memory utilization reflects the component's typical or overall memory consumption.

Problem: A component is consistently experiencing high memory utilization (spiking and remaining high) because it is processing a workload that exceeds its current configured capacity/threshold.

Potential Causes:

A single component is consistently processing a workload that exceeds the capacity it is currently configured for, causing its memory utilization to spike and remain higher. This is often an indicator that the component is reaching its threshold for its dedicated function.

Recommended action:

  • If this behavior is consistent (not a temporary peak), resize the component (which will allocate more memory/CPU to the specific component) using the Lifecycle component action.
  • If the components have a dashboard e.g. Logs management dashboard, refer to the components specific dashboards for memory limits and other remediations.

 

Components’ Volume Utilization

This indicates the disk space utilization for storage volumes used by components. Volume group comprises multiple storage volumes for redundancies and distribution of data.

Problem: Disk space utilization for a storage volume nearing or reaching its capacity, impacting the performance of the component.

Potential Causes:

The storage volume allocated to a particular component is nearing its capacity due to the accumulated size of the components data (e.g., retained logs, historical metrics, binaries, etc.). If the volume reaches capacity, it could impact the health of the component.

Recommended action:

  • Resize the Storage Volume (increase the disk capacity) for the affected component using the Lifecycle components action.
  • If the components have a dashboard e.g. Logs management dashboard, refer to the components specific dashboards for storage limits and other remediations.

 

Gateway Services Health

Components traffic go via the gateway services, indicating the health and memory utilization of the gateway instances. Logs management has their dedicated gateway services.

Problem:

  1. Gateway Service is Unhealthy and hence the components are non-operational or resulting in lower performance
  2. Lower-than-expected network response time or significant delay when interacting with the management services. This is a symptom of a slowdown or performance issue.
  3. High Utilization (above critical level) of Gateway Services resulting in the performance issue

Potential Causes:

  1. An increase in the overall workload being processed by the components. High request volume can lead to queue buildup and delayed processing, manifesting as network latency from the client's perspective.
  2. Issues in the underlying network infrastructure which may be impacting response times.

Recommended action:

  • If the gateway service is down, refer to this KB for restarting the gateway service.
  • If this slowdown is consistent with high utilization, it may be necessary to add additional capacity to handle the increased load. This can be done in one of two ways:
    • Increase the size of the VCF Management Services cluster.
    • Add Virtual IPs (VIPs) to the associated component for additional load balancing of incoming requests across the available cluster nodes.
  • If the slowdown is not related to increased load, confirm that there are no outstanding issues with the virtual or physical networking layer.
  • If the components have a dashboard e.g. Logs management dashboard, refer to the components specific dashboards for memory limits and other remediations.

 

Backup Failures

Problem:

The scheduled or on-demand backup operation for the management services components have failed to complete successfully.

Potential causes:

  • SFTP Reachability: The configured Secure File Transfer Protocol (SFTP) target server is not reachable (e.g., network connectivity issue, firewall blocking, incorrect credentials, or SFTP service is down), or authentication is failing to the SFTP server.
  • Storage Capacity: The SFTP storage location is full, preventing the backup archive from being written.
  • Internal Service Issues: An internal service issue within the management component or the backup process itself caused a failure.
  • Check for component health: Check the health of the component and backup may not happen if the component is unhealthy.

Recommended Action:

 

  • Check the SFTP server: Confirm that the SFTP server is reachable and that there is sufficient disk space available. If backup failed due to an authentication failure, update the password used by VCF Management Services in the backup settings in VCF Operations > Build > Lifecycle > Backup and Restore.

 

  • Check for component health: Check the health of the component and backup may not happen if the component is unhealthy. Refer to the components section of this document for remediation.
  • Review Task or Logs: Refer to the task details or Logs for the failed backup job in the management interface to check for immediate error messages.
  • Escalate: If the cause is not readily apparent from task details or logs, raise a support ticket with the collected error information.

 

Nodes Health

Management services run on multiple nodes, this section indicates the health and utilization of the services runtime nodes.

Problem: The nodes could be in an unhealthy state resulting in some of the components having health or performance issues or preventing the instances from other components from running.

Potential causes:

Multiple conditions can result in a node being unhealthy: extreme memory pressure, excessive disk pressure, persistent or frequent network connectivity problems, or a core OS service is not functional.


Recommended Action

Start troubleshooting node health problems by reviewing the VCF Operations node status metrics for the nodes with health problems. Search in the VCF Operations inventory by node name and select the VCF Management Service Node object,  not the Virtual Machine Object. Review the following metrics in the Status group

  • Disk pressure
  • Kernel deadlock
  • Kublet not ready
  • Memory pressure

If the disk pressure metric is reporting a 1, this condition is typically caused by the node not being able to process logs quickly enough. If the condition is observed for 15 or more minutes, Navigate to the Virtual Machine object for the node and review network and storage latency. Elevated latency will make destaging take longer. If these conditions are not present, check whether the datastore on which the virtual machine is running is impacted by an APD or PDL condition. Use the topology viewer to switch to the object representing the datastore, and check for APD or PDL alerts. Otherwise, use the topology tab to determine the components running on the node, and reduce the load on those components until the problem clears. Or add component instances if appropriate. 

If kernel deadlock or Kublet not ready are observed, wait for the Management Services Runtime to replace the node. The node problem detector will replace unhealthy nodes after a period of time. The replacement is triggered after 3 minutes and will take up to 1.5 hours. 

If memory pressure is observed and is sustained over an hour or more, and only one node is impacted, reduce the load imposed on the node. If the problem is observed for multiple nodes, scale up the cluster to a larger size, if not already running large. To determine the components running on a node, search for the VCF Management Service Node object with the node name, navigate to it, and use the topology viewer. 

If the four above mentioned conditions are not occurring, search for the virtual machine object with the same name as the node, navigate to the object, and review the networking stats for network errors and dropped packets.