Recurring alerts in vCenter "vSAN Performance Service is not enabled" and "'vSAN Cluster Configuration Consistency".

Products

VMware vSAN

Issue/Introduction

On the vCenter UI events page, user receive repeated warnings for vSAN clusters performance service that read as follows.
- vSAN Health Test 'Checks status of vSAN Performance Service' status changed from 'green' to 'yellow'
- Alarm 'vSAN cluster alarm 'vSAN Cluster Configuration Consistency'' on "vSAN_Cluster_Name" changed from Green to Yellow.

The subsequent iteration of vSAN skyline health for performance service checks for the same cluster will show the following on the vCenter UI events page.

- vSAN Health Test 'Checks status of vSAN Performance Service' status changed from 'yellow' to 'green'
- Alarm 'vSAN cluster alarm 'vSAN Cluster Configuration Consistency'' on "vSAN_Cluster_Name" changed from Yellow to Green.

When the user checks the vSAN skyline health, a warning appears for "vSAN cluster configuration consistency" and "Performance service status". The same alert disappears after the following vSAN skyline health iteration.
The health alert is also displayed to the user on the vSAN cluster summary page.

When reviewing the vSAN performance service status for the same cluster via the vCenter UI under vSAN Cluster> Configure> Services, the user notices that the performance service is enabled, healthy, and compliant.

Environment

VMware vSAN 7.0.x

VMware vSAN 8.0.x

Cause

The health alert is generated because the vsan health check was unable to locate the vSAN performance stats object; however, the object is present because the subsequent health check iteration was successful, and when checked manually under vSAN Cluster> Configure> Services, we can see that the performance service is enabled, healthy, and compliant, as mentioned in the previous section.

The health check is unable to retrieve the performance stats object on some health check iterations due to the timeout defined for the API call responsible for the specific health check within the code.
The vSAN health service encountered a timeout when invoking the "QueryStatsObjectInformation" API; the API's timeout is configured for 10 seconds. The user can observe from the log sample that the thread (766390) executing the "QueryStatsObjectInformation" API call consumed more than 10 seconds to complete; this caused the API call to fail and generate the health alert. The relevant excerpt is provided below. (The thread number may differ depending on the environment). The log file from where the following snippet is taken is vmware-vsan-health-service.log which can be found in "var/log/vmware/vsan-health" folder in the vCenter support bundle.

2024-05-10T22:33:06.524Z ERROR vsan-mgmt[3244250] [VsanHealthThreadMgmt::join opID=noOpId] Not all tasks are finished with timeout 10
Traceback (most recent call last):
  File "xx/xx/xx/xx/xx/xx.py", line 408, in join
  File "/xx/xx/xx/xx/xx.py", line 241, in as_completed
    raise TimeoutError(
concurrent.futures._base.TimeoutError: 4 (of 4) futures unfinished
.

2024-05-10T22:33:37.382Z INFO vsan-mgmt[766390] [VsanVcPerformanceManagerImpl::QueryClusterHealth opID=noOpId] QueryClusterHealth objInfo: (vim.cluster.VsanObjectInformation) {
directoryName = 'unknown'
}

++ The user can see that some calls take more than 10 seconds for the same thread (766390), resulting in a health alert on the vCenter events tab and the vSAN skyline health check. (The thread number may differ depending on the environment).

2024-05-10T22:33:37.368Z INFO vsan-mgmt[766390] [VsanPyVmomiProfiler::logProfile opID=noOpId] VsanVcObjectHelper.isMismatch: 11.39s, 11.41s, 4.48s, 4.47s, 4.50s, 4.51s

When we examine the vmware-vsan-health-summary-result.log file, which can be found under "var/log/vmware/vsan-health" folder in the vCenter support bundle, we expect to find the following snippets with respect to performance service check failure. Some log excerpts are clipped to improve readability (the date, time, cluster name, ESXi host name, and thread number may vary depending on the environment).

++ Here the thread number is 766390 which was used on the health check for specific cluster.

2024-05-10T22:33:40.222Z INFO vsan-mgmt[766390] [VsanHealthSummaryLogUtil::PrintHealthResult opID=noOpId] Cluster xxx Overall Health : yellow

Group cluster health : yellow

Test consistentconfig health : yellow
Issues: Host Disk Issue Recommendation
(Host-xxx, '', PerformanceServiceIsTurnedOnInClusterConfiguration,ButItIsNotEnabledYet., Auto-RemediationIsEnabled.See'AskVmware'ForMoreInformation.),

Group perfsvc health : yellow
Test perfsvcstatus health : yellow
Details: Result Status
(Yellow, PerformanceServiceIsDisabled)

Resolution

This is a rare situation in which the API times out due to the extensive amount of clusters that a particular vCenter is responsible for managing.
The user can safely disregard and suppress the health alert since we have determined that there are no problems with the performance service as per the logs snippets mentioned in the cause section. See KB Silencing a vSAN health check to temporarily silence a specific vSAN health check.