Unable to query vSAN health information due to Aria Operation API overload

Products

VCF Operations/Automation (formerly VMware Aria Suite) VMware vSAN

Issue/Introduction

Unable to query vSAN health information. Check vSphere Client logs for details.

Under vCenter SSH Storage/core directory continuous logs are generated for core.vsanvcmgmtd-wor
The vsan-health support bundle shows many API calls, not only for the queryVsanPerf, but also the other calls such as queryHostStatusEx, getHclInfo, isEventQueueFull, waitForVsanHealthGenerationIdChange, queryObjectIdentities and were all very slow:

vsansystem.1: 2023-03-16T22:12:18.554Z info vsansystem[#######] [vSAN@#### sub=AdapterServer opId=########-####] Invoking 'queryVsanPerf' on 'vsan-performance-manager' session '########-####-####-####-#############' active 16
vsansystem.1: 2023-03-16T22:12:18.554Z verbose vsansystem[#######] [vSAN@#### sub=PyBackedMO opId=########-####] Enter vim.cluster.VsanPerformanceManager.queryVsanPerf, Pending: 17
vsansystem.1: 2023-03-16T22:23:29.904Z info vsansystem[#######] [vSAN@#### sub=PyBackedMO opId=########-####] Exit vim.cluster.VsanPerformanceManager.queryVsanPerf (671349 ms)
vsansystem.1: 2023-03-16T22:23:29.905Z warning vsansystem[#######] [vSAN@#### sub=IO.Connection opId=########-####] Failed to write buffer to stream; <io_obj p:0x000000##########, h:124, <TCP '127.0.0.1 : 9096'>, <TCP '###.###.###.### : 0'>> e: 32(Broken pipe), async: false, duration: 0msec
vsansystem.1: 2023-03-16T22:23:29.906Z error vsansystem[#######] [vSAN@#### sub=VsanSoapSvc.HTTPService opId=########-####] Failed to write to response stream; <<io_obj p:0x000000##########, h:124, <TCP '127.0.0.1 : 9096'>, <TCP '###.###.###.###: 0'>>, ########-####-####-####-#############>, N7Vmacore15SystemExceptionE(Broken pipe: The communication pipe/socket is explicitly closed by the remote service.)
vsansystem.1: 2023-03-16T22:23:29.907Z error vsansystem[#######] [vSAN@#### sub=AdapterServer opId=########-####] Failed to send response to the client: N7Vmacore11IOExceptionE(System exception while transmitting HTTP Response:
vsansystem.1: 2023-03-16T22:23:29.908Z info vsansystem[#######] [vSAN@#### sub=IO.Connection opId=########-####] Failed to shutdown socket; <io_obj p:0x000000##########, h:124, <TCP '127.0.0.1 : 9096'>, <TCP '##.###.###.###: 0'>>, e: 104(shutdown: Connection reset by peer)

2023-03-16T22:19:40.999Z warning vsanvcmgmtd[#####] [vSAN@#### sub=Py2CppStub opId=########] Exit host-#####::vim.cluster.VsanPerformanceManager.queryNodeInformation (1115332 ms)
2023-03-16T22:19:41.101Z warning vsanvcmgmtd[#####] [vSAN@#### sub=Py2CppStub opId=########] Exit host-#####::vim.host.VsanSystemEx.queryHostStatusEx (1089715 ms)
2023-03-16T22:19:42.217Z warning vsanvcmgmtd[#####] [vSAN@#### sub=Py2CppStub opId=########] Exit host-#####::vim.host.VsanHealthSystem.getHclInfo (1085903 ms)
2023-03-16T22:19:42.221Z warning vsanvcmgmtd[#####] [vSAN@#### sub=Py2CppStub opId=########] Exit host-#####::vim.host.VsanHostEventsProcessor.isEventQueueFull (1028006 ms)
2023-03-16T22:19:42.221Z warning vsanvcmgmtd[#####] [vSAN@#### sub=Py2CppStub opId=########] Exit host-#####::vim.host.VsanHealthSystem.waitForVsanHealthGenerationIdChange (1083920 ms)
2023-03-16T22:19:43.429Z warning vsanvcmgmtd[#####] [vSAN@#### sub=Py2CppStub opId=########] Exit host-#####::vim.host.VsanHealthSystem.getHclInfo (1027108 ms)
2023-03-16T22:19:46.593Z warning vsanvcmgmtd[#####] [vSAN@#### sub=Py2CppStub opId=###-########-####] Exit host-#####::vim.cluster.VsanObjectSystem.queryObjectIdentities (1023049 ms)

From vCenter log file location /var/log/vmware/vsan-health/vmware-vsan-health-service-#####.log

2023-03-16T21:38:04.223Z ERROR vsan-mgmt[#####] [VsanClusterHealthSystemImpl::PerHostQueryObjectHealthSummary opID=noOpId] Error to query object health for host #####
Traceback (most recent call last):
File "bora/vsan/health/esx/pyMo/VsanClusterHealthSystemImpl.py", line 973, in PerHostQueryObjectHealthSummary
File "/usr/lib/vmware/site-packages/pyVmomi/VmomiSupport.py", line 595, in <lambda>
self.f(*(self.args + (obj,) + args), **kwargs)
File "/usr/lib/vmware/site-packages/pyVmomi/VmomiSupport.py", line 385, in _InvokeMethod
return self._stub.InvokeMethod(self, info, args)
VsanHealthThreadMgmt.TimeoutException

This shows that the vsan-health service was slow. A restart of the vsan-health service on VC by the following command on VC console does not help.

Command: vmon-cli -r vsan-health

The stats on the primary host support bundle show many of the following logs:

vsanmgmt.2: 2023-03-17T12:42:35.055Z info vsand[#######] [opID=########-#### statsdb::QueryStats] table: VirtualMachine, startTime: 2023-03-17 12:00:34.732000+00:00, endTime: 2023-03-17 12:05:34.732000+00:00
vsanmgmt.2: 2023-03-17T12:42:35.144Z info vsand[#######] [opID=########-#### statsdb::QueryStats] table: VirtualMachine, startTime: 2023-03-17 11:49:40.576000+00:00, endTime: 2023-03-17 11:54:40.576000+00:00
vsanmgmt.2: 2023-03-17T12:42:35.234Z info vsand[#######] [opID=########-#### statsdb::QueryStats] table: VirtualMachine, startTime: 2023-03-17 11:49:41.200000+00:00, endTime: 2023-03-17 11:54:41.200000+00:00
vsanmgmt.2: 2023-03-17T12:42:35.321Z info vsand[#######] [opID=########-#### statsdb::QueryStats] table: VirtualMachine, startTime: 2023-03-17 11:49:38.274000+00:00, endTime: 2023-03-17 11:54:38.274000+00:00

From the log pattern, the above query is the VM query from Aria Operations.

vCenter logs under /commands/lstool.txt shows that Aria Operations is integrated.

Attributes:
 Capabilities: VC-trusts
-------------------------------------------------------
 Name: com.vmware.vrops.label
 Description: com.vmware.vrops.summary
 Service Product: com.vmware.cis
 Service Type: com.vmware.vrops
 Service ID: #####-####-####-####-###########_com.vmware.vrops
 Site ID: default-first-site
 Owner ID: vpxd-######-####-###-####-########@vsphere.local
 Version: 6.7.0.000000
 Endpoints:
 Type: com.vmware.cis.common.resourcebundle
 Protocol: https
 URL: https://<hostname>:443/catalog/com.vmware.vrops_catalog.zip
 Endpoint Attributes:
 com.vmware.cis.common.resourcebundle.basename: cis.vcextension.com_vmware_vrops.ResourceBundle

Environment

VMware vRealize Operations 8.3.x
VMware vSAN 7.x
VMware vSAN 8.x

Cause

The slowness is caused by vROps vSAN adapter API calls.
It is not recommended to query metrics for multiple VMs in one API call, this is because in a large-scale setup, there will be thousands of VMs in a cluster. And 400 query specs for disk-group/cache-disk/capacity-disk.

Resolution

Upgrade to Aria Operations 8.10 or Later

Description Upgrading to vROps 8.10 or later provides a more flexible resolution. This version introduces performance optimizations and allows for granular configuration changes via the UI, eliminating the need for global property file modifications.

Procedure

Upgrade Aria Operations Upgrade the Aria Operations environment to version 8.10 or later.
Validate Performance Monitor the environment to determine if the issue persists when querying vSAN options.
- If the issue is resolved, no further action is required.
- If the issue persists, proceed to Step 3.
Disable VM Performance Data Collection (Conditional) If high load is still observed, disable VM performance data collection for the specific vSAN adapter instance.
- Impact: Disabling this setting will result in the loss of a single metric: Percentage of Consumers facing Disk Latency (%) on the vSAN Datastore object.
Configuration Steps:
1. Navigate to Data Sources > Integrations.
2. Select the specific vSAN adapter instance and click Edit.
3. Expand Advanced Settings.
4. Locate the option to Disable VM Performance Data Collection (as shown in the screenshot below).
5. Enable this setting and save the configuration.

Note: Unlike the workaround in the Additional Information section, this method applies only to the specific adapter instance modified, rather than affecting all instances globally.

Additional Information

Workaround: Disabling vSAN VM Performance Data Querying (vROps < 8.10)

Description For vRealize Operations (vROps) versions earlier than 8.10, vSAN VM discovery must be disabled via a configuration property file to stop the system from querying vSAN VM performance data.

Impact Statement

Warning: Implementing this change will result in the loss of VM performance data collection and the vSAN "Storage Policy compliance status" property on vSAN VMs. Ensure this data is not required before proceeding.

Procedure To disable vSAN VM discovery, perform the following steps on EACH node in the vROps cluster:

Navigate to the following configuration file: /usr/lib/vmware-vcops/user/plugins/inbound/VirtualAndPhysicalSANAdapter3/conf/config.properties
Locate the property ENABLE_VM_DISCOVERY.
Change the value to false.
Save and close the file.
Log in to the vROps UI.
Stop and then Start the vSAN adapter instance monitoring the target vSAN environment to apply the changes.

Note: This configuration change is global; it will affect all vSAN adapter instances once they are restarted.

Comparison of Resolution Paths

The following table outlines the impact differences between maintaining the current version versus upgrading to vROps 8.10 or later:

vROps Version	Resolution Method	Data Loss Impact
Earlier than 8.10	Requires editing a configuration property file on every node (Global change).	High: The "Storage Policy compliance status" property is lost.
8.10 and later	1. Upgrade (May fix the issue automatically). 2. If the issue persists, disable via UI Advanced Settings (Instance specific).	Low: Only the "Percentage of Consumers facing Disk Latency (%)" metric is lost.

Symptoms & Impact

If this issue is not resolved, the following symptoms may be observed in the environment:

Performance: Significant slowness when loading vSAN options on the Cluster.
Error Messages: The vSphere UI displays the alert: "Unable to query vSAN health information. Check vSphere Client logs for details."